Lecture Notes in Computer Science

bet	41/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 37 38 39 40 41 42 43 44 ... 88

q = lim

j→∞

−

(

) = lim

j→∞

−k

−

(

)

where

is a function of

with

> h

. Let

−

(

)

∈ D

and

∈ D

be an accumulation point of

k

i

}. Then since f

−1

−

is continuous, letting

−h

> 0, q = lim

j→∞

−n

−

(

), i.e.,

q ∈ Q

Lemma 8. D

q∈Q

u,−1

(

q).

Proof. Let

p be any point in D

. Since

−

(

)

⊆ (I[−1, 1])

where

I[−1, 1]

is the interval [

−1, 1], i.e., f

−

(

) is bounded, and

−

(

)

⊆ D

−

(

p)}

has an accumulation point

q in D

, which is, by Lemma 7, in

ω

. Then

q is

expressed as

q = lim

j→∞

−

(

p). Since Q

is a ﬁnite set of hyperbolic ﬁxed

points,

q = lim

n→∞

−

(

p), i.e., p ∈ W

s,f

(

q) = W

u,f

−1

(

q) = W

u,−1

(

q).

Since

⊆ D

, the next theorem holds.

Theorem 9. A point in P

is either a point in

or in

u,−1

(

q) for some

q ∈ Q

ω

.

Please note that since

q ∈ W

u,−1

(

q), the theorem statement is just “If p ∈ P

then

p ∈ W

u,−1

(

q) for some q ∈ Q

.”

4

An Example of a Recognizer

To construct an SRN recognizer for

n

b

|n > 0}, the SRN should satisfy the

conditions stated in Theorem 9 and Postulate 4, which are summarized as:

(

)

⊆ D

i+1

’s are disjoint,

and

are ﬁnites set of hyperbolic ﬁxed points for

and

−

, respec-

tively,

u,f

−

loc

(

q) for q ∈ Q

and

s,f

−

loc

(

q) for q ∈ Q

are one-dimensional, and

5. If

p ∈ P

then

p ∈ W

u,−1

(

q) for some q ∈ Q

442

A. Iwata, Y. Shinozawa, and A. Sakurai

Let us consider as simple as possible, so that the ﬁrst choice is to think about

a point

p ∈ P

and

q ∈ Q

, that is

(

p) = p and f

−

(

q) = q. Since p cannot be

the same as

q (because f

−1

−

◦ f

(

p) = p + w

−1

· w

· (x

− x

−

) =

p ), we have

to ﬁnd a way to let

p ∈ W

u,−1

(

q).

Since it is very hard in general to calculate stable or unstable manifolds from

a function and its ﬁxed point, we had better try to let

u,−1

(

q) be a “simple”

manifold. There is one more reason to do so: we have to deﬁne

{x|h(x) ≥

} but if W

u,−1

(

q) is not simple, suitable h may not exist.

We have decided that

u,−1

(

q) be a line (if possible). Considering the function

form

f

−

(

s) = σ(w

· s + w

· x

−

), it is not diﬃcult to see that the line could be

one of the axes or one of the bisectors of the right angles at the origin (i.e., one

of the lines

y = x and y = −x). We have chosen the bisector in the ﬁrst (and the

third) quadrant (i.e., the line

y = x). By the way q was chosen to be the origin

and

p was chosen arbitrarily to be (0.8, 0.8).

The item 4 is satisﬁed by setting one of the two eigenvalues of

−

at the

origin to be greater than one, and the other smaller then one. We have chosen

/0.6 for one and 1/μ for the other which is to be set so that Item 1 and 2 are

satisﬁed by considering eigenvalues of

p for f

The design consideration that we have skipped is how to design

{x|h(x)

≥ 0}. A simple way is to make the boundary h(x) = 0 parallel to W

u,−1

(

q) for

our intended

q ∈ Q

. Because, if we do so, by setting the largest eigenvalue of

−

at

q to be equal to the inverse of the eigenvalue of Df

p along the

normal to

u,−1

, we can get the points

s ∈ D

−

◦ f

(

s), f

−

◦ f

(

s), . . . , f

−

◦

(

s), . . . that belong to {a

|n > 0}, reside at approximately equal distance

from

u,−1

. Needless to say that the points belonging to, say,

n+1

|n > 0}

have approximately equal distance from

u,−1

among them and this distance

is diﬀerent from that for

n

b

|n > 0}.

Let

−

(

x) = σ(Ax+B

)

, f

(

x) = σ(Ax+B

). We plan to put

ω

=

{(0, 0)},

{(0.8, 0.8)}, W

u,−1

=

{(x, y)|y = x}, the eigenvalues of the tangent space

−1

−

at (0

, 0) are 1/λ = 1/0.6 and 1/μ (where the eigenvector on y = x is

expanding), and the eigenvalues of the tangent space of

+

at (0

.8, 0.8) are 1/μ

and any value. Then, considering derivatives at (0

, 0) and (0.8, 0.8), it is easy

to see

A = ρ

λ 0

μ ρ −

= (1

− 0.8

)

where

ρ(θ) is a rotation by θ. Then

A = λ

μ λ − μ

λ − μ λ + μ

Next from

σ(B

0

) = (0

, 0)

and

σ((0.8λ, 0.8λ)

) = (0

.8, 0.8)

, B

−1

.8) − 0.8λ

−1

.8) − 0.8λ .

These give us

μ = 5/3, λ = 0.6, B

≈ (1.23722, 1.23722)

A Characterization of Simple RNNs with Two Hidden Units

443

Fig. 1. The vector ﬁeld representation of f

(left) and

−

(right)

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

-1

-0.75

-0.5

-0.25

0.25

0.75

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

Fig. 2. {f

−

◦ f

(

p)| n ≥ 1} (upper), {f

−

◦ f

n+1

(

p)| n ≥ 1} (lower left), and {f

n+1

−

◦

(

p)| n ≥ 1} (lower right) where p = (0.5, 0.95)

444

A. Iwata, Y. Shinozawa, and A. Sakurai

In Fig. 1, the left image shows the vector ﬁeld of

where the arrows starting

x end at f

(

x) and the right image shows the vector ﬁeld of f

−

In Fig. 2, the upper plot shows points corresponding to strings in

|n >

}, the lower-left plot {a

n+1

b

n

|n > 0}, and the lower-right plot {a

n+1

|n > 0}.

The initial point was set to

p = (0.5, 0.95) in Fig. 2. All of them are for n = 1

n = 40 and when n grows the points gather so we could say that they stay in

narrow stripes, i.e.

, for any

Discussion

We obtained a necessary condition that SRN implements a recognizer for the

language

|n > 0} by analyzing its behavior from the viewpoint of discrete

dynamical systems. The condition supposes that

’s are disjoint,

(

)

⊆

i+1

, and

is ﬁnite.

It suggests a possibility of the implementation and in fact we have successfully

built a recognizer for the language, thereby we showed that the learning problem

of the language has at least a solution.

Unstableness of any solutions for learning is suggested to be (but not derived

to be) due to the necessity of

being in an unstable manifold

u,−1

(

q) for

q ∈ Q

. Since

is attractive in the above example,

(

) for

∈ D

comes

exponentially close to

for

n. By even a small ﬂuctuation of P

, since

(

too, is close to

u,−1

(

q), f

−

(

)), which should be in

, is disturbed much.

This means that even if we are close to a solution, by just a small ﬂuctuation of

ω

caused by a new training data,

−

(

)) may easily be pushed out of

0

.

Since Rodriguez et al. [14] showed that the languages that do not belong to

the context-free class could be learned to some degree, we have to further study

the discrepancies.

Instability of grammar learning by SRN shown above might not be seen in our

natural language learning, which suggests that SRN might not be appropriate

for a model of language learning.

References

1. Bod´

en, M., Wiles, J., Tonkes, B., Blair, A.: Learning to predict a context-free

language: analysis of dynamics in recurrent hidden units. Artiﬁcial Neural Networks

(1999); Proc. ICANN 1999, vol. 1, pp. 359–364 (1999)

2. Bod´

en, M., Wiles, J.: Context-free and context-sensitive dynamics in recurrent

neural networks. Connection Science 12(3/4), 197–210 (2000)

3. Bod´

en, M., Blair, A.: Learning the dynamics of embedded clauses. Applied In-

telligence: Special issue on natural language and machine learning 19(1/2), 51–63

(2003)

4. Casey, M.: Correction to proof that recurrent neural networks can robustly recog-

nize only regular languages. Neural Computation 10, 1067–1069 (1998)

5. Chalup, S.K., Blair, A.D.: Incremental Training Of First Order Recurrent Neural

Networks To Predict A Context-Sensitive Language. Neural Networks 16(7), 955–

972 (2003)

A Characterization of Simple RNNs with Two Hidden Units

445

6. Elman, J.L.: Distributed representations, simple recurrent networks and grammat-

ical structure. Machine Learning 7, 195–225 (1991)

7. Elman, J.L.: Language as a dynamical system. In: Mind as Motion: Explorations

in the Dynamics of Cognition, pp. 195–225. MIT Press, Cambridge

8. Gers, F.A., Schmidhuber, J.: LSTM recurrent networks learn simple context free

and context sensitive languages. IEEE Transactions on Neural Networks 12(6),

1333–1340 (2001)

9. Guckenheimer, J., Holmes, P.: Nonlinear Oscillations, Dynamical Systems, and

Bifurcations of Vector Fields. Springer, Heidelberg (Corr. 5th print, 1997)

10. Hopcroft, J.E., Ullman, J.D.: Introduction to automata theory, languages, and

computation. Addison-Wesley, Reading (1979)

11. Katok, A., Hasselblatt, B.: Introduction to the Modern Theory of Dynamical Sys-

tems. Cambridge University Press, Cambridge (1996)

12. Maass, W., Orponen, P.: On the eﬀect of analog noise in discrete-time analog

computations. Neural Computation 10, 1071–1095 (1998)

13. Rodriguez, P., Wiles, J., Elman, J.L.: A recurrent neural network that learns to

count. Connection Science 11, 5–40 (1999)

14. Rodriguez, P.: Simple recurrent networks learn context-free and context-sensitive

languages by counting. Neural Computation 13(9), 2093–2118 (2001)

15. Schmidhuber, J., Gers, F., Eck, D.: Learning Nonregular Languages: A Comparison

of Simple Recurrent Networks and LSTM. Neural Computation 14(9), 2039–2041

(2002)

16. Siegelmann, H.T.: Neural Networks and Analog Computation: beyond the Turing

Limit, Birkh¨

auser (1999)

17. Wiles, J., Blair, A.D., Bod´

en, M.: Representation Beyond Finite States: Alterna-

tives to Push-Down Automata. In: A Field Guide to Dynamical Recurrent Net-

works

Unbiased Likelihood Backpropagation Learning

Masashi Sekino and Katsumi Nitta

Tokyo Institute of Technology, Japan

Abstract. The error backpropagation is one of the popular methods

for training an artiﬁcial neural network. When the error backpropaga-

tion is used for training an artiﬁcial neural network, overﬁtting occurs

in the latter half of the training. This paper provides an explanation

about why overﬁtting occurs with the model selection framework. The

explanation leads to a new method for training an aritiﬁcial neural net-

work, Unibiased Likelihood Backpropagation Learning. Several results

are shown.

Introduction

An artiﬁcial neural network is one of the model for function approximation. It is

possible to approximate arbitrary function when the number of basis functions

is large. The error backpropagation learning [1], which is a famous method for

training an artiﬁcial neural network, is the gradient discent method with the

squared error to learning data as a target function. Therefore, the error back-

propagation learning can obtain local optimum while monotonously decreasing

the error. Here, although the error to learning data is monotonously decreasing,

the error to test data increases in the latter half of training. This phenomenon

is called overﬁtting.

Early stopping is one of the method for preventing the overﬁtting. This

method stop the training when an estimator of the generalization error does

not decrease any longer. For example, the technique which stop the training

when the error to hold-out data does not decrease any longer is often applied.

However, the early stopping basically minimize the error to learning data, there-

fore there is no guarantee for obtaining the optimum parameter which minimize

the estimator of the generalization error.

When the parameters of the basis functions (model parameter) are ﬁxed, an

artiﬁcial neural network becomes a linear regression model. If a regularization pa-

rameter is introduced to assure the regularity of this linear regression model, the

artiﬁcial neural network becomes a set of regular linear regression models. The

cause of why an artiﬁcial neural network tends to overﬁt is that the maximum

likelihood estimation with respect to the model parameter is the model selection

about regular linear regression models based on the empirical likelihood.

In this paper, we propose the unbiased likelihood backpropagation learning

which is the gradient discent method for modifying the model parameter with

unbiased likelihood (information criterion) as a target function. It is expected

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 446–455, 2008.

c Springer-Verlag Berlin Heidelberg 2008

Unbiased Likelihood Backpropagation Learning

447

that the proposed method has better approximation performance because the

method explicitly minimize an estimator of the generalization error.

Following section, section 2 explains about statistical learning and maximum

likelihood estimation, and section 3 explains about information criterion shortly.

Next, in section 4, we give an explanation about an artiﬁcial neural network,

regularized maximum likelihood estimation and why the error backpropagation

learning cause overﬁtting. Then, the proposed method is explained in section 5.

We show the eﬀectiveness of the proposed method by applying the method to

DELVE data set [4] in section 6. Finally we conclude this paper in section 7.

Statistical Learning

Statistical learning aims to construct an optimal approximation ˆ

p(x) of a true

distribution q(x) from a set of hypotheses M ≡ {p(x|θ) | θ ∈ Θ} using learning

data

D ≡ {x

| n = 1, · · · , N} obtained from q(x). M is called model and

approximation ˆ

p(x) is called estimation. When we want to clearly denote that

the estimation ˆ

p(x) is constructed using learning data D, we use a notation

p(x|D). Kullback-Leibler divergence:

D(q||p) ≡

q(x) log

q(x)

p(x)

(1)

is used for distance from q(x) to p(x). Probability density p(x) is called likeli-

hood, and especially the likelihood of the estimation ˆ

p(x) to the learning data

p(D|D) ≡

n=1

p(x

|D)

(2)

is called empirical likelihood. Sample mean of the log-likelihood:

N

log p(D) =

n=1

log p(x

)

(3)

asymptotically converges in probability to the mean log-likelihood:

q(x)

log p(x) ≡

q(x) log p(x)dx

(4)

according to the law of large numbers, where E

q(x)

denotes an expectation under

q(x). Because Kullback-Leibler divergence can be decomposed as

D(q||p) = E

q(x)

log q(x) − E

q(x)

log p(x) ,

(5)

the maximization of the mean log-likelihood is equal to the minimization of

Kullback-Leibler divergence. Therefore, statistical learning methods such as

maximum likelihood estimation, maximum a posteriori estimation and Bayesian

estimation are based on the likelihood.

448

M. Sekino and K. Nitta

Maximum Likelihood Estimation

Maximum likelihood estimation ˆ

ML

(

x) is the hypothesis p(x| ˆθ

) by the max-

imizer ˆ

of likelihood p(D|θ):

(

x) ≡ p(x| ˆθ

(6)

≡ argmax

p(D|θ).

(7)

Information Criterion and Model Selection

3.1

Information Criterion

Because sample mean of the log-likelihood asymptotically converges in probabil-

ity to the mean log-likelihood, statistical learning methods are based on likeli-

hood. However, because learning data

D is a ﬁnite set in practice, the empirical

likelihood ˆ

p(D|D) contains bias. This bias b(ˆ

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 37 38 39 40 41 42 43 44 ... 88