Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
q = lim j→∞
f h j − ( p k i ) = lim j→∞ f h j −k i − ( s k i ) where k i is a function of h j with k i > h j . Let s k i = f k i − ( p k i ) ∈ D
0 and
s 0 ∈ D 0 be an accumulation point of {s k
}. Then since f −1 − is continuous, letting n j = −h j + k i > 0, q = lim j→∞
f −n j − ( s 0 ), i.e.,
q ∈ Q ω . Lemma 8. D ω = q∈Q ω W u,−1 ( q). Proof. Let p be any point in D ω . Since
f − ( D ω ) ⊆ (I[−1, 1]) 2 where I[−1, 1] is the interval [ −1, 1], i.e., f − ( D ω ) is bounded, and f − ( D ω ) ⊆ D ω , {f n − ( p)}
has an accumulation point q in D
ω , which is, by Lemma 7, in Q ω
q is expressed as q = lim j→∞
f n j − ( p). Since Q ω is a finite set of hyperbolic fixed points, q = lim
n→∞ f n − ( p), i.e., p ∈ W s,f ( q) = W u,f −1 ( q) = W u,−1
( q).
Since P ω ⊆ D ω , the next theorem holds. Theorem 9. A point in P ω is either a point in Q ω or in W u,−1
( q) for some q ∈ Q ω
Please note that since q ∈ W
u,−1 ( q), the theorem statement is just “If p ∈ P ω then
p ∈ W u,−1
( q) for some q ∈ Q ω .”
An Example of a Recognizer To construct an SRN recognizer for {a n
n |n > 0}, the SRN should satisfy the conditions stated in Theorem 9 and Postulate 4, which are summarized as: 1. f + ( D i ) ⊆ D i+1 , 2. D i ’s are disjoint, 3. P ω and Q ω are finites set of hyperbolic fixed points for f + and f − , respec- tively,
4. W u,f − loc
( q) for q ∈ Q ω and
W s,f
− loc
( q) for q ∈ Q ω are one-dimensional, and 5. If p ∈ P
ω then
p ∈ W u,−1
( q) for some q ∈ Q ω .
442 A. Iwata, Y. Shinozawa, and A. Sakurai Let us consider as simple as possible, so that the first choice is to think about a point
p ∈ P ω and q ∈ Q ω , that is f + ( p) = p and f − ( q) = q. Since p cannot be the same as q (because f −1 − ◦ f + ( p) = p + w −1 s · w x · (x + − x
− ) =
p ), we have to find a way to let p ∈ W u,−1
( q).
Since it is very hard in general to calculate stable or unstable manifolds from a function and its fixed point, we had better try to let W u,−1
( q) be a “simple” manifold. There is one more reason to do so: we have to define D 0 = {x|h(x) ≥ 0 } but if W u,−1 ( q) is not simple, suitable h may not exist. We have decided that W u,−1 ( q) be a line (if possible). Considering the function form f
( s) = σ(w
s · s + w
x · x
− ), it is not difficult to see that the line could be one of the axes or one of the bisectors of the right angles at the origin (i.e., one of the lines y = x and y = −x). We have chosen the bisector in the first (and the third) quadrant (i.e., the line y = x). By the way q was chosen to be the origin and
p was chosen arbitrarily to be (0.8, 0.8). The item 4 is satisfied by setting one of the two eigenvalues of Df −
origin to be greater than one, and the other smaller then one. We have chosen 1 /0.6 for one and 1/μ for the other which is to be set so that Item 1 and 2 are satisfied by considering eigenvalues of Df + at p for f
+ . The design consideration that we have skipped is how to design D 0 = {x|h(x) ≥ 0}. A simple way is to make the boundary h(x) = 0 parallel to W u,−1 (
our intended q ∈ Q
ω . Because, if we do so, by setting the largest eigenvalue of Df −
q to be equal to the inverse of the eigenvalue of Df + at p along the normal to W u,−1
, we can get the points s ∈ D
0 , f − ◦ f
+ ( s), f 2 − ◦ f 2 + ( s), . . . , f i − ◦ f i + ( s), . . . that belong to {a n b n |n > 0}, reside at approximately equal distance from
W u,−1
. Needless to say that the points belonging to, say, {a n+1 b n |n > 0} have approximately equal distance from W u,−1 among them and this distance is different from that for {a n
n |n > 0}.
Let f − ( x) = σ(Ax+B 0 )
+ ( x) = σ(Ax+B 1 ). We plan to put Q ω
{(0, 0)}, P ω = {(0.8, 0.8)}, W u,−1 =
of f −1 − at (0
, 0) are 1/λ = 1/0.6 and 1/μ (where the eigenvector on y = x is expanding), and the eigenvalues of the tangent space of f +
.8, 0.8) are 1/μ and any value. Then, considering derivatives at (0 , 0) and (0.8, 0.8), it is easy to see
1 2 A = ρ π 4 λ 0 0 μ ρ −
π 4 , 1 μ = (1 − 0.8 2 ) μ where
ρ(θ) is a rotation by θ. Then A = λ
+ μ λ − μ
λ − μ λ + μ Next from σ(B 0
, 0) T and σ((0.8λ, 0.8λ) T + B 1 ) = (0 .8, 0.8) T , B 0 = 0 0 , B 1 = σ −1 (0 .8) − 0.8λ σ −1 (0 .8) − 0.8λ . These give us μ = 5/3, λ = 0.6, B 1 ≈ (1.23722, 1.23722) T .
A Characterization of Simple RNNs with Two Hidden Units 443
Fig. 1. The vector field representation of f + (left) and f − (right) -1 -0.75
-0.5 -0.25
0.25 0.5
0.75 1 -1 -0.75 -0.5
-0.25 0.25
0.5 0.75
1 -1 -0.75 -0.5 -0.25
0.25 0. 0.75 1 -1 -0.75 -0.5 -0.25
0.25 0.5
0.75 1 -1 -0.75 -0.5
-0.25 0.25
0.5 0.75
1 -1 -0.75 -0.5 -0.25
0.25 0.5
0.75 1 Fig. 2. {f n − ◦ f n + ( p)| n ≥ 1} (upper), {f n − ◦ f n+1
+ ( p)| n ≥ 1} (lower left), and {f n+1 − ◦ f n + ( p)| n ≥ 1} (lower right) where p = (0.5, 0.95) 444 A. Iwata, Y. Shinozawa, and A. Sakurai In Fig. 1, the left image shows the vector field of f + where the arrows starting at x end at f + ( x) and the right image shows the vector field of f − . In Fig. 2, the upper plot shows points corresponding to strings in {a n b n |n >
0 }, the lower-left plot {a n+1 b
|n > 0}, and the lower-right plot {a n b n+1 |n > 0}.
The initial point was set to p = (0.5, 0.95) in Fig. 2. All of them are for n = 1 to n = 40 and when n grows the points gather so we could say that they stay in narrow stripes, i.e. D n , for any n. 5 Discussion We obtained a necessary condition that SRN implements a recognizer for the language {a n b n |n > 0} by analyzing its behavior from the viewpoint of discrete dynamical systems. The condition supposes that D i ’s are disjoint, f + ( D i ) ⊆ D i+1 , and
Q ω is finite. It suggests a possibility of the implementation and in fact we have successfully built a recognizer for the language, thereby we showed that the learning problem of the language has at least a solution. Unstableness of any solutions for learning is suggested to be (but not derived to be) due to the necessity of P ω being in an unstable manifold W u,−1 ( q) for
q ∈ Q ω . Since P ω is attractive in the above example, f n + ( s 0 ) for s 0 ∈ D 0 comes exponentially close to P ω for n. By even a small fluctuation of P ω , since
f n + ( s 0 ), too, is close to W u,−1
( q), f
n − ( f n + ( s 0 )), which should be in D 0 , is disturbed much. This means that even if we are close to a solution, by just a small fluctuation of P ω
f n − ( f n + ( s 0 )) may easily be pushed out of D 0
Since Rodriguez et al. [14] showed that the languages that do not belong to the context-free class could be learned to some degree, we have to further study the discrepancies. Instability of grammar learning by SRN shown above might not be seen in our natural language learning, which suggests that SRN might not be appropriate for a model of language learning. References 1. Bod´
en, M., Wiles, J., Tonkes, B., Blair, A.: Learning to predict a context-free language: analysis of dynamics in recurrent hidden units. Artificial Neural Networks (1999); Proc. ICANN 1999, vol. 1, pp. 359–364 (1999) 2. Bod´
en, M., Wiles, J.: Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science 12(3/4), 197–210 (2000) 3. Bod´ en, M., Blair, A.: Learning the dynamics of embedded clauses. Applied In- telligence: Special issue on natural language and machine learning 19(1/2), 51–63 (2003)
4. Casey, M.: Correction to proof that recurrent neural networks can robustly recog- nize only regular languages. Neural Computation 10, 1067–1069 (1998) 5. Chalup, S.K., Blair, A.D.: Incremental Training Of First Order Recurrent Neural Networks To Predict A Context-Sensitive Language. Neural Networks 16(7), 955– 972 (2003)
A Characterization of Simple RNNs with Two Hidden Units 445
6. Elman, J.L.: Distributed representations, simple recurrent networks and grammat- ical structure. Machine Learning 7, 195–225 (1991) 7. Elman, J.L.: Language as a dynamical system. In: Mind as Motion: Explorations in the Dynamics of Cognition, pp. 195–225. MIT Press, Cambridge 8. Gers, F.A., Schmidhuber, J.: LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks 12(6), 1333–1340 (2001) 9. Guckenheimer, J., Holmes, P.: Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, Heidelberg (Corr. 5th print, 1997) 10. Hopcroft, J.E., Ullman, J.D.: Introduction to automata theory, languages, and computation. Addison-Wesley, Reading (1979) 11. Katok, A., Hasselblatt, B.: Introduction to the Modern Theory of Dynamical Sys- tems. Cambridge University Press, Cambridge (1996) 12. Maass, W., Orponen, P.: On the effect of analog noise in discrete-time analog computations. Neural Computation 10, 1071–1095 (1998) 13. Rodriguez, P., Wiles, J., Elman, J.L.: A recurrent neural network that learns to count. Connection Science 11, 5–40 (1999) 14. Rodriguez, P.: Simple recurrent networks learn context-free and context-sensitive languages by counting. Neural Computation 13(9), 2093–2118 (2001) 15. Schmidhuber, J., Gers, F., Eck, D.: Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM. Neural Computation 14(9), 2039–2041 (2002)
16. Siegelmann, H.T.: Neural Networks and Analog Computation: beyond the Turing Limit, Birkh¨ auser (1999) 17. Wiles, J., Blair, A.D., Bod´ en, M.: Representation Beyond Finite States: Alterna- tives to Push-Down Automata. In: A Field Guide to Dynamical Recurrent Net- works
Unbiased Likelihood Backpropagation Learning Masashi Sekino and Katsumi Nitta Tokyo Institute of Technology, Japan Abstract. The error backpropagation is one of the popular methods for training an artificial neural network. When the error backpropaga- tion is used for training an artificial neural network, overfitting occurs in the latter half of the training. This paper provides an explanation about why overfitting occurs with the model selection framework. The explanation leads to a new method for training an aritificial neural net- work, Unibiased Likelihood Backpropagation Learning. Several results are shown. 1 Introduction An artificial neural network is one of the model for function approximation. It is possible to approximate arbitrary function when the number of basis functions is large. The error backpropagation learning [1], which is a famous method for training an artificial neural network, is the gradient discent method with the squared error to learning data as a target function. Therefore, the error back- propagation learning can obtain local optimum while monotonously decreasing the error. Here, although the error to learning data is monotonously decreasing, the error to test data increases in the latter half of training. This phenomenon is called overfitting. Early stopping is one of the method for preventing the overfitting. This method stop the training when an estimator of the generalization error does not decrease any longer. For example, the technique which stop the training when the error to hold-out data does not decrease any longer is often applied. However, the early stopping basically minimize the error to learning data, there- fore there is no guarantee for obtaining the optimum parameter which minimize the estimator of the generalization error. When the parameters of the basis functions (model parameter) are fixed, an artificial neural network becomes a linear regression model. If a regularization pa- rameter is introduced to assure the regularity of this linear regression model, the artificial neural network becomes a set of regular linear regression models. The cause of why an artificial neural network tends to overfit is that the maximum likelihood estimation with respect to the model parameter is the model selection about regular linear regression models based on the empirical likelihood. In this paper, we propose the unbiased likelihood backpropagation learning which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. It is expected M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 446–455, 2008. c Springer-Verlag Berlin Heidelberg 2008 Unbiased Likelihood Backpropagation Learning 447
that the proposed method has better approximation performance because the method explicitly minimize an estimator of the generalization error. Following section, section 2 explains about statistical learning and maximum likelihood estimation, and section 3 explains about information criterion shortly. Next, in section 4, we give an explanation about an artificial neural network, regularized maximum likelihood estimation and why the error backpropagation learning cause overfitting. Then, the proposed method is explained in section 5. We show the effectiveness of the proposed method by applying the method to DELVE data set [4] in section 6. Finally we conclude this paper in section 7. 2 Statistical Learning Statistical learning aims to construct an optimal approximation ˆ p(x) of a true distribution q(x) from a set of hypotheses M ≡ {p(x|θ) | θ ∈ Θ} using learning data
D ≡ {x n | n = 1, · · · , N} obtained from q(x). M is called model and approximation ˆ p(x) is called estimation. When we want to clearly denote that the estimation ˆ p(x) is constructed using learning data D, we use a notation ˆ p(x|D). Kullback-Leibler divergence: D(q||p) ≡ q(x) log
q(x) p(x)
dx (1)
is used for distance from q(x) to p(x). Probability density p(x) is called likeli- hood, and especially the likelihood of the estimation ˆ p(x) to the learning data D: ˆ p(D|D) ≡ N n=1 ˆ p(x
n |D)
(2) is called empirical likelihood. Sample mean of the log-likelihood: 1 N
1 N N n=1 log p(x
n ) (3) asymptotically converges in probability to the mean log-likelihood: E q(x) log p(x) ≡ q(x) log p(x)dx (4) according to the law of large numbers, where E q(x) denotes an expectation under q(x). Because Kullback-Leibler divergence can be decomposed as D(q||p) = E q(x) log q(x) − E q(x) log p(x) , (5) the maximization of the mean log-likelihood is equal to the minimization of Kullback-Leibler divergence. Therefore, statistical learning methods such as maximum likelihood estimation, maximum a posteriori estimation and Bayesian estimation are based on the likelihood.
448 M. Sekino and K. Nitta Maximum Likelihood Estimation Maximum likelihood estimation ˆ p ML
x) is the hypothesis p(x| ˆθ ML ) by the max- imizer ˆ θ ML of likelihood p(D|θ): ˆ p ML ( x) ≡ p(x| ˆθ ML ), (6) ˆ θ ML ≡ argmax θ p(D|θ). (7) 3 Information Criterion and Model Selection 3.1 Information Criterion Because sample mean of the log-likelihood asymptotically converges in probabil- ity to the mean log-likelihood, statistical learning methods are based on likeli- hood. However, because learning data D is a finite set in practice, the empirical likelihood ˆ p(D|D) contains bias. This bias b(ˆ Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling