Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
S → b. S → Sa means that “the letter S should be rewritten to Sa,” and S → b means that “the letter S should be rewritten to b.” If we have more than one applicable rules, we have to try all the possibility. The string generated this way but with no further applicable rewriting rule is called a sentence. The set of all possible sentences is called the language generated by the grammar. Although the word “language” would be better termed formal language in contrast to natural language, we follow custom in formal language theory field(e.g. [10]). In everyday expression a sentence is a sequence of words whereas in the above example it is a string of characters. Nevertheless the essence is common. The concept that a language is defined based on a grammar this way has been a major paradigm in a wide range of formal language study and related field. A study of grammar learning focuses on restoring a grammar from finite sam- ples of sentences of a language associated with the grammar. In contrast to ease of generation of sample sentences, grammar learning from sample sentences is a hard problem and in fact it is impossible except for some very restrictive cases e.g. the language is finite. As is well-known, humans do learn language whose grammar is very complicated. The difference between the two situations might be attributed to possible existence of some unknown restrictions on types of natural language grammars in our brain. Since neural network is a model of M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 436–445, 2008. c Springer-Verlag Berlin Heidelberg 2008 A Characterization of Simple RNNs with Two Hidden Units 437
brain, some researchers think that neural network could learn a grammar from exemplar sentences of a language. The researches on grammar learning by neural networks are characterized by 1. focus on learning of self-embedding rules, since simpler rules that can be expressed by finite state automata are understood to be learnable, 2. simple recurrent neural networks (SRN) are used as basic mechanism, and 3. languages such as {a n b n |n > 0}, {a n b n c n |n > 0} which are clearly the results of, though not a representative of, self-embedding rules, were adopted for target languages. Here a n
n-times repetition of a character a, the language {a n
n |n > 0} is generated by a grammar {S → ab, S → aSb}, and {a
b n c n |n > 0} is generated by a context-sensitive grammar. The adoption of simple languages such as {a n b n |n > 0} and {a n b n c n |n > 0} as target languages are inevitable, since it is not easy to see whether enough generalization is achieved by the learning if realistic grammars are used. Although these languages are simple, many times when they learned, what they learned was just what they were taught, not a grammar, i.e. they did almost rote-learning. In other words, their generalization capability in learning was lim- ited. Also their learned-results were unstable in a sense that when they were given new training sentences which were longer than the ones they learned but still in the same language, learned network changes unexpectedly. The change was more than just refinement of the learned network. Considering these situations, we may doubt that there really exists a solution or correct learned network. Bod´ en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), and others tried to clarify the reasons and explore the possibility of network learning of the languages. But their results were not conclusive and did not give clear conditions that the learned networks satisfy in common. We will in this paper describe a condition that SRNs with two hidden units learned to recognize language {a n b n |n > 0} have in common, that is, a condition that has to be met for SRN to be qualified to be successfully learned the language. The condition implies, by the way, instability of learning results. Moreover by utilizing the condition, we realize a language recognizer. By doing this we found that learning of recognizers of languages more complicated than {a n
n |n > 0}
are hard. RNN (recurrent neural network) is a type of networks that has recurrent connections and a feedforward network connections. The calculation is done at once for the feedforward network part and after one time-unit delay the recurrent connection part gives rises to additional inputs (i.e., additional to the external inputs) to the feedforward part. RNN is a kind of discrete time system. Starting with an initial state (initial outputs, i.e. outputs without external inputs, of feedforward part) the network proceeds to accept the next character in a string given to external inputs, reaches the final state, and obtains the final external output from the final state. SRN (simple recurrent network) is a simple type of RNN and has only one layer of hidden units in its feedforward part. 438 A. Iwata, Y. Shinozawa, and A. Sakurai Rodriguez et al. [13] showed that SRN learns languages {a n b n |n > 0} (or, more correctly, its subset {a i b i |n ≥ i > 0}) and {a n b n c n |n > 0} (or, more cor- rectly, {a i b i |n ≥ i > 0}). For {a n b n |n > 0}, they found an SRN that generalized to n ≤ 16 after learned for n ≤ 11. They analyzed how the languages are pro- cessed by the SRN but the results were not conclusive so that it is still an open problem if it is really possible to realize recognizers of the languages such as {a n
n |n > 0} or {a n b
c n |n > 0} and if SRN could learn or implement more complicated languages. Siegelmann [16] showed that an RNN has computational ability superior to a Turing machine. But the difficulty of learning {a n b n |n > 0} by SRN suggests that at least SRN might not be able to realize Turing machine. The difference might be attributed to the difference of output functions of neurons: piecewise linear function in Siegelmann’s case and sigmoidal function in standard SRN cases, since we cannot find other substantial differences. On the other hand Casey [4] and Maass [12] showed that in noisy environment RNN is equivalent to finite automaton or less powerful one. The results suggest that correct (i.e., infinite precision) computation is to be considered when we had to make research on possibility of computations by RNN or specifically SRN. Therefore in the research we conducted and will report in this paper we have adopted RNN models with infinite precision calculation and with sigmoidal func- tion (tanh( x)) as output function of neurons. From the viewpoints above, we discuss two things in the paper: a necessary condition for an SRN with two hidden units to be a recognizer of the language {a n
n |n > 0}, the condition is sufficient enough to guide us to build an SRN recognizing the language. 2 Preliminaries SRN (simple recurrent network) is a simple type of recurrent neural networks and its function is expressed by s n+1
= σ(w
s · s
n + w x · x
n ), N n ( s n ) =
w os · s n + w oc where σ is a standard sigmoid function (tanh(x) = (1 − exp(−x))/(1 + exp(
−x))) and is applied component-wise. A counter is a device which keeps an integer and allows +1 or −1 operation and answers yes or no to an inquiry if the content is 0 (0-test). A stack is a device which allows operations push i (store i) and pop up (recall the last- stored content, discard it and restore the device as if it were just before the corresponding push operation). Clearly a stack is more general than a counter, so that if a counter is not implementable, a stack is not, too. Elman used SRN as a predictor but Rodriguez et al. [13] and others consider it as a counter. We in this paper mainly stand on the latter viewpoint. We will first explain that a predictor is used as a recognizer or a counter. When we try to train SRN to recognize a language, we train it to predict correctly the next character to come in an input string (see e.g. [6]). As usual we adopted “one hot vector” representation for the network output. One hot vector is a vector representation in which a single element is one and the others A Characterization of Simple RNNs with Two Hidden Units 439
are zero. The network is trained so that the sum of the squared error (difference between actual output and desired output) is minimized. By this method, if two possible outputs with the same occurrence frequency for the same input exist in a training data, the network would learn to output 0.5 for two elements in the output with others being 0, since the output vector should give the minimum of the sum of the squared error. It is easily seen that if a network correctly predicts the next character to come for a string in the language {a n
n |n > 0}, we could see it behaves as a counter with limited function. Let us add a new network output whose value is positive if the original network predicts only a to come (which happens only when the input string up to the time was a n
n for some
n which is common practice since Elman’s work) and is negative otherwise. The modified network seems to count up for a character a and count down for b since it outputs positive value when the number of as and bs coincide and negative value otherwise. Its counting capability, though, is limited since it could output any value when a is fed before the due number of bs are fed, that is, when counting up action is required before the counter returns back to 0-state. A (discrete-time) dynamical system is represented as the iteration of a function application: s i+1
= f(s
i ) , i ∈ N , s i ∈ R
n . A point s is called a fixed point of f if f(s) = s. A point s is an attracting fixed point of f if s is a fixed point and there exists a neighborhood U s around s such that lim i→∞ f
( x) = s for all x ∈ U s
s is a repelling fixed point of f if s is an attracting fixed point of f −1 . A point s is called a periodic point of f if f n ( s) = s for some n. A point s is a ω-limit point of x for f if lim i→∞ f
i ( x) = s for lim i→∞ n i = ∞. A fixed point x of f is hyperbolic if all of the eigenvalues of Df at x have absolute values different from one, where Df = [∂f
i /∂x
j ] is the Jacobian matrix of first partial derivatives of the function f. A set D is invariant under f if for any s ∈ D, f(s) ∈ D. The following theorem plays an important role in the current paper. Theorem 1 (Stable Manifold Theorem for a Fixed Point [9]). Let f : R n → R n be a C r ( r ≥ 1) diffeomorphism with a hyperbolic fixed point x. Then there are local stable and unstable manifolds W s,f
loc ( x), W u,f loc
( x), tangent to the eigenspaces E s,f x , E u,f x of Df at x and of corresponding dimension. W s,f
loc ( x) and W u,f
loc ( x) are as smooth as the map f, i.e. of class C r . Local stable and unstable manifold for f are defined as follows: W s,f loc ( q) = {y ∈ U q | lim
m→∞ dist(
f m ( y), q) = 0} W u,f loc ( q) = {y ∈ U q | lim
m→∞ dist(
f −m ( y), q) = 0} where
U q is a neighborhood of q and dist is a distance function. Then global stable and unstable manifold for f are defined as: W s,f
( q) =
i≥0 f −i ( W s,f loc ( q)) and W u,f ( q) =
i≥0 f i ( W u,f loc ( q)). As defined, SRN is a pair of a discrete-time dynamical system s n+1 = σ(w
s · s n + w x · x
n ) and an external output part N n
w os · s n + w oc . We simply 440 A. Iwata, Y. Shinozawa, and A. Sakurai write the former (dynamical system part) as s n+1 = f(s
n , x
n ) and the external output part as h(s
n ). When RNN (or SRN) is used as a recognizer of the language {a n b n |n > 0},
as described in Introduction, it is seen as a counter where the input character a is for a count-up operation (i.e., +1) and b is for a count-down operation (i.e., −1). In the following we may use x + for
a and x − for b. For abbreviation, in the following, we use f +
f( · , x + ) and f − = f( · , x − ) Please note that f − is undefined for the point outside and on the border of the square (
I[−1, 1]) 2 , where I[−1, 1] is the closed interval [−1, 1]. In the following, though, we do not mention it for simplicity. D 0
{s|h(s) ≥ 0}, that is, a region where the counter value is 0. Let D i = f −i − ( D 0 ), that is, a region where the counter value is i. We postulate that f + ( D i ) ⊆ D i+1
. This means that any point in D i is eligible for a state where the counter content is i. This may seem to be rather demanding. An alternative would be that the point p is for counter content c if and only if p = f
m 1 − ◦ f p 1 + ◦ . . . ◦ f m i
◦ f p i + ( s 0 ) for a predefined s 0
m j ≥ 0 and p j ≥ 0
for 1 ≤ j ≤ i, and i ≥ 0 such that i j=1
( p j − m j ) = c. This, unfortunately, has not resulted in a fruitful result. We also postulate that D i ’s are disjoint. Since we defined D i as a closed set, the postulate is natural. The point is therefore we have chosen D i
The postulate requires that we should keep a margin between D 0 and D 1 and others.
3 A Necessary Condition We consider only SRN with two hidden units, i.e., all the vectors concerning s such as w s , s n , w
os are two-dimensional. Definition 2. D ω is the set of the accumulation points of i≥0 D i , i.e. s ∈ D
ω iff s = lim i→∞ s k i for some
s k i ∈ D k i . Definition 3. P ω is the set of ω-limit points of points in D 0 for f + , i.e. s ∈ P ω iff s = lim i→∞
f k i + ( s 0 ) for some k i
s 0 ∈ D 0 . Q ω is the set of ω-limit points of points in D 0
f −1 − , i.e. s ∈ Q
ω iff s = lim i→∞ f −k i − ( s 0 ) for some k i and s 0 ∈ D 0 . Considering the results obtained in Bod´ en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), it is natural, at least for a first consideration, to postulate that f i + ( x) and f i − ( x) are not wondering so that they will converge to periodic points. Therefore P ω
Q ω are postulated as finite set of hyper- bolic periodic points for f + and f − , respectively. In the following, though, for simplification of presentations, we postulate that P ω
Q ω are finites set of hyperbolic fixed points for f + and f − , respectively. Moreover, according to the same literature, the points in Q ω
for f − , so that we further postulate that W u,f − loc
( q) for q ∈ Q ω and
W s,f
− loc
( q) for
q ∈ Q ω are one-dimensional, where their existence is guaranteed by Theorem 1. A Characterization of Simple RNNs with Two Hidden Units 441
Postulate 4. We postulate that f + ( D i ) ⊆ D i+1
, D i ’s are disjoint, P ω and Q ω are finites set of hyperbolic fixed points for f + and f − , respectively, and W u,f − loc
( q) for q ∈ Q ω and W s,f
− loc
( q) for q ∈ Q ω are one-dimensional. Lemma 5. f −1 − ◦ f + ( D ω ) = D ω , f −1 − ( D ω ( I(−1, 1)) 2 ) =
D ω and D ω ( I(−1, 1)) 2 = f − ( D ω ), and f + ( D ω ) ⊆ D ω . P ω ⊆ D ω and
Q ω ⊆ D ω . Definition 6. W u,−1 ( q) is the global unstable manifold at q ∈ Q ω for
f −1 − , i.e., W u,−1 ( q) = W
u,f −1 − ( q) = W
s,f − ( q). Lemma 7. For any p ∈ D ω , any accumulation point of {f i − ( p)|i > 0} is in Q ω .
p is in D ω , there exist p k i ∈ D k i such that p = lim
i→∞ p k i . Suppose q in D ω
q = lim
j→∞ f h j − ( p). We take k i large enough for any h j so that in any neighborhood of q where f h j − ( p) is in, p k i exists. Then Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling