Lecture Notes in Computer Science

bet	40/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 36 37 38 39 40 41 42 43 ... 88

S → b. S → Sa means that “the letter S should be rewritten to Sa,” and S → b

means that “the letter

S should be rewritten to b.” If we have more than one

applicable rules, we have to try all the possibility. The string generated this way

but with no further applicable rewriting rule is called a sentence. The set of all

possible sentences is called the language generated by the grammar. Although

the word “language” would be better termed formal language in contrast to

natural language, we follow custom in formal language theory ﬁeld(e.g. [10]).

In everyday expression a sentence is a sequence of words whereas in the above

example it is a string of characters. Nevertheless the essence is common. The

concept that a language is deﬁned based on a grammar this way has been a

major paradigm in a wide range of formal language study and related ﬁeld.

A study of grammar learning focuses on restoring a grammar from ﬁnite sam-

ples of sentences of a language associated with the grammar. In contrast to ease

of generation of sample sentences, grammar learning from sample sentences is a

hard problem and in fact it is impossible except for some very restrictive cases

e.g. the language is ﬁnite. As is well-known, humans do learn language whose

grammar is very complicated. The diﬀerence between the two situations might

be attributed to possible existence of some unknown restrictions on types of

natural language grammars in our brain. Since neural network is a model of

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 436–445, 2008.

c Springer-Verlag Berlin Heidelberg 2008

A Characterization of Simple RNNs with Two Hidden Units

437

brain, some researchers think that neural network could learn a grammar from

exemplar sentences of a language.

The researches on grammar learning by neural networks are characterized by

1. focus on learning of self-embedding rules, since simpler rules that can be

expressed by ﬁnite state automata are understood to be learnable,

2. simple recurrent neural networks (SRN) are used as basic mechanism, and

3. languages such as

|n > 0}, {a

|n > 0} which are clearly the results

of, though not a representative of, self-embedding rules, were adopted for

target languages. Here

n

is a string of

n-times repetition of a character a,

the language

n

b

|n > 0} is generated by a grammar {S → ab, S → aSb},

and

{a

n

|n > 0} is generated by a context-sensitive grammar.

The adoption of simple languages such as

|n > 0} and {a

|n > 0}

as target languages are inevitable, since it is not easy to see whether enough

generalization is achieved by the learning if realistic grammars are used.

Although these languages are simple, many times when they learned, what

they learned was just what they were taught, not a grammar, i.e. they did almost

rote-learning. In other words, their generalization capability in learning was lim-

ited. Also their learned-results were unstable in a sense that when they were given

new training sentences which were longer than the ones they learned but still

in the same language, learned network changes unexpectedly. The change was

more than just reﬁnement of the learned network. Considering these situations,

we may doubt that there really exists a solution or correct learned network.

Bod´

en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), and others

tried to clarify the reasons and explore the possibility of network learning of the

languages. But their results were not conclusive and did not give clear conditions

that the learned networks satisfy in common.

We will in this paper describe a condition that SRNs with two hidden units

learned to recognize language

|n > 0} have in common, that is, a condition

that has to be met for SRN to be qualiﬁed to be successfully learned the language.

The condition implies, by the way, instability of learning results. Moreover by

utilizing the condition, we realize a language recognizer. By doing this we found

that learning of recognizers of languages more complicated than

n

b

|n > 0}

are hard.

RNN (recurrent neural network) is a type of networks that has recurrent

connections and a feedforward network connections. The calculation is done at

once for the feedforward network part and after one time-unit delay the recurrent

connection part gives rises to additional inputs (i.e., additional to the external

inputs) to the feedforward part. RNN is a kind of discrete time system. Starting

with an initial state (initial outputs, i.e. outputs without external inputs, of

feedforward part) the network proceeds to accept the next character in a string

given to external inputs, reaches the ﬁnal state, and obtains the ﬁnal external

output from the ﬁnal state.

SRN (simple recurrent network) is a simple type of RNN and has only one

layer of hidden units in its feedforward part.

438

A. Iwata, Y. Shinozawa, and A. Sakurai

Rodriguez et al. [13] showed that SRN learns languages

|n > 0} (or,

more correctly, its subset

|n ≥ i > 0}) and {a

|n > 0} (or, more cor-

rectly,

|n ≥ i > 0}). For {a

|n > 0}, they found an SRN that generalized

n ≤ 16 after learned for n ≤ 11. They analyzed how the languages are pro-

cessed by the SRN but the results were not conclusive so that it is still an open

problem if it is really possible to realize recognizers of the languages such as

n

b

|n > 0} or {a

b

n

|n > 0} and if SRN could learn or implement more

complicated languages.

Siegelmann [16] showed that an RNN has computational ability superior to

a Turing machine. But the diﬃculty of learning

|n > 0} by SRN suggests

that at least SRN might not be able to realize Turing machine. The diﬀerence

might be attributed to the diﬀerence of output functions of neurons: piecewise

linear function in Siegelmann’s case and sigmoidal function in standard SRN

cases, since we cannot ﬁnd other substantial diﬀerences.

On the other hand Casey [4] and Maass [12] showed that in noisy environment

RNN is equivalent to ﬁnite automaton or less powerful one. The results suggest

that correct (i.e., inﬁnite precision) computation is to be considered when we had

to make research on possibility of computations by RNN or speciﬁcally SRN.

Therefore in the research we conducted and will report in this paper we have

adopted RNN models with inﬁnite precision calculation and with sigmoidal func-

tion (tanh(

x)) as output function of neurons.

From the viewpoints above, we discuss two things in the paper: a necessary

condition for an SRN with two hidden units to be a recognizer of the language

n

b

|n > 0}, the condition is suﬃcient enough to guide us to build an SRN

recognizing the language.

Preliminaries

SRN (simple recurrent network) is a simple type of recurrent neural networks

and its function is expressed by

n+1

σ(w

· s

· x

(

) =

where

σ is a standard sigmoid function (tanh(x) = (1 − exp(−x))/(1 +

exp(

−x))) and is applied component-wise.

A counter is a device which keeps an integer and allows +1 or

−1 operation

and answers yes or no to an inquiry if the content is 0 (0-test). A stack is a

device which allows operations push

i (store i) and pop up (recall the last-

stored content, discard it and restore the device as if it were just before the

corresponding push operation). Clearly a stack is more general than a counter,

so that if a counter is not implementable, a stack is not, too.

Elman used SRN as a predictor but Rodriguez et al. [13] and others consider

it as a counter. We in this paper mainly stand on the latter viewpoint. We will

ﬁrst explain that a predictor is used as a recognizer or a counter.

When we try to train SRN to recognize a language, we train it to predict

correctly the next character to come in an input string (see e.g. [6]). As usual

we adopted “one hot vector” representation for the network output. One hot

vector is a vector representation in which a single element is one and the others

A Characterization of Simple RNNs with Two Hidden Units

439

are zero. The network is trained so that the sum of the squared error (diﬀerence

between actual output and desired output) is minimized. By this method, if two

possible outputs with the same occurrence frequency for the same input exist in

a training data, the network would learn to output 0.5 for two elements in the

output with others being 0, since the output vector should give the minimum of

the sum of the squared error.

It is easily seen that if a network correctly predicts the next character to

come for a string in the language

n

b

|n > 0}, we could see it behaves as a

counter with limited function. Let us add a new network output whose value

is positive if the original network predicts only

a to come (which happens only

when the input string up to the time was

n

b

for some

n which is common

practice since Elman’s work) and is negative otherwise. The modiﬁed network

seems to count up for a character

a and count down for b since it outputs positive

value when the number of

as and bs coincide and negative value otherwise. Its

counting capability, though, is limited since it could output any value when

is fed before the due number of

bs are fed, that is, when counting up action is

required before the counter returns back to 0-state.

A (discrete-time) dynamical system is represented as the iteration of a function

application:

i+1

f(s

)

, i ∈ N , s

∈ R

. A point

s is called a ﬁxed point of

f if f(s) = s. A point s is an attracting ﬁxed point of f if s is a ﬁxed point

and there exists a neighborhood

around

s such that lim

i→∞

f

i

(

x) = s for all

x ∈ U

s

. A point

s is a repelling ﬁxed point of f if s is an attracting ﬁxed point

−1

. A point

s is called a periodic point of f if f

(

s) = s for some n. A point

s is a ω-limit point of x for f if lim

i→∞

f

n

(

x) = s for lim

i→∞

∞. A ﬁxed

point

x of f is hyperbolic if all of the eigenvalues of Df at x have absolute values

diﬀerent from one, where

Df = [∂f

/∂x

] is the Jacobian matrix of ﬁrst partial

derivatives of the function

f. A set D is invariant under f if for any s ∈ D,

f(s) ∈ D.

The following theorem plays an important role in the current paper.

Theorem 1 (Stable Manifold Theorem for a Fixed Point [9]). Let f :

→ R

be a

(

r ≥ 1) diﬀeomorphism with a hyperbolic ﬁxed point x. Then

there are local stable and unstable manifolds

s,f

loc

(

x), W

u,f

loc

(

x), tangent to the

eigenspaces

s,f

u,f

Df at x and of corresponding dimension. W

s,f

loc

(

x) and

u,f

loc

(

x) are as smooth as the map f, i.e. of class C

Local stable and unstable manifold for

f are deﬁned as follows:

s,f

loc

(

q) = {y ∈ U

| lim

m→∞

dist(

(

y), q) = 0}

u,f

loc

(

q) = {y ∈ U

| lim

m→∞

dist(

−m

(

y), q) = 0}

where

is a neighborhood of

q and dist is a distance function. Then global

stable and unstable manifold for

f are deﬁned as: W

s,f

(

q) =

i≥0

−i

(

s,f

loc

(

q))

and

u,f

(

q) =

i≥0

(

u,f

loc

(

q)).

As deﬁned, SRN is a pair of a discrete-time dynamical system

n+1

σ(w

· x

) and an external output part

n

=

· s

. We simply

440

A. Iwata, Y. Shinozawa, and A. Sakurai

write the former (dynamical system part) as

n+1

f(s

, x

) and the external

output part as

h(s

When RNN (or SRN) is used as a recognizer of the language

|n > 0},

as described in Introduction, it is seen as a counter where the input character

is for a count-up operation (i.e., +1) and

b is for a count-down operation (i.e.,

−1). In the following we may use x

for

a and x

−

for

b. For abbreviation, in

the following, we use

+

=

f( · , x

) and

−

f( · , x

−

)

Please note that

−

is undeﬁned for the point outside and on the border of the

square (

I[−1, 1])

, where

I[−1, 1] is the closed interval [−1, 1]. In the following,

though, we do not mention it for simplicity.

0

is the set

{s|h(s) ≥ 0}, that is, a region where the counter value is 0. Let

−i

−

(

), that is, a region where the counter value is

We postulate that

(

)

⊆ D

i+1

. This means that any point in

is eligible

for a state where the counter content is

i. This may seem to be rather demanding.

An alternative would be that the point

p is for counter content c if and only if

p = f

−

◦ f

◦ . . . ◦ f

i

−

◦ f

(

) for a predeﬁned

0

, some

≥ 0 and p

≥ 0

for 1

≤ j ≤ i, and i ≥ 0 such that

j=1

(

− m

) =

c. This, unfortunately, has

not resulted in a fruitful result.

We also postulate that

’s are disjoint. Since we deﬁned

as a closed set,

the postulate is natural. The point is therefore we have chosen

i

to be closed.

The postulate requires that we should keep a margin between

and

others.

A Necessary Condition

We consider only SRN with two hidden units, i.e., all the vectors concerning

such as

, s

, w

are two-dimensional.

Deﬁnition 2. D

is the set of the accumulation points of

i≥0

, i.e.

s ∈ D

iﬀ

s = lim

i→∞

for some

∈ D

Deﬁnition 3. P

is the set of

ω-limit points of points in D

for

, i.e.

s ∈ P

iﬀ

s = lim

i→∞

(

) for some

i

and

∈ D

is the set of

ω-limit points

of points in

0

for

−1

−

, i.e.

s ∈ Q

iﬀ

s = lim

i→∞

−k

−

(

) for some

and

∈ D

Considering the results obtained in Bod´

en et al. ([1,2,3]), Rodriguez et al.

([13,14]), Chalup et al. ([5]), it is natural, at least for a ﬁrst consideration, to

postulate that

(

x) and f

−

(

x) are not wondering so that they will converge

to periodic points. Therefore

ω

and

are postulated as ﬁnite set of hyper-

bolic periodic points for

and

−

, respectively. In the following, though, for

simpliﬁcation of presentations, we postulate that

ω

and

are ﬁnites set of

hyperbolic ﬁxed points for

and

−

, respectively.

Moreover, according to the same literature, the points in

ω

are saddle points

for

−

, so that we further postulate that

u,f

−

loc

(

q) for q ∈ Q

and

s,f

−

loc

(

q) for

q ∈ Q

are one-dimensional, where their existence is guaranteed by Theorem 1.

A Characterization of Simple RNNs with Two Hidden Units

441

Postulate 4. We postulate that f

(

)

⊆ D

i+1

’s are disjoint,

and

are ﬁnites set of hyperbolic ﬁxed points for

and

−

, respectively, and

u,f

−

loc

(

for

q ∈ Q

and

s,f

−

loc

(

q) for q ∈ Q

are one-dimensional.

Lemma 5. f

−1

−

◦ f

(

) =

−1

−

(

I(−1, 1))

) =

and

(

I(−1, 1))

−

(

), and

(

)

⊆ D

and

⊆ D

Deﬁnition 6. W

u,−1

(

q) is the global unstable manifold at q ∈ Q

for

−1

−

, i.e.,

u,−1

(

q) = W

u,f

−1

−

(

q) = W

s,f

−

(

q).

Lemma 7. For any p ∈ D

, any accumulation point of

−

(

p)|i > 0} is in Q

.

Proof. Since

p is in D

, there exist

∈ D

such that

p = lim

i→∞

. Suppose

q in D

ω

is the accumulation point stated in the theorem statement, i.e.,

q =

lim

j→∞

−

(

p). We take k

large enough for any

so that in any neighborhood of

q where f

−

(

p) is in, p

exists. Then

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 36 37 38 39 40 41 42 43 ... 88