Lecture Notes in Computer Science

bet	43/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 39 40 41 42 43 44 45 46 ... 88

vector, given by

w = w

1,1,2

, · · · w

1,(n

+1),2

, · · · , w

,1,L

, · · · , w

L

,(n

L−1

+1),L

(1)

The energy function up to the t-th training sample is given by

E(w) =

τ =1

d(τ) − h(w, x(τ ))

+ [w − ˆ

w(0)]

−1

(0) [w − ˆ

w(0)]

(2)

where x(τ ) is a n

-dimensional input vector, d(τ ) is the desired n

-dimensional

output, and h(w

, x(τ )) is a nonlinear function that describes the function of

the network. The matrix P (0) is the error covariance matrix and is usually set

to δ

−1

M×M

, where I

M×M

is a M × M identity matrix. The minimization of (2)

leads to the standard RLS equations [1, 3, 6, 7], given by

K(t) = P (t − 1) H

(t) I

×n

+ H(t) P (t − 1) H

(t)

−1

(3)

P (t) = P (t − 1) − K(t) H(t) P (t − 1)

(4)

w(t) = ˆ

w(t − 1) + K(t) [d(t) − h( ˆ

w(t − 1), x(t)) ] ,

(5)

where H(t) =

∂h(w,(t))

∂w

w= ˆ

w(t−1)

is the gradient matrix (n

× M) of

h(w, x(t)); and K(t) is the so-called Kalman gain matrix (M × n

) in the clas-

sical control theory. The matrix P (t) is the so-called error covariance matrix. It

is symmetric positive deﬁnite.

As mentioned in [5], the standard RLS algorithm only has the limited weight

decay eﬀect, being

t

o

per training iteration, where t

is the number of training

iterations. The decay eﬀect decreases linearly as the number of training iterations

increases. Hence, the more training presentations take place, the less smoothing

eﬀect would have in the data ﬁtting process.

A true weight decay RLS algorithm, namely TDRLS, was then proposed [5],

where a decay term is added to the original energy function. The new energy

function is given by

E(w) =

t

τ=1

d(τ ) − h(w, x(τ ))

αw

w + [w − ˆ

w(0)]

−1

(0) [w − ˆ

w(0)] (6)

where α is a regularization parameter. The gradient of E(w) is given by

∂E(w)

∂w

≈ P

−1

(0) [w − ˆ

w(0)] +

t

τ =1

αw − H

(τ ) [d(τ ) − H(τ )w − ξ(τ )] .

(7)

458

C.S. Leung, K.-W. Wong, and Y. Xu

In the above, we linearize h(w, x(τ )) around the estimate ˆ

w(τ −1). That means,

h(w, x(τ )) = H(τ )w + ξ(τ ), where ξ(τ ) = h( ˆ

w(τ − 1), x(τ )) − H(τ ) ˆ

w(τ ) + ρ(τ ).

To minimize the energy function, we set the gradient to zero. Hence, we have

w(t) = P (t)r(t)

(8)

where

−1

(t) = P

−1

(t − 1) + H

(t) H(t) + αI

M×M

(9)

r(t) = r(t − 1) + H

(t) [d(t) − ξ(t)] .

(10)

Furthermore, we deﬁne P

∗

(t)

= [I

M×M

+ αP (t − 1)]

−1

P (t − 1). Hence, we have

∗−1

(t) = P

−1

(t − 1) + αI

M×M

. With the matrix inversion lemma [7] in the

recursive calculation of P (t), (8) becomes

∗

(t − 1) = [I

M×M

+ αP (t − 1)]

−1

P (t − 1)

(11)

K(t) = P

∗

(t − 1) H

(t) I

×n

+ H(t) P

∗

(t − 1)H

(t)

−1

(12)

P (t) = P

∗

(t − 1) − K(t) H(t) P

∗

(t − 1)

(13)

w(t) = ˆ

w(t − 1)−αP (t) ˆ

w(t − 1) + K(t)[d(t) − h( ˆ

w(t − 1), x(t))]. (14)

Equations (11)-(14) are the general global TWDRLS equations. Those are

more compact than the equations presented in [5]. Also, the weight up-

dating equation in [5] ( i.e., (14) in this paper) is more complicated.

When the regularization parameter α is set to zero, the decay term αw

vanishes and (11)-(14) reduce to the standard RLS equations. The decay eﬀect

can be easily understood by the decay term αw

w in the energy function given

by (6). As mentioned in [5], the decay eﬀect per training iterations is equal to

αw

w which does not decrease with the number of training iterations. The

energy function of TWDRLS is the same as that of batch model weight decay

methods. Hence, existing heuristic methods [8, 9] for choosing the value of α can

be used for the TWDRLS’s case.

We can also explain the weight decay eﬀect based on the recursive equations

(11)-(14). The main diﬀerence between the standard RLS equations and the

TWDRLS equations is the introduction of a decay term

−αP (t) ˆ

w(t − 1) in

(14). This term guarantees that the magnitude of the updating weight vector

decays an amount proportional to αP (t). Since P (t) is positive deﬁnite, the

magnitude of the weight vector would not be too large. So the generalization

ability of the trained networks would be better [8, 9].

A drawback of TWDRLS is the requirement in computing the inverse of

the M -dimensional matrix (I

M×M

+ αP (t − 1)). This complexity is equal to

O(M

) which is much larger than that of the standard RLS, O(M

). Hence,

the TWDRLS algorithm is computationally prohibitive even for a network with

moderate size. In the next Section, a local version of the TWDRLS algorithm

will be proposed to solve this large complexity problem.

The Local True Weight Decay Recursive Least Square Algorithm

459

Localization of the TWDRLS Algorithm

To localize the TWDRLS algorithm, we ﬁrst divide the weight vector into several

small vectors, where w

i,l

= w

i,1,l

, · · · , w

i,(n

l−1

+1),l

is denoted as the weights

connecting all the neurons of layer l − 1 to the i − th neuron of layer l. We

consider the estimation of each weight vector separately. When we consider the

i−th neuron in layer l, we assume that other weight vectors are constant vectors.

Such a technique is usually used in many numerical methods [10].

At each training iteration, we update each weight vector separately. Each

neuron has its energy function. The energy function of the i − th neuron in layer

l is given by

E(w

i,l

) =

τ =1

d(τ ) − h(w

i,l

, x(τ))

+ αw

i,l

+ [w

i,l

− ˆ

i,l

(0)]

−1

i,l

(0) [w

i,l

− ˆ

i,l

(0)] .

(15)

Utilizing a derivation process similar to the previous analysis, we obtain the

following recursive equations for the local TWDRLS algorithm. Each neuron

(excepting input neurons) has its set of TWDRLS equations. For the i − th

neuron in layer l, the TWDRLS equations are given by

∗

i,l

(

t − 1) =

l−1

+1)×(n

l−1

+1)

αP

i,l

(

t − 1)

−1

i,l

(

t − 1)

(16)

i,l

(

t) = P

∗

i,l

(

t − 1) H

i,l

(

t) I

×n

i,l

(

t) P

∗

i,l

(

t − 1)H

i,l

(

−1

(17)

i,l

(

t) = P

∗

i,l

(

t − 1) − K

i,l

(

t) H

i,l

(

t) P

∗

i,l

(

t − 1)

(18)

i,l

(

t) = ˆ

i,l

(

t−1)−αP

i,l

(

t) ˆ

i,l

(

t−1) + K

i,l

(

t)[d(t)−h( ˆ

i,l

(

t−1), x(t))], (19)

where H

i,l

is the n

× n

l−1

+1)

local gradient matrix. In this matrix, only one

row associated with the considered neuron is nonzero for output layer L. K

i,l

(t)

is the (n

l−1

+ 1)

× n

local Kalman gain. P

i,l

(t) is the (n

l−1

+ 1)

× (n

l−1

+ 1)

local error covariance matrix.

The training process of the local TWDRLS algorithm is as follows. There

are

l=2

neurons (excepting input neurons). Hence, there are

l=2

sets of

TWDRLS equations. We update the local weight vectors in accordance with a

descending order of l and then an ascending order of i using (16)-(19). At each

training stage, only the concerned local weight vector is updated and all other

local weight vectors remain unchanged.

In the global TWDRLS, the complexity mainly comes from the computing

the inverse of the M -dimensional matrix (I

M×M

+ αP (t − 1)). This complexity

is equal to O(M

). So, the computational complexity is equal to T CC

global

O(M

) = O

l=2

l−1

+ 1)

. Since the size of the matrix is M × M ,

the space complexity (storage requirement) is equal to T CS

global

= O(M

) =

l=2

l−1

+ 1)

From (16), the computational cost of local TWDRLS algorithm mainly comes

from the inversion of an (n

l−1

+ 1)

× (n

l−1

+ 1) matrix. In this way, the

460

C.S. Leung, K.-W. Wong, and Y. Xu

computational complexity of each set of local TWDRLS equations is equal to

O((n

l−1

+1)

) and the corresponding space complexity is equal to O((n

l−1

+1)

Hence, the total computational complexity of the local TWDRLS is given by

T CC

local

= O

l=2

l−1

+ 1)

and the space complexity (storage require-

ment) is equal to T CS

local

= O

l=2

l−1

+ 1)

. They are much smaller

than the computational and space complexities of the global case.

Simulations

Two problems, the generalized XOR and the sunspot data prediction, are consid-

ered. We use three-layers networks. The initial weights are small zero-mean inde-

pendent identically distributed Gaussian random variables. The transfer function

of hidden neurons is a hyperbolic tangent. Since the generalized XOR is a classi-

ﬁcation problem, output neurons are with the hyperbolic tangent function. For

the sunspot data prediction problem, output neurons are with the linear activa-

tion function. The training for each problem is performed 10 times with diﬀerent

random initial weights.

4.1

Generalized XOR Problem

The generalized XOR problem is formulated as d = sign(x

) with inputs in

the range [

−1, 1]. The network has 2 input neurons, 10 hidden neurons, and

1 output neuron. As a result, there are 41 weights. The training set and test

set, shown in Figure 1, consists of 50 and 2,000 samples, respectively. The total

number of training cycles is set to 200. In each cycle, training samples from the

training set are feeded to the network one by one.

The decision boundaries obtained from typical networks trained with both

global and local TWDRLS and standard RSL algorithms are plotted in Figure 2.

(a) Training samples

(b) Test samples

Fig. 1. Training and test samples for the generalized XOR problems

The Local True Weight Decay Recursive Least Square Algorithm

461

Table 1. Computational and space complexities of the global and local TWDRLS

algorithms for solving the generalized XOR problem

Algorithm Computational complexity Space complexitty

Global

O(6.89 × 10

)

O(1.68 × 10

)

Local

O(1.60 × 10

)

O(2.21 × 10

)

(a) Global TWDRLS,

α = 0

(b) Local TWDRLS,

α = 0

α = 0.00178

(d) Local TWDRLS,

α = 0.00178

Fig. 2. Decision boundaries of various trained networks for the generalized XOR prob-

lem. Note that when

α = 0, the TWDRLS is identical to RLS.

From Figures 1 and 2, the decision boundaries obtained from the trained net-

works with TWDRLS algorithm are closer to the ideal ones than those with

the standard RLS algorithm. Also, both local and global TWDRLS algorithms

produce a similar shape of decision boundaries.

Figure 3 summarizes the average test set false rates in the 10 runs. The av-

erage test set false rates obtained by global and local TWDRLS algorithms are

usually lower than those obtained by the standard RLS algorithm over a wide

range of regularization parameters. That means, both global and local TWDRLS

algorithms can improve the generalization ability. In terms of average false rate,

the performance of the local TWDRLS algorithm is quite similar to that of the

global ones. The computational and space complexities for global and local al-

gorithms are listed in Table 1. From Figure 3 and Table 1, we can conclude that

462

C.S. Leung, K.-W. Wong, and Y. Xu

Fig. 3. Average test set false rate of 10 runs for the generalized XOR problem

the performance of local TWDRLS is comparable to that of the global ones, and

that its complexities are much smaller.

Figure 3 indicates that the average test set false rate ﬁrst decreases with the

regularization parameter α and then increases with it. This shows that a proper

selection of α will indeed improve the generalization ability of the network.

On the other hand, we observe that the test set false rate becomes very high

at large values of α, especially for the networks trained with global TWDRLS

algorithm. This is due to the fact that when the value of α is too large, the

weight decay eﬀect is very substantial and the trained network cannot learn the

target function. In order to further illustrate this, we plot in Figure 4 the decision

boundary obtained from the network trained with global TWDRLS algorithm

for α = 0.0178. The ﬁgure shows that the network has already converged when

the decision boundary is still quite far from the ideal one. This is because when

the value of α is too large, the weight decay eﬀect is too strong. That means, the

regularization parameter α cannot be too large otherwise the network cannot

learn the target function.

4.2

Sunspot Data Prediction

The sunspot data from 1700 to 1979 are normalized to the range [0,1] and taken

as the training and the test sets. Following the common practice, we divide

the data into a training set (1700

− 1920) and two test sets, namely, Test-set 1

(1921

− 1955) and Test-set 2 (1956 − 1979). The sunspot series is rather non-

stationary and Test-set 2 is atypical for the series as a whole.

In the simulation, we assume that the series is generated from the following

auto-regressive model, given by

d(t) = ϕ(d(t − 1), · · · , d(t − 12)) + (t)

(20)

where (t) is noise and ϕ(·, · · · , ·) is an unknown nonlinear function. A network

with 12 input neurons, 8 hidden neurons (with hyperbolic tangent activation

The Local True Weight Decay Recursive Least Square Algorithm

463

Table 2. Computational and space complexities of the global and local TWDRLS

algorithms for solving the sunspot data prediction

Algorithm Computational complexity Space complexitty

Global

O(1.44 × 10

)

O(1.28 × 10

)

Local

O(1.83 × 10

)

O(1.43 × 10

)

Fig. 4. Decision boundaries of a trained network with local TWDRLS where α =

.0178. In this case, the value of the regularization parameter is too large. Hence, the

network cannot form a good decision boundary.

(a) Test-set 1 average RMSE

(b) Test-set 2 average RMSE

Fig. 5. RMSE of networks trained by global and local TWDRLS algorithms. Note that

when

α = 0, the TWDRLS is identical to RLS.

function), and one output neuron (with linear activation function) is used for

approximating ϕ(·, · · · , ·). The total number of training cycles is equal to 200.

As this is a time series problem, the training samples are feeded to the network

sequentially in each iteration. The criterion to evaluate the model performance

464

C.S. Leung, K.-W. Wong, and Y. Xu

is the mean squared error (RMSE) of the test set. The experiment are repeated

10 times with diﬀerent initial weights.

Figure 5 summarizes the average RMSE in 10 runs. The computational and

space complexities for global and local algorithms are listed in Table 2.

We observe from Figure 5 that over a wide range of the regularization pa-

rameter α, both global and local TWDRLS algorithms have greatly improved

the generalization ability of the trained networks, especially for test-set 2 that

is quite diﬀerent from the training set. However, the test RMSE becomes very

large at large values of α. The reasons are similar to those stated in the last

subsection. This is because at large value of α, the weight decay eﬀect is too

strong and so the network cannot learn the target function. In most cases, the

performance of the local training is found to be comparable to that of the global

ones. Also, Table 2 shows that those complexities of the local training are much

smaller than those of the global one.

Conclusion

We have investigated the problem of training the MFNN model using the

TWDRLS algorithms. We derive a set of concise equations for the local

TWDRLS algorithm. The computational complexity and the storage require-

ment are reduced considerably when using the local approach. Computer simu-

lations indicate that both local and global TWDRLS algorithms can improve the

generation ability of MFNNs. The performance of the local TWDRLS algorithm

is comparable to that of the global ones.

Acknowledgement

The work is supported by the Hong Kong Special Administrative Region RGC

Earmarked Grant (Project No. CityU 115606).

References

1. Shah, S., Palmieri, F., Datum, M.: Optimal ﬁltering algorithm for fast learning in

feedforward neural networks. Neural Networks 5, 779–787 (1992)

2. Leung, C.S., Wong, K.W., Sum, J., Chan, L.W.: A pruning method for recursive

least square algorithm. Neural Networks 14, 147–174 (2001)

3. Scalero, R., Tepedelelenlioglu, N.: Fast new algorithm for training feedforward

neural networks. IEEE Trans. Signal Processing 40, 202–210 (1992)

4. Leung, C.S., Sum, J., Young, G., Kan, W.K.: On the kalman ﬁltering method in

neural networks training and pruning. IEEE Trans. Neural Networks 10, 161–165

(1999)

5. Leung, C.S., Tsoi, A.H., Chan, L.W.: Two regularizers for recursive least squared

algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Net-

works 12, 1314–1332 (2001)

6. Mosca, E.: Optimal Predictive and adaptive control. Prentice-Hall, Englewood

Cliﬀs, NJ (1995)

The Local True Weight Decay Recursive Least Square Algorithm

465

7. Haykin, S.: Adaptive ﬁlter theory. Prentice-Hall, Englewood Cliﬀs, NJ (1991)

8. Mackay, D.: Bayesian interpolation. Neural Computation 4, 415–447 (1992)

9. Mackay, D.: A practical bayesian framework for backpropagation networks. Neural

Computation 4, 448–472 (1992)

10. William H, H.: Applied numerical linear algebra. Prentice-Hall, Englewood Cliﬀs,

NJ (1989)

Experimental Bayesian Generalization Error of

Non-regular Models under Covariate Shift

Keisuke Yamazaki and Sumio Watanabe

Precision and Intelligence Laboratory, Tokyo Institute of Technology

R2-5, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan

{k-yam,swatanab}@pi.titech.ac.jp

Abstract. In the standard setting of statistical learning theory, we as-

sume that the training and test data are generated from the same distri-

bution. However, this assumption cannot hold in many practical cases,

e.g., brain-computer interfacing, bioinformatics, etc. Especially, changing

input distribution in the regression problem often occurs, and is known

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 39 40 41 42 43 44 45 46 ... 88