Lecture Notes in Computer Science

bet	42/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 38 39 40 41 42 43 44 45 ... 88

p) is deﬁned as

b(ˆ

p) ≡ E

q(D)

log ˆ

p(D) −N · E

q(x)

log ˆ

p(x|D) ,

(8)

where q(D) ≡

n=1

q(x

Because of the bias, it is known that the most overﬁtted model to learning

data is selected when select a regular model from a candidate set of regular

models based on the empirical likelihood. Addressing this problem, there have

been proposed many information criteria which evaluates learning results by

correcting the bias. Generally the form of the information criterion is

IC(ˆ

p, D) ≡ −2 log ˆ

p(D) + 2ˆb(ˆ

(9)

where ˆ

b(ˆ

p) is the estimator of the bias b(ˆ

p).

Corrected AIC (cAIC) [3] estimates and corrects accurate bias of the empirical

log-likelihood as

b

cAIC

(ˆ

) =

N (M + 1)

N − M − 2

(10)

under the assumption that the learning model is a normal linear regression model,

the true model is included in the learning model and the estimation is constructed

by maximum likelihood estimation. Here, M is the number of explanatory vari-

ables and dim

θ = M +1 (1 is the number of the estimators of variance.) Therefore,

cAIC asymptotically equals AIC [2]: ˆ

cAIC

(ˆ

)

→ ˆb

AIC

(ˆ

) (N → ∞).

Artiﬁcial Neural Network

In a regression problem, we want to estimate the true function using learning

data

D = {(x

, y

)

| n = 1, · · · , N}, where x

∈ R

is an input and y

∈ R is

the corresponding output. An artiﬁcial neural network is deﬁned as

Unbiased Likelihood Backpropagation Learning

449

f (x; θ

) =

i=1

φ(x; ϕ

)

(11)

where a

(i = 1, · · · , N ) are regression coeﬃcients and φ(x; ϕ

) are basis func-

tions which are parametalized by

. Model parameter of this neural network is

≡ (ϕ

, · · · , ϕ

)

(

denotes transpose.)

Design matrix

X, coeﬃcient vector a and output vector y are deﬁned as

ij

≡ φ(x

;

)

(12)

a ≡ (a

, · · · , a

)

(13)

y ≡ (y

, · · · , y

)

T

(14)

When the model parameter

are ﬁxed, the artiﬁcial neural network (11)

becomes a linear regression model parametalized by

a. A normal linear regression

model:

p(y|x; θ

)

≡

√

2πσ

exp

−

(y−f (x; θ

))

2

2σ

(15)

is usually used when the noise included in the output y is assumed to follow a

normal distribution. In this paper, we call the parameter

R

≡ (a

, σ

)

regular

parameter.

4.1

Regularized Maximum Likelihood Estimation

To assure the regularity of the normal linear regression model (15), regularized

maximum likelihood estimation is usually used for estimating

R

= (

, σ

)

Regularized maximum likelihood estimation maximize regularized log-likelihood:

log p(D) − exp(λ)

(16)

λ ∈ R is called a regularization parameter.

Regularized maximum likelihood estimators of the coeﬃcient vector

a and

the valiance σ

are

a = Ly

(17)

and

y − X ˆa

(18)

where

L ≡ (X

X + exp(λ)I)

−1

(19)

and

I is the identity matrix.

Here, the eﬀective number of the regression coeﬃcients is

ef f

= tr

XL .

(20)

Therefore, M

ef f

is used for the number of explanatory variables M in (10)

When the denominator of (10) is not positive value, we use ˆ

cAIC

(ˆ

p) = ∞.

450

M. Sekino and K. Nitta

4.2

Overﬁtting of the Error Backpropagation Learning

The error backpropagation learning is usually used for training an artiﬁcal neural

network. This method is equal to the gradient discent method based on the

likelihood of the linear regression model (15), because the target function is the

squared error to learning data. In what follows, we assume the noise follow a

normal distribution and the normal linear regression model (15) is regular for all

. Then, an artiﬁcial neural network becomes a set of regular linear regression

models.

For simplicity, let’s think about a set of regular models

H ≡ {M(θ

)

| θ

∈

}, where M(θ

)

≡ {p(x|θ

;

)

| θ

∈ Θ

} is a regular model. We can

deﬁne a new model

≡ {ˆp(x; θ

)

| θ

∈ Θ

}, where ˆp(x; θ

) is the estima-

tion of

M(θ

). Concerning model parameter

M

, statistical learning methods

construct the estimation of the model

based on the empirical likelihood

p(D; θ

For example, maximum likelihood estimation selects ˆ

MML

which is the max-

imizer of ˆ

p(D; θ

M

):

(

x) = ˆp(x; ˆθ

MML

),

(21)

MML

≡ argmax

p(D; θ

(22)

Thus, the maximum likelihood estimation with respect to model parameter

is the model selection from

H based on the empirical likelihood.

Because the error backpropagation is the method which realizes maximum

likelihood estimation by the gradient discent method, the model

M(θ

M

) be-

comes the one gradually overﬁtted in the latter half of the training. Therefore,

we propose a learning method for model parameter

M

based on unbiased like-

lihood which is the empirical likelihood corrected by an appropriate information

criterion.

Unbiased Likelihood Backpropagation Learning

5.1

Unbiased Likelihood

Using an information criterion IC(ˆ

p, D), we deﬁne unbiased likelihood as:

p

ub

(

D) = exp −

2

IC(ˆ

p, D) .

(23)

This unbiased likelihood satisﬁes

q(D)

log ˆ

(

D) = E

q(x)

log ˆ

p(x)

(24)

when the assumptions of the information criterion are satisﬁed.

Unbiased Likelihood Backpropagation Learning

451

5.2

Regular Hierarchical Model

In this paper, we consider about a certain type of hierarchical model, which we

call a regular hierarchical model, deﬁned as a set of regular models. A concise

deﬁnitions of a regular hierarchical model are follows.

Regular Hierarchical Model

–

H ≡ {M(θ

)

| θ

∈ Θ

}

–

M(θ

)

≡ {p(x|θ

;

)

| θ

∈ Θ

}

–

M(θ

) is a regular model with respect to

An artiﬁcial neural network is one of the regular hierarchical models. And

also we deﬁne unbiased maximum likelihood estimation as follows.

Unbiased Maximum Likelihood Estimation

Unbiased

maximum

likelihood

estimation

ubML

(

x) is the estimation

p(x; ˆ

MubML

) by the maximizer ˆ

MubML

of the unbiased likelihood ˆ

(

D|θ

ubML

(

x) ≡ ˆp(x; ˆθ

MubML

)

(25)

MubML

≡ argmax

(

D; θ

(26)

5.3

Unbiased Likelihood Backpropagation Learning

The partial diﬀerential of the unbiased likelihood is

∂

∂θ

log ˆ

(

D; θ

) =

∂

∂θ

log ˆ

p(D; θ

)

−

∂

∂θ

b(ˆ

p; θ

(27)

We deﬁne the unbiased likelihood estimation based on the gradient method with

this partial diﬀerential as unbiased likelihood backpropagation learning.

5.4

Unbiased Likelihood Backpropagation Learning for an Artiﬁcial

Neural Network

In this paper, we derive the unbiased likelihood backpropagation learning for an

artiﬁcial neural network when the bias of the empirical likelihood is estimated

by cAIC (10). The partial diﬀerential of the empirical likelihood with respect to

which is the ﬁrst term of (27), is

∂

∂θ

log ˆ

p(D; θ

) =

∂X ˆ

∂θ

(

y − X ˆa)

(28)

∂X ˆ

∂θ

∂X

∂θ

L + L

∂X

∂θ

y − 2L

∂X

∂θ

Ly.

(29)

452

M. Sekino and K. Nitta

The partial diﬀerential of cAIC (10) with respect to θ

, which is the second

term of (27), is

∂

∂θ

cAIC

(ˆ

p; θ

) =

N (N − 1)

(N − M − 2)

∂M

ef f

∂θ

(30)

∂M

ef f

∂θ

= 2 tr

∂X

∂θ

L − L

∂X

∂θ

L .

(31)

We can also obtain the partial diﬀerential of the unbiased likelihood with respect

to λ as

∂

∂λ

log ˆ

p(D; θ

) =

−

L(y − X ˆa) exp(λ),

(32)

and

∂M

ef f

∂λ

−tr L

L exp(λ).

(33)

Now, we have already obtain the partial diﬀerential of the unbiased likelihood

with respect to

and λ, therefore it is possible to apply the unbiased likelihood

backpropagation learning to an artiﬁcial neural network.

Application to Kernel Regression Model

Kernel regression model is one of the artiﬁcial neural networks. The kernel re-

gression model using gaussian kernels has the model parameter of degree one,

which is the size of gaussian kernels. This model is comprehensible about the

behavior of learning methods. Therefore, the kernel regression model using gaus-

sian kernels is used in the following simulations. In the implementation of the

gradient discent method, we adopted quasi-Newton method with BFGS method

for estimating the Hesse matrix and golden section search for determining the

modiﬁcation length.

6.1

Kernel Regression Model

Kernel regression model is

f (x; θ

) =

n=1

K(x, x

;

(34)

K(x, x

;

) are kernel functions parametalized by model parameter

. Gaus-

sian kernel:

K(x, x

; c) = exp −

x − x

n

2

(35)

is used in the following simulations, where c is a parameter which decides the

size of a gaussian kernel. Model parameter is θ

= c.

Unbiased Likelihood Backpropagation Learning

453

-1

-0.5

0.5

log-likelihood

iterations

emp

true

cAIC

(a) Empirical Likelihood Backpropagation

-1

-0.5

0.5

log-likelihood

iterations

emp

true

cAIC

(b) Unbiased Likelihood Backpropagation

Fig. 1. An example of the transition of mean log empirical likelihood, mean log test

likelihood and mean log unbiased likelihood (cAIC)

454

M. Sekino and K. Nitta

6.2

Simulations

For the purpose of evaluation, the empirical likelihood backpropagation learn-

ing and the unbiased likelihood backpropagation learning are applied to the 8

dimensional input to 1 dimensional output regression problems of “kin-family”

and “pumadyn-family” in the DELVE data set [4]. Each data has 4 combinations

of fairly linear (f) or non linear (n), and medium noise (m) or high noise (h). We

use 128 samples for learning data. 50 templates are chosen randomly and kernel

functions are put on the templates.

Fig.1 shows an example of the transition of mean log empirical likelihood,

mean log test likelihood and mean log unbiased likelihood (cAIC). It shows the

test likelihood of the empirical likelihood backpropagation learning (a) decrease

in the latter half of the training and overﬁtting occurs. On the contrary, it shows

the unbiased likelihood backpropagation learning (b) keeps the test likelihood

close to the unbiased likelihood (cAIC) and overﬁtting does not occur.

Table 1 shows mean and standard deviations of the mean log test likelihood

for 100 experiments. The number in bold face shows it is signiﬁcantly better

result by the t-test at the signiﬁcance level 1%.

Table 1. Mean and standard deviations of the mean log test likelihood for 100 exper-

iments. The number in bold face shows it is signiﬁcantly better result by the t-test at

the signiﬁcance level 1%.

Data

Empirical BP

Unbiased BP

kin-8fm

2

.323 ± 0.346 2.531 ± 0.140

kin-8fh

.108 ± 0.240 1.599 ± 0.100

kin-8nm

−0.394 ± 0.415 0.078 ± 0.287

kin-8nh

−0.435 ± 0.236 0.064 ± 0.048

pumadyn-8fm

−2.235 ± 0.249 −1.708 ± 0.056

pumadyn-8fh

−3.168 ± 0.187 −2.626 ± 0.021

pumadyn-8nm

−3.012 ± 0.238 −2.762 ± 0.116

pumadyn-8nh

−3.287 ± 0.255 −2.934 ± 0.055

6.3

Discussion

The reason why the results of the unbiased likelihood backpropagation learning

in Table 1 shows better is attributed to the fact that the method maximizes

the true likelihood averagely, because the mean of the log unbiased likelihood is

equal to the log true likelihood (see (24)).

The reason why the standard deviations of the test likelihood of the unbi-

ased likelihood backpropagation learning is smaller than that of the empirical

likelihood backpropagation learning is assumed to be due to the fact that the

empirical likelihood prefer the model which has bigger degree of freedom. On the

contrary, the unbiased likelihood prefer the model which has appropriate degree

of freedom. Therefore, the variance of the estimation of the unbiased likelihood

backpropagation learning becomes smaller than that of the empirical likelihood

backpropagation learning.

Unbiased Likelihood Backpropagation Learning

455

Conclusion

In this paper, we provide an explanation about why overﬁtting occurs with the

model selection framework. We propose the unibiased likelihood backpropaga-

tion learning, which is the gradient discent method for modifying the model

parameter with unbiased likelihood (information criterion) as a target function.

And we conﬁrm the eﬀectiveness of the proposed method by applying the method

to DELVE data set.

References

1. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations

by error propagation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel

Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1987)

2. Akaike, H.: A new look at the statistical model identiﬁcation. IEEE Transactions

on Automatic Control 19(6), 716–723 (1974)

3. Sugiura, N.: Further analysis of the data by Akaike’s information criterion and the

ﬁnite corrections. Communications in Statistics, vol. A78, pp. 13–26 (1978)

4. Rasmussen, C.E., Neal, R.M., Hinton, G.E., van Camp, D., Revow, M., Ghahramani,

Z., Kustra, R., Tibshirani, R.: The DELVE manual (1996),

http://www.cs.toronto.edu/

∼

delve/

The Local True Weight Decay Recursive Least

Square Algorithm

Chi Sing Leung, Kwok-Wo Wong, and Yong Xu

Department of Electronic Engineering,

City University of Hong Kong, Hong Kong

eeleungc@cityu.edu.hk

Abstract. The true weight decay recursive least square (TWDRLS)

algorithm is an eﬃcient fast online training algorithm for feedforward

neural networks. However, its computational and space complexities are

very large. This paper ﬁrst presents a set of more compact TWDRLS

equations. Afterwards, we propose a local version of TWDRLS to reduce

the computational and space complexities. The eﬀectiveness of this lo-

cal version is demonstrated by simulations. Our analysis shows that the

computational and space complexities of the local TWDRLS are much

smaller than those of the global TWDRLS.

Introduction

Training multilayered feedforward neural networks (MFNNs) using recursive

least square (RLS) algorithms has aroused much attention in many literatures

[1, 2, 3, 4]. This is because those RLS algorithms are eﬃcient second-order gradi-

ent descent training methods. They lead to a faster convergence when compared

with ﬁrst-order methods, such as the backpropagation (BP) algorithm. More-

over, fewer parameters are required to be tuned during training.

Recently, Leung

et. al. found that the standard RLS algorithm has an implicit

weight decay eﬀect [2]. However, its decay eﬀect is not substantial and so its gen-

eralization ability is not very good. A true weight decay RLS (TWDRLS) algo-

rithm is then proposed [5]. However, the computational complexity of TWDRLS

is equal to O(M

) at each iteration, where M is the number of weights. There-

fore, it is necessary to reduce the complexity of TWDRLS so that the TWDRLS

can be used for large scale practical problems.

The main goal of this paper is to reduce both the computational complexity

and storage requirement. In Section 2, we derive a set of concise equations for

TWDRLS and give some discussions on it. We then describe a local TWDRLS

algorithm in Section 3. Simulation results are presented in Section 4. We then

summarize our ﬁndings in Section 5.

TWDRLS Algorithm

A general MFNN is composed of L layers, indexed by 1, · · · , L from input to

output. There are n

neurons in layer l. The output of the i-th neuron in the

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 456–465, 2008.

c Springer-Verlag Berlin Heidelberg 2008

The Local True Weight Decay Recursive Least Square Algorithm

457

l-th layer is denoted by y

i,l

. That means, the i-th neuron of the output layer is

represented by y

i,L

while the i-th input of the network is represented by y

i,1

The connection weight from the j-th neuron of layer l − 1 to the i-th neuron of

layer l is denoted by w

i,j,l

. Biases are implemented as weights and are speciﬁed

by w

i,(n

l−1

+1),l

, where l = 2, · · · , L. Hence, the number of weights in a MFNN is

given by M =

l=2

l−1

+ 1)n

In the standard RLS algorithm, we arrange all weights as a M -dimensional

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 38 39 40 41 42 43 44 45 ... 88