Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
p) is defined as b(ˆ
p) ≡ E q(D)
log ˆ p(D) −N · E q(x) log ˆ
p(x|D) , (8)
where q(D) ≡ N n=1 q(x n ). Because of the bias, it is known that the most overfitted model to learning data is selected when select a regular model from a candidate set of regular models based on the empirical likelihood. Addressing this problem, there have been proposed many information criteria which evaluates learning results by correcting the bias. Generally the form of the information criterion is IC(ˆ
p, D) ≡ −2 log ˆ p(D) + 2ˆb(ˆ p) (9)
where ˆ b(ˆ
p) is the estimator of the bias b(ˆ p).
Corrected AIC (cAIC) [3] estimates and corrects accurate bias of the empirical log-likelihood as ˆ b
(ˆ p ML ) = N (M + 1) N − M − 2 , (10) under the assumption that the learning model is a normal linear regression model, the true model is included in the learning model and the estimation is constructed by maximum likelihood estimation. Here, M is the number of explanatory vari- ables and dim θ = M +1 (1 is the number of the estimators of variance.) Therefore, cAIC asymptotically equals AIC [2]: ˆ b cAIC
(ˆ p ML ) → ˆb
AIC (ˆ p ML ) (N → ∞). 4 Artificial Neural Network In a regression problem, we want to estimate the true function using learning data
D = {(x n , y n ) | n = 1, · · · , N}, where x n ∈ R
d is an input and y n ∈ R is
the corresponding output. An artificial neural network is defined as Unbiased Likelihood Backpropagation Learning 449
f (x; θ M ) = M i=1
a i φ(x; ϕ i ) (11) where a i (i = 1, · · · , N ) are regression coefficients and φ(x; ϕ i ) are basis func- tions which are parametalized by ϕ i . Model parameter of this neural network is θ M ≡ (ϕ T 1 , · · · , ϕ T M ) T ( T denotes transpose.) Design matrix X, coefficient vector a and output vector y are defined as X ij
i ; ϕ j ) (12) a ≡ (a 1 , · · · , a M ) T (13) y ≡ (y
1 , · · · , y N )
. (14)
When the model parameter θ M are fixed, the artificial neural network (11) becomes a linear regression model parametalized by a. A normal linear regression model:
p(y|x; θ M ) ≡ 1 √ 2πσ 2 exp − (y−f (x; θ M ))
2σ 2 (15) is usually used when the noise included in the output y is assumed to follow a normal distribution. In this paper, we call the parameter θ R
T , σ
2 ) T regular parameter. 4.1 Regularized Maximum Likelihood Estimation To assure the regularity of the normal linear regression model (15), regularized maximum likelihood estimation is usually used for estimating θ R
a T , σ 2 ) T . Regularized maximum likelihood estimation maximize regularized log-likelihood: log p(D) − exp(λ) a 2 . (16)
λ ∈ R is called a regularization parameter. Regularized maximum likelihood estimators of the coefficient vector a and the valiance σ 2 are
ˆ a = Ly
(17) and
ˆ σ 2 = 1 N y − X ˆa 2 , (18) where
L ≡ (X T X + exp(λ)I) −1 X T (19) and
I is the identity matrix. Here, the effective number of the regression coefficients is M ef f
= tr XL .
(20) Therefore, M ef f is used for the number of explanatory variables M in (10) 1 . 1 When the denominator of (10) is not positive value, we use ˆ b cAIC (ˆ p) = ∞.
450 M. Sekino and K. Nitta 4.2 Overfitting of the Error Backpropagation Learning The error backpropagation learning is usually used for training an artifical neural network. This method is equal to the gradient discent method based on the likelihood of the linear regression model (15), because the target function is the squared error to learning data. In what follows, we assume the noise follow a normal distribution and the normal linear regression model (15) is regular for all θ M . Then, an artificial neural network becomes a set of regular linear regression models.
For simplicity, let’s think about a set of regular models H ≡ {M(θ
M ) | θ M ∈ Θ M }, where M(θ M )
R ; θ M ) | θ R ∈ Θ
R } is a regular model. We can define a new model M C ≡ {ˆp(x; θ M ) | θ M ∈ Θ M }, where ˆp(x; θ M ) is the estima- tion of M(θ
M ). Concerning model parameter θ M
construct the estimation of the model M C based on the empirical likelihood ˆ p(D; θ M ). For example, maximum likelihood estimation selects ˆ θ MML
which is the max- imizer of ˆ p(D; θ M
ˆ p ML ( x) = ˆp(x; ˆθ MML ),
ˆ θ MML ≡ argmax θ M ˆ p(D; θ
M ). (22) Thus, the maximum likelihood estimation with respect to model parameter θ M is the model selection from H based on the empirical likelihood. Because the error backpropagation is the method which realizes maximum likelihood estimation by the gradient discent method, the model M(θ M
comes the one gradually overfitted in the latter half of the training. Therefore, we propose a learning method for model parameter θ M
lihood which is the empirical likelihood corrected by an appropriate information criterion. 5 Unbiased Likelihood Backpropagation Learning 5.1 Unbiased Likelihood Using an information criterion IC(ˆ p, D), we define unbiased likelihood as: ˆ p
( D) = exp − 1 2
p, D) . (23)
This unbiased likelihood satisfies E q(D) 1 N log ˆ p ub ( D) = E q(x)
log ˆ p(x)
(24) when the assumptions of the information criterion are satisfied. Unbiased Likelihood Backpropagation Learning 451
5.2 Regular Hierarchical Model In this paper, we consider about a certain type of hierarchical model, which we call a regular hierarchical model, defined as a set of regular models. A concise definitions of a regular hierarchical model are follows. Regular Hierarchical Model – H ≡ {M(θ
M ) | θ M ∈ Θ
M } – M(θ M ) ≡ {p(x|θ R ; θ M ) | θ R ∈ Θ R } – M(θ M ) is a regular model with respect to θ R . An artificial neural network is one of the regular hierarchical models. And also we define unbiased maximum likelihood estimation as follows. Unbiased Maximum Likelihood Estimation Unbiased
maximum likelihood estimation ˆ p ubML ( x) is the estimation ˆ p(x; ˆ
θ MubML
) by the maximizer ˆ θ MubML of the unbiased likelihood ˆ p ub ( D|θ
M ): ˆ p ubML
( x) ≡ ˆp(x; ˆθ MubML )
ˆ θ MubML ≡ argmax θ M ˆ p ub ( D; θ
M ). (26) 5.3 Unbiased Likelihood Backpropagation Learning The partial differential of the unbiased likelihood is ∂ ∂θ M log ˆ
p ub ( D; θ M ) = ∂ ∂θ M log ˆ p(D; θ
M ) − ∂ ∂θ M ˆ b(ˆ
p; θ M ). (27) We define the unbiased likelihood estimation based on the gradient method with this partial differential as unbiased likelihood backpropagation learning. 5.4
Unbiased Likelihood Backpropagation Learning for an Artificial Neural Network In this paper, we derive the unbiased likelihood backpropagation learning for an artificial neural network when the bias of the empirical likelihood is estimated by cAIC (10). The partial differential of the empirical likelihood with respect to θ M which is the first term of (27), is ∂ ∂θ M log ˆ
p(D; θ M ) = 1 ˆ σ 2 ∂X ˆ
a ∂θ M T ( y − X ˆa) (28) ∂X ˆ
a ∂θ M = ∂X ∂θ M L + L
T ∂X T ∂θ M y − 2L T X T ∂X ∂θ M Ly. (29)
452 M. Sekino and K. Nitta The partial differential of cAIC (10) with respect to θ M , which is the second term of (27), is ∂ ∂θ M ˆ b cAIC (ˆ p; θ M ) =
N (N − 1) (N − M − 2) 2 ∂M
∂θ M (30) ∂M ef f
∂θ M = 2 tr ∂X ∂θ M L − L T X T ∂X ∂θ M L .
(31) We can also obtain the partial differential of the unbiased likelihood with respect to λ as ∂
log ˆ p(D; θ
M ) =
− 1 ˆ σ 2 y T L T L(y − X ˆa) exp(λ), (32)
and ∂M ef f ∂λ = −tr L T L exp(λ). (33) Now, we have already obtain the partial differential of the unbiased likelihood with respect to θ M and λ, therefore it is possible to apply the unbiased likelihood backpropagation learning to an artificial neural network. 6 Application to Kernel Regression Model Kernel regression model is one of the artificial neural networks. The kernel re- gression model using gaussian kernels has the model parameter of degree one, which is the size of gaussian kernels. This model is comprehensible about the behavior of learning methods. Therefore, the kernel regression model using gaus- sian kernels is used in the following simulations. In the implementation of the gradient discent method, we adopted quasi-Newton method with BFGS method for estimating the Hesse matrix and golden section search for determining the modification length. 6.1 Kernel Regression Model Kernel regression model is f (x; θ
M ) =
N n=1
a n K(x, x n ; θ M ). (34) K(x, x n ; θ M ) are kernel functions parametalized by model parameter θ M . Gaus- sian kernel: K(x, x
n ; c) = exp − x − x n
2c 2 (35) is used in the following simulations, where c is a parameter which decides the size of a gaussian kernel. Model parameter is θ M = c.
Unbiased Likelihood Backpropagation Learning 453
-1 -0.5
0 0.5
1 0 2 4 6 8 10 12
log-likelihood iterations emp true
cAIC (a) Empirical Likelihood Backpropagation -1 -0.5
0 0.5
1 0 2 4 6 8 10 12
log-likelihood iterations emp true
cAIC (b) Unbiased Likelihood Backpropagation Fig. 1. An example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC) 454 M. Sekino and K. Nitta 6.2 Simulations For the purpose of evaluation, the empirical likelihood backpropagation learn- ing and the unbiased likelihood backpropagation learning are applied to the 8 dimensional input to 1 dimensional output regression problems of “kin-family” and “pumadyn-family” in the DELVE data set [4]. Each data has 4 combinations of fairly linear (f) or non linear (n), and medium noise (m) or high noise (h). We use 128 samples for learning data. 50 templates are chosen randomly and kernel functions are put on the templates. Fig.1 shows an example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC). It shows the test likelihood of the empirical likelihood backpropagation learning (a) decrease in the latter half of the training and overfitting occurs. On the contrary, it shows the unbiased likelihood backpropagation learning (b) keeps the test likelihood close to the unbiased likelihood (cAIC) and overfitting does not occur. Table 1 shows mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Table 1. Mean and standard deviations of the mean log test likelihood for 100 exper- iments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Data
Empirical BP Unbiased BP kin-8fm 2
kin-8fh 1 .108 ± 0.240 1.599 ± 0.100 kin-8nm −0.394 ± 0.415 0.078 ± 0.287 kin-8nh −0.435 ± 0.236 0.064 ± 0.048 pumadyn-8fm −2.235 ± 0.249 −1.708 ± 0.056 pumadyn-8fh −3.168 ± 0.187 −2.626 ± 0.021 pumadyn-8nm −3.012 ± 0.238 −2.762 ± 0.116 pumadyn-8nh −3.287 ± 0.255 −2.934 ± 0.055 6.3 Discussion The reason why the results of the unbiased likelihood backpropagation learning in Table 1 shows better is attributed to the fact that the method maximizes the true likelihood averagely, because the mean of the log unbiased likelihood is equal to the log true likelihood (see (24)). The reason why the standard deviations of the test likelihood of the unbi- ased likelihood backpropagation learning is smaller than that of the empirical likelihood backpropagation learning is assumed to be due to the fact that the empirical likelihood prefer the model which has bigger degree of freedom. On the contrary, the unbiased likelihood prefer the model which has appropriate degree of freedom. Therefore, the variance of the estimation of the unbiased likelihood backpropagation learning becomes smaller than that of the empirical likelihood backpropagation learning. Unbiased Likelihood Backpropagation Learning 455
7 Conclusion In this paper, we provide an explanation about why overfitting occurs with the model selection framework. We propose the unibiased likelihood backpropaga- tion learning, which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. And we confirm the effectiveness of the proposed method by applying the method to DELVE data set. References 1. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1987) 2. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723 (1974) 3. Sugiura, N.: Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, vol. A78, pp. 13–26 (1978) 4. Rasmussen, C.E., Neal, R.M., Hinton, G.E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., Tibshirani, R.: The DELVE manual (1996), http://www.cs.toronto.edu/ ∼ delve/ The Local True Weight Decay Recursive Least Square Algorithm Chi Sing Leung, Kwok-Wo Wong, and Yong Xu Department of Electronic Engineering, City University of Hong Kong, Hong Kong eeleungc@cityu.edu.hk Abstract. The true weight decay recursive least square (TWDRLS) algorithm is an efficient fast online training algorithm for feedforward neural networks. However, its computational and space complexities are very large. This paper first presents a set of more compact TWDRLS equations. Afterwards, we propose a local version of TWDRLS to reduce the computational and space complexities. The effectiveness of this lo- cal version is demonstrated by simulations. Our analysis shows that the computational and space complexities of the local TWDRLS are much smaller than those of the global TWDRLS. 1 Introduction Training multilayered feedforward neural networks (MFNNs) using recursive least square (RLS) algorithms has aroused much attention in many literatures [1, 2, 3, 4]. This is because those RLS algorithms are efficient second-order gradi- ent descent training methods. They lead to a faster convergence when compared with first-order methods, such as the backpropagation (BP) algorithm. More- over, fewer parameters are required to be tuned during training. Recently, Leung et. al. found that the standard RLS algorithm has an implicit weight decay effect [2]. However, its decay effect is not substantial and so its gen- eralization ability is not very good. A true weight decay RLS (TWDRLS) algo- rithm is then proposed [5]. However, the computational complexity of TWDRLS is equal to O(M 3 ) at each iteration, where M is the number of weights. There- fore, it is necessary to reduce the complexity of TWDRLS so that the TWDRLS can be used for large scale practical problems. The main goal of this paper is to reduce both the computational complexity and storage requirement. In Section 2, we derive a set of concise equations for TWDRLS and give some discussions on it. We then describe a local TWDRLS algorithm in Section 3. Simulation results are presented in Section 4. We then summarize our findings in Section 5. 2 TWDRLS Algorithm A general MFNN is composed of L layers, indexed by 1, · · · , L from input to output. There are n l neurons in layer l. The output of the i-th neuron in the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 456–465, 2008. c Springer-Verlag Berlin Heidelberg 2008 The Local True Weight Decay Recursive Least Square Algorithm 457
l-th layer is denoted by y i,l
. That means, the i-th neuron of the output layer is represented by y i,L while the i-th input of the network is represented by y i,1 . The connection weight from the j-th neuron of layer l − 1 to the i-th neuron of layer l is denoted by w i,j,l
. Biases are implemented as weights and are specified by w
i,(n l−1
+1),l , where l = 2, · · · , L. Hence, the number of weights in a MFNN is given by M = L l=2 (n l−1
+ 1)n l . In the standard RLS algorithm, we arrange all weights as a M -dimensional Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling