Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
vector, given by w = w 1,1,2
, · · · w 1,(n
1 +1),2
, · · · , w n L ,1,L , · · · , w n L
L−1 +1),L
T . (1) The energy function up to the t-th training sample is given by E(w) =
t τ =1
d(τ) − h(w, x(τ )) 2 + [w − ˆ w(0)] T P −1 (0) [w − ˆ w(0)] (2)
where x(τ ) is a n 1 -dimensional input vector, d(τ ) is the desired n L -dimensional output, and h(w 0 , x(τ )) is a nonlinear function that describes the function of the network. The matrix P (0) is the error covariance matrix and is usually set to δ
−1 I M×M , where I M×M
is a M × M identity matrix. The minimization of (2) leads to the standard RLS equations [1, 3, 6, 7], given by K(t) = P (t − 1) H T (t) I n L ×n L + H(t) P (t − 1) H T (t)
−1 (3)
P (t) = P (t − 1) − K(t) H(t) P (t − 1) (4)
ˆ w(t) = ˆ
w(t − 1) + K(t) [d(t) − h( ˆ w(t − 1), x(t)) ] , (5) where H(t) = ∂h(w,(t)) ∂w w= ˆ w(t−1) T is the gradient matrix (n L × M) of
h(w, x(t)); and K(t) is the so-called Kalman gain matrix (M × n L ) in the clas- sical control theory. The matrix P (t) is the so-called error covariance matrix. It is symmetric positive definite. As mentioned in [5], the standard RLS algorithm only has the limited weight decay effect, being δ t
per training iteration, where t o is the number of training iterations. The decay effect decreases linearly as the number of training iterations increases. Hence, the more training presentations take place, the less smoothing effect would have in the data fitting process. A true weight decay RLS algorithm, namely TDRLS, was then proposed [5], where a decay term is added to the original energy function. The new energy function is given by E(w) = t
d(τ ) − h(w, x(τ )) 2 + αw T w + [w − ˆ w(0)] T P −1 (0) [w − ˆ w(0)] (6) where α is a regularization parameter. The gradient of E(w) is given by ∂E(w) ∂w
−1 (0) [w − ˆ w(0)] + t
αw − H T (τ ) [d(τ ) − H(τ )w − ξ(τ )] . (7) 458 C.S. Leung, K.-W. Wong, and Y. Xu In the above, we linearize h(w, x(τ )) around the estimate ˆ w(τ −1). That means, h(w, x(τ )) = H(τ )w + ξ(τ ), where ξ(τ ) = h( ˆ w(τ − 1), x(τ )) − H(τ ) ˆ w(τ ) + ρ(τ ). To minimize the energy function, we set the gradient to zero. Hence, we have ˆ w(t) = P (t)r(t) (8) where
P −1 (t) = P −1 (t − 1) + H T (t) H(t) + αI M×M (9)
r(t) = r(t − 1) + H T (t) [d(t) − ξ(t)] . (10) Furthermore, we define P ∗ (t)
Δ = [I
M×M + αP (t − 1)] −1 P (t − 1). Hence, we have P ∗−1
(t) = P −1 (t − 1) + αI M×M . With the matrix inversion lemma [7] in the recursive calculation of P (t), (8) becomes P ∗ (t − 1) = [I M×M
+ αP (t − 1)] −1 P (t − 1) (11) K(t) = P
∗ (t − 1) H T (t) I
n L ×n L + H(t) P
∗ (t − 1)H
T (t)
−1 (12)
P (t) = P ∗ (t − 1) − K(t) H(t) P ∗ (t − 1)
(13) ˆ w(t) = ˆ w(t − 1)−αP (t) ˆ w(t − 1) + K(t)[d(t) − h( ˆ w(t − 1), x(t))]. (14) Equations (11)-(14) are the general global TWDRLS equations. Those are more compact than the equations presented in [5]. Also, the weight up- dating equation in [5] ( i.e., (14) in this paper) is more complicated. When the regularization parameter α is set to zero, the decay term αw T w vanishes and (11)-(14) reduce to the standard RLS equations. The decay effect can be easily understood by the decay term αw T w in the energy function given by (6). As mentioned in [5], the decay effect per training iterations is equal to αw T w which does not decrease with the number of training iterations. The energy function of TWDRLS is the same as that of batch model weight decay methods. Hence, existing heuristic methods [8, 9] for choosing the value of α can be used for the TWDRLS’s case. We can also explain the weight decay effect based on the recursive equations (11)-(14). The main difference between the standard RLS equations and the TWDRLS equations is the introduction of a decay term −αP (t) ˆ w(t − 1) in (14). This term guarantees that the magnitude of the updating weight vector decays an amount proportional to αP (t). Since P (t) is positive definite, the magnitude of the weight vector would not be too large. So the generalization ability of the trained networks would be better [8, 9]. A drawback of TWDRLS is the requirement in computing the inverse of the M -dimensional matrix (I M×M
+ αP (t − 1)). This complexity is equal to O(M
3 ) which is much larger than that of the standard RLS, O(M 2 ). Hence, the TWDRLS algorithm is computationally prohibitive even for a network with moderate size. In the next Section, a local version of the TWDRLS algorithm will be proposed to solve this large complexity problem.
The Local True Weight Decay Recursive Least Square Algorithm 459
3 Localization of the TWDRLS Algorithm To localize the TWDRLS algorithm, we first divide the weight vector into several small vectors, where w i,l = w
i,1,l , · · · , w i,(n l−1
+1),l T is denoted as the weights connecting all the neurons of layer l − 1 to the i − th neuron of layer l. We consider the estimation of each weight vector separately. When we consider the i−th neuron in layer l, we assume that other weight vectors are constant vectors. Such a technique is usually used in many numerical methods [10]. At each training iteration, we update each weight vector separately. Each neuron has its energy function. The energy function of the i − th neuron in layer l is given by E(w
i,l ) =
t τ =1
d(τ ) − h(w i,l
, x(τ)) 2 + αw T i,l
w i,l
+ [w i,l
− ˆ w i,l (0)] T P −1 i,l
(0) [w i,l
− ˆ w i,l (0)] . (15)
Utilizing a derivation process similar to the previous analysis, we obtain the following recursive equations for the local TWDRLS algorithm. Each neuron (excepting input neurons) has its set of TWDRLS equations. For the i − th neuron in layer l, the TWDRLS equations are given by P ∗
( t − 1) =
¢ I (n l−1 +1)×(n
l−1 +1)
+ αP i,l ( t − 1)
£ −1 P i,l ( t − 1) (16) K i,l ( t) = P
∗ i,l
( t − 1) H
T i,l
( t) I
n L ×n L + H i,l ( t) P ∗ i,l
( t − 1)H
T i,l
( t) −1 (17) P i,l ( t) = P
∗ i,l
( t − 1) − K i,l (
i,l ( t) P ∗ i,l
( t − 1)
(18) ˆ w i,l ( t) = ˆ w i,l
( t−1)−αP
i,l ( t) ˆ w i,l
( t−1) + K
i,l ( t)[d(t)−h( ˆ w i,l
( t−1), x(t))], (19) where H i,l
is the n L × n (n l−1
+1) local gradient matrix. In this matrix, only one row associated with the considered neuron is nonzero for output layer L. K i,l
(t) is the (n l−1 + 1)
× n L local Kalman gain. P i,l (t) is the (n l−1 + 1)
× (n l−1
+ 1) local error covariance matrix. The training process of the local TWDRLS algorithm is as follows. There are
L l=2
n l neurons (excepting input neurons). Hence, there are L l=2
n l sets of TWDRLS equations. We update the local weight vectors in accordance with a descending order of l and then an ascending order of i using (16)-(19). At each training stage, only the concerned local weight vector is updated and all other local weight vectors remain unchanged. In the global TWDRLS, the complexity mainly comes from the computing the inverse of the M -dimensional matrix (I M×M + αP (t − 1)). This complexity is equal to O(M 3 ). So, the computational complexity is equal to T CC global = O(M 3 ) = O
L l=2
n l (n l−1 + 1)
3 . Since the size of the matrix is M × M , the space complexity (storage requirement) is equal to T CS global
= O(M 2 ) = O L l=2 n l (n l−1 + 1)
2 . From (16), the computational cost of local TWDRLS algorithm mainly comes from the inversion of an (n l−1
+ 1) × (n
l−1 + 1) matrix. In this way, the 460 C.S. Leung, K.-W. Wong, and Y. Xu computational complexity of each set of local TWDRLS equations is equal to O((n
l−1 +1)
3 ) and the corresponding space complexity is equal to O((n l−1 +1)
2 ). Hence, the total computational complexity of the local TWDRLS is given by T CC local
= O L l=2 n l (n l−1 + 1)
3 and the space complexity (storage require- ment) is equal to T CS local
= O L l=2 n l (n l−1 + 1)
2 . They are much smaller than the computational and space complexities of the global case. 4 Simulations Two problems, the generalized XOR and the sunspot data prediction, are consid- ered. We use three-layers networks. The initial weights are small zero-mean inde- pendent identically distributed Gaussian random variables. The transfer function of hidden neurons is a hyperbolic tangent. Since the generalized XOR is a classi- fication problem, output neurons are with the hyperbolic tangent function. For the sunspot data prediction problem, output neurons are with the linear activa- tion function. The training for each problem is performed 10 times with different random initial weights. 4.1 Generalized XOR Problem The generalized XOR problem is formulated as d = sign(x 1 x 2 ) with inputs in the range [ −1, 1]. The network has 2 input neurons, 10 hidden neurons, and 1 output neuron. As a result, there are 41 weights. The training set and test set, shown in Figure 1, consists of 50 and 2,000 samples, respectively. The total number of training cycles is set to 200. In each cycle, training samples from the training set are feeded to the network one by one. The decision boundaries obtained from typical networks trained with both global and local TWDRLS and standard RSL algorithms are plotted in Figure 2. (a) Training samples (b) Test samples Fig. 1. Training and test samples for the generalized XOR problems
The Local True Weight Decay Recursive Least Square Algorithm 461
Table 1. Computational and space complexities of the global and local TWDRLS algorithms for solving the generalized XOR problem Algorithm Computational complexity Space complexitty Global
O(6.89 × 10 4 ) O(1.68 × 10 3 ) Local O(1.60 × 10 3 )
2 ) (a) Global TWDRLS, α = 0 (b) Local TWDRLS, α = 0 (c) Global TWDRLS, α = 0.00178 (d) Local TWDRLS, α = 0.00178 Fig. 2. Decision boundaries of various trained networks for the generalized XOR prob- lem. Note that when α = 0, the TWDRLS is identical to RLS. From Figures 1 and 2, the decision boundaries obtained from the trained net- works with TWDRLS algorithm are closer to the ideal ones than those with the standard RLS algorithm. Also, both local and global TWDRLS algorithms produce a similar shape of decision boundaries. Figure 3 summarizes the average test set false rates in the 10 runs. The av- erage test set false rates obtained by global and local TWDRLS algorithms are usually lower than those obtained by the standard RLS algorithm over a wide range of regularization parameters. That means, both global and local TWDRLS algorithms can improve the generalization ability. In terms of average false rate, the performance of the local TWDRLS algorithm is quite similar to that of the global ones. The computational and space complexities for global and local al- gorithms are listed in Table 1. From Figure 3 and Table 1, we can conclude that 462 C.S. Leung, K.-W. Wong, and Y. Xu Fig. 3. Average test set false rate of 10 runs for the generalized XOR problem the performance of local TWDRLS is comparable to that of the global ones, and that its complexities are much smaller. Figure 3 indicates that the average test set false rate first decreases with the regularization parameter α and then increases with it. This shows that a proper selection of α will indeed improve the generalization ability of the network. On the other hand, we observe that the test set false rate becomes very high at large values of α, especially for the networks trained with global TWDRLS algorithm. This is due to the fact that when the value of α is too large, the weight decay effect is very substantial and the trained network cannot learn the target function. In order to further illustrate this, we plot in Figure 4 the decision boundary obtained from the network trained with global TWDRLS algorithm for α = 0.0178. The figure shows that the network has already converged when the decision boundary is still quite far from the ideal one. This is because when the value of α is too large, the weight decay effect is too strong. That means, the regularization parameter α cannot be too large otherwise the network cannot learn the target function. 4.2
Sunspot Data Prediction The sunspot data from 1700 to 1979 are normalized to the range [0,1] and taken as the training and the test sets. Following the common practice, we divide the data into a training set (1700 − 1920) and two test sets, namely, Test-set 1 (1921
− 1955) and Test-set 2 (1956 − 1979). The sunspot series is rather non- stationary and Test-set 2 is atypical for the series as a whole. In the simulation, we assume that the series is generated from the following auto-regressive model, given by d(t) = ϕ(d(t − 1), · · · , d(t − 12)) + (t) (20)
where (t) is noise and ϕ(·, · · · , ·) is an unknown nonlinear function. A network with 12 input neurons, 8 hidden neurons (with hyperbolic tangent activation The Local True Weight Decay Recursive Least Square Algorithm 463
Table 2. Computational and space complexities of the global and local TWDRLS algorithms for solving the sunspot data prediction Algorithm Computational complexity Space complexitty Global
O(1.44 × 10 6 ) O(1.28 × 10 4 ) Local O(1.83 × 10 4 )
3 ) Fig. 4. Decision boundaries of a trained network with local TWDRLS where α = 0 .0178. In this case, the value of the regularization parameter is too large. Hence, the network cannot form a good decision boundary. (a) Test-set 1 average RMSE (b) Test-set 2 average RMSE Fig. 5. RMSE of networks trained by global and local TWDRLS algorithms. Note that when α = 0, the TWDRLS is identical to RLS. function), and one output neuron (with linear activation function) is used for approximating ϕ(·, · · · , ·). The total number of training cycles is equal to 200. As this is a time series problem, the training samples are feeded to the network sequentially in each iteration. The criterion to evaluate the model performance 464 C.S. Leung, K.-W. Wong, and Y. Xu is the mean squared error (RMSE) of the test set. The experiment are repeated 10 times with different initial weights. Figure 5 summarizes the average RMSE in 10 runs. The computational and space complexities for global and local algorithms are listed in Table 2. We observe from Figure 5 that over a wide range of the regularization pa- rameter α, both global and local TWDRLS algorithms have greatly improved the generalization ability of the trained networks, especially for test-set 2 that is quite different from the training set. However, the test RMSE becomes very large at large values of α. The reasons are similar to those stated in the last subsection. This is because at large value of α, the weight decay effect is too strong and so the network cannot learn the target function. In most cases, the performance of the local training is found to be comparable to that of the global ones. Also, Table 2 shows that those complexities of the local training are much smaller than those of the global one. 5 Conclusion We have investigated the problem of training the MFNN model using the TWDRLS algorithms. We derive a set of concise equations for the local TWDRLS algorithm. The computational complexity and the storage require- ment are reduced considerably when using the local approach. Computer simu- lations indicate that both local and global TWDRLS algorithms can improve the generation ability of MFNNs. The performance of the local TWDRLS algorithm is comparable to that of the global ones. Acknowledgement The work is supported by the Hong Kong Special Administrative Region RGC Earmarked Grant (Project No. CityU 115606). References 1. Shah, S., Palmieri, F., Datum, M.: Optimal filtering algorithm for fast learning in feedforward neural networks. Neural Networks 5, 779–787 (1992) 2. Leung, C.S., Wong, K.W., Sum, J., Chan, L.W.: A pruning method for recursive least square algorithm. Neural Networks 14, 147–174 (2001) 3. Scalero, R., Tepedelelenlioglu, N.: Fast new algorithm for training feedforward neural networks. IEEE Trans. Signal Processing 40, 202–210 (1992) 4. Leung, C.S., Sum, J., Young, G., Kan, W.K.: On the kalman filtering method in neural networks training and pruning. IEEE Trans. Neural Networks 10, 161–165 (1999)
5. Leung, C.S., Tsoi, A.H., Chan, L.W.: Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Net- works 12, 1314–1332 (2001) 6. Mosca, E.: Optimal Predictive and adaptive control. Prentice-Hall, Englewood Cliffs, NJ (1995)
The Local True Weight Decay Recursive Least Square Algorithm 465
7. Haykin, S.: Adaptive filter theory. Prentice-Hall, Englewood Cliffs, NJ (1991) 8. Mackay, D.: Bayesian interpolation. Neural Computation 4, 415–447 (1992) 9. Mackay, D.: A practical bayesian framework for backpropagation networks. Neural Computation 4, 448–472 (1992) 10. William H, H.: Applied numerical linear algebra. Prentice-Hall, Englewood Cliffs, NJ (1989) Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift Keisuke Yamazaki and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute of Technology R2-5, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan {k-yam,swatanab}@pi.titech.ac.jp Abstract. In the standard setting of statistical learning theory, we as- sume that the training and test data are generated from the same distri- bution. However, this assumption cannot hold in many practical cases, e.g., brain-computer interfacing, bioinformatics, etc. Especially, changing input distribution in the regression problem often occurs, and is known Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling