Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
- Bu sahifa navigatsiya:
- Fig. 3.
- 6 Conclusion
4 Simulation Results Now we will examine the small scale numerical simulations results. The number of inputs was K = 5 and the number of output neurons was N = 3. Artificial zero-mean vectors with uncorrelated elements were generated by the following equations:
.
.18
= ) (5, .5);
+ ,1)
log(rand(1 1).
- 2 * .5) < ) ((rand(1,1 .145
= ) (4, /17.8);
.35sin( = ) (3,
11)/9).^5; - ,23)
((rem(
.45 = ) (2, /2); sin(
.45
=
) (1,
i x i x i i x i i x i i x ⋅
Input signal is constructed as s = 0.47*mix*x, where mixing matrix mix is defined as
).
.5 -
K + = mix
In Fig.2 cosine of the angles between column vectors of matrix W and numerically calculated eigenvectors of input signal covariance matrix were used as illustration of the algorithm efficiency. Learning rate was chosen constant, γ
= 3.45, for the first 15000 iterations, and then was set at γ
= 0.115. Individual part was taken as α=0.5. Initial value for matrix W was selected as W init
=-.5+rand(5,3). In Fig. 3 we can see that proposed method can be efficiently used for extraction of the deterministic input signals (under the assumption that mixing matrix is orthogonal). First column in Fig. 3 represents original signals, second column represents input signals obtained from original signals by multiplication with orthonormal matrix mix ( mix = mix (mix T
-.5
), and last column represents output, or reconstructed signals, obtained by implementation of MMHO. 0 0.5
1 1.5
2 2.5
x 10 4 0 0.2 0.4
0.6 0.8
1 1.2
1.4 Number of iterations co s(
i, p ca i) Fig. 2. Cosine of angles between column vectors of W and three principal eigenvectors of input covariance matrix versus the number of iterations Modified Modulated Hebb-Oja Learning Rule 533 0
100 150
200 250
-1 0 1 Output 0 50 100 150
200 250
-0.5 0 0.5 0 50 100 150 200
250 -0.2
0 0.2
0 50 100 150 200
250 -1 0 1 Input
0 50 100 150 200
250 -0.5
0 0.5
0 50 100 150 200
250 -0.5
0 0.5
0 50 100 150 200
250 -0.5
0 0.5
0 50 100 150 200
250 -0.5
0 0.5
0 50 100 150 200
250 -0.5
0 0.5
Original signals 0 50 100 150
200 250
-2 0 2 0 50 100 150 200
250 -0.5
0 0.5
0 50 100 150 200
250 -0.1
0 0.1
0 50 100 150 200
250 -0.5
0 0.5
Fig. 3. Blind signal extraction of deterministic components (mixing matrix is orthonormal) 5 Comparison to a Part of the Frog Retinal Wiring Equation (6) can be rewritten as ( )
) ( ) . 1,..,
, ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 ( 2 1 1 2 2 2 2 2 N k i y i y i y i i y i i i i i y i i y i i i i k k j j k k k k k k k k = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + − − − + − + = + ∑ − = w x y x w x w w αγ α γ
(7) Proposed learning rule can be implemented by the network structure shown in Fig. 4. By analyzing the proposed structure, we can see that some degree of similarity with the part of the frog’s retinal wiring exists [6]. We can make the following analogies:
•
the inputs x 1 ,…,x K can be viewed as photoreceptors •
1 ,…,y N correspond to bipolar cells (BC) •
Σ can be viewed as horizontal cell •
units labeled as A k and which calculate the output partial powers ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + = ∑ − = ) ( ) ( ) ( 2 1 1 2
y i y i A k k j j k
could be related to the amacrine cells (AC) • IP unit can be related to interplexiform cell (output power = IP = A N )
534 M. Jankovic et al. y N
2 y 1 x K x 2 x 1 IP Input
Output A1 A2 AN Fig. 4. Structure of the computational circuit that can be used for realization of MMHO algorithm (small empty circle at the end of an arrow denotes the square function) • we can notice presence of backward inhibitory projections from amacrine cells to the bipolar cells • it is possible to see that A k cells are serially connected which can possibly explain serial synapsing between amacrine cells in frog’s retina.
Although there is no physiological evidence which can support the similarity with the part of the frog’s retinal processing, high degree of structural similarity can be sign of possible similarity in the physiological sense. Although, some further refinement of the circuit structure is necessary in order to match known results (like reciprocal connections between ACs, or fact that back projections of amacrine cell will finish on BC terminals rather than on their dendrite) here we will suggest that species which have IP cells damaged will have significantly deteriorated selectivity abilities. 6 Conclusion In this paper a biologically plausible method that performs PCA has been proposed. In the proposed method, learning rule for individual synaptic efficacy does not require explicit information about the other synaptic efficacies, especially those related to other neurons. Simplicity of the “neural circuits” that perform global computations and a fact that their number does not depend on the number of input and output neurons, could be seen as good features of the proposed method. Comparing to some other recursive PCA methods that use local feedback connections in order to maintain stability, the MMHO method uses global and local feedback connections. Some degree of similarity of the circuit that can be used for the realization of the proposed learning rule with the part of the retinal wiring of the frog’s retina has been suggested.
Modified Modulated Hebb-Oja Learning Rule 535 It is suggested that the role of the interplexiform cells in overall retinal processing could be significant, although it is usually assumed that their role is not important. References 1. Abed-Meraim, K., Attallah, V., Ckheif, A., Hua, Y.: Orthogonal Oja algorithm. IEEE Signal Processing Letters 7, 116–119 (2000) 2. Baldi, P., Hornik, K.: Learning in linear neural networks: A survey. IEEE Trans. Neural Networks 6, 837–858 (1995) 3. Bienenstock, E., Cooper, L.N., Munro, P.W.: Theory of the development of the neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 32-48 (1982) 4. Chen, T.-P., Amari, S.-I.: Unified stabilization approach to principal and minor components. Neural Networks 14, 1377–1387 (2001) 5. Cichocki, A., Amari, S.–I.: Adaptive Blind Signal and Image Processing – Learning Algorithms and Applications. John Wiley and Sons, Chichester (2003) 6. Dowling, J.: The Retina: An approachable part of the brain. The Belknap Press of Harward University Press (1987) 7. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis Applications 20, 303–353 (1998) 8. Douglas, S.C., Kung, S.Y., Amari, S.-I.: A self-stabilized minor subspace rule. IEEE Signal Processing Letters 5, 328–330 (1998) 9. Fiori, S.: A theory for learning by weight flow on Stiefel-Grassman Manifold. Neural Computation 13, 1625–1647 (2001) 10. Földiák, P.: Adaptive network for optimal linear feature extraction. In: IJCNN 1989, Washington, D.C., USA, pp. 401–405 (1989) 11. Jankovic, M.: A new modulated Hebbian learning rule – Method for local computation of a principal subspace. In: ICONIP2001, Shanghai, China, vol. 1, pp. 470–475 (2001) 12. Jankovic, M.: A new simple ∞OH neuron model as a biologically plausible principal component analyzer. IEEE Trans. on Neural Networks 14, 853–859 (2003) 13. Jankovic, M., Ogawa, H.: A new modulated Hebb learning rule – Biologically plausible method for local computation of principal subspace. Int. J. Neural Systems 13, 215–224 (2003) 14. Jankovic, M., Reljin, B.: Neural learning on Grassman/Stiefel principal/minor submanifold. In: EUROCON 2005, Serbia, pp. 249–252 (2005) 15. Jankovic, M., Ogawa, H.: Modulated Hebb-Oja learning rule – A method for principal subspace analysis. IEEE Trans. on Neural Networks 17, 345–356 (2006) 16. Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Contr. 22, 551–575 (1977) 17. Möller, R., Konies, A.: Coupled principal component analysis. IEEE Trans. on Neural Networks 15, 214–222 (2004) 18. Oja, E.: A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267–273 (1982) 19. Oja, E.: Subspace Method of Pattern Recognition. Research Studies Press and J. Wiley, Letchworth (1983) 20. Oja, E., Karhunen, J.: On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. J. Math. Anal., Appl. 106, 69–84 (1985)
536 M. Jankovic et al. 21. Oja, E., Ogawa, H., Wangviwattana, J.: Principal component analysis by homogeneous neural networks, Part I: The weighted subspace criterion. IEICE Trans. Inf.&Syst. E75-D, 366–375 (1992) 22. Peltonen, J., Kaski, S.: Discriminative components of data. IEEE Trans. on Neural Networks 16, 68–83 (2005) 23. Plumbley, M.D., Oja, E.: A “nonnegative PCA” algorithm for independent component analysis. IEEE Trans. on Neural Networks 15, 66–76 (2004) 24. Tanaka, T.: Generalized weighted rules for principal components tracking. IEEE Trans. on Signal Processing 53, 1243–1253 (2005) 25. Waheed, K., Salem, F.M.: Blind information-theoretic multiuser detection algorithms for DS-CMDA and WCDMA downlink systems. IEEE Trans. on Neural Networks 16, 937– 948 (2005) 26. Xu, L.: Least mean square error reconstruction principle for self-organizing neural nets. Neural Networks 6, 627–648 (1993) 27. Yang, B.: Projection approximation subspace tracking. IEEE Trans. on Signal Processing 43, 95–107 (1995) 28. Zhang, Y., Weng, J., Hwang, W.-S.: Auditory learning: a developmental method. IEEE Trans. on Neural Networks 16, 601–616 (2005) Orthogonal Shrinkage Methods for Nonparametric Regression under Gaussian Noise Katsuyuki Hagiwara Faculty of Education, Mie University, 1577 Kurima-Machiya-cho, Tsu 514-8507 Japan hagi@edu.mie-u.ac.jp Abstract. In this article, for regression problems, we proposed shrink- age methods in training a machine in terms of a regularized cost func- tion. The machine considered in this article is represented by a linear combination of fixed basis functions, in which the number of basis func- tions, or equivalently, the number of adjustable weights is identical to the number of training data. This setting can be viewed as a nonpara- metric regression method in statistics. In the regularized cost function employed in this article, the error function is defined by the sum of squared errors and the regularization term is defined by the quadratic form of the weight vector. By assuming i.i.d. Gaussian noise, we pro- posed three thresholding methods for the orthogonal components which are obtained by eigendecomposition of the Gram matrix of the vectors of basis function outputs. The final weight values are obtained by a lin- ear transformation of the thresholded orthogonal components and are shrinkage estimators. The proposed methods are quite simple and auto- matic, in which the regularization parameter is enough to be fixed for a small constant value. Simple numerical experiments showed that, by comparing the leave-one-out cross validation method, the computational costs of the proposed methods are strictly low and the generalization capabilities of the trained machines are comparable when the number of data is relatively large. 1 Introduction This article considers a regression method using learning machines that are de- fined by a linear combination of fixed basis functions. Especially, we focus on the machines in which the number of basis functions, or equivalently, the number of adjustable weights is identical to the number of training data. This problem is viewed as a nonparametric regression method in statistics. In machine learning, support vector regressions belong to this type of regression, in which the ker- nel trick together with the representer theorem yields a linear combination of kernel functions(e.g.[4]). In recent works, it has also been proposed variations of support vector machines, such as least squared support vector machines[8], or equivalently, regularized least squares[6]. In training the above stated machines, regularization methods are naturally introduced because of the following two reasons. The first reason is to stabilize M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 537–546, 2008. c Springer-Verlag Berlin Heidelberg 2008 538 K. Hagiwara the training process. Since the number of basis functions is large, the basis func- tions can be nearly linearly dependent depending on choices of basis functions. In this case, for example, the least squares estimators of weights are numerically unstable solutions. The second reason is to avoid over-fitting. For example, the outputs of the above stated machine with the least squares estimators are iden- tical to output data. Obviously, such outputs may not generalize to unseen data well due to over-fitting. The generalization capability of the trained machine is generally sensitive to the choice of a regularization parameter value which is often chosen by using a cross validation method in applications. In this article, we consider rather other methods for improving the generalization performance while the regularizer is needed for stabilizing the training process. In this article, we propose shrinkage methods for training the above stated machine in terms of a regularized cost function. In the regularized cost func- tion, the error function is defined by the sum of squared errors and the reg- ularizer is defined by the quadratic form of the weight vector. Therefore, the formulation in this article is almost equivalent to least squares support vector machines/regularized least squares. However, it is slightly general since the ba- sis functions are not assumed to be the Mercer kernels. Eigendecomposition is firstly applied to the Gram matrix of the vectors of basis function outputs. It is viewed as orthogonalization of the vectors of basis function outputs. This procedure has also been applied for reducing computational costs in training kernel machines[9]. We then obtain the coefficients of the orthogonalized basis outputs by a linear transformation of the regularized estimators of weights. The coefficients are referred to as orthogonal components here. By assuming i.i.d. Gaussian noise, we consider to keep or remove the orthogonal components by a hard thresholding method, in which the orthogonal components to be removed are set to zero. By introducing an idea of the universal threshold level used in wavelet denoising[5], we propose three hard thresholding methods for the or- thogonal components. The final weights are obtained by a linear transformation of the coefficients of the thresholded orthogonal components and are shrinkage estimators. Although our methods include a regularization parameter, it can be fixed for a small value. This is because the machines trained by our methods have compact representations in an orthogonal domain and the contribution of the regularizer in improving the generalization performance is not significant. Therefore, we need the regularizer only for ensuring the numerical stability of training and it is enough to be fixed for a small value. There are several related works to our methods. The lasso implements the l 1 regularizer which is known to act as a thresholding method[7]. A thresholding operation of the lasso is directly applied on weights of basis functions while our methods are applied on orthogonal components. In the lasso, the threshold level is determined by a regularization parameter while the levels are given with a the- oretical implication in our methods. The regularization parameter may be chosen by cross validation methods in the lasso, which is time consuming compared to our methods since it can be fixed in our methods. Recently, the Gram-Schmidt type orthogonalization procedure has been introduced to least squares support
Orthogonal Shrinkage Methods for Nonparametric Regression 539
vector machines(e.g. [2]), where the orthogonal components are chosen via a greedy manner according to the residual error. [2] further proposed a method in which regularization parameters are assigned individually to orthogonal compo- nents and are updated based on the Bayesian log evidence. This method includes some parameters such as a criterion for determining the number of orthogonal components and the number of updates, which are heuristically determined. Our methods have an advantage at this point since they do not include parameters that are needed to be empirically adjusted. This article is organized as follows. In Section 2, we formulate a learning machine and its training procedure using a regularization method. We also introduce eigen- decomposition or orthogonalization of basis function outputs here. In Section 3, we give shrinkage methods. Some numerical experiments for the proposed methods including comparisons with the leave-one-out cross validation method are shown in Section 4. Conclusions and future works are given in Section 5. 2 Formulation 2.1 Regression Problem and Regularized Cost Let {(x
i , y
i ) : x
i ∈ R
d , y
i ∈ R, i = 1, . . . , n} be a set of input-output training data. We assume that the output data are generated according to the following rule.
y i = h(x i ) + e
i , i = 1, . . . , n, (1) where h is a fixed target function on R d and e 1 , . . . , e n are i.i.d. noise that are assumed to be sampled from a Gaussian distribution N (0, σ 2 ). x 1 , . . . , x n are
independent samples from a probability distribution on R d and are independent of y
1 , . . . , y n .
f w (x) = n j=1
w j g j (x), x
∈ R d (2) where w = (w 1 , . . . , w n ) ∈ R n is a coefficient or weight vector. For simplifying the subsequent discussions, we define y = (y 1 , . . . , y n ) , where denotes the transpose. Also we redefine w as a vertical vector; i.e. w = (w 1 , . . . , w n ) . Let G be the n × n matrix whose (i, j) element is g j (x i ). We define g j = (g j (x 1 ), . . . , g j (x n )) that is the jth column vector of G. We assume that g 1 , . . . , g n are linearly independent. We train the machine in terms of the regularized cost function defined by C(w) = E(w) + λR(w), λ ∈ (0, ∞), (3)
where E is an error function, R is a regularization term and λ is a regularization parameter. In this article, we employ E(w) = y − Gw
2 (4)
R(w) = w 2 , (5) where
· denotes the Euclidean norm. 540 K. Hagiwara 2.2 Training Via Eigendecomposition of Gram Matrix We define F λ = (G G + λI n ), where I n is the n
× n identity matrix. Under (3), (4) and (5), the estimate of the weight vector is given by w λ
λ,1 , . . . , w λ,n ) = F
−1 λ G y (6) and the minimum of the regularized cost function is written by C(w λ
− w λ F λ w λ . (7)
Since the column vectors of G are linearly independent, G G is positive definite, where G G is the Gram matrix of (g 1 , . . . , g n ). Then, there exists an orthonormal matrix Q such that Q (G G)Q = Γ where Γ = diag(γ 1 , . . . , γ n ) with γ
k > 0 for
any k. Here, diag(a 1 , . . . , a n ) denotes a diagonal matrix whose (k, k) element is a k
k is the kth eigen value and the kth column vector of Q is the corresponding eigen vector. We define
Γ λ = diag(γ 1 + λ, . . . , γ n + λ).
(8) Then, Q F λ Q = Γ
λ and QΓ
λ Q = F
λ hold. By using Q and Γ λ , (6) and (7) are written by w λ = QΓ −1 λ Q G y (9)
C(w λ ) = y y − v λ v λ (10)
where v λ = (v λ,1
, . . . , v λ,n
) = Γ −1/2
λ Q G y.
(11) By (9) and (11), w λ and v
λ are linked by the equation w λ
−1/2 λ v λ (12)
or v λ = Γ 1/2
λ Q w
λ . (13) w λ is a coordinate vector relative to G while v λ is one relative to Q whose column vectors are orthonormal. We refer to v λ,1
, . . . , v λ,n
as orthogonal components. 3 Shrinkage Methods We define w ∗ = (G G) −1 G h, where h = (h(x 1 ), . . . , h(x n )) . We also define v ∗
= (v ∗ λ,1 , . . . , v ∗ λ,n ) = Γ −1/2
λ Γ Q w
∗ (14)
Γ λ = diag γ 1 γ 1 + λ
, . . . , γ n γ n + λ . (15)
Then, the conditional distribution of v λ = v λ (x n ) given x n = (x 1 , . . . , x n ) is
shown to be v λ ∼ N(v ∗ λ , σ 2 Γ λ |x n ). The components with v ∗ λ,i
= 0 can be viewed Orthogonal Shrinkage Methods for Nonparametric Regression 541
as purely noise components which we need to remove. We now consider to keep or remove the orthogonal components by a hard thresholding method. Let T n,i (v
) be the ith component after thresholding; i.e. T n,i (v
) = v λ,i
if it is kept T n,i (v λ,i
) = 0 if it is removed. We define T n (v λ ) = (T
1,n (v λ,1 ), . . . , T n,n
(v λ,n
)) . Then, training by using a thresholding procedure is as follows: 1. Fix a small positive value for λ. 2. By eigendecomposition of G G, we obtain eigen values (γ 1 , . . . , γ n ) and an
orthonormal matrix Q whose column vectors are the corresponding eigen vectors. By using (11) with (8), we obtain v λ .
n,i (v λ,i ), i = 1, . . . , n. 4. By (12), we obtain w λ = (w
λ,1 , . . . , w λ,n ) = QΓ
−1/2 λ T n (v λ ) as a trained weight vector after thresholding. Then, the output of a trained machine is given by f w λ (x) = n j=1 w λ,j
g j (x) for any x ∈ R d . By the definition of T n,i
, i = 1, . . . , n, it is easy to see that the resulting w λ is a shrinkage estimator of w λ . Typically, if we set threshold levels {θ n,1
, . . . , θ n,n
} for each of orthogonal components then we can calculate the thresholded com- ponents by T i,n (v λ,i
) = v λ,i v 2 λ,i > θ n,i
0 v 2 λ,i ≤ θ
n,i , , i = 1, . . . , n. (16) We propose three methods as T n .
propriate threshold levels. To do this, we show a basic result on i.i.d. Gaussian random variables. Let W 1 , . . . , W n be i.i.d. random variables having N (0, σ 2 ).
n, = 2 log n + ( −1 + ) log log n, where is a constant. Then, it can be shown that lim n
P max 1 ≤i≤n |W i | 2 > C
n, = 0,
for > 0
(17) lim
n →∞ P max 1 ≤i≤n
|W i | 2 ≤ C
n, = 0,
for < 0. (18)
The strong convergence result with C n,1
is known and C n,1
is used as the universal threshold level for removing irrelevant wavelet coefficients in wavelet denoising[5]. The universal threshold level is shown to be asymptotically equiv- alent to the minimax one[5]. (17) and (18) are the weaker results while they evaluate the O(log log n) term. We return to our problem and consider the case of v ∗ λ
n ; i.e. all of v λ,i ’s
n denotes the n-dimensional zero vector. We define u = (u 1
n ) that satisfies u ∼ N(0 n
2 Γ λ ). We define σ 2 i = σ 2 γ i /(γ
i + λ),
i = 1, . . . , n, where σ 2 i = 0 for any i. We also define u = (u 1 , . . . , u n ), where
u i = u i /σ i . Then, u ∼ N(0
n , I
n ). By (17) and the definition of u, we have P n
u 2 i > σ 2 i C n, = P max 1 ≤i≤n u 2 i > C n, → 0 (n → ∞), (19) 542 K. Hagiwara if > 0. On the other hand, by using (18) and the definition of u, we have P n i=1 u 2 i ≤ σ 2 i C n, = P max 1 ≤i≤n u 2 i ≤ C n, → 0 (n → ∞), (20) if
(19) tells us that, for any i, u 2 i cannot exceed σ 2 i C n, with high probabil- ity when n is large and > 0. Therefore, if we employ σ i C
with > 0 as
component-wise threshold levels, (19) implies that they remove noise compo- nents if it is. On the other hand, (20) tells us that there are some noise com- ponents which satisfy u 2 i > σ 2 i C n, with high probability when n is large and < 0. Therefore, the ith component should be recognized as a noise component if u
2 i ≤ σ 2 i C n, even when it is not. In other words, we can not distinguish non- zero mean components from zero mean components in this case. Hence, σ 2 i C n,0
, i = 1, . . . , n are critical levels for identifying noise components. We note that these results are still valid when v ∗ λ,i are not zero for some i but the number of such components is very small. This is the case of assuming the sparseness of the representation of h in the orthogonal domain. As a conclusion, we propose a hard thresholding method by putting θ n,i =
2 i C n,0 , i = 1, . . . , n in (16), where we set = 0. We refer to this method by component-wise hard thresholding(CHT). In practical applications of the method, we need an estimate of noise variance σ 2 . Fortunately, in nonparametric regression methods, [1] suggested to apply σ 2 = y [I
− H λ ] 2 y trace[(I − H λ ) 2 ] , (21) where H
λ is defined by H λ = GF
−1 λ G = GQΓ λ Q G .
Although the method includes a regularization parameter, it can be fixed for a small value. This is because the thresholding method keeps a trained machine compact on the orthogonal domain, by which the contribution of the regularizer may not be significant in improving the generalization performance. Therefore it is needed only for ensuring the numerical stability of the matrix calculations. Since the basis functions can be nearly linearly dependent in practical appli- cations, small eigen values are less reliable. We therefore ignore the orthogonal components whose eigen values are less than a predetermined small value, say, 10 −16
. Although the run time of eigendecomposition is O(n 3 ), the subsequent procedures of CHT such as calculations of σ 2 and w λ are achieved with less computational costs by the eigendecomposition. Hard thresholding with the universal threshold level(HTU). Basically eigendecomposition of G G corresponds to the principal component analysis of g 1 , . . . , g n . Therefore, for nearly linearly dependent basis functions, only several eigen vectors are largely contributed. On the other hand, the components with small eigen values are largely affected by numerical errors. Therefore, it is natural to take care of only the components with large eigen values. For a component
Orthogonal Shrinkage Methods for Nonparametric Regression 543
with a large eigen value, γ i λ holds since we can choose a small value for λ. Thus, σ 2 i σ 2 holds by the definition of σ 2 i in CHT. We then consider to apply a single threshold level σ 2 C n,0 instead of σ 2 i
n,0 , i = 1, . . . , n in CHT; i.e. we set θ n,i
= σ 2 C n,0 in (16). This is a direct application of the universal threshold level in wavelet denoising[5]. This method is referred to by hard thresholding with the universal threshold level(HTU). Backward hard thresholding(BHT). On the other hand, since the thresh- old level derived here is the worst case evaluation for a noise level, CHT and HTU have a possibility of yielding a bias between f w λ and h by unexpected removes of contributed components. The component with a large eigen value is composed by a linear sum of many basis functions, it may be a smooth compo- nent. Therefore, removes of these components may yield a large bias. Actually, in wavelet denoising, fast/detail components are the target of a thresholding method and slow/approximation components are harmless by the thresholding method[5], which may be a device for reducing the bias. For our method, we also introduce this idea and consider the following procedure. We assume that γ 1
2 ≤ · · · ≤ γ n . The method is that, by increasing j from 1 to n, we find the first j = j for which v λ,j
> σ 2 C n,0 occurs. Then, thresholding is made by T j
λ,j ) = v
λ,j if j
≥ j and T j (v λ,j ) = 0 if j < j. This keeps components with large eigen values and possibly reduces the bias. We refer to this method by backward hard thresholding(BHT). BHT can be viewed as a stopping cri- terion for choosing contributed components in orthogonal components that are enumerated in order of the magnitudes of eigen values. 4 Numerical Experiments 4.1 Choice of Regularization Parameter CHT, HTU and BHT do not include parameters to be adjusted except the regularization parameter. Since thresholding of orthogonal components yields a certain simple representation of a machine, the regularization parameter may not be significant in improving the generalization performance. To demonstrate this property of our methods, through a simple numerical experiment, we see the relationship between generalization performances of trained machines and regularization parameter values. The target function is h(x) = 5 sinc(8 x) for x ∈ R. x 1
n are randomly drawn in the interval [ −1, 1]. We assume i.i.d. Gaussian noise with mean zero and variance σ 2 = 1. The basis functions are Gaussian basis functions that are defined by g j (x) = exp {−(x−x i ) 2 /(2τ
2 ) }, j = 1, . . . , n, where we set τ 2 = 0.05.
In this experiment, under a fixed value of a regularization parameter, we trained machines for 1000 sets of training data of size n. At each trial, the test error is measured by the mean squared error between the target function and the trained machine, which is calculated on 1000 equally spaced input points in [ −1, 1]. Figure 1 (a) and (b) depict the results for n = 200 and n = 400 respectively, which show the relationship between the averaged test errors of trained machines 544 K. Hagiwara 10 −6 10 −4 10 −2 10 0 10 2 0 0.05 0.1 reguralization parameter averaged test error (a)
n = 200 10 −6 10 −4 10 −2 10 0 10 2 0 0.05 0.1 reguralization parameter averaged test error (b)
n = 400 Fig. 1. The dependence of test errors on regularization parameters. (a) n = 200 and (b) n = 400. The filled circle, open circle and open square indicate the results for the raw estimate, CHT and BHT respectively. and regularization parameter values. The filled circle, open circle and open square indicate the results for the raw estimator, CHT and BHT respectively, where the raw estimate is obtained by (6) at each fixed value of a regularization parameter. We do not show the result for HTU since it is almost the same as the result for CHT. In these figures, we can see that the averaged test errors of our methods are almost unchanged for small values of a regularization parameter while those of the row estimates are sensitive to regularization parameter values. We can also see that BHT is entirely superior to the raw estimate and CHT while CHT is worse than the raw estimate around λ = 10 1 for both of n = 200 and 400. In practical applications, the regularization parameter of the raw estimate should be determined based on training data and the performance comparisons to the leave-one-out cross validation method are shown in below. 4.2
Comparison with LOOCV We compare the performances of the proposed methods to the performance of the leave-one-out cross validation(LOOCV) choice of a regularization parame- ter value. We see not only generalization performances but also computational times of the methods. For the regularized estimator considered in this article, it is known that the LOOCV error is calculated without training on validation sets[3,6]. We assume the same conditions as the previous experiment. The CPU time is measured only for the estimation procedure. The experiments are conducted by using Matlab on the computer that has a 2.13 GHz Core2 CPU, 1 GByte memory. Table 1 (a) and (b) show the averaged test errors and averaged CPU times of LOOCV and our methods respectively, in which the standard deviations(divided by 2) are also appended. The examined values for a regularization parameter in LOOCV is
{m × 10 −j : m = 1, 2, 5, j = −4, −3, . . . , 3}. In our methods, the Orthogonal Shrinkage Methods for Nonparametric Regression 545
Table 1. Test errors and CPU times of LOOCV, CHT, HTU and BHT n LOOCV CHT HTU
BHT 100
0.079 ± 0.027
0.101 ± 0.034
0.100 ± 0.034
0.076 ± 0.027
200 0.040
± 0.013 0.046
± 0.014 0.045
± 0.014 0.035
± 0.011 400
0.021 ± 0.006
0.023 ± 0.007
0.023 ± 0.007
0.017 ± 0.005
(a) Test errors n LOOCV CHT HTU
BHT 100
0.079 ± 0.002
0.013 ± 0.002
0.014 ± 0.002
0.014 ± 0.002
200 0.533
± 0.003 0.080
± 0.002 0.080
± 0.002 0.080
± 0.002 400
3.657 ± 0.003
0.523 ± 0.004
0.523 ± 0.004
0.523 ± 0.004
(b) CPU times regularization parameter is fixed at 1 × 10 −4
candidate values for LOOCV. Based on Table 1 (a), we first discuss the generalization performances of the methods. CHT and HTU are almost comparable. This implies that only the com- ponents corresponding to large eigen values are contributed. CHT and HTU are entirely worse than LOOCV in average while the differences are within the stan- dard deviations for n = 200 and n = 400. BHT entirely outperforms LOOCV, CHT and HTU in average while the difference between the averaged test error of LOOCV and that of BHT is almost within the standard deviations. As pointed out previously, CHT and HTU have a possibility to remove smooth components accidentally since the threshold levels were determined based on the worst case evaluation of dispersion of noise. The better generalization performance of BHT compared with CHT and HTU in Table 1 (a) is caused by this fact. On the other hand, as shown in Table 1 (b), our methods completely outperform LOOCV in terms of the CPU times. 5 Conclusions and Future Works In this article, we proposed shrinkage methods in training a machine by using a regularization method. The machine is represented by a linear combination of fixed basis functions, in which the number of basis functions, or equivalently, the number of weights is identical to that of training data. In the regularized cost function, the error function is defined by the sum of squared errors and the regularization term is defined by the quadratic form of the weight vector. In the proposed shrinkage procedures, basis functions are orthogonalized by eigen- decomposition of the Gram matrix of the vectors of basis function outputs. Then, the orthogonal components are kept or removed according to the proposed thresholding methods. The proposed methods are based on the statistical prop- erties of regularized estimators of weights, which are derived by assuming i.i.d. Gaussian noise. The final weights are obtained by a linear transformation of the thresholded orthogonal components and are shrinkage estimators of weights. We
546 K. Hagiwara proposed three versions of thresholding methods which are component-wise hard thresholding, hard thresholding with the universal threshold level and backward hard thresholding. Since the regularization parameter can be fixed for a small value in our methods, our methods are automatic. Additionally, since eigende- composition algorithms are included in many software packages and the thresh- olding methods are simple, the implementations of our methods are quite easy. The numerical experiments showed that our methods achieve relatively good generalization capabilities in strictly less computational time by comparing with the LOOCV method. Especially, the backward hard thresholding method out- performed the LOOCV method in average in terms of the generalization perfor- mance. As future works, we need to investigate the performance of our methods on real world problems. Furthermore, we need to evaluate the generalization error when applying the proposed shrinkage methods. References 1. Carter, C.K., Eagleson, G.K.: A comparison of variance estimators in nonparametric regression. J. R. Statist. Soc. B 54, 773–780 (1992) 2. Chen, S.: ‘Local regularization assisted orthogonal least squares regression. Neuro- computing 69, 559–585 (2006) 3. Craven, P., Wahba, G.: Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403 (1979) 4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 5. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455 (1994) 6. Rifkin, R.: Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D thesis, MIT (2002) 7. Tibshirani, R.: Regression shrinkage and selection via lasso. J.R. Statist. Soc. B 58, 267–288 (1996) 8. Suykens, J.A.K., Brabanter, J.D., Lukas, L., Vandewalle, J.: Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48, 85–105 (2002) 9. Williams, C.K.I., Seeger, M.: Using the Nystr¨ om method to speed up kernel ma- chines. In: Leen, T.K., Diettrich, T.G., Tresp, V. (eds.) Advances in Neural Infor- mation Processing Systems, vol. 13, pp. 682–688 (2001) A Subspace Method Based on Data Generation Model with Class Information Minkook Cho, Dongwoo Yoon, and Hyeyoung Park School of Electrical Engineering and Computer Science Kyungpook National University, Deagu, Korea ucaresoft@paran.com, dongwoo0926@hanmail.net, hypark@knu.ac.kr Abstract. Subspace methods have been used widely for reduction ca- pacity of memory or complexity of system and increasing classification performances in pattern recognition and signal processing. We propose a new subspace method based on a data generation model with intra-class factor and extra-class factor. The extra-class factor is associated with the distribution of classes and is important for discriminating classes. The intra-class factor is associated with the distribution within a class, and is required to be diminished for obtaining high class-separability. In the proposed method, we first estimate the intra-class factors and reduce them from the original data. We then extract the extra-class factors by PCA. For verification of proposed method, we conducted computational experiments on real facial data, and show that it gives better performance than conventional methods. 1 Introduction Subspace methods are for finding a low dimensional subspace which presents some meaningful information of input data. They are widely used for high di- mensional pattern classification such as image data owing to two main reasons. First, by applying a subspace method, we can reduce capacity of memory or complexity of system. Also, we can expect to increase classification performances by eliminating useless information and by emphasizing essential information for classification. The most popular subspace method is PCA(Principal Component Analysis) [10,11,8] and FA(Factor Analysis)[6,14] which are based on data generation mod- els. The PCA finds a subspace of independent linear combinations (principal components) that retains as much of the information in the original variables as possible. However, PCA method is an unsupervised method, which does not use class information. This may cause some loss of critical information for clas- sification. Contrastively, LDA(Linear discriminant analysis)[1,4,5] method is a supervised learning method which uses information of the target label of data set. The LDA method attempts to find basis vectors of subspace maximizing the linear class separability. It is generally known that LDA can give better classifi- cation performance than PCA by using class information. However, LDA gives M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 547–555, 2008. c Springer-Verlag Berlin Heidelberg 2008 548 M. Cho, D. Yoon, and H. Park at most k-1 basis of subspace for k-class, and cannot extract features stably for the data set with limited number of data in each class. Another subspace method for classification is the intra-person space method[9], which is developed for face recognition. The intra-person space is defined as a difference between two facial data from same person. For dimension reduction, the low dimensional eigenspace is obtained by applying PCA to the intra-person space. In the classification tasks, raw input data are projected to the intra-personal eigenspace to get the low dimensional features. The intra-person method showed better performance than PCA and LDA in the FERET Data[9]. However, it is not based on data generation model and it cannot give a sound theoretical reason why the intra-person space gives good information for classification. On the other hand, a data generation model with class information has re- cently been developed[12]. It is a variant of the factor analysis model with two type of factors; class factor(what we call extra-class factor) and environment factor(what we call intra-class factor). Based on the data generation model, the intra-class factor is estimated by using difference vectors between two data in the same class. The estimated probability distribution of intra-class factor is applied to measuring the similarity of data for classification. Through this method takes similar approaches to the intra-person method in the sense that it is using the difference vectors to get the intra-class information and similarity measure, it is based on the data generation model which can give an explanation on the devel- oped similarity measure. Still, it does not include dimension reduction process and another subspace method is necessary for high-dimensional data. In this paper, we propose an appropriate subspace method for the data gen- eration model developed in [12]. The proposed method finds subspace which undercuts the effect of intra-class factors and enlarges the effect of extra-class factors based on the data generation model. In Section 2, the model will be explained in detail. 2 Data Generation Model In advance of defining the data generation model, let us consider that we ob- tained several images(i.e. data) from different persons(i.e. class). We know that pictures of different persons are obviously different. Also, we know that the pic- tures of a person is not exactly same due to some environmental condition such as illumination. Therefore, it is natural to assume that a data consists of the intra-class factor which represents within-class variations such as the variation of pictures of same person and the extra-class factor which represents between- class variations such as the differences between two persons. Under this condition, a random variable x for observed data can be written as a function of two distinct random variable ξ and η, which is the form, x = f (ξ, η), (1)
where ξ represents an extra-class factor, which keeps some unique information in each class, and η represents an intra-class factor which represents environmental A Subspace Method Based on Data Generation Model 549
variation in the same class. In [12], it was assumed that η keeps information of any variation within class and its distribution is common for all classes. In order to explicitly define the generation model, a linear additive factor model was applied such as x i
i + η.
(2) This means that a random sample x i in class C i is generated by the summation of a linearly transformed class prototype ξ i and a random class-independent variation η. In this paper, as an extension of the model, we assume that the low dimen- sional intra-class factor is linearly transformed to generate an observed data x. Therefore, function f is defined as x i
i + V η
i . (3) This model implies that a data of a specific class is generated by the extra-class factor ξ which gives some discriminative information among class and the intra- class factor η which represents some variation in a class. In this equation, W and V are transformation matrices of corresponding factors. We call W the extra-class factor loading and call V the intra-class factor loading. Figure 1 represents this data generation model. To find a good subspace for classification, we try to find V and W using class information which is given with the input data. In Section 3, we explain how to find the subspace based on the data generation model. Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling