Lecture Notes in Computer Science

bet	51/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 47 48 49 50 51 52 53 54 ... 88

Fig. 3.
6 Conclusion

4 Simulation Results

Now we will examine the small scale numerical simulations results. The number of

inputs was K = 5 and the number of output neurons was N = 3. Artificial zero-mean

vectors with uncorrelated elements were generated by the following equations:

.

randn(1,1)

.18

)

(5,

.5);

,1)

log(rand(1

1).

.5)

<

)

((rand(1,1

.145

)

(4,

/17.8);

.35sin(

)

(3,

11)/9).^5;

,23)

((rem(

.45

)

(2,

/2);

sin(

.45

)

(1,

i

x

i

x

i

i

x

i

i

x

i

i

x

⋅

Input signal is constructed as s = 0.47*mix*x, where mixing matrix mix is defined as

).

rand(

mix

In Fig.2 cosine of the angles between column vectors of matrix W and numerically

calculated eigenvectors of input signal covariance matrix were used as illustration of

the algorithm efficiency. Learning rate was chosen constant,

γ

i

= 3.45, for the first

15000 iterations, and then was set at

γ

i

= 0.115. Individual part was taken as α=0.5.

Initial value for matrix W was selected as W

init

=-.5+rand(5,3). In Fig. 3 we can see

that proposed method can be efficiently used for extraction of the deterministic input

signals (under the assumption that mixing matrix is orthogonal).

First column in Fig. 3 represents original signals, second column represents input

signals obtained from original signals by multiplication with orthonormal matrix mix

( mix = mix (mix

T

mix)

-.5

), and last column represents output, or reconstructed signals,

obtained by implementation of MMHO.

0.5

1.5

2.5

x 10

0.2

0.4

0.6

0.8

1.2

1.4

Number of iterations

s(

w

i)

Fig. 2. Cosine of angles between column vectors of W and three principal eigenvectors of input

covariance matrix versus the number of iterations

Modified Modulated Hebb-Oja Learning Rule

533

0

50

100

150

200

250

-1

Output

100

150

200

250

-0.5

0.5

100

150

200

250

-0.2

0.2

100

150

200

250

-1

Input

100

150

200

250

-0.5

0.5

100

150

200

250

-0.5

0.5

100

150

200

250

-0.5

0.5

100

150

200

250

-0.5

0.5

100

150

200

250

-0.5

0.5

Original signals

100

150

200

250

-2

100

150

200

250

-0.5

0.5

100

150

200

250

-0.1

0.1

100

150

200

250

-0.5

0.5

Fig. 3. Blind signal extraction of deterministic components (mixing matrix is orthonormal)

5 Comparison to a Part of the Frog Retinal Wiring

Equation (6) can be rewritten as

(

)

(

)

(

)

1,..,

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

2

N

k

i

y

i

y

i

y

i

i

y

i

i

i

i

i

y

i

i

y

i

i

i

i

k

k

j

j

k

k

k

k

k

k

k

k

⎟

⎠

⎞

⎜

⎝

⎛

−

∑

−

=

w

x

y

x

w

x

w

w

αγ

(7)

Proposed learning rule can be implemented by the network structure shown in Fig. 4.

By analyzing the proposed structure, we can see that some degree of similarity

with the part of the frog’s retinal wiring exists [6]. We can make the following

analogies:

•

the inputs x

,…,x

can be viewed as photoreceptors

•

the outputs y

,…,y

correspond to bipolar cells (BC)

•

circuit for calculation of the input energy marked as

can be viewed as

horizontal cell

•

units labeled as A

and which calculate the output partial powers

⎟

⎠

⎞

⎜

⎝

⎛

∑

−

)

(

)

(

)

(

2

i

y

i

y

i

A

k

k

j

j

k

could be related to the amacrine cells (AC)

•

IP unit can be related to interplexiform cell (output power = IP = A

)

534

M. Jankovic et al.

N

y

Input

Output

AN

Fig. 4. Structure of the computational circuit that can be used for realization of MMHO

algorithm (small empty circle at the end of an arrow denotes the square function)

•

we can notice presence of backward inhibitory projections from amacrine cells

to the bipolar cells

•

it is possible to see that A

cells are serially connected which can possibly

explain serial synapsing between amacrine cells in frog’s retina.

Although there is no physiological evidence which can support the similarity with the

part of the frog’s retinal processing, high degree of structural similarity can be sign of

possible similarity in the physiological sense. Although, some further refinement of

the circuit structure is necessary in order to match known results (like reciprocal

connections between ACs, or fact that back projections of amacrine cell will finish on

BC terminals rather than on their dendrite) here we will suggest that species which

have IP cells damaged will have significantly deteriorated selectivity abilities.

6 Conclusion

In this paper a biologically plausible method that performs PCA has been proposed.

In the proposed method, learning rule for individual synaptic efficacy does not require

explicit information about the other synaptic efficacies, especially those related to

other neurons. Simplicity of the “neural circuits” that perform global computations

and a fact that their number does not depend on the number of input and output

neurons, could be seen as good features of the proposed method. Comparing to some

other recursive PCA methods that use local feedback connections in order to maintain

stability, the MMHO method uses global and local feedback connections. Some

degree of similarity of the circuit that can be used for the realization of the proposed

learning rule with the part of the retinal wiring of the frog’s retina has been suggested.

Modified Modulated Hebb-Oja Learning Rule

535

It is suggested that the role of the interplexiform cells in overall retinal processing

could be significant, although it is usually assumed that their role is not important.

References

1. Abed-Meraim, K., Attallah, V., Ckheif, A., Hua, Y.: Orthogonal Oja algorithm. IEEE

Signal Processing Letters 7, 116–119 (2000)

2. Baldi, P., Hornik, K.: Learning in linear neural networks: A survey. IEEE Trans. Neural

Networks 6, 837–858 (1995)

3. Bienenstock, E., Cooper, L.N., Munro, P.W.: Theory of the development of the neuron

selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of

Neuroscience, 32-48 (1982)

4. Chen, T.-P., Amari, S.-I.: Unified stabilization approach to principal and minor

components. Neural Networks 14, 1377–1387 (2001)

5. Cichocki, A., Amari, S.–I.: Adaptive Blind Signal and Image Processing – Learning

Algorithms and Applications. John Wiley and Sons, Chichester (2003)

6. Dowling, J.: The Retina: An approachable part of the brain. The Belknap Press of Harward

University Press (1987)

7. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality

constraints. SIAM Journal on Matrix Analysis Applications 20, 303–353 (1998)

8. Douglas, S.C., Kung, S.Y., Amari, S.-I.: A self-stabilized minor subspace rule. IEEE

Signal Processing Letters 5, 328–330 (1998)

9. Fiori, S.: A theory for learning by weight flow on Stiefel-Grassman Manifold. Neural

Computation 13, 1625–1647 (2001)

10. Földiák, P.: Adaptive network for optimal linear feature extraction. In: IJCNN 1989,

Washington, D.C., USA, pp. 401–405 (1989)

11. Jankovic, M.: A new modulated Hebbian learning rule – Method for local computation of

a principal subspace. In: ICONIP2001, Shanghai, China, vol. 1, pp. 470–475 (2001)

12. Jankovic, M.: A new simple ∞OH neuron model as a biologically plausible principal

component analyzer. IEEE Trans. on Neural Networks 14, 853–859 (2003)

13. Jankovic, M., Ogawa, H.: A new modulated Hebb learning rule – Biologically plausible

method for local computation of principal subspace. Int. J. Neural Systems 13, 215–224

(2003)

14. Jankovic, M., Reljin, B.: Neural learning on Grassman/Stiefel principal/minor

submanifold. In: EUROCON 2005, Serbia, pp. 249–252 (2005)

15. Jankovic, M., Ogawa, H.: Modulated Hebb-Oja learning rule – A method for principal

subspace analysis. IEEE Trans. on Neural Networks 17, 345–356 (2006)

16. Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Contr. 22,

551–575 (1977)

17. Möller, R., Konies, A.: Coupled principal component analysis. IEEE Trans. on Neural

Networks 15, 214–222 (2004)

18. Oja, E.: A simplified neuron model as a principal component analyzer. J. Math. Biol. 15,

267–273 (1982)

19. Oja, E.: Subspace Method of Pattern Recognition. Research Studies Press and J. Wiley,

Letchworth (1983)

20. Oja, E., Karhunen, J.: On stochastic approximation of the eigenvectors and eigenvalues of

the expectation of a random matrix. J. Math. Anal., Appl. 106, 69–84 (1985)

536

M. Jankovic et al.

21. Oja, E., Ogawa, H., Wangviwattana, J.: Principal component analysis by homogeneous

neural networks, Part I: The weighted subspace criterion. IEICE Trans. Inf.&Syst. E75-D,

366–375 (1992)

22. Peltonen, J., Kaski, S.: Discriminative components of data. IEEE Trans. on Neural

Networks 16, 68–83 (2005)

23. Plumbley, M.D., Oja, E.: A “nonnegative PCA” algorithm for independent component

analysis. IEEE Trans. on Neural Networks 15, 66–76 (2004)

24. Tanaka, T.: Generalized weighted rules for principal components tracking. IEEE Trans. on

Signal Processing 53, 1243–1253 (2005)

25. Waheed, K., Salem, F.M.: Blind information-theoretic multiuser detection algorithms for

DS-CMDA and WCDMA downlink systems. IEEE Trans. on Neural Networks 16, 937–

948 (2005)

26. Xu, L.: Least mean square error reconstruction principle for self-organizing neural nets.

Neural Networks 6, 627–648 (1993)

27. Yang, B.: Projection approximation subspace tracking. IEEE Trans. on Signal

Processing 43, 95–107 (1995)

28. Zhang, Y., Weng, J., Hwang, W.-S.: Auditory learning: a developmental method. IEEE

Trans. on Neural Networks 16, 601–616 (2005)

Orthogonal Shrinkage Methods for

Nonparametric Regression under Gaussian Noise

Katsuyuki Hagiwara

Faculty of Education, Mie University,

1577 Kurima-Machiya-cho, Tsu 514-8507 Japan

hagi@edu.mie-u.ac.jp

Abstract. In this article, for regression problems, we proposed shrink-

age methods in training a machine in terms of a regularized cost func-

tion. The machine considered in this article is represented by a linear

combination of ﬁxed basis functions, in which the number of basis func-

tions, or equivalently, the number of adjustable weights is identical to

the number of training data. This setting can be viewed as a nonpara-

metric regression method in statistics. In the regularized cost function

employed in this article, the error function is deﬁned by the sum of

squared errors and the regularization term is deﬁned by the quadratic

form of the weight vector. By assuming i.i.d. Gaussian noise, we pro-

posed three thresholding methods for the orthogonal components which

are obtained by eigendecomposition of the Gram matrix of the vectors

of basis function outputs. The ﬁnal weight values are obtained by a lin-

ear transformation of the thresholded orthogonal components and are

shrinkage estimators. The proposed methods are quite simple and auto-

matic, in which the regularization parameter is enough to be ﬁxed for

a small constant value. Simple numerical experiments showed that, by

comparing the leave-one-out cross validation method, the computational

costs of the proposed methods are strictly low and the generalization

capabilities of the trained machines are comparable when the number of

data is relatively large.

Introduction

This article considers a regression method using learning machines that are de-

ﬁned by a linear combination of ﬁxed basis functions. Especially, we focus on the

machines in which the number of basis functions, or equivalently, the number of

adjustable weights is identical to the number of training data. This problem is

viewed as a nonparametric regression method in statistics. In machine learning,

support vector regressions belong to this type of regression, in which the ker-

nel trick together with the representer theorem yields a linear combination of

kernel functions(e.g.[4]). In recent works, it has also been proposed variations of

support vector machines, such as least squared support vector machines[8], or

equivalently, regularized least squares[6].

In training the above stated machines, regularization methods are naturally

introduced because of the following two reasons. The ﬁrst reason is to stabilize

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 537–546, 2008.

c Springer-Verlag Berlin Heidelberg 2008

538

K. Hagiwara

the training process. Since the number of basis functions is large, the basis func-

tions can be nearly linearly dependent depending on choices of basis functions.

In this case, for example, the least squares estimators of weights are numerically

unstable solutions. The second reason is to avoid over-ﬁtting. For example, the

outputs of the above stated machine with the least squares estimators are iden-

tical to output data. Obviously, such outputs may not generalize to unseen data

well due to over-ﬁtting. The generalization capability of the trained machine

is generally sensitive to the choice of a regularization parameter value which is

often chosen by using a cross validation method in applications. In this article,

we consider rather other methods for improving the generalization performance

while the regularizer is needed for stabilizing the training process.

In this article, we propose shrinkage methods for training the above stated

machine in terms of a regularized cost function. In the regularized cost func-

tion, the error function is deﬁned by the sum of squared errors and the reg-

ularizer is deﬁned by the quadratic form of the weight vector. Therefore, the

formulation in this article is almost equivalent to least squares support vector

machines/regularized least squares. However, it is slightly general since the ba-

sis functions are not assumed to be the Mercer kernels. Eigendecomposition is

ﬁrstly applied to the Gram matrix of the vectors of basis function outputs. It

is viewed as orthogonalization of the vectors of basis function outputs. This

procedure has also been applied for reducing computational costs in training

kernel machines[9]. We then obtain the coeﬃcients of the orthogonalized basis

outputs by a linear transformation of the regularized estimators of weights. The

coeﬃcients are referred to as orthogonal components here. By assuming i.i.d.

Gaussian noise, we consider to keep or remove the orthogonal components by a

hard thresholding method, in which the orthogonal components to be removed

are set to zero. By introducing an idea of the universal threshold level used in

wavelet denoising[5], we propose three hard thresholding methods for the or-

thogonal components. The ﬁnal weights are obtained by a linear transformation

of the coeﬃcients of the thresholded orthogonal components and are shrinkage

estimators. Although our methods include a regularization parameter, it can be

ﬁxed for a small value. This is because the machines trained by our methods

have compact representations in an orthogonal domain and the contribution of

the regularizer in improving the generalization performance is not signiﬁcant.

Therefore, we need the regularizer only for ensuring the numerical stability of

training and it is enough to be ﬁxed for a small value.

There are several related works to our methods. The lasso implements the l

regularizer which is known to act as a thresholding method[7]. A thresholding

operation of the lasso is directly applied on weights of basis functions while our

methods are applied on orthogonal components. In the lasso, the threshold level

is determined by a regularization parameter while the levels are given with a the-

oretical implication in our methods. The regularization parameter may be chosen

by cross validation methods in the lasso, which is time consuming compared to

our methods since it can be ﬁxed in our methods. Recently, the Gram-Schmidt

type orthogonalization procedure has been introduced to least squares support

Orthogonal Shrinkage Methods for Nonparametric Regression

539

vector machines(e.g. [2]), where the orthogonal components are chosen via a

greedy manner according to the residual error. [2] further proposed a method in

which regularization parameters are assigned individually to orthogonal compo-

nents and are updated based on the Bayesian log evidence. This method includes

some parameters such as a criterion for determining the number of orthogonal

components and the number of updates, which are heuristically determined. Our

methods have an advantage at this point since they do not include parameters

that are needed to be empirically adjusted.

This article is organized as follows. In Section 2, we formulate a learning machine

and its training procedure using a regularization method. We also introduce eigen-

decomposition or orthogonalization of basis function outputs here. In Section 3,

we give shrinkage methods. Some numerical experiments for the proposed methods

including comparisons with the leave-one-out cross validation method are shown

in Section 4. Conclusions and future works are given in Section 5.

Formulation

2.1

Regression Problem and Regularized Cost

Let

{(x

, y

) : x

∈ R

, y

∈ R, i = 1, . . . , n} be a set of input-output training

data. We assume that the output data are generated according to the following

rule.

= h(x

) + e

, i = 1, . . . , n,

(1)

where h is a ﬁxed target function on

and e

, . . . , e

are i.i.d. noise that are

assumed to be sampled from a Gaussian distribution N (0, σ

). x

, . . . , x

are

independent samples from a probability distribution on

and are independent

of y

, . . . , y

.

We consider curve-ﬁtting by using a machine:

(x) =

j=1

(x), x

∈ R

(2)

where w = (w

, . . . , w

)

∈ R

is a coeﬃcient or weight vector. For simplifying the

subsequent discussions, we deﬁne y = (y

, . . . , y

) , where denotes the transpose.

Also we redeﬁne w as a vertical vector; i.e. w = (w

, . . . , w

) . Let G be the n

× n

matrix whose (i, j) element is g

). We deﬁne g

= (g

), . . . , g

)) that is

the jth column vector of G. We assume that g

, . . . , g

are linearly independent.

We train the machine in terms of the regularized cost function deﬁned by

C(w) = E(w) + λR(w), λ

∈ (0, ∞),

(3)

where E is an error function, R is a regularization term and λ is a regularization

parameter. In this article, we employ

E(w) = y

− Gw

(4)

R(w) = w

(5)

where

· denotes the Euclidean norm.

540

K. Hagiwara

2.2

Training Via Eigendecomposition of Gram Matrix

We deﬁne F

= (G G + λI

), where I

is the n

× n identity matrix. Under (3),

(4) and (5), the estimate of the weight vector is given by

λ

= (w

λ,1

, . . . , w

λ,n

) = F

−1

G y

(6)

and the minimum of the regularized cost function is written by

C(w

λ

) = y y

− w

(7)

Since the column vectors of G are linearly independent, G G is positive deﬁnite,

where G G is the Gram matrix of (g

, . . . , g

). Then, there exists an orthonormal

matrix Q such that Q (G G)Q = Γ where Γ = diag(γ

, . . . , γ

) with γ

> 0 for

any k. Here, diag(a

, . . . , a

) denotes a diagonal matrix whose (k, k) element is

k

. Note that QΓ Q = G G holds since Q is orthonormal. γ

is the kth eigen

value and the kth column vector of Q is the corresponding eigen vector. We

deﬁne

= diag(γ

+ λ, . . . , γ

+ λ).

(8)

Then, Q F

Q = Γ

and QΓ

Q = F

hold. By using Q and Γ

, (6) and (7) are

written by

= QΓ

−1

Q G y

(9)

C(w

) = y y

− v

(10)

where

= (v

λ,1

, . . . , v

λ,n

) = Γ

−1/2

Q G y.

(11)

By (9) and (11), w

and v

are linked by the equation

λ

= QΓ

−1/2

(12)

= Γ

1/2

Q w

(13)

is a coordinate vector relative to G while v

is one relative to Q whose column

vectors are orthonormal. We refer to v

λ,1

, . . . , v

λ,n

as orthogonal components.

Shrinkage Methods

We deﬁne w

∗

= (G G)

−1

G h, where h = (h(x

), . . . , h(x

)) . We also deﬁne

∗

λ

= (v

∗

λ,1

, . . . , v

∗

λ,n

) = Γ

−1/2

Γ Q w

∗

(14)

= diag

+ λ

, . . . ,

+ λ

(15)

Then, the conditional distribution of v

= v

) given x

= (x

, . . . , x

) is

shown to be v

∼ N(v

∗

, σ

). The components with v

∗

λ,i

= 0 can be viewed

Orthogonal Shrinkage Methods for Nonparametric Regression

541

as purely noise components which we need to remove. We now consider to keep or

remove the orthogonal components by a hard thresholding method. Let T

n,i

(v

λ,i

)

be the ith component after thresholding; i.e. T

n,i

(v

λ,i

) = v

λ,i

if it is kept

n,i

λ,i

) = 0 if it is removed. We deﬁne T

) = (T

1,n

λ,1

), . . . , T

n,n

λ,n

)) .

Then, training by using a thresholding procedure is as follows:

1. Fix a small positive value for λ.

2. By eigendecomposition of G G, we obtain eigen values (γ

, . . . , γ

) and an

orthonormal matrix Q whose column vectors are the corresponding eigen

vectors. By using (11) with (8), we obtain v

.

3. Calculate T

n,i

λ,i

), i = 1, . . . , n.

4. By (12), we obtain w

= (w

λ,1

, . . . , w

λ,n

) = QΓ

−1/2

) as a trained

weight vector after thresholding. Then, the output of a trained machine is

given by f

(x) =

j=1

λ,j

(x) for any x

∈ R

By the deﬁnition of T

n,i

, i = 1, . . . , n, it is easy to see that the resulting w

a shrinkage estimator of w

. Typically, if we set threshold levels

{θ

n,1

, . . . , θ

n,n

}

for each of orthogonal components then we can calculate the thresholded com-

ponents by

i,n

λ,i

) =

λ,i

> θ

n,i

λ,i

≤ θ

n,i

, i = 1, . . . , n.

(16)

We propose three methods as T

.

Component-wise hard thresholding(CHT). We ﬁrst consider to derive ap-

propriate threshold levels. To do this, we show a basic result on i.i.d. Gaussian

random variables. Let W

, . . . , W

be i.i.d. random variables having N (0, σ

).

We deﬁne C

= 2 log n + (

−1 + ) log log n, where is a constant. Then, it can

be shown that

lim

n

→∞

P max

≤i≤n

> C

= 0,

for

> 0

(17)

lim

→∞

P max

≤i≤n

≤ C

= 0,

for

< 0.

(18)

The strong convergence result with C

n,1

is known and C

n,1

is used as the

universal threshold level for removing irrelevant wavelet coeﬃcients in wavelet

denoising[5]. The universal threshold level is shown to be asymptotically equiv-

alent to the minimax one[5]. (17) and (18) are the weaker results while they

evaluate the O(log log n) term.

We return to our problem and consider the case of v

∗

λ

= 0

; i.e. all of v

λ,i

’s

are noise components. Here, 0

denotes the n-dimensional zero vector. We deﬁne

u = (u

1

, . . . , u

) that satisﬁes u

∼ N(0

n

, σ

). We deﬁne σ

= σ

/(γ

+ λ),

i = 1, . . . , n, where σ

= 0 for any i. We also deﬁne u = (u

, . . . , u

), where

= u

/σ

. Then, u

∼ N(0

, I

). By (17) and the deﬁnition of u, we have

n

i=1

> σ

P max

≤i≤n

> C

→ 0 (n → ∞),

(19)

542

K. Hagiwara

> 0. On the other hand, by using (18) and the deﬁnition of u, we have

i=1

≤ σ

P max

≤i≤n

≤ C

→ 0 (n → ∞),

(20)

if

< 0.

(19) tells us that, for any i, u

cannot exceed σ

with high probabil-

ity when n is large and

> 0. Therefore, if we employ σ

C

n,

with

> 0 as

component-wise threshold levels, (19) implies that they remove noise compo-

nents if it is. On the other hand, (20) tells us that there are some noise com-

ponents which satisfy u

> σ

with high probability when n is large and

< 0. Therefore, the ith component should be recognized as a noise component

if u

≤ σ

even when it is not. In other words, we can not distinguish non-

zero mean components from zero mean components in this case. Hence, σ

n,0

i = 1, . . . , n are critical levels for identifying noise components. We note that

these results are still valid when v

∗

λ,i

are not zero for some i but the number of

such components is very small. This is the case of assuming the sparseness of

the representation of h in the orthogonal domain.

As a conclusion, we propose a hard thresholding method by putting θ

n,i

=

σ

n,0

, i = 1, . . . , n in (16), where we set

= 0. We refer to this method

by component-wise hard thresholding(CHT). In practical applications of the

method, we need an estimate of noise variance σ

. Fortunately, in nonparametric

regression methods, [1] suggested to apply

y [I

− H

]

trace[(I

− H

)

]

(21)

where H

is deﬁned by H

= GF

−1

G = GQΓ

Q G .

Although the method includes a regularization parameter, it can be ﬁxed for

a small value. This is because the thresholding method keeps a trained machine

compact on the orthogonal domain, by which the contribution of the regularizer

may not be signiﬁcant in improving the generalization performance. Therefore

it is needed only for ensuring the numerical stability of the matrix calculations.

Since the basis functions can be nearly linearly dependent in practical appli-

cations, small eigen values are less reliable. We therefore ignore the orthogonal

components whose eigen values are less than a predetermined small value, say,

−16

. Although the run time of eigendecomposition is O(n

), the subsequent

procedures of CHT such as calculations of σ

and w

are achieved with less

computational costs by the eigendecomposition.

Hard thresholding with the universal threshold level(HTU). Basically

eigendecomposition of G G corresponds to the principal component analysis of

, . . . , g

. Therefore, for nearly linearly dependent basis functions, only several

eigen vectors are largely contributed. On the other hand, the components with

small eigen values are largely aﬀected by numerical errors. Therefore, it is natural

to take care of only the components with large eigen values. For a component

Orthogonal Shrinkage Methods for Nonparametric Regression

543

with a large eigen value, γ

λ holds since we can choose a small value for λ.

Thus, σ

holds by the deﬁnition of σ

in CHT. We then consider to apply

a single threshold level σ

n,0

instead of σ

i

C

n,0

, i = 1, . . . , n in CHT; i.e. we

set θ

n,i

= σ

n,0

in (16). This is a direct application of the universal threshold

level in wavelet denoising[5]. This method is referred to by hard thresholding

with the universal threshold level(HTU).

Backward hard thresholding(BHT). On the other hand, since the thresh-

old level derived here is the worst case evaluation for a noise level, CHT and

HTU have a possibility of yielding a bias between f

and h by unexpected

removes of contributed components. The component with a large eigen value is

composed by a linear sum of many basis functions, it may be a smooth compo-

nent. Therefore, removes of these components may yield a large bias. Actually,

in wavelet denoising, fast/detail components are the target of a thresholding

method and slow/approximation components are harmless by the thresholding

method[5], which may be a device for reducing the bias. For our method, we

also introduce this idea and consider the following procedure. We assume that

1

≤ γ

≤ · · · ≤ γ

. The method is that, by increasing j from 1 to n, we ﬁnd

the ﬁrst j = j for which v

λ,j

> σ

n,0

occurs. Then, thresholding is made by

j

(v

λ,j

) = v

λ,j

if j

≥ j and T

λ,j

) = 0 if j < j. This keeps components

with large eigen values and possibly reduces the bias. We refer to this method

by backward hard thresholding(BHT). BHT can be viewed as a stopping cri-

terion for choosing contributed components in orthogonal components that are

enumerated in order of the magnitudes of eigen values.

Numerical Experiments

4.1

Choice of Regularization Parameter

CHT, HTU and BHT do not include parameters to be adjusted except the

regularization parameter. Since thresholding of orthogonal components yields a

certain simple representation of a machine, the regularization parameter may

not be signiﬁcant in improving the generalization performance. To demonstrate

this property of our methods, through a simple numerical experiment, we see

the relationship between generalization performances of trained machines and

regularization parameter values.

The target function is h(x) = 5 sinc(8 x) for x

∈ R. x

1

, . . . , x

are randomly

drawn in the interval [

−1, 1]. We assume i.i.d. Gaussian noise with mean zero

and variance σ

= 1. The basis functions are Gaussian basis functions that are

deﬁned by g

(x) = exp

{−(x−x

)

/(2τ

)

}, j = 1, . . . , n, where we set τ

= 0.05.

In this experiment, under a ﬁxed value of a regularization parameter, we trained

machines for 1000 sets of training data of size n. At each trial, the test error is

measured by the mean squared error between the target function and the trained

machine, which is calculated on 1000 equally spaced input points in [

−1, 1].

Figure 1 (a) and (b) depict the results for n = 200 and n = 400 respectively,

which show the relationship between the averaged test errors of trained machines

544

K. Hagiwara

10

−6

10

−4

10

−2

10

0

10

2

0

0.05

0.1

reguralization parameter

averaged test error

(a)

n = 200

10

−6

10

−4

10

−2

10

0

10

2

0

0.05

0.1

reguralization parameter

averaged test error

(b)

n = 400

Fig. 1. The dependence of test errors on regularization parameters. (a) n = 200 and

(b) n = 400. The ﬁlled circle, open circle and open square indicate the results for the

raw estimate, CHT and BHT respectively.

and regularization parameter values. The ﬁlled circle, open circle and open square

indicate the results for the raw estimator, CHT and BHT respectively, where the

raw estimate is obtained by (6) at each ﬁxed value of a regularization parameter.

We do not show the result for HTU since it is almost the same as the result for

CHT. In these ﬁgures, we can see that the averaged test errors of our methods

are almost unchanged for small values of a regularization parameter while those

of the row estimates are sensitive to regularization parameter values. We can

also see that BHT is entirely superior to the raw estimate and CHT while CHT

is worse than the raw estimate around λ = 10

for both of n = 200 and 400. In

practical applications, the regularization parameter of the raw estimate should

be determined based on training data and the performance comparisons to the

leave-one-out cross validation method are shown in below.

4.2

Comparison with LOOCV

We compare the performances of the proposed methods to the performance of

the leave-one-out cross validation(LOOCV) choice of a regularization parame-

ter value. We see not only generalization performances but also computational

times of the methods. For the regularized estimator considered in this article,

it is known that the LOOCV error is calculated without training on validation

sets[3,6].

We assume the same conditions as the previous experiment. The CPU time is

measured only for the estimation procedure. The experiments are conducted by

using Matlab on the computer that has a 2.13 GHz Core2 CPU, 1 GByte memory.

Table 1 (a) and (b) show the averaged test errors and averaged CPU times of

LOOCV and our methods respectively, in which the standard deviations(divided

by 2) are also appended. The examined values for a regularization parameter in

LOOCV is

{m × 10

−j

: m = 1, 2, 5, j =

−4, −3, . . . , 3}. In our methods, the

Orthogonal Shrinkage Methods for Nonparametric Regression

545

Table 1. Test errors and CPU times of LOOCV, CHT, HTU and BHT

LOOCV

CHT

HTU

BHT

100

0.079

± 0.027

0.101

± 0.034

0.100

± 0.034

0.076

± 0.027

200

0.040

± 0.013

0.046

± 0.014

0.045

± 0.014

0.035

± 0.011

400

0.021

± 0.006

0.023

± 0.007

0.023

± 0.007

0.017

± 0.005

(a) Test errors

LOOCV

CHT

HTU

BHT

100

0.079

± 0.002

0.013

± 0.002

0.014

± 0.002

0.014

± 0.002

200

0.533

± 0.003

0.080

± 0.002

0.080

± 0.002

0.080

± 0.002

400

3.657

± 0.003

0.523

± 0.004

0.523

± 0.004

0.523

± 0.004

(b) CPU times

regularization parameter is ﬁxed at 1

× 10

−4

which is the smallest value in

candidate values for LOOCV.

Based on Table 1 (a), we ﬁrst discuss the generalization performances of the

methods. CHT and HTU are almost comparable. This implies that only the com-

ponents corresponding to large eigen values are contributed. CHT and HTU are

entirely worse than LOOCV in average while the diﬀerences are within the stan-

dard deviations for n = 200 and n = 400. BHT entirely outperforms LOOCV,

CHT and HTU in average while the diﬀerence between the averaged test error of

LOOCV and that of BHT is almost within the standard deviations. As pointed

out previously, CHT and HTU have a possibility to remove smooth components

accidentally since the threshold levels were determined based on the worst case

evaluation of dispersion of noise. The better generalization performance of BHT

compared with CHT and HTU in Table 1 (a) is caused by this fact. On the other

hand, as shown in Table 1 (b), our methods completely outperform LOOCV in

terms of the CPU times.

Conclusions and Future Works

In this article, we proposed shrinkage methods in training a machine by using

a regularization method. The machine is represented by a linear combination of

ﬁxed basis functions, in which the number of basis functions, or equivalently,

the number of weights is identical to that of training data. In the regularized

cost function, the error function is deﬁned by the sum of squared errors and

the regularization term is deﬁned by the quadratic form of the weight vector. In

the proposed shrinkage procedures, basis functions are orthogonalized by eigen-

decomposition of the Gram matrix of the vectors of basis function outputs.

Then, the orthogonal components are kept or removed according to the proposed

thresholding methods. The proposed methods are based on the statistical prop-

erties of regularized estimators of weights, which are derived by assuming i.i.d.

Gaussian noise. The ﬁnal weights are obtained by a linear transformation of the

thresholded orthogonal components and are shrinkage estimators of weights. We

546

K. Hagiwara

proposed three versions of thresholding methods which are component-wise hard

thresholding, hard thresholding with the universal threshold level and backward

hard thresholding. Since the regularization parameter can be ﬁxed for a small

value in our methods, our methods are automatic. Additionally, since eigende-

composition algorithms are included in many software packages and the thresh-

olding methods are simple, the implementations of our methods are quite easy.

The numerical experiments showed that our methods achieve relatively good

generalization capabilities in strictly less computational time by comparing with

the LOOCV method. Especially, the backward hard thresholding method out-

performed the LOOCV method in average in terms of the generalization perfor-

mance. As future works, we need to investigate the performance of our methods

on real world problems. Furthermore, we need to evaluate the generalization

error when applying the proposed shrinkage methods.

References

1. Carter, C.K., Eagleson, G.K.: A comparison of variance estimators in nonparametric

regression. J. R. Statist. Soc. B 54, 773–780 (1992)

2. Chen, S.: ‘Local regularization assisted orthogonal least squares regression. Neuro-

computing 69, 559–585 (2006)

3. Craven, P., Wahba, G.: Smoothing noisy data with spline functions. Numerische

Mathematik 31, 377–403 (1979)

4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines

and other kernel-based learning methods. Cambridge University Press, Cambridge

(2000)

5. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage.

Biometrika 81, 425–455 (1994)

6. Rifkin, R.: Everything old is new again: a fresh look at historical approaches in

machine learning. Ph.D thesis, MIT (2002)

7. Tibshirani, R.: Regression shrinkage and selection via lasso. J.R. Statist. Soc. B 58,

267–288 (1996)

8. Suykens, J.A.K., Brabanter, J.D., Lukas, L., Vandewalle, J.: Weighted least squares

support vector machines: robustness and sparse approximation. Neurocomputing 48,

85–105 (2002)

9. Williams, C.K.I., Seeger, M.: Using the Nystr¨

om method to speed up kernel ma-

chines. In: Leen, T.K., Diettrich, T.G., Tresp, V. (eds.) Advances in Neural Infor-

mation Processing Systems, vol. 13, pp. 682–688 (2001)

A Subspace Method Based on

Data Generation Model with Class Information

Minkook Cho, Dongwoo Yoon, and Hyeyoung Park

School of Electrical Engineering and Computer Science

Kyungpook National University, Deagu, Korea

ucaresoft@paran.com, dongwoo0926@hanmail.net, hypark@knu.ac.kr

Abstract. Subspace methods have been used widely for reduction ca-

pacity of memory or complexity of system and increasing classiﬁcation

performances in pattern recognition and signal processing. We propose a

new subspace method based on a data generation model with intra-class

factor and extra-class factor. The extra-class factor is associated with

the distribution of classes and is important for discriminating classes.

The intra-class factor is associated with the distribution within a class,

and is required to be diminished for obtaining high class-separability. In

the proposed method, we ﬁrst estimate the intra-class factors and reduce

them from the original data. We then extract the extra-class factors by

PCA. For veriﬁcation of proposed method, we conducted computational

experiments on real facial data, and show that it gives better performance

than conventional methods.

Introduction

Subspace methods are for ﬁnding a low dimensional subspace which presents

some meaningful information of input data. They are widely used for high di-

mensional pattern classiﬁcation such as image data owing to two main reasons.

First, by applying a subspace method, we can reduce capacity of memory or

complexity of system. Also, we can expect to increase classiﬁcation performances

by eliminating useless information and by emphasizing essential information for

classiﬁcation.

The most popular subspace method is PCA(Principal Component Analysis)

[10,11,8] and FA(Factor Analysis)[6,14] which are based on data generation mod-

els. The PCA ﬁnds a subspace of independent linear combinations (principal

components) that retains as much of the information in the original variables

as possible. However, PCA method is an unsupervised method, which does not

use class information. This may cause some loss of critical information for clas-

siﬁcation. Contrastively, LDA(Linear discriminant analysis)[1,4,5] method is a

supervised learning method which uses information of the target label of data

set. The LDA method attempts to ﬁnd basis vectors of subspace maximizing the

linear class separability. It is generally known that LDA can give better classiﬁ-

cation performance than PCA by using class information. However, LDA gives

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 547–555, 2008.

c Springer-Verlag Berlin Heidelberg 2008

548

M. Cho, D. Yoon, and H. Park

at most k-1 basis of subspace for k-class, and cannot extract features stably for

the data set with limited number of data in each class.

Another subspace method for classiﬁcation is the intra-person space method[9],

which is developed for face recognition. The intra-person space is deﬁned as a

diﬀerence between two facial data from same person. For dimension reduction,

the low dimensional eigenspace is obtained by applying PCA to the intra-person

space. In the classiﬁcation tasks, raw input data are projected to the intra-personal

eigenspace to get the low dimensional features. The intra-person method showed

better performance than PCA and LDA in the FERET Data[9]. However, it is not

based on data generation model and it cannot give a sound theoretical reason why

the intra-person space gives good information for classiﬁcation.

On the other hand, a data generation model with class information has re-

cently been developed[12]. It is a variant of the factor analysis model with two

type of factors; class factor(what we call extra-class factor) and environment

factor(what we call intra-class factor). Based on the data generation model, the

intra-class factor is estimated by using diﬀerence vectors between two data in the

same class. The estimated probability distribution of intra-class factor is applied

to measuring the similarity of data for classiﬁcation. Through this method takes

similar approaches to the intra-person method in the sense that it is using the

diﬀerence vectors to get the intra-class information and similarity measure, it is

based on the data generation model which can give an explanation on the devel-

oped similarity measure. Still, it does not include dimension reduction process

and another subspace method is necessary for high-dimensional data.

In this paper, we propose an appropriate subspace method for the data gen-

eration model developed in [12]. The proposed method ﬁnds subspace which

undercuts the eﬀect of intra-class factors and enlarges the eﬀect of extra-class

factors based on the data generation model. In Section 2, the model will be

explained in detail.

Data Generation Model

In advance of deﬁning the data generation model, let us consider that we ob-

tained several images(i.e. data) from diﬀerent persons(i.e. class). We know that

pictures of diﬀerent persons are obviously diﬀerent. Also, we know that the pic-

tures of a person is not exactly same due to some environmental condition such

as illumination. Therefore, it is natural to assume that a data consists of the

intra-class factor which represents within-class variations such as the variation

of pictures of same person and the extra-class factor which represents between-

class variations such as the diﬀerences between two persons.

Under this condition, a random variable x for observed data can be written

as a function of two distinct random variable ξ and η, which is the form,

x = f (ξ, η),

(1)

where ξ represents an extra-class factor, which keeps some unique information in

each class, and η represents an intra-class factor which represents environmental

A Subspace Method Based on Data Generation Model

549

variation in the same class. In [12], it was assumed that η keeps information of

any variation within class and its distribution is common for all classes. In order

to explicitly deﬁne the generation model, a linear additive factor model was

applied such as

i

= W ξ

+ η.

(2)

This means that a random sample x

in class C

is generated by the summation

of a linearly transformed class prototype ξ

and a random class-independent

variation η.

In this paper, as an extension of the model, we assume that the low dimen-

sional intra-class factor is linearly transformed to generate an observed data x.

Therefore, function f is deﬁned as

i

= W ξ

+ V η

(3)

This model implies that a data of a speciﬁc class is generated by the extra-class

factor ξ which gives some discriminative information among class and the intra-

class factor η which represents some variation in a class. In this equation, W and V

are transformation matrices of corresponding factors. We call W the extra-class

factor loading and call V the intra-class factor loading. Figure 1 represents this

data generation model. To ﬁnd a good subspace for classiﬁcation, we try to ﬁnd V

and W using class information which is given with the input data. In Section 3,

we explain how to ﬁnd the subspace based on the data generation model.

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 47 48 49 50 51 52 53 54 ... 88