Lecture Notes in Computer Science

bet	47/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 43 44 45 46 47 48 49 50 ... 88

2007)

20. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points.

In: International Conference on Computer Vision and Pattern Recognition, pp.

257–263 (2003)

21. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines

and other kernel-based learning methods. Cambridge University Press, Cambridge

(2000)

22. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

23. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin clas-

siﬁers. In: D. Proc. Fifth Ann. Workshop on Computational Learning Theory, pp.

144–152. ACM, New York (1992)

24. Fyfe, C., Lai, P.L.: Kernel and nonlinear canonical correlation analysis. Interna-

tional Journal of Neural Systems 10, 365–377 (2001)

25. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an

overview with application to learning methods. Neural Computation 16, 2639–2664

(2004)

26. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge

University Press, Cambridge (2004)

27. Stephan, K.E., Harrison, L.M., Penny, W.D., Friston, K.J.: Biophysical models of

fmri responses. Current Opinion in Neurobiology 14, 629–635 (2004)

Parallel Reinforcement Learning for Weighted

Multi-criteria Model with Adaptive Margin

Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima

Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Japan

hira@mail.saitama-u.ac.jp

Abstract. Reinforcement learning (RL) for a linear family of tasks is

studied in this paper. The key of our discussion is nonlinearity of the

optimal solution even if the task family is linear; we cannot obtain the

optimal policy by a naive approach. Though there exists an algorithm for

calculating the equivalent result to Q-learning for each task all together,

it has a problem with explosion of set sizes. We introduce adaptive mar-

gins to overcome this diﬃculty.

Introduction

Reinforcement learning (RL) for a linear family of tasks is studied in this paper.

Such learning is useful for time-varying environments, multi-criteria problems,

and inverse RL [5,6]. The family is deﬁned as a weighted sum of several criteria.

This family is linear in the sense that reward is linear with respect to weight

parameters. For instance, criteria of network routing include end-to-end delay,

loss of packets, and power level associated with a node [5]. Selecting appropriate

weights beforehand is diﬃcult in practice and we need try and errors. In addition,

appropriate weights may change someday. Parallel RL for all possible weight

values is desirable in such cases.

The key of our discussion is nonlinearity of the optimal solution; it is not

linear but piecewise-linear actually. This fact implies that we cannot obtain the

best policy by the following naive approach:

1. Find the value function for each criterion.

2. Calculate weighted sum of them to obtain the total value function.

3. Construct a policy on the basis of the total value function.

A typical example is presented in section 5.

Piecewise-linearity of the optimal solution has been pointed out independently

in [4] and [5]. The latter aims at fast adaptation under time-varying environ-

ments. The former is our previous report, and we have tried to obtain the optimal

solutions for various weight values all together. Though we have developed an

algorithm that gives exactly equivalent solution to Q-learning for each weight

value, it has a diﬃculty with explosion of set size. This diﬃculty is not a problem

of the algorithm but an intrinsic nature of Q-learning for the weighted criterion

model.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 487–496, 2008.

c Springer-Verlag Berlin Heidelberg 2008

488

K. Hiraoka, M. Yoshida, and T. Mishima

We have introduced a simple approximation with a ‘margin’ into decision

of convexity ﬁrst [6]. Then we have improved it so that we obtain an interval

estimation and we can monitor the eﬀect of the approximation [7]. In this paper,

we propose adaptive adjustment of margins.

In margin-based approach, we have to manage large sets of vectors in the ﬁrst

stage of learning. The peak of the set size tends to be large if we set a small

margin to obtain an accurate ﬁnal result. The proposed method reduces worry

of this trade-oﬀ. By changing margins appropriately through learning steps, we

can enjoy small set size in the ﬁrst stage with large margins, and an accurate

result in the ﬁnal stage with small margins.

The weighted criterion model is deﬁned in section 2, and parallel RL for it is

described in section 3. Then the diﬃculty of set size is pointed out and margins

are introduced in section 4. Adaptive adjustment of margins is also proposed

there. Its behavior is veriﬁed with experiments in section 5. Finally, a conclusion

is given in section 6.

Weighted Criterion Model

An “orthodox” RL setting is assumed for states and actions as follows.

– The time step is discrete (t = 0, 1, 2, 3, . . .).

– The state set S and the action set A are ﬁnite and known.

– The state transition rule P is unknown.

– The state s

is observable.

– The task is a Markov decision process (MDP).

The reward r

t+1

is given as a weighted sum of partial rewards r

t+1

, . . . , r

t+1

(

β) =

i=1

t+1

β · r

t+1

(1)

weight vector

β ≡ (β

, . . . , β

)

∈ R

(2)

reward vector

t+1

≡ (r

t+1

, . . . , r

t+1

)

∈ R

(3)

We assume that the partial rewards r

t+1

, . . . , r

t+1

are also observable, whereas

their reward rules

R(1), . . . , R(M) are unknown. Multi-criteria RL problems of

this type have been introduced independently in [3] and [5].

We hope to ﬁnd the optimal policy π

∗

for each weight

β that maximizes the

expected cumulative reward with a given discount factor 0 < γ < 1,

∗

= argmax

∞

τ=0

τ+1

(

β) ,

(4)

where E

[

·] denotes the expectation under a policy π. To be exact, π

∗

is deﬁned

as a policy that attains Q

∗

(s, a; γ) = Q

∗

(s, a; γ)

≡ max

(s, a; γ) for all

state-action pairs (s, a), where the action-value function Q

is deﬁned as

Parallel Reinforcement Learning for Weighted Multi-criteria Model

489

(s, a; γ)

≡ E

∞

τ=0

τ+1

(

β) s

= s, a

= a .

(5)

It is well known that MDP has a deterministic policy π

∗

β

that satisﬁes the above

condition; such π

∗

is obtained from the optimal value function [2],

∗

: S

→ A : s → argmax

a∈A

∗

(s, a; γ).

(6)

Thus we concentrate on estimation of Q

∗

. Note that Q

∗

is nonlinear with

respect to

β. A typical example is presented in section 5. Basic properties of the

action-value function Q are described brieﬂy in the rest of this section [4,5,6].

The discount factor γ is ﬁxed through this paper, and it is omitted below.

Proposition 1. Q

β

(s, a) is linear with respect to

β for a ﬁxed policy π.

Proof. Neither

P nor π depend on β from assumptions. Hence, joint distribution

of (s

, a

), (s

, a

), (s

, a

), . . . is independent of

β. It implies linearity.

Deﬁnition 1. If f : R

→ R can be written as f(β) = max

q∈Ω

(

q · β) with a

nonempty ﬁnite set Ω

⊂ R

, we call f Finite-Max-Linear (FML) and write it

as f = FML

Ω

It is trivial that f is convex and piecewise-linear if f is FML.

Proposition 2. The optimal action-value function is FML as a function of the

weight

β. Namely, there exists a nonempty ﬁnite set Ω

∗

(s, a)

⊂ R

for each

state-action pair (s, a), and Q

∗

is written as

∗

(s, a) =

max

q∈Ω

∗

(

s,a)

q · β.

(7)

Proof. We have assumed MDP. It is well known that Q

∗

can be written as

∗

(s, a) = max

π∈Π

Q

π

(s, a) for the set Π of all deterministic policies. Π is

ﬁnite, and Q

is linear with respect to

β from proposition 1. Hence, Q

∗

β

is

FML.

Proposition 3. Assume that an estimated action-value function Q

is FML as

a function of the weight

β. If we apply Q-learning, the updated

new

, a

) = (1

− α)Q

, a

) + α

β · r

t+1

+ γ max

a∈A

t+1

, a)

(8)

is still FML as a function of

β, where α > 0 is the learning rate.

Proof. There exists a nonempty ﬁnite set Ω(s, a)

⊂ R

M

such that Q

(s, a) =

max

q∈Ω(s,a)

(

q · β) for each (s, a). Then (8) implies Q

new

β

(s

, a

) = max

q∈ ˜

Ω

q · β,

where

Ω ≡ (1 − α)q + α(r

t+1

+ γ

q ) a ∈ A, q ∈ Ω(s

, a

q ∈ Ω(s

t+1

, a) , (9)

because max

f(x) + max

g(y) = max

x,y

(f (x) + g(y)) holds in general. The set

Ω is ﬁnite, and Q

new

β

is FML.

These propositions imply that (1) the true Q

∗

is FML, and (2) its estimation

is also FML as long as the initial estimation is FML.

490

K. Hiraoka, M. Yoshida, and T. Mishima

Parallel Q-Learning for All Weights

A parallel Q-learning method for the weighted criterion model has been proposed

in [6]. The estimation Q

for all

β ∈ R

are updated all together in parallel

Q-learning.

In this method, Q

(s, a) for each (s, a) is treated in an FML expression:

(s, a) =

max

q∈Ω(s,a)

q · β = FML

Ω(s,a)

(

β)

(10)

with a certain set Ω(s, a)

⊂ R

. We store and update Ω(s, a) instead of Q

(s, a)

on the basis of propositions 2 and 3. Though a naive updating rule has been

suggested in the proof of proposition 3, it is extremely redundant and ineﬃcient.

We need several deﬁnitions to describe a better algorithm.

Deﬁnition 2. An element c ∈ Ω is redundant if FML

(

Ω−{c})

= FML

Ω

Deﬁnition 3. We use Ω

†

to represent non-redundant elements in Ω.

Note that FML

Ω

†

= FML

Ω

[5].

Deﬁnition 4. We deﬁne the following operations:

cΩ ≡ {cq | q ∈ Ω}, c + Ω ≡ {c + q | q ∈ Ω},

(11)

Ω

Ω ≡ (Ω ∪ Ω )

†

k=1

Ω

≡

k=1

Ω

†

(12)

Ω ⊕ Ω ≡ {q + q | q ∈ Ω, q ∈ Ω }, Ω Ω ≡ (Ω ⊕ Ω )

†

(13)

With these operations, the updating rule of Ω is described as follows [6]:

Ω

new

, a

) = (1

− α)Ω(s

, a

)

α r

t+1

+ γ

a∈A

Ω(s

t+1

, a) .

(14)

The initial value of Ω at t = 0 is Ω(s, a) =

{o} ⊂ R

M

for all (s, a)

∈ S × A. It

corresponds to a constant initial function Q

(s, a) = 0.

Proposition 4. When (10) holds for all states s ∈ S and actions a ∈ A,

new

, a

) in (8) is equal to FML

Ω

new

(

)

(

β) for (14). Namely, parallel Q-

learning is equivalent to Q-learning for each

β:

update

(s, a)

} → Q

new

, a

)

FML expression

{Ω(s, a)}

→ Ω

new

, a

)

update

(15)

Parallel Reinforcement Learning for Weighted Multi-criteria Model

491

Ω[+]Ω’

Ω

Ω’

(1) Set directions

of edges

(2) Merge and sort

edges according to

their arguments

(3) Connect edges

to generate a

polygon

(4) Shift the origin

(max x in

Ω)+(max x in Ω’)

(max x in

Ω[+]Ω’)

(max y in

Ω)+(max y in Ω’)

(max y in

Ω[+]Ω’)

Fig. 1. Calculation of Ω Ω in (14) for two-dimensional convex polygons. Vertices of

polygons correspond to

Ω, Ω and Ω

Ω .

Proof. We have introduced a set ˜

Ω in (9) to prove proposition 3. With the above

operations, (9) is written as

Ω = (1 − α)Ω(s

, a

)

⊕ α r

t+1

a∈A

Ω(s

t+1

, a) .

Then ( ˜

Ω)

†

= Ω

new

, a

) is obtained and FML

Ω

new

(

)

(

β)= FML

Ω(s

)

(

β) =

new

, a

) is implied.

It is well known that Ω

†

is equal to the vertices in the convex hull of Ω [6]. Eﬃ-

cient algorithms of convex hull have been developed in computational geometry

[8]. Using them, we can calculate the merged set (Ω

Ω ) = (Ω ∪ Ω )

†

. The sum

set (Ω

Ω ) have been also studied as Minkowski sum algorithms [9,10,11]. Its

calculation is particularly easy for two-dimensional convex polygons (Fig.1).

Before closing the present section, we note an FML version of Bellman equa-

tion in our notation. Theoretically, we can use successive iteration of this equa-

tion to ﬁnd the optimal policy when we know

P and R, though we must take

care of numerical error in practice.

Proposition 5. FML expression Q

∗

= FML

Ω

∗

(

β) satisﬁes

Ω

∗

(s, a)

†

+ γ

a ∈A

s ∈S

Ω

∗

(s , a ),

(16)

where

= (

(1), . . . ,

a

s

(M )),

(i) = E[r

t+1

| s

= s, a

= a],

(17)

= P (s

t+1

= s

| s

= s, a

= a),

(18)

s ∈{s

,...,s

}

= X

· · · X

(19)

492

K. Hiraoka, M. Yoshida, and T. Mishima

In particular, the next equation holds if state transition is deterministic:

Ω

∗

(s, a)

†

+ γ

a ∈A

Ω

∗

(s , a ),

(20)

where s is the next state for the action a at the current state s.

Proof. Substituting (7) and

s,β

≡ E[r

t+1

(

β) | s

= s, a

= a] =

· β into the

Bellman equation Q

∗

β

(s, a) =

s,β

+ γ

s ∈S

max

a ∈A

∗

(s , a ), we obtain

max

q∈Ω

∗

(

s,a)

q · β = max

q ∈Ω (s,a)

q · β,

(21)

Ω (s, a) =

a ∈A

R

a

+ γ

s ∈S

∈ Ω

∗

(s , a )

(22)

in the same way as (9). Hence, Ω

∗

is equal to Ω except for redundancy.

Interval Operations

Under regularity conditions, Q-learning has been proved to converge to Q

∗

[1].

That result implies pointwise convergence of parallel Q-learning to Q

∗

for each

β because of proposition 3. From proposition 2, Q

∗

β

(s, a) is expressed with a

ﬁnite Ω

∗

(s, a). However, as we can see in Fig.1, the number of elements in the

set Ω(s, a) increases monotonically and it never ‘converges’ to Ω

∗

(s, a). This is

not a paradox; the following assertions can be true at the same time.

1. Vertices of polygons P

, P

, . . . monotonically increase.

2. P

t

converges to a polygon P

∗

in the sense that the volume of the diﬀerence

t

P = (P

∪ P

∗

)

− (P

∩ P

∗

) converges to 0.

2’. The function FML

(

·) converges pointwise to FML

∗

(

·).

In short, pointwise convergence of a piecewise-linear function does not imply

convergence of the number of pieces. Note that it is not a problem of the algo-

rithm. It is an intrinsic nature of pointwise Q-learning of the weighted criterion

model for each weight

β.

To overcome this diﬃculty, we tried a simple approximation with a small

‘margin’ at ﬁrst [6]. Then we have introduced interval operations to monitor ap-

proximation error [7]. A pair of sets Ω

(s, a) and Ω

(s, a) are updated instead

of the original Ω(s, a) so that CH Ω

(s, a)

⊂ CH Ω(s, a) ⊂ CH Ω

(s, a) holds,

where CH Z represents the convex hull of Z. This relation implies lower and up-

per bounds Q

β

(s, a)

≤ Q

(s, a)

≤ Q

(s, a), where Q

(s, a) = FML

Ω

(

s,a)

(

β)

for X = L, U . When the diﬀerence between Q

and Q

is suﬃciently small, it is

guaranteed that the eﬀect of the approximation can be ignored. Updating rules

of Ω

and Ω

are same as those of Ω, except for the following approximations

after every calculation of

and

. We assume M = 2 here.

Lower approximation for Ω

: A vertex is removed if the change of the area

of CH Ω

(s, a) is smaller than a threshold

/2 (Fig.2 left).

Parallel Reinforcement Learning for Weighted Multi-criteria Model

493

if the area of

triangle /// is small

- remove c

- remove b,c

- add z

a

b

if the area of

triangle /// is small

Fig. 2. Lower approximation (left) and upper approximation (right)

Upper approximation for Ω

: An edge is removed if the change of the area

of CH Ω

(s, a) is smaller than a threshold

/2 (Fig.2 right).

In this paper, we propose an automatic adjustment of the margins

. The

below procedures are performed at every step t after the updating of Ω

, Ω

The symbol X represents L or U here. ξ

, ξ

≥ 1 and θ

, θ

Ω

≥ 0 are constants.

1. Check the changes of set sizes and interval width compared with the previous

ones. Namely, check these values:

∆

X

Ω

= Ω

Xnew

, a

)

− Ω

, a

) ,

(23)

∆

= Q

Unew

, a

)

− Q

Lnew

, a

)

− Q

, a

)

− Q

, a

) , (24)

where

|Z| is the number of elements in Z, and ¯β is selected beforehand.

2. Increase of set size suggests a need of thinning, whereas increase of interval

width suggests a need of more accurate calculation. Modify margins as

Xnew

=

˜

(∆

≤ θ

)

/ξ

(∆

> θ

)

, where ˜

(∆

Ω

≤ θ

Ω

)

(∆

Ω

> θ

Ω

)

(25)

To avoid underﬂow, we set

Xnew

=

min

Xnew

is smaller than a constant

min

Experiments with a Basic Task of Weighted Criterion

We have veriﬁed behaviors of the proposed method. We set S =

{S, G, A, B, X, Y},

A = {Up, Down, Left, Right}, s

= S, and γ = 0.8 (Fig.3) [6]. Each action causes

a deterministic state transition to the corresponding direction except at G, where

the agent is moved to S regardless of its action. Rewards 1, 4b, b are oﬀered at

t

= G, X, Y, respectively. If a

is an action to ‘outside wall’ at s

= G, the state

is unchanged and a negative reward (

−1) is added further.

It is a weighted criterion model of M = 2, because it can be written as

the form r

t+1

=

β · r

t+1

for

t+1

= (r

t+1

, r

t+1

) and

β = (b, 1). The optimal

policy changes depending on the weight b. Hence, the optimal value function is

494

K. Hiraoka, M. Yoshida, and T. Mishima

X

G

(4b)

(b)

(1)

outside = wall (-1)

Fig. 3. Task for experiments. Numbers in parentheses are reward values.

Table 1. Optimal state-value functions and optimal policies

Range of weight

Optimal

∗

(S)

Optimal state transition

b < −16/25

→ A → S → · · ·

−16/25 ≤ b < −225/1796 (2000b + 1280)/2101 S → A → Y → B → G → S → · · ·

−225/1796 ≤ b < 15/47

(400

b + 80)/61

→ X → G → S → · · ·

/47 ≤ b < 3/4

b/3

→ X → Y → X → · · ·

/4 ≤ b

b − 4

→ X → X → · · ·

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

5000

10000

15000

20000

Lower margin

1e-13

1e-10

1e-7

1e-4

0.1

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

5000

10000

15000

20000

Upper margin

1e-13

1e-10

1e-7

1e-4

0.1

Fig. 4. Transition of margins

and

from various initial margins

100

150

200

250

300

5000

10000

15000

20000

Total number of elements

s,a

|Ω

(s,a)|

1e-13

1e-10

1e-7

1e-4

0.1

100

150

200

250

300

5000

10000

15000

20000

Total number of elements

s,a

|Ω

(s,a)|

1e-13

1e-10

1e-7

1e-4

0.1

Fig. 5. Total number of elements

s,a

|Ω

(

s, a)|. (Left: X = L, Right: X = U).

Parallel Reinforcement Learning for Weighted Multi-criteria Model

495

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

5000

10000

15000

20000

Interval width

1e-13

1e-10

1e-7

1e-4

0.1

Fig. 6. Interval width Q

(0.2,1)

, Up) − Q

(0.2,1)

, Up)

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

10000

15000

20000

Total number of elements

s,a

|Ω

(s,a)|

1e-2(Lower)

1e-2(Upper)

1e-9(Lower)

1e-9(Upper)

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

5000

10000

15000

20000

Interval width

1e-2

1e-9

Fig. 7. Fixed-margin algorithm (

=

L

= 10

−2

and

= 10

−9

). Left: total

number of elements

s,a

|Ω

(

s, a)| for X = U, L. Right: interval width.

100

150

200

250

300

2000

4000

6000

8000

10000

Total number of elements

s,a

|Ω

(s,a)|

1e-13

1e-10

1e-7

1e-4

0.1

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

2000

4000

6000

8000

10000

Interval width

1e-13

1e-10

1e-7

1e-4

0.1

Fig. 8. Average of 100 trials with inappropriate factors ξ

= 1

.5, ξ

= 1

.015 for γ = 0.5

Left: total number of elements in upper approximation. Right: interval width.

nonlinear with respect to b (Table 1). Note that the second pattern (S

→A→Y)

in Table 1 cannot appear on the naive approach in section 1.

The proposed algorithm is applied to this task with random actions a

and

parameters α = 0.7, (ξ

, ξ

) = (1.7, 1.015), (θ

, θ

Ω

) = (0, 2), ¯

β = (0.2, 1),

min

= 10

−14

. The initial margins

at t = 0 is one of 10

−1

, 10

−4

, 10

−7

496

K. Hiraoka, M. Yoshida, and T. Mishima

−10

, 10

−13

. On this task, we can replace convex hulls with upper convex hulls

in our algorithm because

β is restricted to the upper half plane [6]. We also

assume

|b| ≤ 10 ≡ b

max

and we safely remove the edges on the both end in Fig.2

if the absolute value of their slope is greater than b

max

for lower approximation.

Averages of 100 trials are shown in Fig.4,5,6. The proposed algorithm is ro-

bust to wide range of initial margins. It realizes reduced set sizes and small

interval width at the same time; these requirements are trade-oﬀ in the conven-

tional ﬁxed-margin algorithm [7] (Fig.7). A problem of the proposed algorithms

is sensitivity to the factors ξ

, ξ

. When they are inappropriate, instability is

observed after a long run (Fig.8). Another problem is slow convergence of the

interval width Q

− Q

compared with the ﬁxed-margin algorithm.

Conclusion

A parallel RL method with adaptive margins is proposed for the weighted cri-

terion model, and its behaviors are veriﬁed experimentally with a basic task.

Adaptive margins realize reduced set sizes and accurate results.

A problem of the adaptive margins is instability for inappropriate parameters.

Though it is robust for initial margins, it needs tuning of factor parameters.

Another problem is slow convergence of the interval between upper and lower

estimations. These points must be studied further.

References

1. Jaakkola, T., et al.: Neural Computation 6, 1185–1201 (1994)

2. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT Press, Cambridge

(1998)

3. Kaneko, Y., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. 167

(2004)

4. Kaneko, N., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. A-2-10

(2005)

5. Natarajan, S., et al.: In: Proc. Intl. Conf. on Machine Learning, pp. 601–608 (2005)

6. Hiraoka, K., et al.: The Brain & Neural Networks (in Japanese). Japanese Neural

Network Society 13, 137–145 (2006)

7. Yoshida, M., et al.: Proc. FIT (in Japanese) (to appear, 2007)

8. Preparata, F.P., et al.: Computational Geometry. Springer, Heidelberg (1985)

9. Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.):

ICCS 2001. LNCS, vol. 2073. Springer, Heidelberg (2001)

10. Fukuda, K.: J. Symbolic Computation 38, 1261–1272 (2004)

11. Fogel, E., et al.: In: Proc. ALENEX, pp. 3–15 (2006)

Convergence Behavior of Competitive

Repetition-Suppression Clustering

Davide Bacciu

1,2

and Antonina Starita

IMT Lucca Institute for Advanced Studies,

P.zza San Ponziano 6, 55100 Lucca, Italy

d.bacciu@imtlucca.it

Dipartimento di Informatica, Universit`

a di Pisa,

Largo B. Pontecorvo 3, 56127 Pisa, Italy

starita@di.unipi.it

Abstract. Competitive Repetition-suppression (CoRe) clustering is a

bio-inspired learning algorithm that is capable of automatically deter-

mining the unknown cluster number from the data. In a previous work it

has been shown how CoRe clustering represents a robust generalization of

rival penalized competitive learning (RPCL) by means of M-estimators.

This paper studies the convergence behavior of the CoRe model, based

on the analysis proposed for the distance-sensitive RPCL (DSRPCL)

algorithm. Furthermore, it is proposed a global minimum criterion for

learning vector quantization in kernel space that is used to assess the

correct location property for the CoRe algorithm.

Introduction

CoRe learning has been proposed as a biologically inspired learning model mim-

icking a memory mechanism of the visual cortex, i.e. repetition suppression [1].

CoRe is a soft-competitive model that allows only a subset of the most active

units to learn in proportion to their activation strength, while it penalizes the

least active units, driving them away from the patterns producing low ﬁring

strengths. This feature has been exploited in [2] to derive a clustering algorithm

that is capable of automatically determining the unknown cluster number from

the data by means of a reward-punishment procedure that resembles the rival

penalization mechanism of RPCL [3]. Recently, Ma and Wang [4] have proposed

a generalized loss function for the RPCL algorithm, named DSRPCL, that has

been used for studying the convergence behavior of the rival penalization scheme.

In this paper, we present a convergence analysis for CoRe clustering that founds

on Ma and Wang’s approach, describing how CoRe satisﬁes the three proper-

ties of separation nature, correct division and correct location [4]. The intuitive

analysis presented in [4] for DSRPCL is enforced with theoretical considerations

showing that CoRe pursues a global optimality criterion for vector quantization

algorithms. In order to do this, we introduce a kernel interpretation for the CoRe

loss that is used to generalize the results given in [5] for hard vector quantization,

to kernel-based algorithms.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 497–506, 2008.

c Springer-Verlag Berlin Heidelberg 2008

498

D. Bacciu and A. Starita

A Kernel Based Loss Function for CoRe Clustering

A CoRe clustering network consists of cluster detector units that are charac-

terized by a prototype c

, that identiﬁes the preferred stimulus for the unit u

and represents the learned cluster centroid. In addition, units are characterized

by an activation function ϕ

, λ

), deﬁned in terms of a set of parameters

i

, that determines the ﬁring strength of the unit in response to the presen-

tation of an input pattern x

∈ χ. Such an activation function measures the

similarity between the prototype c

and the inputs, determining whether the

pattern x

belongs to the i-th cluster. In the remainder of the paper we will

use an activation function that is a gaussian centered in c

with spread σ

, i.e.

|{c

, σ

}) = exp −0.5 x

− c

/σ

CoRe clustering works essentially by evolving a small set of highly selective

cluster detectors out of an initially larger population by means of a competitive

reward-punishment procedure that resembles the rival penalization mechanism

[3]. Such a competition is engaged between two sets of units: at each step the

most active units are selected to form the winners pool, while the remainder is

inserted into the losers pool. More formally, we deﬁne the winners pool for the

input x

k

as the set of units u

that ﬁres more than θ

win

or the single unit that

is maximally active for the pattern, that is

win

{i | ϕ

, σ

}) ≥ θ

win

} ∪ {i | i = arg max

∈U

ϕ

| {c

, σ

})}

(1)

where the second term of the union ensures that win

is non-empty. Conversely,

the losers pool for x

is lose

= U

\ win

, that is the complement of win

with respect to the neuron set U . The units belonging to the losers pool are

penalized and their response is suppressed. The strength of the penalization

for the pattern x

, at time t, is regulated by the repetition suppression RS

∈

[0, 1] and is proportional to the frequency of the pattern that has elicited the

suppressive eﬀect (see [2,6] for details). The repetition suppression is used to

deﬁne a pseudo-target activation for the units in the losers pool as ˆ

) =

, σ

})(1 − RS

k

). This reference signal forces the losers to reduce their

activation proportionally to the amount of repetition suppression they receive.

The error of the i-th loser unit can thus be written as

t

i,k

( ˆ

)

− ϕ

, σ

}))

(

−ϕ

, σ

})RS

)

(2)

Conversely, in order to strengthen the activation of the winner units, we set the

target activation for the neurons u

(i

∈ win

) to M , that is the maximum of

the activation function ϕ

(

·). The error, in this case, can be written as

i,k

= (M

− ϕ

, σ

})).

(3)

To analyze the CoRe convergence, we give an error formulation that accumu-

lates the residuals in (2) and (3) for a given epoch e: summing up over all CoRe

units in U and the dataset χ = (x

, . . . , x

) yields

Convergence Behavior of Competitive Repetition-Suppression Clustering

499

(χ, U ) =

i=1

k=1

− ϕ

)) +

i=1

k=1

− δ

) ϕ

)RS

|χ|+k)

(4)

where δ

is the indicator function for the set win

and where

i

, σ

} has been

omitted from ϕ

to ease the notation. Note that, in (4), we have implicitly used

the fact that the units can be treated as independent. The CoRe learning equa-

tions can be derived using gradient descent to minimize

e

(χ, U ) with respect

to the parameters

, σ

} [2]. Hence, the prototype increment for the e-th epoch

can be calculated as follows

= α

k=1

⎡

⎣δ

)

− c

)

(σ

)

− (1 − δ

)

)RS

|χ|+k)

− c

)

⎤

⎦

(5)

where α

is a suitable learning rate ensuring that

decreases with e. Similarly,

the spread update can be calculated as

= α

k=1

)

− c

(σ

)

− (1 − δ

)(ϕ

)RS

|χ|+k)

)

− c

(σ

)

. (6)

As one would expect, unit prototypes are attracted by similar patterns (ﬁrst

term in (5)) and are repelled by the dissimilar inputs (second term in (5)). More-

over, the neural selectivity is enhanced by reducing the Gaussian spread each

time the corresponding unit happens to be a winner. Conversely, the variance of

loser neurons is enlarged, reducing the units’ selectivity and penalizing them for

not having sharp responses.

The error formulation introduced so far can be restated by exploiting the

kernel trick [7] to express the CoRe loss in terms of diﬀerences in a given fea-

ture space F . Kernel methods are algorithms that exploit a nonlinear mapping

Φ : χ

→ F to project the data from the input space χ onto a convenient,

implicit feature space F . The kernel trick is used to express all operations on

Φ(x

1

), Φ(x

)

∈ F in terms of the inner product Φ(x

), Φ(x

) . Such inner prod-

uct can be calculated without explicitly using the mapping Φ, by means of the

kernel κ(x

, x

) = Φ(x

), Φ(x

) .

To derive the kernel interpretation for the CoRe loss in (4), consider ﬁrst the

formulation of the distance d

κ

of two vectors x

, x

∈ χ in the feature space

κ

, induced by the kernel κ, and described by the mapping Φ : χ

→ F

, that

is d

, x

) =

Φ(x

)

− Φ(x

)

= κ(x

, x

)

− 2κ(x

, x

) + κ(x

, x

). The

kernel trick [7] have been used to substitute the inner products in feature space

with a suitable kernel κ calculated in the data space. If κ is chosen to be a

gaussian kernel, then we have that κ(x, x) = 1. Hence d

can be rewritten as

= Φ(x

)

− Φ(x

)

= 2

− 2κ(x

, x

). Now, if we take x

to be an element

of the input dataset, e.g. x

∈ χ, and x

to be the prototype c

of the i-th CoRe

unit, we can rewrite d

in such a way to depend on the activation function

. Therefore, applying the substitution κ(x

, c

) = ϕ

, σ

}) we obtain

, σ

}) = 1 −

Φ(x

)

− Φ(c

)

. Now, if we substitute this result in

the formulation of the CoRe loss in (4), we obtain

500

D. Bacciu and A. Starita

e

(χ, U ) =

i=1

k=1

Φ(x

)

− Φ(c

)

i=1

k=1

− δ

) RS

|χ|+k)

−

Φ(x

)

− Φ(c

)

(7)

Equation (7) states that CoRe minimizes the feature space distance between

the prototype c

and those x

that are close in the kernel space induced by the

activation functions ϕ

, while it maximizes the feature space distance between

the prototypes and those x

that are far from c

in the kernel space.

Separation Nature

To prove the separation nature of the CoRe process we need to demonstrate

that, given a a bounded hypersphere

G containing all the sample data, then after

suﬃcient iterations of the algorithm the cluster prototypes will ﬁnally either fall

into

G or remain outside it and never get into G. In particular, those prototypes

remaining outside the hypersphere will be driven far away from the samples by

the RS repulsion. We consider a prototype c

to be far away from the data if,

for a given epoch e, it is in the loser pool for every x

∈ χ. To prove CoRe

separation nature we ﬁrst demonstrate the following Lemma.

Lemma 1. When a prototype c

is far away from the data at a given epoch e,

then it will always be a loser for every x

∈ χ and will be driven away from the

data samples.

Proof. The deﬁnition of far away implies that, given c

i

,

∀x

∈ χ. i ∈ lose

where the e in the superscript refers to the learning epoch. Given the prototype

update in (5), we obtain the weight vector increment Δc

i

at epoch e as follows

−α

k=1

)RS

|χ|+k)

− c

(8)

As a result of (8), the prototype c

e+1

i

is driven further from the data. On the

other hand, by deﬁnition (1), for each of the data samples there exists at least

one winner unit for every epoch e, such that its prototype is moved towards

the samples for which it has been a winner. Moreover, not every prototype can

be deﬂected from the data, since this would make the ﬁrst term of

e

(χ, U )

(see (4)) grow and, consequently, the whole

(χ, U ) will diverge since the loser

error term in (4) is lower bounded. However, this would contradict the fact

that

J

e

(χ, U ) decreases with e since CoRe applies gradient descent to the loss

function. Therefore, there must exist at least one winning prototype c

l

that

remains close to the samples at epoch e. On the other hand c

is already far

away from the samples and, by (8), c

e+1

i

will be further from the data and

won’t be a winner for any x

∈ χ. To prove this, consider the deﬁnition of win

Convergence Behavior of Competitive Repetition-Suppression Clustering

501

in (1): for c

e+1

to be a winner, it must hold either (i) ϕ

(x

k

)

≥ θ

win

or (ii)

i = arg max

∈U

, λ

). The former does not hold because the receptive ﬁeld

area where the ﬁring strength of the i-th unit is above the threshold θ

win

does

not contain any sample at epoch e. Consequently, it cannot contain any sample

at epoch e + 1 since its center c

e+1

has been deﬂected further from the data. The

latter does not hold since there exist at least one prototype, i.e. c

, that remains

close to the data, generating higher activations than unit u

. As a consequence,

a far away prototype c

will be deﬂected from the data until it reaches a stable

point where the corresponding ﬁring strength ϕ

is negligible.

Now we can proceed to demonstrate the following

Theorem 1. For a CoRe process there exist an hypersphere

G surrounding the

sample data χ such that after suﬃcient iterations each prototype c

will ﬁnally

either (i) fall into

G or (ii) keep outside G and reach a stable point.

Proof. The CoRe process is a gradient descent (GD) algorithm on

(χ, U ),

hence, for a suﬃciently small learning step, the loss decreases with the number

of epochs. Therefore, being

(χ, U ) always positive the GD process will converge

to a minimum

∗

.

The sequences of prototype vectors

} will converge either to a point close

to the samples or to a point of negligible activation far away from the data. If a

unit u

has a suﬃciently long subsequence of prototypes

e

i

} diverging from the

dataset then, at a certain time, will no longer be a winner for any sample and,

by Lemma 1, will converge at a point far away from the data. The attractors

for the sequence

e

i

} of the diverging units lie at a certain distance r from the

samples, that is determined by those points x where the gaussian unit centered

in x produces a negligible activation in response to any pattern x

∈ χ. Hence, G

can be chosen as any hypersphere surrounding the samples with radius smaller

than r.

On the other hand, since

(χ, U ) decreases to

∗

, there must exist at least

one prototype that is not far away from the data (otherwise the ﬁrst term of

e

(χ, U ) in (4) will diverge). In this case, the sequences

} must have accumu-

lation points close to the samples. Therefore any hypersphere

G enclosing all the

samples will also surround the accumulation points of

e

i

} and, after a certain

epoch E, the sequence will be always within such hypersphere.

In summary, Theorem 1 tells that the separation nature holds for a CoRe process:

some prototypes are possibly pushed away from the data until their contribution

to the error in (4) becomes negligible. Far away prototypes will always be losers

and will never head back to the data. Conversely, some prototypes will converge

to the samples, heading to a saddle point of the loss

(χ, U ) by means of a

gradient descent process.

Correct Division and Location

Following the convergence analysis in [4] we now turn our attention to the is-

sues of correct division and location of the weight vectors. This means that the

502

D. Bacciu and A. Starita

number of prototypes falling into

G will be n

, i.e. the number of the actual

clusters in the sample data, and they will ﬁnally converge to the centers of the

clusters. At this point, we leave the intuitive study presented for DSRPCL [4],

introducing a sound analysis of the properties of the saddle points identiﬁed

by CoRe, giving a suﬃcient and necessary condition for identifying the global

minimum of a vector quantization loss in feature space.

4.1

A Global Minimum Condition for Vector Quantization in

Kernel Space

The classical problem of hard vector quantization (VQ) in Euclidean space is

to determine a codebook V = v

, . . . , v

minimizing the total distortion, cal-

culated by Euclidean norms, resulting from the approximation of the inputs

∈ χ by the code vectors v

. Here, we focus on a more general problem that

is vector quantization in feature space. Given the nonlinear mapping Φ and the

induced feature space norm

F

κ

introduced in the previous sections, we aim

at optimizing the distortion

min D(χ, Φ

) =

k=1

i=1

Φ(x

)

− Φ

(9)

where Φ

{Φ

, . . . , Φ

} represents the codebook in the kernel space and

is equal to 1 if the i-th cluster is the closest to the k-th pattern in the

feature space F

, and is 0 otherwise. It is widely known that VQ generates a

Voronoi tessellation of the quantized space and that a necessary condition for

the minimization of the distortion requires the code-vectors to be selected as the

centroids of the Voronoi regions [8]. In [5], it is given a necessary and suﬃcient

condition for the global minimum of an Euclidean VQ distortion function. In the

following, we generalize this result to vector quantization in feature space.

To prove the global minimum condition in kernel space we need to extend the

results in [9] (Proposition 3.1.7 and 3.2.4) to the most general case of a kernel

induced distance metric. Therefore we introduce the following lemma.

Lemma 2. Let κ be a kernel and Φ : χ

→ F

a map into the corresponding

feature space F

. Given a dataset χ = x

, . . . , x

partitioned into N subsets C

deﬁne the feature space mean Φ

k=1

Φ(x

) and the i-th partition centroid

∈C

Φ(x

), then we have

k=1

Φ(x

)

− Φ

i=1 k

∈C

Φ(x

)

− Φ

i=1

| Φ

− Φ

. (10)

Proof. Given a generic feature vector Φ

, consider the identity Φ(x

)

− Φ

(Φ(x

)

− Φ

) + (Φ

− Φ

): its squared norm in feature space is

Φ(x

k

)

− Φ

= Φ(x

)

− Φ

+ Φ

− Φ

+ 2(Φ(x

)

− Φ

)

(Φ

− Φ

Convergence Behavior of Competitive Repetition-Suppression Clustering

503

Summing over all the elements in the i-th partition we obtain

∈C

Φ(x

)

− Φ

∈C

Φ(x

)

− Φ

∈C

− Φ

∈C

(Φ(x

)

− Φ

)

(Φ

− Φ

)

∈C

Φ(x

)

− Φ

| Φ

− Φ

(11)

The last term in (11) vanishes since

∈C

i

(Φ(x

)

− Φ

) = 0 by deﬁnition of

. Now, applying the substitution Φ

= Φ

and summing up for all the N

partitions yields

k=1

Φ(x

)

− Φ

i=1 k

∈C

Φ(x

)

− Φ

i=1

| Φ

− Φ

(12)

where the left side of equality holds since

i=1

= χ and

i=1

∅ .

Using the results from Lemma 2 we can proceed with the formulation of the

global minimum criterion by generalizing the results of Proposition 1 in [5] to

vector quantization in feature space.

Proposition 1. Let

{Φ

, . . . , Φ

} be a global minimum solution to the prob-

lem in (9), then we have

i=1

| Φ

− Φ

≥

i=1

| Φ

− Φ

(13)

for any local optimal solution

{Φ

, . . . , Φ

N

} to (9), where {C

, . . . , C

} and

, . . . , C

} are the χ partitions corresponding to the centroids Φ

g

i

= 1/

∈C

Φ(x

) and Φ

= 1/

∈C

Φ(x

) respectively, and where Φ

is the

dataset mean (see deﬁnition in Lemma 2).

Proof. Since

{Φ

v

g

, . . . , Φ

g

N

} is a global minimum for (9) we have

i=1 k

∈C

Φ(x

)

− Φ

≤

i=1 k

∈C

− Φ

(14)

for any local minimum

{Φ

, . . . , Φ

N

}. From Lemma 2 we have that

k=1

Φ(x

)

− Φ

i=1 k

∈C

Φ(x

)

− Φ

i=1

| Φ

− Φ

(15)

k=1

Φ(x

)

− Φ

i=1 k

∈C

Φ(x

)

− Φ

i=1

| Φ

− Φ

. (16)

Since (14) holds, we obtain

i=1

| Φ

− Φ

≥

i=1

| Φ

− Φ

504

D. Bacciu and A. Starita

4.2

Correct Division and Location for CoRe Clustering

To evaluate the correct division and location properties we ﬁrst analyze the case

when the number of units N is equal to the true cluster number n

.

Consider the loss in (4) as being decomposed into a winner and a loser depen-

dent term, i.e.

(χ, U ) =

win

(χ, U )+

lose

(χ, U ). By deﬁnition,

win

(χ, U ) =

c

i=1

k=1

− ϕ

)) must have at least one minimum point. Applying the

necessary condition ∂

win

(χ, U )/∂c

= 0 we obtain an estimate of the proto-

types by means of ﬁxed point iteration, that is

k=1

)

(17)

When the number of prototypes equals the number of clusters, the ﬁxed point

iteration in (17) converges by positioning each unit weight vector close to the

true cluster centroids. In addition, it can be shown that (17) approximates a

local minima of the kernel vector quantization loss in (9). To prove this, con-

sider the CoRe loss formulation in kernel space (7): we have

win

(χ, U ) =

2

n

i=1

k=1

Φ(x

)

− Φ(c

)

, where c

is estimated by (17).

Now, consider the VQ loss in (9): a necessary condition for its minimization

requires the computation of the cluster centroids as Φ

i

=

∈C

Φ(x

The exact calculation of Φ

requires to know the form of the implicit nonlinear

mapping Φ to solve the so-called pre-image problem [10], that is determining z

such that Φ(z) = Φ

i

. Unfortunately, such a problem is insolvable in the general

case [10]. However, instead of calculating the exact pre-image we can search

an approximation by seeking z minimizing ρ(z) = Φ

i

− Φ(z)

, that is the

feature space distance between the centroid in kernel space and the mapping of

its approximated pre-image. Rather than optimizing ρ(z), it is easier to minimize

the distance between Φ

i

and its orthogonal projection onto the span Φ(z). Due

to space limitations, we omit the technicalities of this calculation (see [10] for

further details). It turns out that the minimization of ρ(z) reduces to the the

evaluation of the gradient of ρ (z) = Φ

, Φ(z) . By substituting the deﬁnition

of Φ

and applying the kernel trick we obtain

ρ (z) = (1/

k,j

∈C

κ(x

, x

) + κ(z, z) + (1/

i

|)

∈C

κ(x

, z)

where κ(z, z) = 1 since we are using a gaussian kernel. Diﬀerentiating ρ (z) with

respect to z and solving by ﬁxed point iteration yields

e

=

∈C

κ(x

, z

−1

∈C

κ(x

, z

−1

)

(18)

that is the same as the prototype estimate obtained in (17) for gaussian kernels

centered in z

. The indicator function δ

in (17) is not null only for those points

k

for which unit u

was in the winner set. This does not ensures the partition

conditions over χ, since, by deﬁnition of win

, some points can be associated with

Convergence Behavior of Competitive Repetition-Suppression Clustering

505

two or more winners. However, by (6) we know that the variance of the winners

tends to reduce as learning proceeds. Therefore, using the same arguments by

Gersho [8] it can be demonstrated that, after a certain epoch E, the CoRe

winners competition will become a WTA process where δ

will be ensuring the

partition conditions over χ.

Summarizing, the minimization of the CoRe winners error

win

(χ, U ) gener-

ates an approximate solution to the vector quantization problem in feature space

in (9). As a consequence, the prototypes c

become a local solution satisfying

the conditions of Proposition 1. Hence, substituting the deﬁnition of Φ

in the

results of Proposition 1 we obtain that

, . . . , c

} is an approximated global

minimum for (9) if and only if

c

i=1

k=1

Φ(c

)

− Φ(x

)

≥

i=1

k=1

| ˜

Φ(˜

)

− Φ(x

)

(19)

holds for every

{˜c

1

, . . . , ˜

} that are approximated pre-images of a local mini-

mum for (9). In summary, a global optimum to (9) should minimize the feature

space distance between the prototypes and samples belonging to their cluster

while maximizing the weight vector distance from the sample mean, or, equiva-

lently, the distance from all the samples in the dataset χ. The loser component

lose

(χ, U ) in the kernel CoRe loss (7) depends on the term (1

− (1/2) Φ(c

)

−

Φ(x

)

that maximizes the distance between the prototypes c

and those x

that do not fall in the respective Voronoi sets C

. Hence,

lose

(χ, U ) produces

a distortion in the estimate of c

that pursues the global optimality criterion

except for the fact that it discounts the repulsive eﬀect of the x

∈ C

. In fact,

(19) suggests that c

has to be repelled by all the x

∈ χ. On the other hand, the

estimate c

is a linear combination of the x

∈ C

: applying the repulsive eﬀect

in (19) would subtract their contribution, either canceling the attractive eﬀect

(which would be catastrophic) or simply scaling the magnitude of the learning

step without changing the ﬁnal direction. Hence, the CoRe loss makes a reason-

able assumption discarding the repulsive eﬀect of the x

∈ C

when calculating

the estimate of c

. Summarizing, CoRe locates the prototypes close to the cen-

troids of the n

clusters by means of (17), escaping from local minima of the loss

function by approximating the global minimum condition of Proposition 1.

Finally, we need to study the behavior of

e

(χ, U ) as the number of units N

varies with respect to the true cluster number n

. Using the same motivations

in [4], we see that the winner-dependent loss

win

tends to reduce as the the

number of units increases. However, if the number of units falling into

G is larger

than n

c

there will be a number of clusters that are erroneously split. Therefore,

the samples from these clusters will tend to produce an increased level of error

lose

contrasting the reduction of

win

. On the other hand,

lose

will tend

to reduce when the number of units inside

G is lower than n

. This however will

produce increased levels of

win

since the prototype allocation won’t match the

underlying sample distribution. Hence, the CoRe error will have its minimum

when the number of units inside

G will approximate n

506

D. Bacciu and A. Starita

Conclusion

The paper presents a sound analysis of the convergence behavior of CoRe clus-

tering, showing how the minimization of the CoRe cost function satisﬁes the

properties of separation nature, correct division and location [4]. As the loss

reduces to a minimum, the CoRe algorithm is shown to converge allocating the

correct number of prototypes to the centers of the clusters. Moreover, it is given

a sound optimality criterion that shows how CoRe gradient descent pursues a

global minimum of the vector quantization problem in feature space. The results

presented in the paper hold for a batch gradient descent process. However, it can

be proved that, under Ljung’s conditions [11], they can be extended to stochastic

(online) gradient descent. Moreover, we plan to investigate further the properties

of the CoRe kernel formulation, extending the convergence analysis to a wider

class of activation functions other than gaussians, i.e. normalized kernels.

References

1. Grill-Spector, K., Henson, R., Martin, A.: Repetition and the brain: neural models

of stimulus-speciﬁc eﬀects. Trends in Cognitive Sciences 10(1), 14–23 (2006)

2. Bacciu, D., Starita, A.: A robust bio-inspired clustering algorithm for the automatic

determination of unknown cluster number. In: Proceedings of the 2007 Interna-

tional Joint Conference on Neural Networks, pp. 1314–1319. IEEE, Los Alamitos

(2007)

3. Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering

analysis, rbf net, and curve detection. IEEE Trans. on Neur. Net. 4(4) (1993)

4. Ma, J., Wang, T.: A cost-function approach to rival penalized competitive learning

(rpcl). IEEE Trans. on Sys., Man, and Cyber 36(4), 722–737 (2006)

5. Munoz-Perez, J., Gomez-Ruiz, J.A., Lopez-Rubio, E., Garcia-Bernal, M.A.: Expan-

sive and competitive learning for vector quantization. Neural Process. Lett. 15(3),

261–273 (2002)

6. Bacciu, D., Starita, A.: Competitive repetition suppression learning. In: Kollias,

S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp.

130–139. Springer, Heidelberg (2006)

7. Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear component analysis as a kernel

eigenvalue problem. Neural Comp. 10(5), 1299–1319 (1998)

8. Yair, E., Zeger, K., Gersho, A.: Competitive learning and soft competition for

vector quantizer design. IEEE Trans. on Sign. Proc. 40(2), 294–309 (1992)

9. Spath, H.: Cluster analysis algorithms. Ellis Horwood (1980)

10. Scholkopf, B., Mika, S., Burges, C.J.C., Knirsch, P., Muller, K.R., Ratsch, G.,

Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE Trans.

on Neur. Net. 10(5), 1000–1017 (1999)

11. Ljung, L.: Strong convergence of a stochastic approximation algorithm. The Annals

of Statistics 6(3), 680–696 (1978)

Self-Organizing Clustering with Map of

Nonlinear Varieties Representing Variation in

One Class

Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma

Kyushu Institute of Technology,

Faculty of Engineering,

1-1 Sensui-cho Tobata-ku Kitakyushu, 804-8550, Japan

kawano@ecs.kyutech.ac.jp

Abstract. Adaptive Subspace Self-Organizing Map (ASSOM) is an evo-

lution of Self-Organizing Map, where each computational unit deﬁnes a

linear subspace. Recently, its modiﬁed version, where each unit deﬁnes

an linear manifold instead of the linear subspace, has been proposed. The

linear manifold in a unit is represented by a mean vector and a set of

basis vectors. After training, these units result in a set of linear variety

detectors. In another point of view, we can consider the AMSOM repre-

sents the latent commonality of data as linear structures. In numerous

cases, however, these are not enough to describe the latent commonality

of data because of its linearity. In this paper, the nonlinear variety is

considered in order to represent a diversity of data in a class. The eﬀec-

tiveness of the proposed method is veriﬁed by applying it to some simple

classiﬁcation problems.

Introduction

The subspace method is popular in pattern recognition, feature extraction, com-

pression, classiﬁcation and signal processing.[1] Unlike other techniques where

classes are primarily deﬁned as regions or zones in the feature space, the sub-

space method uses linear subspaces that are deﬁned by a set of normalized basis

vectors. One linear subspace is usually associated with one class. An input vector

is classiﬁed to a particular class if its projection error into the subspace asso-

ciated with one class is the minimum. The subspace method, as compared to

other pattern recognition techniques, has advantages in applications where the

relative intensities or energies of the vector components are more important than

the overall level of the signal. It also provides an economical representation for

groups of vectors with high dimensionality, since one can often use a small set

of basis vectors to approximate the subspace where the vectors reside. Another

paradigm is to use is use a mixture of local subspace to collectively model the

data space.

Adaptive-Subspace Self-Organizing Map (ASSOM)[2][3] is a mixture of lo-

cal subspace method for pattern recognition. ASSOM, which is an evolution

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 507–516, 2008.

c Springer-Verlag Berlin Heidelberg 2008

508

H. Kawano, H. Maeda, and N. Ikoma

of Self-Organizing Map (SOM),[4] consists of an input layer and a competitive

layer arranging some computational units in a line or a lattice structure. Each

computational units deﬁnes a subspace spanned by some basis vectors. ASSOM

creates a set of subspaces representations by competitive selection and coop-

erative learning. In SOM, a set of reference vectors is spatially organized to

partition the input space. In ASSOM, a set of reference sub-models is topolog-

ically ordered, with each sub-model responsible for describing a speciﬁc region

of the input space by its local principal subspace. The ASSOM is attractive not

only because it inherits the topographic representation property in the SOM,

but also because the learning results of ASSOM can faithfully describe the the

core features of various transformation groups. The simulation results in the

reference [2] and the reference [3] have illustrated that diﬀerent feature ﬁlters

can be self-organized to diﬀerent low-dimensional subspaces and a wavelet type

representation does emerge in the learning.

Recently, Adaptive Manifold Self-Organizing Map (AMSOM) which is a mod-

iﬁed version of ASSOM has been proposed.[5] AMSOM is the same structure as

the ASSOM, except for the way to represent each computational unit. Each unit

in AMSOM deﬁnes an aﬃne subspace which is composed of a mean vector and

a set of basis vectors. By incorporating a mean vector into each unit, the recog-

nition performance has been improved signiﬁcantly. The simulation results in

the reference [5] have been shown that AMSOM outperforms linear PCA-based

method and ASSOM in face recognition problem.

In both ASSOM and AMSOM, a local subspace in each unit can be adapted

by linear PCA learning algorithms. On the other hand, it is known that there are

a number of advantages in introducing nonlinearities into a PCA type network

with reproducing kernels.[6][13] For example, the performance of the subspace

method is aﬀected by the dimensionality for the intersections of subspaces.[1] In

other words, the dimensionality of subspace should be as possible as low in order

to achieve successful performance. it is, however, not enough to describe variation

in a class of patterns by low dimensional subspace because of its linearity.

From this consideration, we propose a nonlinear extended version of the AM-

SOM with the reproducing kernels. The proposed method could be expected to

construct nonlinear varieties so that eﬀective representation of data belonging to

the same category is achieved with low dimensionality. The eﬀectiveness of the

proposed method is veriﬁed by applying it to some simple pattern classiﬁcation

problems.

Adaptive Manifold Self-Organizing Map (AMSOM)

In this section, we give a brief review of the original AMSOM. Fig.1 shows the

structure of the AMSOM. It consists of an input layer and a competitive layer, in

which n and M units are included respectively. Suppose i

∈ {1, · · · , M} is used

to index computational units in the competitive layer, the dimensionality of the

input vector is n. The i-th computational unit constructs an aﬃne subspace,

which is composed of a mean vector μ

(i)

and a subspace spanned by H basis

Self-Organizing Clustering with Map

509

x

1

x

j

x

n

Competitive Layer

Input Layer

i

1

M

Fig. 1. A structure of the Adaptive Manifold Self-Organizing Map (AMSOM)

Subspace

Origin

mean vector :

^

φ

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 43 44 45 46 47 48 49 50 ... 88