Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
2007)
20. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: International Conference on Computer Vision and Pattern Recognition, pp. 257–263 (2003) 21. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
22. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 23. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin clas- sifiers. In: D. Proc. Fifth Ann. Workshop on Computational Learning Theory, pp. 144–152. ACM, New York (1992) 24. Fyfe, C., Lai, P.L.: Kernel and nonlinear canonical correlation analysis. Interna- tional Journal of Neural Systems 10, 365–377 (2001) 25. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16, 2639–2664 (2004) 26. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 27. Stephan, K.E., Harrison, L.M., Penny, W.D., Friston, K.J.: Biophysical models of fmri responses. Current Opinion in Neurobiology 14, 629–635 (2004)
Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Japan hira@mail.saitama-u.ac.jp Abstract. Reinforcement learning (RL) for a linear family of tasks is studied in this paper. The key of our discussion is nonlinearity of the optimal solution even if the task family is linear; we cannot obtain the optimal policy by a naive approach. Though there exists an algorithm for calculating the equivalent result to Q-learning for each task all together, it has a problem with explosion of set sizes. We introduce adaptive mar- gins to overcome this difficulty. 1 Introduction Reinforcement learning (RL) for a linear family of tasks is studied in this paper. Such learning is useful for time-varying environments, multi-criteria problems, and inverse RL [5,6]. The family is defined as a weighted sum of several criteria. This family is linear in the sense that reward is linear with respect to weight parameters. For instance, criteria of network routing include end-to-end delay, loss of packets, and power level associated with a node [5]. Selecting appropriate weights beforehand is difficult in practice and we need try and errors. In addition, appropriate weights may change someday. Parallel RL for all possible weight values is desirable in such cases. The key of our discussion is nonlinearity of the optimal solution; it is not linear but piecewise-linear actually. This fact implies that we cannot obtain the best policy by the following naive approach: 1. Find the value function for each criterion. 2. Calculate weighted sum of them to obtain the total value function. 3. Construct a policy on the basis of the total value function. A typical example is presented in section 5. Piecewise-linearity of the optimal solution has been pointed out independently in [4] and [5]. The latter aims at fast adaptation under time-varying environ- ments. The former is our previous report, and we have tried to obtain the optimal solutions for various weight values all together. Though we have developed an algorithm that gives exactly equivalent solution to Q-learning for each weight value, it has a difficulty with explosion of set size. This difficulty is not a problem of the algorithm but an intrinsic nature of Q-learning for the weighted criterion model.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 487–496, 2008. c Springer-Verlag Berlin Heidelberg 2008 488 K. Hiraoka, M. Yoshida, and T. Mishima We have introduced a simple approximation with a ‘margin’ into decision of convexity first [6]. Then we have improved it so that we obtain an interval estimation and we can monitor the effect of the approximation [7]. In this paper, we propose adaptive adjustment of margins. In margin-based approach, we have to manage large sets of vectors in the first stage of learning. The peak of the set size tends to be large if we set a small margin to obtain an accurate final result. The proposed method reduces worry of this trade-off. By changing margins appropriately through learning steps, we can enjoy small set size in the first stage with large margins, and an accurate result in the final stage with small margins. The weighted criterion model is defined in section 2, and parallel RL for it is described in section 3. Then the difficulty of set size is pointed out and margins are introduced in section 4. Adaptive adjustment of margins is also proposed there. Its behavior is verified with experiments in section 5. Finally, a conclusion is given in section 6. 2 Weighted Criterion Model An “orthodox” RL setting is assumed for states and actions as follows. – The time step is discrete (t = 0, 1, 2, 3, . . .). – The state set S and the action set A are finite and known. – The state transition rule P is unknown. – The state s t is observable. – The task is a Markov decision process (MDP). The reward r t+1 is given as a weighted sum of partial rewards r 1 t+1
, . . . , r M t+1 : r t+1 ( β) =
M i=1
β i r i t+1
= β · r
t+1 , (1) weight vector β ≡ (β
1 , . . . , β M )
M , (2) reward vector r t+1 ≡ (r 1 t+1 , . . . , r M t+1 ) ∈ R
M . (3) We assume that the partial rewards r 1 t+1 , . . . , r M t+1 are also observable, whereas their reward rules R(1), . . . , R(M) are unknown. Multi-criteria RL problems of this type have been introduced independently in [3] and [5]. We hope to find the optimal policy π ∗ β for each weight β that maximizes the expected cumulative reward with a given discount factor 0 < γ < 1, π ∗ β = argmax
π E π ∞ τ=0
γ τ r τ+1 ( β) , (4) where E
π [ ·] denotes the expectation under a policy π. To be exact, π ∗ is defined as a policy that attains Q π ∗ β β (s, a; γ) = Q ∗ β (s, a; γ) ≡ max π Q π β (s, a; γ) for all state-action pairs (s, a), where the action-value function Q π β is defined as Parallel Reinforcement Learning for Weighted Multi-criteria Model 489
Q π β (s, a; γ) ≡ E
π ∞ τ=0 γ τ r τ+1 ( β) s 0 = s, a
0 = a .
(5) It is well known that MDP has a deterministic policy π ∗ β
condition; such π ∗ β is obtained from the optimal value function [2], π ∗ β : S
→ A : s → argmax a∈A
Q ∗ β (s, a; γ). (6)
Thus we concentrate on estimation of Q ∗ β . Note that Q ∗ β is nonlinear with respect to β. A typical example is presented in section 5. Basic properties of the action-value function Q are described briefly in the rest of this section [4,5,6]. The discount factor γ is fixed through this paper, and it is omitted below. Proposition 1. Q π β
β for a fixed policy π. Proof. Neither P nor π depend on β from assumptions. Hence, joint distribution of (s
0 , a
0 ), (s
1 , a
1 ), (s
2 , a
2 ), . . . is independent of β. It implies linearity. Definition 1. If f : R M → R can be written as f(β) = max q∈Ω ( q · β) with a nonempty finite set Ω ⊂ R
M , we call f Finite-Max-Linear (FML) and write it as f = FML Ω . It is trivial that f is convex and piecewise-linear if f is FML. Proposition 2. The optimal action-value function is FML as a function of the weight β. Namely, there exists a nonempty finite set Ω ∗ (s, a)
⊂ R M for each state-action pair (s, a), and Q ∗ β is written as Q ∗ β (s, a) =
max q∈Ω
∗ ( s,a) q · β. (7)
Proof. We have assumed MDP. It is well known that Q ∗ β can be written as Q ∗ β (s, a) = max π∈Π Q
β (s, a) for the set Π of all deterministic policies. Π is finite, and Q π β is linear with respect to β from proposition 1. Hence, Q ∗ β
FML. Proposition 3. Assume that an estimated action-value function Q β is FML as a function of the weight β. If we apply Q-learning, the updated Q new
β (s t , a t ) = (1 − α)Q β (s t , a
t ) + α
β · r t+1
+ γ max a∈A
Q β (s t+1 , a)
(8) is still FML as a function of β, where α > 0 is the learning rate. Proof. There exists a nonempty finite set Ω(s, a) ⊂ R M
β (s, a) =
max q∈Ω(s,a)
( q · β) for each (s, a). Then (8) implies Q new β
t , a
t ) = max
˜ q∈ ˜
Ω ˜ q · β, where ˜ Ω ≡ (1 − α)q + α(r t+1 + γ
q ) a ∈ A, q ∈ Ω(s t , a t ), q ∈ Ω(s t+1 , a) , (9) because max x f(x) + max y g(y) = max x,y (f (x) + g(y)) holds in general. The set ˜ Ω is finite, and Q new β
These propositions imply that (1) the true Q ∗ β is FML, and (2) its estimation Q β is also FML as long as the initial estimation is FML. 490 K. Hiraoka, M. Yoshida, and T. Mishima 3 Parallel Q-Learning for All Weights A parallel Q-learning method for the weighted criterion model has been proposed in [6]. The estimation Q β for all
β ∈ R M are updated all together in parallel Q-learning. In this method, Q β (s, a) for each (s, a) is treated in an FML expression: Q β (s, a) = max q∈Ω(s,a)
q · β = FML Ω(s,a)
( β) (10) with a certain set Ω(s, a) ⊂ R
M . We store and update Ω(s, a) instead of Q β (s, a)
on the basis of propositions 2 and 3. Though a naive updating rule has been suggested in the proof of proposition 3, it is extremely redundant and inefficient. We need several definitions to describe a better algorithm. Definition 2. An element c ∈ Ω is redundant if FML ( Ω−{c})
= FML Ω . Definition 3. We use Ω † to represent non-redundant elements in Ω. Note that FML Ω † = FML Ω [5]. Definition 4. We define the following operations: cΩ ≡ {cq | q ∈ Ω}, c + Ω ≡ {c + q | q ∈ Ω}, (11) Ω
† , K k=1 Ω k ≡ K k=1 Ω k † , (12)
Ω ⊕ Ω ≡ {q + q | q ∈ Ω, q ∈ Ω }, Ω Ω ≡ (Ω ⊕ Ω ) † . (13) With these operations, the updating rule of Ω is described as follows [6]: Ω new
(s t , a t ) = (1
− α)Ω(s t , a t ) α r t+1 + γ
a∈A Ω(s
t+1 , a) .
(14) The initial value of Ω at t = 0 is Ω(s, a) = {o} ⊂ R M
∈ S × A. It corresponds to a constant initial function Q β (s, a) = 0. Proposition 4. When (10) holds for all states s ∈ S and actions a ∈ A, Q new β (s t , a t ) in (8) is equal to FML Ω new
( s t ,a t ) ( β) for (14). Namely, parallel Q- learning is equivalent to Q-learning for each β: update {Q β (s, a) } → Q new
β (s t , a t ) FML expression FML expression {Ω(s, a)} → Ω
new (s t , a t ) update . (15) Parallel Reinforcement Learning for Weighted Multi-criteria Model 491
Ω[+]Ω’ O Ω Ω’ (1) Set directions of edges (2) Merge and sort edges according to their arguments (3) Connect edges to generate a polygon (4) Shift the origin (max x in Ω)+(max x in Ω’) (max x in Ω[+]Ω’)
x y (max y in Ω)+(max y in Ω’) (max y in Ω[+]Ω’) Fig. 1. Calculation of Ω Ω in (14) for two-dimensional convex polygons. Vertices of polygons correspond to Ω, Ω and Ω Ω . Proof. We have introduced a set ˜ Ω in (9) to prove proposition 3. With the above operations, (9) is written as ˜ Ω = (1 − α)Ω(s t , a
t ) ⊕ α r t+1 + a∈A Ω(s t+1
, a) . Then ( ˜
Ω) † = Ω new (s t , a t ) is obtained and FML Ω new
( s t ,a t ) ( β)= FML
˜ Ω(s
t ,a t ) ( β) = Q new
β (s t , a t ) is implied. It is well known that Ω † is equal to the vertices in the convex hull of Ω [6]. Effi- cient algorithms of convex hull have been developed in computational geometry [8]. Using them, we can calculate the merged set (Ω Ω ) = (Ω ∪ Ω ) † . The sum set (Ω Ω ) have been also studied as Minkowski sum algorithms [9,10,11]. Its calculation is particularly easy for two-dimensional convex polygons (Fig.1). Before closing the present section, we note an FML version of Bellman equa- tion in our notation. Theoretically, we can use successive iteration of this equa- tion to find the optimal policy when we know P and R, though we must take care of numerical error in practice. Proposition 5. FML expression Q ∗ β = FML Ω ∗ ( β) satisfies Ω ∗
† = R a s + γ a ∈A + s ∈S P a ss Ω ∗ (s , a ), (16) where
R a s = ( R a s (1), . . . , R a
(M )), R a s (i) = E[r i t+1
| s t = s, a t = a],
(17) P a ss = P (s
t+1 = s
| s t = s, a t = a),
(18) + s ∈{s 1 ,...,s
k } X s = X
s 1 X s 2 · · · X s k . (19) 492 K. Hiraoka, M. Yoshida, and T. Mishima In particular, the next equation holds if state transition is deterministic: Ω ∗ (s, a) † = R a s + γ a ∈A
Ω ∗ (s , a ), (20) where s is the next state for the action a at the current state s. Proof. Substituting (7) and R a s,β ≡ E[r
t+1 ( β) | s t = s, a
t = a] =
R a s · β into the Bellman equation Q ∗ β
R a s,β + γ s ∈S
P a ss max a ∈A
Q ∗ β (s , a ), we obtain max
q∈Ω ∗ ( s,a) q · β = max q ∈Ω (s,a) q · β,
(21) Ω (s, a) = a ∈A R
s + γ
s ∈S P a ss q s q s ∈ Ω ∗ (s , a )
(22) in the same way as (9). Hence, Ω ∗ is equal to Ω except for redundancy. 4 Interval Operations Under regularity conditions, Q-learning has been proved to converge to Q ∗ [1]. That result implies pointwise convergence of parallel Q-learning to Q ∗ β for each β because of proposition 3. From proposition 2, Q ∗ β
finite Ω ∗ (s, a). However, as we can see in Fig.1, the number of elements in the set Ω(s, a) increases monotonically and it never ‘converges’ to Ω ∗ (s, a). This is not a paradox; the following assertions can be true at the same time. 1. Vertices of polygons P 1 , P
2 , . . . monotonically increase. 2. P t
∗ in the sense that the volume of the difference P t
t ∪ P
∗ ) − (P t ∩ P
∗ ) converges to 0. 2’. The function FML P t ( ·) converges pointwise to FML P ∗
·). In short, pointwise convergence of a piecewise-linear function does not imply convergence of the number of pieces. Note that it is not a problem of the algo- rithm. It is an intrinsic nature of pointwise Q-learning of the weighted criterion model for each weight β. To overcome this difficulty, we tried a simple approximation with a small ‘margin’ at first [6]. Then we have introduced interval operations to monitor ap- proximation error [7]. A pair of sets Ω L (s, a) and Ω U (s, a) are updated instead of the original Ω(s, a) so that CH Ω L (s, a) ⊂ CH Ω(s, a) ⊂ CH Ω U (s, a) holds, where CH Z represents the convex hull of Z. This relation implies lower and up- per bounds Q L β
≤ Q β (s, a) ≤ Q U β (s, a), where Q X β (s, a) = FML Ω X ( s,a)
( β) for X = L, U . When the difference between Q L and Q
U is sufficiently small, it is guaranteed that the effect of the approximation can be ignored. Updating rules of Ω
L and Ω
U are same as those of Ω, except for the following approximations after every calculation of and
. We assume M = 2 here. Lower approximation for Ω L : A vertex is removed if the change of the area of CH Ω L (s, a) is smaller than a threshold L /2 (Fig.2 left). Parallel Reinforcement Learning for Weighted Multi-criteria Model 493
a b c d e a b d e if the area of triangle /// is small - remove c - remove b,c - add z a
c d if the area of triangle /// is small a z d Fig. 2. Lower approximation (left) and upper approximation (right) Upper approximation for Ω U : An edge is removed if the change of the area of CH Ω U (s, a) is smaller than a threshold U /2 (Fig.2 right). In this paper, we propose an automatic adjustment of the margins L , U . The
below procedures are performed at every step t after the updating of Ω L , Ω U . The symbol X represents L or U here. ξ s , ξ
w ≥ 1 and θ Q , θ
Ω ≥ 0 are constants. 1. Check the changes of set sizes and interval width compared with the previous ones. Namely, check these values: ∆ X
= Ω Xnew
(s t , a t ) − Ω X (s t , a t ) , (23) ∆ Q = Q Unew
¯ β (s t , a
t ) − Q Lnew ¯ β (s t , a t ) − Q U ¯ β (s t , a t ) − Q L ¯ β (s t , a t ) , (24)
where |Z| is the number of elements in Z, and ¯β is selected beforehand. 2. Increase of set size suggests a need of thinning, whereas increase of interval width suggests a need of more accurate calculation. Modify margins as Xnew =
X (∆ Q ≤ θ Q ) ˜ X /ξ w (∆ Q > θ Q ) , where ˜ X = X (∆ X Ω ≤ θ
Ω ) ξ s X (∆ X Ω > θ Ω ) . (25) To avoid underflow, we set Xnew =
if Xnew
is smaller than a constant min
. 5 Experiments with a Basic Task of Weighted Criterion We have verified behaviors of the proposed method. We set S = {S, G, A, B, X, Y}, A = {Up, Down, Left, Right}, s 0 = S, and γ = 0.8 (Fig.3) [6]. Each action causes a deterministic state transition to the corresponding direction except at G, where the agent is moved to S regardless of its action. Rewards 1, 4b, b are offered at s t
t is an action to ‘outside wall’ at s t = G, the state is unchanged and a negative reward ( −1) is added further. It is a weighted criterion model of M = 2, because it can be written as the form r t+1 =
t+1 for
r t+1
= (r 1 t+1 , r 2 t+1 ) and β = (b, 1). The optimal policy changes depending on the weight b. Hence, the optimal value function is
494 K. Hiraoka, M. Yoshida, and T. Mishima S X
A Y B (4b) (b)
(1) outside = wall (-1) Fig. 3. Task for experiments. Numbers in parentheses are reward values. Table 1. Optimal state-value functions and optimal policies Range of weight Optimal
V ∗ b (S) Optimal state transition b < −16/25 0 S → A → S → · · · −16/25 ≤ b < −225/1796 (2000b + 1280)/2101 S → A → Y → B → G → S → · · · −225/1796 ≤ b < 15/47 (400
b + 80)/61 S → X → G → S → · · · 15 /47 ≤ b < 3/4 32 b/3
S → X → Y → X → · · · 3 /4 ≤ b
16 b − 4
S → X → X → · · · 1e-18 1e-16
1e-14 1e-12
1e-10 1e-08
1e-06 1e-04
0.01 1 0 5000 10000
15000 20000
Lower margin t 1e-13 1e-10 1e-7
1e-4 0.1
1e-18 1e-16
1e-14 1e-12
1e-10 1e-08
1e-06 1e-04
0.01 1 0 5000 10000
15000 20000
Upper margin t 1e-13 1e-10 1e-7
1e-4 0.1
Fig. 4. Transition of margins L and U from various initial margins 0 50
100 150
200 250
300 0 5000 10000 15000
20000 Total number of elements Σ s,a
|Ω L (s,a)| t 1e-13
1e-10 1e-7
1e-4 0.1
0 50
100 150
200 250
300 0 5000 10000 15000
20000 Total number of elements Σ s,a
|Ω U (s,a)| t 1e-13
1e-10 1e-7
1e-4 0.1
Fig. 5. Total number of elements s,a
|Ω X ( s, a)|. (Left: X = L, Right: X = U). Parallel Reinforcement Learning for Weighted Multi-criteria Model 495
1e-18 1e-16
1e-14 1e-12
1e-10 1e-08
1e-06 1e-04
0.01 1 0 5000 10000
15000 20000
Interval width t 1e-13 1e-10 1e-7
1e-4 0.1
Fig. 6. Interval width Q U (0.2,1) (A , Up) − Q L (0.2,1)
(A , Up)
0 500
1000 1500
2000 2500
3000 3500
4000 4500
0 5000
10000 15000
20000 Total number of elements Σ s,a
|Ω X (s,a)| t 1e-2(Lower) 1e-2(Upper) 1e-9(Lower) 1e-9(Upper) 1e-18
1e-16 1e-14
1e-12 1e-10
1e-08 1e-06
1e-04 0.01
1 0 5000 10000 15000
20000 Interval width t 1e-2
1e-9 Fig. 7. Fixed-margin algorithm ( U =
= 10 −2 and U = L = 10 −9 ). Left: total number of elements s,a
|Ω X ( s, a)| for X = U, L. Right: interval width. 0 50 100 150
200 250
300 0 2000 4000 6000
8000 10000
Total number of elements Σ s,a |Ω U (s,a)| t 1e-13
1e-10 1e-7
1e-4 0.1
1e-18 1e-16
1e-14 1e-12
1e-10 1e-08
1e-06 1e-04
0.01 1 0 2000 4000
6000 8000
10000 Interval width t 1e-13
1e-10 1e-7
1e-4 0.1
Fig. 8. Average of 100 trials with inappropriate factors ξ s = 1 .5, ξ w = 1 .015 for γ = 0.5 Left: total number of elements in upper approximation. Right: interval width. nonlinear with respect to b (Table 1). Note that the second pattern (S →A→Y)
in Table 1 cannot appear on the naive approach in section 1. The proposed algorithm is applied to this task with random actions a t and
parameters α = 0.7, (ξ s , ξ w ) = (1.7, 1.015), (θ Q , θ
Ω ) = (0, 2), ¯ β = (0.2, 1), min
= 10 −14
. The initial margins L = U at t = 0 is one of 10 −1 , 10
−4 , 10
−7 ,
496 K. Hiraoka, M. Yoshida, and T. Mishima 10 −10
, 10 −13
. On this task, we can replace convex hulls with upper convex hulls in our algorithm because β is restricted to the upper half plane [6]. We also assume
|b| ≤ 10 ≡ b max
and we safely remove the edges on the both end in Fig.2 if the absolute value of their slope is greater than b max for lower approximation. Averages of 100 trials are shown in Fig.4,5,6. The proposed algorithm is ro- bust to wide range of initial margins. It realizes reduced set sizes and small interval width at the same time; these requirements are trade-off in the conven- tional fixed-margin algorithm [7] (Fig.7). A problem of the proposed algorithms is sensitivity to the factors ξ s , ξ w . When they are inappropriate, instability is observed after a long run (Fig.8). Another problem is slow convergence of the interval width Q U − Q
L compared with the fixed-margin algorithm. 6 Conclusion A parallel RL method with adaptive margins is proposed for the weighted cri- terion model, and its behaviors are verified experimentally with a basic task. Adaptive margins realize reduced set sizes and accurate results. A problem of the adaptive margins is instability for inappropriate parameters. Though it is robust for initial margins, it needs tuning of factor parameters. Another problem is slow convergence of the interval between upper and lower estimations. These points must be studied further. References 1. Jaakkola, T., et al.: Neural Computation 6, 1185–1201 (1994) 2. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT Press, Cambridge (1998) 3. Kaneko, Y., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. 167 (2004) 4. Kaneko, N., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. A-2-10 (2005) 5. Natarajan, S., et al.: In: Proc. Intl. Conf. on Machine Learning, pp. 601–608 (2005) 6. Hiraoka, K., et al.: The Brain & Neural Networks (in Japanese). Japanese Neural Network Society 13, 137–145 (2006) 7. Yoshida, M., et al.: Proc. FIT (in Japanese) (to appear, 2007) 8. Preparata, F.P., et al.: Computational Geometry. Springer, Heidelberg (1985) 9. Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.): ICCS 2001. LNCS, vol. 2073. Springer, Heidelberg (2001) 10. Fukuda, K.: J. Symbolic Computation 38, 1261–1272 (2004) 11. Fogel, E., et al.: In: Proc. ALENEX, pp. 3–15 (2006) Convergence Behavior of Competitive Repetition-Suppression Clustering Davide Bacciu 1,2
and Antonina Starita 2 1 IMT Lucca Institute for Advanced Studies, P.zza San Ponziano 6, 55100 Lucca, Italy d.bacciu@imtlucca.it 2 Dipartimento di Informatica, Universit` a di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy starita@di.unipi.it Abstract. Competitive Repetition-suppression (CoRe) clustering is a bio-inspired learning algorithm that is capable of automatically deter- mining the unknown cluster number from the data. In a previous work it has been shown how CoRe clustering represents a robust generalization of rival penalized competitive learning (RPCL) by means of M-estimators. This paper studies the convergence behavior of the CoRe model, based on the analysis proposed for the distance-sensitive RPCL (DSRPCL) algorithm. Furthermore, it is proposed a global minimum criterion for learning vector quantization in kernel space that is used to assess the correct location property for the CoRe algorithm. 1 Introduction CoRe learning has been proposed as a biologically inspired learning model mim- icking a memory mechanism of the visual cortex, i.e. repetition suppression [1]. CoRe is a soft-competitive model that allows only a subset of the most active units to learn in proportion to their activation strength, while it penalizes the least active units, driving them away from the patterns producing low firing strengths. This feature has been exploited in [2] to derive a clustering algorithm that is capable of automatically determining the unknown cluster number from the data by means of a reward-punishment procedure that resembles the rival penalization mechanism of RPCL [3]. Recently, Ma and Wang [4] have proposed a generalized loss function for the RPCL algorithm, named DSRPCL, that has been used for studying the convergence behavior of the rival penalization scheme. In this paper, we present a convergence analysis for CoRe clustering that founds on Ma and Wang’s approach, describing how CoRe satisfies the three proper- ties of separation nature, correct division and correct location [4]. The intuitive analysis presented in [4] for DSRPCL is enforced with theoretical considerations showing that CoRe pursues a global optimality criterion for vector quantization algorithms. In order to do this, we introduce a kernel interpretation for the CoRe loss that is used to generalize the results given in [5] for hard vector quantization, to kernel-based algorithms. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 497–506, 2008. c Springer-Verlag Berlin Heidelberg 2008
498 D. Bacciu and A. Starita 2 A Kernel Based Loss Function for CoRe Clustering A CoRe clustering network consists of cluster detector units that are charac- terized by a prototype c i , that identifies the preferred stimulus for the unit u i and represents the learned cluster centroid. In addition, units are characterized by an activation function ϕ i (x k , λ
i ), defined in terms of a set of parameters λ i
tation of an input pattern x k ∈ χ. Such an activation function measures the similarity between the prototype c i and the inputs, determining whether the pattern x k belongs to the i-th cluster. In the remainder of the paper we will use an activation function that is a gaussian centered in c i with spread σ i , i.e.
ϕ i (x k |{c
i , σ
i }) = exp −0.5 x k − c
i 2 /σ 2 i . CoRe clustering works essentially by evolving a small set of highly selective cluster detectors out of an initially larger population by means of a competitive reward-punishment procedure that resembles the rival penalization mechanism [3]. Such a competition is engaged between two sets of units: at each step the most active units are selected to form the winners pool, while the remainder is inserted into the losers pool. More formally, we define the winners pool for the input x k
i that fires more than θ win or the single unit that is maximally active for the pattern, that is win
k = {i | ϕ i (x k , {c i , σ i }) ≥ θ win } ∪ {i | i = arg max j ∈U
j (x k | {c i , σ i })}
(1) where the second term of the union ensures that win k is non-empty. Conversely, the losers pool for x k is lose k = U
\ win k , that is the complement of win k with respect to the neuron set U . The units belonging to the losers pool are penalized and their response is suppressed. The strength of the penalization for the pattern x k , at time t, is regulated by the repetition suppression RS t k ∈ [0, 1] and is proportional to the frequency of the pattern that has elicited the suppressive effect (see [2,6] for details). The repetition suppression is used to define a pseudo-target activation for the units in the losers pool as ˆ ϕ t i (x k ) = ϕ i (x k , {c i , σ i })(1 − RS t k
activation proportionally to the amount of repetition suppression they receive. The error of the i-th loser unit can thus be written as E t
= 1 2 ( ˆ ϕ t i (x k ) − ϕ
i (x k , {c i , σ i })) 2 = 1 2 ( −ϕ i (x k , {c i , σ i })RS t k ) 2 . (2) Conversely, in order to strengthen the activation of the winner units, we set the target activation for the neurons u i (i
k ) to M , that is the maximum of the activation function ϕ i ( ·). The error, in this case, can be written as E t i,k = (M
− ϕ i (x k , {c i , σ
i })).
(3) To analyze the CoRe convergence, we give an error formulation that accumu- lates the residuals in (2) and (3) for a given epoch e: summing up over all CoRe units in U and the dataset χ = (x 1 , . . . , x k , . . . , x K ) yields
Convergence Behavior of Competitive Repetition-Suppression Clustering 499
J e (χ, U ) = I i=1
K k=1
δ ik (1 − ϕ i (x k )) +
I i=1
K k=1
(1 − δ
ik ) ϕ
i (x k )RS (e |χ|+k) k 2 (4) where δ ik is the indicator function for the set win k and where {c i
i } has been omitted from ϕ i to ease the notation. Note that, in (4), we have implicitly used the fact that the units can be treated as independent. The CoRe learning equa- tions can be derived using gradient descent to minimize J e
to the parameters {c i , σ i } [2]. Hence, the prototype increment for the e-th epoch can be calculated as follows c e i = α
c K k=1 ⎡ ⎣δ ik ϕ i (x k ) (x k − c
e i ) (σ e i ) 2 − (1 − δ ik ) ϕ i (x k )RS (e |χ|+k) k σ e i 2 (x k − c
e i ) ⎤ ⎦ (5) where α c is a suitable learning rate ensuring that J e decreases with e. Similarly, the spread update can be calculated as σ e i = α
σ K k=1 δ ik ϕ i (x k ) x k − c e i 2 (σ e i ) 3 − (1 − δ ik )(ϕ i (x k )RS (e |χ|+k) k ) 2 x k − c e i 2 (σ e i ) 3 . (6) As one would expect, unit prototypes are attracted by similar patterns (first term in (5)) and are repelled by the dissimilar inputs (second term in (5)). More- over, the neural selectivity is enhanced by reducing the Gaussian spread each time the corresponding unit happens to be a winner. Conversely, the variance of loser neurons is enlarged, reducing the units’ selectivity and penalizing them for not having sharp responses. The error formulation introduced so far can be restated by exploiting the kernel trick [7] to express the CoRe loss in terms of differences in a given fea- ture space F . Kernel methods are algorithms that exploit a nonlinear mapping Φ : χ
→ F to project the data from the input space χ onto a convenient, implicit feature space F . The kernel trick is used to express all operations on Φ(x 1
2 ) ∈ F in terms of the inner product Φ(x 1 ), Φ(x
2 ) . Such inner prod- uct can be calculated without explicitly using the mapping Φ, by means of the kernel κ(x 1 , x
2 ) = Φ(x
1 ), Φ(x
2 ) .
To derive the kernel interpretation for the CoRe loss in (4), consider first the formulation of the distance d F κ
1 , x
2 ∈ χ in the feature space F κ
→ F κ , that is d F κ (x 1 , x 2 ) =
Φ(x 1 ) − Φ(x 2 ) 2 F κ = κ(x 1 , x 1 ) − 2κ(x 1 , x
2 ) + κ(x
2 , x
2 ). The
kernel trick [7] have been used to substitute the inner products in feature space with a suitable kernel κ calculated in the data space. If κ is chosen to be a gaussian kernel, then we have that κ(x, x) = 1. Hence d F κ can be rewritten as d F κ = Φ(x
1 ) − Φ(x 2 ) 2 F κ = 2 − 2κ(x 1 , x 2 ). Now, if we take x 1 to be an element of the input dataset, e.g. x k ∈ χ, and x 2 to be the prototype c i of the i-th CoRe unit, we can rewrite d F κ in such a way to depend on the activation function ϕ i . Therefore, applying the substitution κ(x k , c i ) = ϕ
i (x k , {c i , σ i }) we obtain ϕ i (x k , {c i , σ
i }) = 1 −
1 2 Φ(x k ) − Φ(c i ) 2 F κ . Now, if we substitute this result in the formulation of the CoRe loss in (4), we obtain 500 D. Bacciu and A. Starita J e
1 2 I i=1 K k=1 δ ik Φ(x k ) − Φ(c i ) 2 F κ + + 1 2 I i=1
K k=1
(1 − δ
ik ) RS
(e |χ|+k)
k 1 − 1 2 Φ(x k ) − Φ(c i ) 2 F κ 2 (7) Equation (7) states that CoRe minimizes the feature space distance between the prototype c i and those x k that are close in the kernel space induced by the activation functions ϕ i , while it maximizes the feature space distance between the prototypes and those x k that are far from c i in the kernel space. 3 Separation Nature To prove the separation nature of the CoRe process we need to demonstrate that, given a a bounded hypersphere G containing all the sample data, then after sufficient iterations of the algorithm the cluster prototypes will finally either fall into G or remain outside it and never get into G. In particular, those prototypes remaining outside the hypersphere will be driven far away from the samples by the RS repulsion. We consider a prototype c i to be far away from the data if, for a given epoch e, it is in the loser pool for every x k ∈ χ. To prove CoRe separation nature we first demonstrate the following Lemma. Lemma 1. When a prototype c i is far away from the data at a given epoch e, then it will always be a loser for every x k ∈ χ and will be driven away from the data samples. Proof. The definition of far away implies that, given c e i
∀x k ∈ χ. i ∈ lose e k , where the e in the superscript refers to the learning epoch. Given the prototype update in (5), we obtain the weight vector increment Δc e i
c e i = −α σ K k=1
ϕ i (x k )RS
(e |χ|+k)
k σ e i 2 (x k − c
e i ). (8) As a result of (8), the prototype c e+1 i
other hand, by definition (1), for each of the data samples there exists at least one winner unit for every epoch e, such that its prototype is moved towards the samples for which it has been a winner. Moreover, not every prototype can be deflected from the data, since this would make the first term of J e
(see (4)) grow and, consequently, the whole J e (χ, U ) will diverge since the loser error term in (4) is lower bounded. However, this would contradict the fact that J
(χ, U ) decreases with e since CoRe applies gradient descent to the loss function. Therefore, there must exist at least one winning prototype c e l
remains close to the samples at epoch e. On the other hand c e i is already far away from the samples and, by (8), c e+1 i
won’t be a winner for any x k ∈ χ. To prove this, consider the definition of win k Convergence Behavior of Competitive Repetition-Suppression Clustering 501
in (1): for c e+1
i to be a winner, it must hold either (i) ϕ i (x
) ≥ θ
win or (ii)
i = arg max j ∈U ϕ j (x k , λ
j ). The former does not hold because the receptive field area where the firing strength of the i-th unit is above the threshold θ win
does not contain any sample at epoch e. Consequently, it cannot contain any sample at epoch e + 1 since its center c e+1
i has been deflected further from the data. The latter does not hold since there exist at least one prototype, i.e. c l , that remains close to the data, generating higher activations than unit u i . As a consequence, a far away prototype c i will be deflected from the data until it reaches a stable point where the corresponding firing strength ϕ i is negligible. Now we can proceed to demonstrate the following Theorem 1. For a CoRe process there exist an hypersphere G surrounding the sample data χ such that after sufficient iterations each prototype c i will finally either (i) fall into G or (ii) keep outside G and reach a stable point. Proof. The CoRe process is a gradient descent (GD) algorithm on J e (χ, U ), hence, for a sufficiently small learning step, the loss decreases with the number of epochs. Therefore, being J e (χ, U ) always positive the GD process will converge to a minimum J ∗
The sequences of prototype vectors {c e i } will converge either to a point close to the samples or to a point of negligible activation far away from the data. If a unit u
i has a sufficiently long subsequence of prototypes {c e
} diverging from the dataset then, at a certain time, will no longer be a winner for any sample and, by Lemma 1, will converge at a point far away from the data. The attractors for the sequence {c e
} of the diverging units lie at a certain distance r from the samples, that is determined by those points x where the gaussian unit centered in x produces a negligible activation in response to any pattern x k ∈ χ. Hence, G can be chosen as any hypersphere surrounding the samples with radius smaller than r.
On the other hand, since J e (χ, U ) decreases to J ∗ , there must exist at least one prototype that is not far away from the data (otherwise the first term of J e
{c e i } must have accumu- lation points close to the samples. Therefore any hypersphere G enclosing all the samples will also surround the accumulation points of {c e
} and, after a certain epoch E, the sequence will be always within such hypersphere. In summary, Theorem 1 tells that the separation nature holds for a CoRe process: some prototypes are possibly pushed away from the data until their contribution to the error in (4) becomes negligible. Far away prototypes will always be losers and will never head back to the data. Conversely, some prototypes will converge to the samples, heading to a saddle point of the loss J e (χ, U ) by means of a gradient descent process. 4 Correct Division and Location Following the convergence analysis in [4] we now turn our attention to the is- sues of correct division and location of the weight vectors. This means that the 502 D. Bacciu and A. Starita number of prototypes falling into G will be n c , i.e. the number of the actual clusters in the sample data, and they will finally converge to the centers of the clusters. At this point, we leave the intuitive study presented for DSRPCL [4], introducing a sound analysis of the properties of the saddle points identified by CoRe, giving a sufficient and necessary condition for identifying the global minimum of a vector quantization loss in feature space. 4.1
A Global Minimum Condition for Vector Quantization in Kernel Space The classical problem of hard vector quantization (VQ) in Euclidean space is to determine a codebook V = v 1 , . . . , v N minimizing the total distortion, cal- culated by Euclidean norms, resulting from the approximation of the inputs x k ∈ χ by the code vectors v i . Here, we focus on a more general problem that is vector quantization in feature space. Given the nonlinear mapping Φ and the induced feature space norm · F
introduced in the previous sections, we aim at optimizing the distortion min D(χ, Φ V ) = 1 K K k=1 N i=1 δ 1 ik Φ(x k ) − Φ v i 2 F κ (9) where Φ
V = {Φ v 1 , . . . , Φ v N } represents the codebook in the kernel space and δ 1 ik is equal to 1 if the i-th cluster is the closest to the k-th pattern in the feature space F κ , and is 0 otherwise. It is widely known that VQ generates a Voronoi tessellation of the quantized space and that a necessary condition for the minimization of the distortion requires the code-vectors to be selected as the centroids of the Voronoi regions [8]. In [5], it is given a necessary and sufficient condition for the global minimum of an Euclidean VQ distortion function. In the following, we generalize this result to vector quantization in feature space. To prove the global minimum condition in kernel space we need to extend the results in [9] (Proposition 3.1.7 and 3.2.4) to the most general case of a kernel induced distance metric. Therefore we introduce the following lemma. Lemma 2. Let κ be a kernel and Φ : χ → F
κ a map into the corresponding feature space F κ . Given a dataset χ = x 1 , . . . , x K partitioned into N subsets C i , define the feature space mean Φ χ = 1 K K k=1 Φ(x k ) and the i-th partition centroid Φ v i = 1 |C i | k ∈C i Φ(x k ), then we have K k=1
Φ(x k ) − Φ χ 2 F κ = N i=1 k
∈C i Φ(x k ) − Φ v i 2 F κ + N i=1
|C i | Φ v i − Φ χ 2 F κ . (10)
Proof. Given a generic feature vector Φ 1 , consider the identity Φ(x k ) − Φ 1 = (Φ(x k ) − Φ v i ) + (Φ v i − Φ 1 ): its squared norm in feature space is Φ(x k
− Φ 1 2 F κ = Φ(x k ) − Φ v i 2 F κ + Φ v i − Φ 1 2 F κ + 2(Φ(x
k ) − Φ v i ) T (Φ v i − Φ
1 ).
Convergence Behavior of Competitive Repetition-Suppression Clustering 503
Summing over all the elements in the i-th partition we obtain k ∈C i Φ(x
k ) − Φ 1 2 F κ = k ∈C i Φ(x k ) − Φ v i 2 F κ + k ∈C i Φ v i − Φ 1 2 F κ +2 k ∈C i (Φ(x k ) − Φ v i ) T (Φ v i − Φ 1 ) = k ∈C i Φ(x k ) − Φ v i 2 F κ + |C i | Φ v i − Φ 1 2 F κ . (11) The last term in (11) vanishes since k ∈C
(Φ(x k ) − Φ v i ) = 0 by definition of Φ v i . Now, applying the substitution Φ 1 = Φ
χ and summing up for all the N partitions yields K k=1 Φ(x k ) − Φ χ 2 F κ = N i=1 k
∈C i Φ(x k ) − Φ v i 2 F κ + N i=1
|C i | Φ v i − Φ χ 2 F κ (12)
where the left side of equality holds since N i=1 C i = χ and N i=1
C i = ∅ . Using the results from Lemma 2 we can proceed with the formulation of the global minimum criterion by generalizing the results of Proposition 1 in [5] to vector quantization in feature space. Proposition 1. Let {Φ v g 1 , . . . , Φ v g N } be a global minimum solution to the prob- lem in (9), then we have N i=1
|C g i | Φ v g i − Φ
χ 2 F κ ≥ N i=1 |C i | Φ v i − Φ χ 2 F κ (13) for any local optimal solution {Φ v 1 , . . . , Φ v N
g 1 , . . . , C g N } and {C 1 , . . . , C N } are the χ partitions corresponding to the centroids Φ v g
= 1/ |C g i | k ∈C g i Φ(x k ) and Φ v i = 1/ |C i | k ∈C i Φ(x k ) respectively, and where Φ χ is the
dataset mean (see definition in Lemma 2). Proof. Since {Φ v
1 , . . . , Φ v g
} is a global minimum for (9) we have N i=1 k ∈C g i Φ(x k ) − Φ v g i 2 F κ ≤ N i=1 k ∈C i Φ v i − Φ χ 2 F κ (14) for any local minimum {Φ v 1 , . . . , Φ v N
K k=1
Φ(x k ) − Φ χ 2 F κ = N i=1 k
∈C g i Φ(x k ) − Φ v g i 2 F κ + N i=1 |C g i | Φ
v g i − Φ χ 2 F κ (15) K k=1
Φ(x k ) − Φ χ 2 F κ = N i=1 k
∈C i Φ(x k ) − Φ v i 2 F κ + N i=1
|C i | Φ v i − Φ χ 2 F κ . (16)
Since (14) holds, we obtain N i=1 |C g i | Φ v g i − Φ
χ 2 F κ ≥ N i=1 |C i | Φ v i − Φ χ 2 F κ
504 D. Bacciu and A. Starita 4.2 Correct Division and Location for CoRe Clustering To evaluate the correct division and location properties we first analyze the case when the number of units N is equal to the true cluster number n c .
dent term, i.e. J e (χ, U ) = J win e (χ, U )+
J lose
e (χ, U ). By definition, J win
e (χ, U ) = n c
K k=1
δ ik (1 − ϕ i (x k )) must have at least one minimum point. Applying the necessary condition ∂ J win e (χ, U )/∂c i = 0 we obtain an estimate of the proto- types by means of fixed point iteration, that is c e i = N k=1 δ ik ϕ i (x k )x k N k=1
δ ik ϕ i (x k ) . (17) When the number of prototypes equals the number of clusters, the fixed point iteration in (17) converges by positioning each unit weight vector close to the true cluster centroids. In addition, it can be shown that (17) approximates a local minima of the kernel vector quantization loss in (9). To prove this, con- sider the CoRe loss formulation in kernel space (7): we have J win e (χ, U ) = 1 2
c i=1
K k=1
δ ik Φ(x k ) − Φ(c i ) 2 F κ , where c i is estimated by (17). Now, consider the VQ loss in (9): a necessary condition for its minimization requires the computation of the cluster centroids as Φ v i
1 |C i | k ∈C i Φ(x
k ). The exact calculation of Φ v i requires to know the form of the implicit nonlinear mapping Φ to solve the so-called pre-image problem [10], that is determining z such that Φ(z) = Φ v i
case [10]. However, instead of calculating the exact pre-image we can search an approximation by seeking z minimizing ρ(z) = Φ v i
2 F κ , that is the feature space distance between the centroid in kernel space and the mapping of its approximated pre-image. Rather than optimizing ρ(z), it is easier to minimize the distance between Φ v i
to space limitations, we omit the technicalities of this calculation (see [10] for further details). It turns out that the minimization of ρ(z) reduces to the the evaluation of the gradient of ρ (z) = Φ v i , Φ(z) . By substituting the definition of Φ
v i and applying the kernel trick we obtain ρ (z) = (1/ |C i |) k,j
∈C i κ(x k , x
j ) + κ(z, z) + (1/ |C i
k ∈C i κ(x k , z) where κ(z, z) = 1 since we are using a gaussian kernel. Differentiating ρ (z) with respect to z and solving by fixed point iteration yields z e
k ∈C i κ(x k , z e −1 )x k k ∈C i κ(x
k , z
e −1 ) (18) that is the same as the prototype estimate obtained in (17) for gaussian kernels centered in z e . The indicator function δ ik in (17) is not null only for those points x k
i was in the winner set. This does not ensures the partition conditions over χ, since, by definition of win k , some points can be associated with Convergence Behavior of Competitive Repetition-Suppression Clustering 505
two or more winners. However, by (6) we know that the variance of the winners tends to reduce as learning proceeds. Therefore, using the same arguments by Gersho [8] it can be demonstrated that, after a certain epoch E, the CoRe winners competition will become a WTA process where δ ik will be ensuring the partition conditions over χ. Summarizing, the minimization of the CoRe winners error J win
e (χ, U ) gener- ates an approximate solution to the vector quantization problem in feature space in (9). As a consequence, the prototypes c i become a local solution satisfying the conditions of Proposition 1. Hence, substituting the definition of Φ χ in the results of Proposition 1 we obtain that {c 1 , . . . , c n c } is an approximated global minimum for (9) if and only if n c
K k=1
|C i | K Φ(c
i ) − Φ(x k ) 2 F κ ≥ n c i=1 K k=1
| ˜ C i | K Φ(˜ c i ) − Φ(x k ) 2 F κ (19) holds for every {˜c 1
c n c } that are approximated pre-images of a local mini- mum for (9). In summary, a global optimum to (9) should minimize the feature space distance between the prototypes and samples belonging to their cluster while maximizing the weight vector distance from the sample mean, or, equiva- lently, the distance from all the samples in the dataset χ. The loser component J lose e (χ, U ) in the kernel CoRe loss (7) depends on the term (1 − (1/2) Φ(c i ) − Φ(x
k ) 2 F κ that maximizes the distance between the prototypes c i and those x k that do not fall in the respective Voronoi sets C i . Hence,
J lose
e (χ, U ) produces a distortion in the estimate of c i that pursues the global optimality criterion except for the fact that it discounts the repulsive effect of the x k ∈ C i . In fact, (19) suggests that c i has to be repelled by all the x k ∈ χ. On the other hand, the estimate c i is a linear combination of the x k ∈ C
i : applying the repulsive effect in (19) would subtract their contribution, either canceling the attractive effect (which would be catastrophic) or simply scaling the magnitude of the learning step without changing the final direction. Hence, the CoRe loss makes a reason- able assumption discarding the repulsive effect of the x k ∈ C
i when calculating the estimate of c i . Summarizing, CoRe locates the prototypes close to the cen- troids of the n c clusters by means of (17), escaping from local minima of the loss function by approximating the global minimum condition of Proposition 1. Finally, we need to study the behavior of J e
varies with respect to the true cluster number n c . Using the same motivations in [4], we see that the winner-dependent loss J win e tends to reduce as the the number of units increases. However, if the number of units falling into G is larger than n c
the samples from these clusters will tend to produce an increased level of error in J lose e contrasting the reduction of J win
e . On the other hand, J lose
e will tend to reduce when the number of units inside G is lower than n c . This however will produce increased levels of J win e since the prototype allocation won’t match the underlying sample distribution. Hence, the CoRe error will have its minimum when the number of units inside G will approximate n c . 506 D. Bacciu and A. Starita 5 Conclusion The paper presents a sound analysis of the convergence behavior of CoRe clus- tering, showing how the minimization of the CoRe cost function satisfies the properties of separation nature, correct division and location [4]. As the loss reduces to a minimum, the CoRe algorithm is shown to converge allocating the correct number of prototypes to the centers of the clusters. Moreover, it is given a sound optimality criterion that shows how CoRe gradient descent pursues a global minimum of the vector quantization problem in feature space. The results presented in the paper hold for a batch gradient descent process. However, it can be proved that, under Ljung’s conditions [11], they can be extended to stochastic (online) gradient descent. Moreover, we plan to investigate further the properties of the CoRe kernel formulation, extending the convergence analysis to a wider class of activation functions other than gaussians, i.e. normalized kernels. References 1. Grill-Spector, K., Henson, R., Martin, A.: Repetition and the brain: neural models of stimulus-specific effects. Trends in Cognitive Sciences 10(1), 14–23 (2006) 2. Bacciu, D., Starita, A.: A robust bio-inspired clustering algorithm for the automatic determination of unknown cluster number. In: Proceedings of the 2007 Interna- tional Joint Conference on Neural Networks, pp. 1314–1319. IEEE, Los Alamitos (2007) 3. Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans. on Neur. Net. 4(4) (1993) 4. Ma, J., Wang, T.: A cost-function approach to rival penalized competitive learning (rpcl). IEEE Trans. on Sys., Man, and Cyber 36(4), 722–737 (2006) 5. Munoz-Perez, J., Gomez-Ruiz, J.A., Lopez-Rubio, E., Garcia-Bernal, M.A.: Expan- sive and competitive learning for vector quantization. Neural Process. Lett. 15(3), 261–273 (2002) 6. Bacciu, D., Starita, A.: Competitive repetition suppression learning. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 130–139. Springer, Heidelberg (2006) 7. Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp. 10(5), 1299–1319 (1998) 8. Yair, E., Zeger, K., Gersho, A.: Competitive learning and soft competition for vector quantizer design. IEEE Trans. on Sign. Proc. 40(2), 294–309 (1992) 9. Spath, H.: Cluster analysis algorithms. Ellis Horwood (1980) 10. Scholkopf, B., Mika, S., Burges, C.J.C., Knirsch, P., Muller, K.R., Ratsch, G., Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE Trans. on Neur. Net. 10(5), 1000–1017 (1999) 11. Ljung, L.: Strong convergence of a stochastic approximation algorithm. The Annals of Statistics 6(3), 680–696 (1978)
Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma Kyushu Institute of Technology, Faculty of Engineering, 1-1 Sensui-cho Tobata-ku Kitakyushu, 804-8550, Japan kawano@ecs.kyutech.ac.jp Abstract. Adaptive Subspace Self-Organizing Map (ASSOM) is an evo- lution of Self-Organizing Map, where each computational unit defines a linear subspace. Recently, its modified version, where each unit defines an linear manifold instead of the linear subspace, has been proposed. The linear manifold in a unit is represented by a mean vector and a set of basis vectors. After training, these units result in a set of linear variety detectors. In another point of view, we can consider the AMSOM repre- sents the latent commonality of data as linear structures. In numerous cases, however, these are not enough to describe the latent commonality of data because of its linearity. In this paper, the nonlinear variety is considered in order to represent a diversity of data in a class. The effec- tiveness of the proposed method is verified by applying it to some simple classification problems. 1 Introduction The subspace method is popular in pattern recognition, feature extraction, com- pression, classification and signal processing.[1] Unlike other techniques where classes are primarily defined as regions or zones in the feature space, the sub- space method uses linear subspaces that are defined by a set of normalized basis vectors. One linear subspace is usually associated with one class. An input vector is classified to a particular class if its projection error into the subspace asso- ciated with one class is the minimum. The subspace method, as compared to other pattern recognition techniques, has advantages in applications where the relative intensities or energies of the vector components are more important than the overall level of the signal. It also provides an economical representation for groups of vectors with high dimensionality, since one can often use a small set of basis vectors to approximate the subspace where the vectors reside. Another paradigm is to use is use a mixture of local subspace to collectively model the data space. Adaptive-Subspace Self-Organizing Map (ASSOM)[2][3] is a mixture of lo- cal subspace method for pattern recognition. ASSOM, which is an evolution M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 507–516, 2008. c Springer-Verlag Berlin Heidelberg 2008 508 H. Kawano, H. Maeda, and N. Ikoma of Self-Organizing Map (SOM),[4] consists of an input layer and a competitive layer arranging some computational units in a line or a lattice structure. Each computational units defines a subspace spanned by some basis vectors. ASSOM creates a set of subspaces representations by competitive selection and coop- erative learning. In SOM, a set of reference vectors is spatially organized to partition the input space. In ASSOM, a set of reference sub-models is topolog- ically ordered, with each sub-model responsible for describing a specific region of the input space by its local principal subspace. The ASSOM is attractive not only because it inherits the topographic representation property in the SOM, but also because the learning results of ASSOM can faithfully describe the the core features of various transformation groups. The simulation results in the reference [2] and the reference [3] have illustrated that different feature filters can be self-organized to different low-dimensional subspaces and a wavelet type representation does emerge in the learning. Recently, Adaptive Manifold Self-Organizing Map (AMSOM) which is a mod- ified version of ASSOM has been proposed.[5] AMSOM is the same structure as the ASSOM, except for the way to represent each computational unit. Each unit in AMSOM defines an affine subspace which is composed of a mean vector and a set of basis vectors. By incorporating a mean vector into each unit, the recog- nition performance has been improved significantly. The simulation results in the reference [5] have been shown that AMSOM outperforms linear PCA-based method and ASSOM in face recognition problem. In both ASSOM and AMSOM, a local subspace in each unit can be adapted by linear PCA learning algorithms. On the other hand, it is known that there are a number of advantages in introducing nonlinearities into a PCA type network with reproducing kernels.[6][13] For example, the performance of the subspace method is affected by the dimensionality for the intersections of subspaces.[1] In other words, the dimensionality of subspace should be as possible as low in order to achieve successful performance. it is, however, not enough to describe variation in a class of patterns by low dimensional subspace because of its linearity. From this consideration, we propose a nonlinear extended version of the AM- SOM with the reproducing kernels. The proposed method could be expected to construct nonlinear varieties so that effective representation of data belonging to the same category is achieved with low dimensionality. The effectiveness of the proposed method is verified by applying it to some simple pattern classification problems. 2 Adaptive Manifold Self-Organizing Map (AMSOM) In this section, we give a brief review of the original AMSOM. Fig.1 shows the structure of the AMSOM. It consists of an input layer and a competitive layer, in which n and M units are included respectively. Suppose i ∈ {1, · · · , M} is used to index computational units in the competitive layer, the dimensionality of the input vector is n. The i-th computational unit constructs an affine subspace, which is composed of a mean vector μ (i)
and a subspace spanned by H basis Self-Organizing Clustering with Map 509
x 1 x j x n Competitive Layer Input Layer
Fig. 1. A structure of the Adaptive Manifold Self-Organizing Map (AMSOM) Subspace φ Origin μ mean vector : φ ^
Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling