Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
3 Results The behaviour of the model is best illustrated by studying examples of its time series. Fig.2 describes a time series of the Q layer activities q i (t) for all seven cells, two cues and five distractors over a 100 trial experiment. This figure illustrates successive cue testing where the stimulus with maximum salience x max
is gated into working memory and drives an action and is then reactivated at the end period and reinforced if it drove the correct action and punished if not. Fig.3 describes in more detail time series of individual trials late in learning after the correct cues have been found. This system is able to extract the correct cues and find the correct cue-action allo- cations. If a cue is extracted once and happens to drive the correct action its salience is easily strengthened as described above and the action it is driving is reinforced as described in more detail in [21,22]. Should a cue be extracted but not drive the correct action the situation is different however. If there is no positive dopamine signal at the end period, not only will the correct action not be reinforced but also the salience will not be strengthened, even though it is in fact the correct cue, but is simply not driving the correct action. A parsimonious point about this model is that this situation is easily 20 40
80 100
Trial Number 0 20 40 60 80 100 i (t), Dopamine Modulator D(t) 6 8 10 12 14 16 18 Trial Number 0 10
30 40 50 60 70 Q layer activities q i (t), Dopamine Modulator D(t) Fig. 2. (a) Time series of 100 trials of the Q layer activites, q i (t) , overlaid on top of each other, each consisting of three stages separated by intertrial delay periods. In this figure the three trial stages of cue presentation, action selection reactivation, and end period reactivation are difficult to see each trial. Both cues are found by around trial number 80. Also shown is the dopamine modulator. (b) Magnification from panel (a) showing trials 6 to 18 illustrating successive cue
working memory. Some trials it drives the correct action in stage 2 and some not, dependent on which particular cue it is presented with, and therefore is not consistently reinforced as can be seen from the dopamine time series. By trial 12 this distractor is punished sufficiently for its salience to become non-maximal and another new distractor becomes the maximally salient one, until trial 18 when it too is punished sufficiently. In trial 18 one of the correct cues becomes the dominant stimulus for the first time.
276 A. Ponzi
5000 5020
5040 5060
5080 Time
0 10 20 30 A layer activites, Dopamine modulator 5240 5260
5280 5300
5320 5340
Time 2260
2280 2300
2320 2340
Time 10 -3 10 -2 10 -1 10 0 10 1 Q layer activites, M layer activites, Dopamine modulator M1 M2 D Fig. 3. (a) Time series of Q layer activities q i (t) for two consecutive trials after the cues have been recognised in two different panels one for each cue type. (b) A similar figure in log scale for clarity. The selective activation of a single cue at the presentation stage 1, the action selection stage 2 and end period stage 3 is clear. The other q i (t)
are only weakly activated and invisible near zero. Also shown is the dopamine D(t) time series which is depressed below baseline during the cue presentation stage 1 and reactivation stage 2. It is excited above baseline during the end stage 3 by primary reward because these cues are driving the correct actions. Also shown are the two M layer action selection cells, M 1 (t) and M 2 (t) . Each of the two cues activates a different M cell as required, while the other is suppressed. The M cells are activated in the end stage 3 for reinforcement as well as in the action selection stage 2. Notice that the M cells are activated late in the two reactivation periods to allow the cue reactivation to reach its fixed point level. This is modeled by setting the term G(t) to unity in Eq.6 late in the action selection periods. dealt with. In fact supposing the correct cue is reactivated at stage 2 but drives the wrong action, in this model the cue is still reactivated by the end signal stage 3 but there is no primary reward and in fact the reactivation suppresses dopamine via the inhibitory pro- jection in Eq.5. This negative dopamine signal targets the action selected by the cue for punishment. Since the cue in fact selects the incorrect action this is an appropriate behaviour of the model. Although the cue will not be given salience by this action, next time the cue is extracted from the distractors by successive testing it will more likely drive the correct action. Once it has driven the correct action its salience will be forever strengthened and the correct action reinforced and stabilized. Indeed this situation is shown in Fig.2. One of the two cues is found for the first time at trial 18, however as can be seen in the below baseline dopamine trace during the end stage 3 of trial 18 in Fig.2(b) it happens to be driving the incorrect action. However the negative dopamine is sufficient to depress the incorrect cue-action allocation so that at a later trial when this correct cue is again tested it drives the correct action. The other cue is similarly extracted, tested and its cue-action allocation reversed and by trial 80 they form a stable attractor state and remain at high salience for the rest of the experiment, while the distractors are suppressed as can be seen in Fig.2(a). Another parsimonious point of this model is that the cues is are preferentially picked out over the distractors, or pop-out due to the fact that they exclude each other in trials. Exclusion means that cues are not presented during the stimulus presentation stage 1 as often as the distractors are. As described above the combination of Eqs. 4 and 5 means that q i
cells active during the cue presentation stage cause inhibition of dopamine Model of Cue Extraction from Distractors by Active Recall 277
D(t) in Eq.5 which causes LTD in their corresponding x i (t)
. Cues which are not active do not receive this LTD punishment. This exclusion property of the cues means the excluded cue is favoured over the distractors and is picked out more readily as a form of novelty response. 4 Discussion We have presented a model of cue extraction in a general approach to the problem of temporal credit assignment and shown that though simple the model behaves as desired with several parsimonious aspects. We have described cue extraction and action binding by successive testing enhanced by cue mutual exclusion. We describe constant distrac- tors but note extraction will also occur if distractors are sampled randomly each trial, providing distractors have a non-zero probability of co-occurrence with each other in any given trial. Even in the case the distractors have exclusive sets, since they are not predictive of reward, the system will still work. One drawback of the model presented above is the necessity for an end signal to drive final stage 3 reactivation. At times when primary reward is found the stage 3 reactivation could in principle be driven by primary reward, however the problem with this idea would be that no reactivation would occur when primary reward was not found, in contradiction to the requirement described above for discovery of the correct action from the punishment signal. To deal with this difficulty we extend the model to include another set of cells N . These cells learn to fire at the reward locations even when there is no reward and are used to drive the thalamus reactivations. Here we only have one and it is given by, dN/dt = −N + j
P N j (t)P j , where the weights J P N j
by dJ P N
j /dt = (D(t) − D K
j . It is activated by projection from the primary P layer representing the external environment and LTP occurs only in the presence of excess dopamine. Accordingly the dopamine signal Eq.5 is modified further to include an inhibitory projection −k 9 N (t) from this cell. Therefore when the N cell has been bound to a given location represented by the P j by positive R(t) and the animal returns to this location when primary reward is absent (since the cue was wrong) the N unit acts to generate a negative dopamine signal. This enhances the action punishment behaviour described above when reward is not found. Furthermore we modify the thalamus signal T (t)
Eq.3 by replacing the end signal by this N cell activation, dT /dt = −T (t)+T
K + T junction + N (t)
, so that the locations where reward has been experienced activate replays even if reward is not found on subsequent visits. We do not implememt this model here, but as shown in [25] it works well without the necessity for an end signal. In [25] we also include working memory maintenance of actions as well as cues to allow rapid reversal of cue-action allocations without any noise, accounting for some of the known properties of the dopamine signal [1]. In future work we will show how to produce the junction thalamus signal in an internally defined way. We have not specified the location of the P and Q layers, but anywhere in cortex or hippocampus is possible. At a certain level the Q layer may even represent the hip- pocampus (CA3) while the P layer may correspond to cortex projecting to hippocam- pus, the M and N region the nucleus accumbens while the dopamine system the VTA. Our ‘gating’ of PFC layer input into hippocampus would agree with Lisman and Grace [7] suggestions concerning control of behaviourally significant information flow.
278 A. Ponzi
The reactivation of the cue in the Q layer at the junction stage 2 for action selection means that after learning the cell that is firing at the junction depends on where the animal is coming from and also after learning of the task, where the animal is going to. This may therefore correspond to the hippocampal splitter cells [14,15] described in the introduction. Since the plus maze in both allocentric and egocentric conditions will obey similar principles, where however the cue presentation stage corresponds to the spatial view of the animal from the cross junction, which depends on the direction the animal approaches, encoded in the place cells, this work may also explain the plus maze retrospective and prospective coding [17]. Furthermore the striatal MSN neurons show response characteristics of context or task neurons, such as are found in the dorsal striatum [20] and ventral striatum [19], described in the introduction. This can be both the M action selecting cells, and the N reactivation generating cells. N cells may also generate the specific spatial response characteristic of expert neurons at the T-maze junction seen in Barnes et al. [20]. References 1. Schultz, W.: J. Neurophysiol. 80(1), 1–27 (1998) 2. Wickens, J.: 8, R77–R109 (1997) 3. Pasupathy, A., Miller, E.K.: Nature 433, 873 (2005) 4. Kawagoe, R., Takikawa, Y., Hikosaka, O.: J. Neurophysiol. 91(2), 1013–1024 (2004) 5. Hollerman, J.R., Tremblay, L., Schultz, W.: J. Neurophysiol. 80, 947–963 (1998) 6. Nicola, S.M., Surmeier, J., Malenka, R.C.: Ann. Rev. Neuroscience. 23, 185 (2000) 7. Lisman, J.E., Grace, A.A.: Neuron 46, 703 (2005) 8. Goldman-Rakic, P.S.: Neuron 14(3), 477–485 (1995) 9. Gruber, A.J., Dayan, P., Gutkin, B.S., Solla, S.A.: J. Comput. Neurosci. 20, 153–166 (2006) 10. Cohen, J.D., Braver, T.S., Brown, J.W.: Curr. Opin. Neurobiol. 12(2), 223–229 (2002) 11. Dreher, J.C., Guigon, E., Burnod, Y.: J. Cog. Neurosci. 14(6), 853–865 (2002) 12. Frank, M.J., Loughry, B., O’Reilly, R.C.: Cog. A. & B. Neurosci. 1(2), 137–160 (2001) 13. Kiyatkin, E.A., Rebec, G.V.: J. Neurophysiol. 75(1), 142–153 (1996) 14. Hasselmo, M.E., Eichenbaum, H.: Neural Networks, 18, 1172 (2005) 15. Wood, E.R., Dudchenko, P.A., Robitsek, R.J., Eichenbaum, H.: Neuron 27, 623–633 (2000) 16. Kunec, S., Hasselmo, M.E., Kopell, N.: J.Neurophysiol. 94 (2005) 17. Ferbinteanu, J., Shapiro, M.L.: Neuron 40, 1227–1239 (2003) 18. Anderson, M.I., Jeffery, K.J.: J. Neurosci. 23, 8827–8835 (2003) 19. Mulder, A.B., Tabuchi, E., Wiener, S.I.: European J.Neuroscience 19, 1923–1932 (2001) 20. Barnes, T.D., Kubota, Y., Hu, D., Jin, D.Z., Graybiel, A.M.: Nature 437, 1158 (2005) 21. Ponzi, A.: Proceedings of IJCNN 2007 (2007) 22. Ponzi, A.: Neural Networks (to appear, 2008) 23. Reynolds, J.N.J., Hyland, B.I., Wickens, J.R.: Nature 413, 67 (2001) 24. Nakahara, H., Amari, S., Hikosaka, O.: Neural Computation 14, 819–844 (2002) 25. Ponzi, A.: Proceedings of ICCN 2007 (2007) 26. Ponzi, A.: IEICE Technical Report 103(163), 19–24 (2006)
PLS Mixture Model for Online Dimension Reduction Jiro Hayami and Koichiro Yamauchi Graduate School of Information Science and Technology, Hokkaido University, Sapporo Hokkaido 060-0814, Japan Abstract. This article presents an online learning method for model- ing high dimensional input data. This method approximates a nonlin- ear function by summing up several local linear functions. Each linear function is represented as the weighted sum of a small number of dom- inant variables, which are extracted by the partial least squares (PLS) regression method. Moreover, a radial function, which represents the re- spective input area of each linear function, is also redefined using the dominant variables. This article also presents an online deterministic an- nealing expectation maximization (DAEM) algorithm which includes a temperature control mechanism for acquireing the most suitable system parameters. Experimental results show the effective learning behavior of the new method. Keywords: Partial Least Squares (PLS) Method, PLS Mixture Model, Dimension Reduction, Online DAEM Algorithm. 1 Introduction In real-world applications, learning systems usually have to perform online learn- ing of high-dimensional data. In such learning tasks, the learning machines may have to overcome the following problems. – The learning machine must achieve a quick response to newly presented samples. – High-dimensional data usually have multicollinearity or include irrelevant dimensions. If the learning machines learn them, they normally waste a huge amount of resources, so somehow they need to reduce the amount of resources for learning such datasets. To solve these problems, we propose a variation of the mixture of experts(ME). The ME used here is a mixture of the local linear models. Each linear model is tuned by the learning samples distributed in a local narrow area so that it shows relatively quick responses to newly presented learning samples. Moreover, in the proposed method, each local linear model and corresponding kernel reduces the number of useless dimensions by using both the partial least squares method (PLS) (e.g. [1]) and the online DAEM algorithm proposed later M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 279–288, 2008. c Springer-Verlag Berlin Heidelberg 2008 280 J. Hayami and K. Yamauchi in sec. 4.4. PLS is specialized for selecting dimensions from multi-dimensional patterns, which have multicollinearity. Using PLS, the proposed system achieves online dimension selection not only for each linear model but also for the corre- sponding Gaussian kernel. 2 Related Work Schaal et al. have proposed Receptive Field Weighted Regression (RFWR) [2] This has a similar architecture to our model, but it does not reduce useless dimensions. Later, RFWR was improved to Locally Weighted Projection Re- gression(LWPR) [3]. The improved model reduces useless dimensions of each linear model by using the PLS method but does not reduce the dimensions of each kernel. Ishii et al. have constructed an online EM (expectation maximization) algo- rithm on a normalized Gaussian network, which has a similar architecture to that of RFWR[4]. However, their model may not reduce the input dimensions if the data have multicollinearity. 3 PLS Regression The PLS regression method extracts latent variables, which correlate with both inputs and outputs, and approximates the output function. Let x, D be the input vector x = (x 1 · · · x D ) and let D be its dimensions D = dim(x). In this paper, the output of the learning machine is a one-dimensional output value y. Now, let N and r be the number of learning samples and the dimensions of the latent variables, respectively. Then, a sample matrix can be written as X = (x
1 · · · x
N ) T , y = (y 1 · · · y N ) T . We need to standardize the learning samples before PLS regression. In the stan- dardization process, the averaged vector is subtracted from each learning sample. The input and output vectors are represented using latent vectors t i and the
two loading matrices P and q as follows X − ¯ X = t 1 p T 1 + t 2 p T 2 + · · · + t r p T r + E = TP
T + E
(1) y − ¯y = t 1 q 1 + t 2 q 2 + · · · + t r q r + f = Tq + f , (2)
where E and f denote residual errors and, ¯ X and ¯
y denote the average input vector and the output value. In the initial state, the number of latent variables is zero. The latent variables are increased one by one until the residual errors are small enough to ignore. For example, let’s consider the case where there is one latent variable. First of all, we have to generate the standard input and output matrices: ˜ X = X − ¯
X ˜ y = y − ¯y. We assume that the latent variable t 1 is generated by a linear transformation: t 1 = w 1 ˜ X. (3)
Now, we derive the matrix w 1 so as to maximize ˜ y T t 1 for w
1 = 1 using the Lagrangian technique.
PLS Mixture Model for Online Dimension Reduction 281
w 1 = ˜ X T ˜ y ˜ X T ˜ y (4)
The loading matrices p 1 and q 1 can be derived by least squares estimation. p 1
˜ X T t 1 t T 1 t 1 , q 1 = ˜ y T t 1 t T 1 t 1 (5) The residual errors E and f are E = X − ¯
X − t
1 p T 1 , f = y − ¯y − t 1 q 1 . (6) If we replace ˜ X and ˜
y by E and f , we can derive the next latent variable by executing Eqs.(3)-(6) again. Now we can describe the output value y using the latent variables as follows[5]. f (x) = ¯ y + (x − ¯x)W(P
T W) −1 q, (7)
where P = (p 1 · · · p r ), q = (q 1 · · · q
r ) T and W = (w 1 · · · w r ). 3.1 Online PLS Method So far, we have explained the PLS method using all the learning samples: X = (x 1
N ) T y = (y 1 · · · y N ) T . To execute the PLS method in an online manner, we need a buffer for storing all the learning samples. However, this approach not only wastes a huge amount of storage space but also has a high computational cost. To solve this problem, the modified PLS method [6][7] derives the latent vari- ables from two covariance matrices: Σ Xy and Σ. Here, Σ Xy is the covariance matrix of x and y, given by Σ Xy = ˜ X T ˜ y/N , while Σ denotes the covariance matrix of x give by Σ = ˜ X T
X/N , where ˜ X and ˜
y are the standardized matrices ˜ X = X − ¯ X and ˜
y = y − ¯y.
Therefore, w r , p r and q
r are sequentially derived as below. w r
Σ Xy r Σ Xy r , p r = Σ r w r /c r , q r = Σ Xy r T w r /c r , (8) where c
r = w
r T Σ r w r . To get the next (the r + 1-th) new latent variable, we just repeat the above equations using the updated covariance matrices. Σ Xy
← Σ Xy r − c r p r q r , Σ r+1 ← Σ r − c r p r p r T (9) Σ Xy 1 and Σ
1 can be derived in an online manner, so the latent variables can also be derived in an online manner. Therefore, Σ Xy 1 = xy T / 1 − x y T / 1 2 (10)
Σ 1 = xx T / 1 − x x T / 1 2 , where the notation x denotes an approximated average of x, which will be defined in Eq(21). Equations (8) and (9) are repeated until Σ Xy r < ε or Σ r < ε where ε is a small threshold. In this study, we executed the online PLS method combined with a new online learning method namely the ‘online-DAEM’ algorithm described in 4.4.
282 J. Hayami and K. Yamauchi 4 Proposed System The proposed system architecture is similar to that of normalized Gaussian networks (NGnet)[8]. The PLS regression method is embedded in the learning method for reducing dimensions. In particular, the PLS regression method is applied not only to the learning method for the linear regression model but also to that for the kernel. In this paper, we call the kernel the ‘PLS kernel.’ Section 4.1 explains the whole system structure. 4.3 and 4.2 explain the lin- ear function and PLS kernel, respectively. 4.4 explains the ‘online DAEM’ algo- rithm, which determines the PLS parameters. 4.5 explains the unit manipulation procedure. 4.1
Outline Inputs
Output Dimension reduction PLS
regression Dimension reduction Dimension reduction PLS
kernel Fig. 1. PLS mixture model The proposed system consists of a linear function and a corre- sponding Gaussian kernel func- tion, as shown in Fig.1. The Gaussian kernel function de- termines the respective input area for the corresponding lin- ear function. Let f c
each model unit and g(x |c) be
the kernel function. Then, the ultimate output is ˆ y =
c f c (x)g(x |c)
c g(x
|c) , (11) where g(x |c) denotes the kernel function. It is expressed by g(x
|c) = exp − (x − ¯x c )Σ −1 c (x − ¯x c ) T 2 . (12) Note that the kernel function above does not depend on the scaling of data. By this modification, we cannot deal with the network as a statistical model. 4.2 Linear Function In this model, the PLS regression method explained in section 3 is applied to construct the linear function. As a result, the linear function overcomes the problem of multicollinearity. The output of the c-th unit is f c
c + (x
− x c )W c (P T c W c ) −1 q c , (13) where q c denotes the loading matrix of the c-th unit. PLS Mixture Model for Online Dimension Reduction 283
4.3 PLS Kernel To define the kernel function, we have to calculate the inverse of the covariance matrix Σ
c . However, if there are too few samples, Σ c cannot be a regular matrix, so Σ −1 c cannot be derived. To overcome this problem, the proposed method derives a pseudo inverse matrix of Σ c . Let Σ
∗ P LS
c be the pseudo inverse matrix. Σ ∗
c = W
c W T c Σ c W c −1 W T c (14) |x − ¯x
c | P LS = (x − ¯x
c )Σ ∗ P LS c (x − ¯x c ) T (15)
g(x |c) = exp − |x − ¯x c
P LS 2 (16) where W c is the c-th unit’s matrix defined in Eq(7). Then we can obtain the Mahalanobis distance in the input space: |x − ¯x
c | P LS . In this paper, we call the kernel defined in Eq(16) the ‘PLS kernel.’ Figures. 2, 3 and 4 show examples of the PLS kernel where the numbers of latent variables are 0, 1, and 2, respectively. If the number of latent variable is zero, the PLS kernel is a uniform distribu- tion. In the case of one latent variable, the PLS kernel has variance along the 1st PLS direction but has a uniform distribution along the other directions. In the case of two latent variables, the PLS kernel has two variances along the two PLS directions. Therefore, the respective area of the PLS kernel is larger when there are fewer latent variables. Note that this causes the problem that the units allocated in early steps of the learning become dominant in the network. This causes the degrades the generalization ability of the network. To overcome this problem, the system restricts the width of the PLS kernel according to the number of learning iterations of the unit. Therefore, in the early steps of the learning, the shape of the PLS kernel is a hyper sphere, but it gradually changes to an oval-shaped hyper ellipse reflecting the PLS directions. Therefore, Σ ∗
= λ(t) c k c I D + (1 − λ (t)
c ) Σ
∗ P LS
c where
λ(t) c = 1 1 + exp
1 c −˜t 2 (17)
where 1 c denotes an approximated average of the c-th unit’s respective ratio, which reflects the number of parameter-updates of the unit approximately. -1 -0.5
0 0.5
1 -1 -0.5
0 0.5
1 0 0.5 1 y x0 x1 y Fig. 2. PLS kernel without any latent variable -1 -0.5 0 0.5
1 -1 -0.5
0 0.5
1 0 0.5 1 y x0 x1 y Fig. 3. PLS kernel with one latent variable -1 -0.5 0 0.5
1 -1 -0.5
0 0.5
1 0 0.5 1 y x0 x1 y Fig. 4. PLS kernel with two latent variables 284 J. Hayami and K. Yamauchi 1 c
weighted sum of the hyper sphere and hyper ellipse reflecting the latent variables. The weight for the hyper sphere λ(t) is large in the early steps of the learning, but it gradually decreases to zero, whereas the weight for the hyper ellipse 1 − λ(t)
gradually increases to 1. The k c determines the initial variance of the hyper sphere, ˜ t denotes the averaged fitness value required for the weights for the 1st and 2nd terms to become 0.5, and I D denotes D dimensional unit matrix. Thus, the modified PLS kernel is g(x
|c) = exp − (x − ¯x c )Σ ∗ c (x − ¯x c ) T 2 (18)
4.4 Online DAEM Algorithm The PLS mixture model is a variation of the mixture of experts(ME). There are various learning methods suitable for learning the ME. In this method, we selected a variation of the EM algorithm as the learning method because its convergence speed is fast. Although the PLS mixture model is not a statistical model, we can apply an online EM algorithm for learning the PLS mixture model. An online EM algorithm for NGnet parameter estimation, which has a similar structure to our model, has been proposed by Ishii et al. [4]. Generally speak- ing, however, the quality of the resultant parameters is affected by their initial parameter values. Ueda et al. proposed the Deterministic Annealing EM (DAEM) algorithm to overcome this problem[9]. The original DAEM algorithm, however, was not an online learning algorithm. To apply DAEM to our model, we must extend it to an online learning algorithm. The following text explains the online DAEM learning algorithm for a mixture of PLS models. The covariance matrix of x and y, Σ Xy c , the covariance matrix of x, Σ
c , the average vector of x, ¯ x c
y c are modified by the online DAEM algorithm. The online DAEM algorithm executes E- and M-steps alternately to updating parameters every time it receives one instance. E-step In the E-step, the respective ratio of each unit to the i-th instance (current instance) is calculated. Let p c |x i , y i , θ (i −1)
be the respective ratio of the c-th unit to the i-th instance. p c |x
, y i , θ (t) = g (x i |c)
τ y c (x i , y i ) C m=1 g (x
i |m)
τ y m (x i , y i ) , (19) where τ denotes inverted temperature and is gradually increased as τ ← τ · β (β > 1) every time it receives one instance 1 . y
c (x i , y i ) denotes a suitability of the c-th linear unit for the current input (x i , y i ): 1 Although PLS kernel function is not probability density function and Eq(19) is also not precise probability density function, they have similar properties as that of probability density function.
PLS Mixture Model for Online Dimension Reduction 285
y c (x i , y
i ) = exp
−(f c (x) − y i ) 2 /(2σ
2 ) (20) M-step In the M-step, the parameters are updated using the value calculated in the E- step. Let p (t)
c be the respective value of module c at time t p (t) c
|x t , y t , θ
(t) and let
x c be the approximated average of x for the module c. If p (t) c > θ learning
, the c-th unit executes below procedure. x (t) c = ρ
x (t −1) c +p (t) c x t , (21)
where ρ denotes the forgetting coefficient and 0 ρ ≤ 1. Then, the parameters that maximize the likelihood for incomplete data are Σ c = xx T c / 1 c − x c x T c / 1 2 c (22) Σ Xy c = xy T c / 1 c − x c y T c / 1 2 c (23) x c = x c / 1 c , y c = y c / 1 c (24) Note that the notation ‘t’ is omitted in the above equations. Thus, x c , y c , Σ
c and
Σ Xy c are derived. From Σ c and Σ Xy c , latent variables are derived by repeating Eqs.(8) and (9). 4.5
Unit Manipulation Each unit is allocated adaptively. If the suitabilities of all units for a new learning sample i are less than a threshold, a new unit is allocated. Therefore, if the condition ( ∀c) (p (x i , y
i , c
|θ) < ) (25)
is satisfied, a new unit N + 1 is allocated: x N +1 = x i , y N +1
= y i , Σ ∗ N +1 = k N +1
I D , (26) where k
N +1 is the parameter used in Eq.(17) and denotes a small positive threshold. Note that the initial output function of the new allocated linear model is f (x) = ¯ y. Although unit pruning or merging strategies can also be applied to the PLS mixture model, we do not use such strategies in this paper to show the effect of the PLS clearly. 5 Experiments To examine the performances of the proposed system, we tested it using two datasets: ones including irrelevant and correlated dimensions. 286 J. Hayami and K. Yamauchi 5.1 Performances of Dataset Including Irrelevant Dimensions The proposed PLS mixture model is good at reducing variables in the case of there are several correlated dimensions. However, it could also reduce irrelevant dimensions if its parameters were optimized. To verify this property, we tested our model using the following synthetic dataset. y = cos (2(x 0 + x 1 )) ,
(27) where x
0 and x
1 are independent from each other. We set both x 0 and x
1 to uniform random values in the intervals 0 ≤ x 0 ≤ 1, 0 ≤ x 1 ≤ 1, respectively. Fig. 5. PLS kernels after 1000 observa- tions
Fig. 6. PLS kernels after 5000 observa- tions
0.02 0.04
0.06 0.08
0.1 0.12
0.14 0.16
0.18 0.2
0 5000
10000 0 10 20 30
40 50
60 70
80 RMSE
Number of Models step
RMSE(NG-OEM) RMSE(PLS-ODAEM) Number of Models(NG-OEM) Number of Models(PLS-ODAEM) Fig. 7. RMSE, number of units versus number of iterations
PLS Mixture Model for Online Dimension Reduction 287
Note that x 0 and x 1 can be converted to t 0
1 √ 2 x 0 + 1 √ 2 x 1 , t 1 = 1 √ 2 x 0 − 1 √ 2 x 1 . (28) From these equations, we can see that t 0 is related to y but t 1 is not. Therefore, y is one dimensional function in nature. PLS-kernel distributions after 1000 and 5000 observations are shown in Figs. 5 and 6. We can see that, in the early steps of the learning, the PLS kernels was shaped like a sphere like shape, but they became ellipse shaped in the latter steps of the learning. In particular, the widths along t 0 became narrow, while those along t 1 became wide. This means that the kernels are forced to ignore irrelevant t 1 . Using this strategy, the model achieves a high generalization ability. 5.2
Performances for a Multicollinearity Dataset To verify the effect of PLS kernels, we tested the proposed model using multi- collinearity dataset. This is a 10 dimensional dataset including two essential dimensions x 0 and
x 1 , which have following relation to y. y = max exp
−10x 0 2 , exp −50x
1 2 , 1.25 exp −5 x 0 2 + x 1 2 , (29)
where x 0 and x 1 are independent of each other. We set them to uniform random values in the intervals −1 ≤ x
0 ≤ 1, −1 ≤ x 1 ≤ 1, respectively. The other dimensions correlate with x 0 and x 1 as follows. x 2
0 , x 4 = 1 √ 2 (x 0 − x
1 ), x
6 = 1 √ 5 (x 0 − 2x
1 ), x 8 = 1 √ 5 (2x 0 − x
1 ), x 3 = x
1 , x 5 = 1 √ 2 (x 0 + x
1 ), x
7 = 1 √ 5 (x 0 + 2x
1 ), x 9 = 1 √ 5 (2x 0 + x
1 ) We also compared the performance of our model with that of an NGnet us- ing online EM algorithm (NG-OEM)[4] without pruning, merging, or dividing strategies 2 .
root mean square errors (RMSEs) were averaged over the five trials. The number of units was also examined in the same manner as the RMSE. The performances of PLS-ODAEM (our model) and NG-OEM are shown in Fig. 7. We can see that the number of units was smaller for PLS-ODAEM than for NG-OEM. The convergence speed of the RMSE of PLS-ODAEM was slower than that of NG-OEM, but the ultimate RMSE of PLS-ODAEM was less than that of NG-OEM. 2 There are cases that the pruning and merging strategies also bring the dimension reduction effect. To investigate PLS effect clearly, we examined the performances without the pruning and merging strategies. 288 J. Hayami and K. Yamauchi 6 Conclusion In this paper, we proposed a PLS mixture model using an online DAEM algo- rithm. The new model reduces the number of input dimensions using the partial least squares method (PLS). The PLS was applied to both the linear model and the Gaussian kernel and the parameters were optimized by the online DAEM algorithm. Actually, this model does not satisfy the condition for the statistical model so far. However, if we restrict λ(t) c in Eq(17) larger than 0, we will be able to use p(x |c) = g(x|c)/ g(x|c)dx instead of g(x|c) to make our method satisfy the condition. This model can be applied not only to a multicollinearity dataset but also to datasets that include irrelevant variables. We plan to improve the model by adding pruning, merging and dividing strategies. References 1. Wold, H.: Soft modeling by latent variables; the nonlinear iterative partial least squares approach. In: Gani, J. (ed.) Perspectives in Probability and Statistics, Pa- pers in Honour of M. S Bartlett, pp. 520–540. Academic Press, Inc., London (1975) 2. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local infor- mation. Neural Computation 10(8), 2047–2084 (1998) 3. Vijayakumar, S., Souza, A.D., Schaal, S.: Incremental online learning in high di- mensions. Neural Computation 17, 2602–2634 (December) 4. Sato, M.-a., Ishii, S.: On-line em algorithm for the normalized gaussian network. Neural Computation 12, 407–432 (2000) 5. Manne, R.: Analysis of two partial-least-squares algorithms for multivariate calibra- tion. Chemometrics and Intelligent Laboratory Systems 22, 187–197 (1987) 6. Lindgren, S.W.F., Geladi, P.: The kernel algorithm for pls. Journal of Chemomet- rics 7(1), 45–59 (1993) 7. Ter Braak, C.J.F., De Jong, S.: Comments on the pls kernel algorithm. Journal of Chemometrics 8(2), 169–174 (1994) 8. Xu, L., Jordan, M.I., Hinton, G.E.: An alternative model for mixtures of experts. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7, pp. 633–640. The MIT Press, Cambridge (1995) 9. Ueda, N., Nakano, R.: Deterministic annealing variant of the em algorithm. Ad- vances in Neural Information Processing System 7, 545–552 (1995) Analysis on Bidirectional Associative Memories with Multiplicative Weight Noise Chi Sing Leung 1 , Pui Fai Sum 2 , and Tien-Tsin Wong 3 1
eeleungc@cityu.edu.hk 2 Institute of Electronic Commerce, National Chung Hsing University Taiwan 3 Department of Computer Science and Engineering, The Chinese University of Hong Kong Abstract. In neural networks, network faults can be exhibited in dif- ferent forms, such as node fault and weight fault. One kind of weight faults is due to the hardware or software precision. This kind of weight faults can be modelled as multiplicative weight noise. This paper ana- lyzes the capacity of a bidirectional associative memory (BAM) affected by multiplicative weight noise. Assuming that weights are corrupted by multiplicative noise, we study how many number of pattern pairs can be stored as fixed points. Since capacity is not meaningful without consider- ing the error correction capability, we also present the capacity of a BAM with multiplicative noise when there are some errors in the input pattern. Simulation results have been carried out to confirm our derivations. 1 Introduction Associative memories have a wide range of applications including content ad- dressable memory and pattern recognition [1,2]. An important feature of asso- ciative memories is the ability to recall the stored patterns based on partial or noisy inputs. One form of associative memories is the bivalent additive bidirec- tional associative memory (BAM) [3] model. There are two layers, F X and F Y , of
neurons in a BAM. Layer F X has n neurons and layer F Y has p neurons. A BAM is used to store pairs of bipolar patterns, (X h , Y h )’s, where h = 1, 2, · · · , m; X h ∈ {+1, −1} n ; Y h ∈ {+1, −1} p ; and m is the number of patterns stored. We shall refer to these patterns pairs as library pairs. The recall process is an iterative one starting with a stimulus pair (X (0) , Y
(0) ) in F
X . After a number of iterations, the patterns in F X and F Y should converge to a fixed point which is desired to be one of the library pairs. BAM has three important features [3]. Firstly, BAM can perform both het- eroassociative and autoassociative data recalls: the final state in layer F X rep- resents the autoassociative recall, while the final state in layer F Y represents the heteroassociative recall. Secondly, the initial input can be presented in one of the two layers. Lastly, BAM is stable during recall. In other words, for any connection matrix, a BAM always converges to a stable state. Several methods have been proposed to improve its capacity [4,5,6,7,8]. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 289–298, 2008. c Springer-Verlag Berlin Heidelberg 2008 290 C.S. Leung, P.F. Sum, and T.-T. Wong Although the capacity of BAM has been intensively studied with a perfect lab- oratory environment consideration[9,10,11,12,13], practical realization of BAM may encounter the problem of inaccuracy in the stored weights. All the previous studies assume that the stored weights matrix is noiseless. However, this is not always the case when training a BAM for some real applications. One kind of weight faults is due to the hardware or software precision [14,15]. For example, in the digital implementation, when we use a low precision floating point format, such as 16-bit half-float[16], to represent trained weights, truncation errors will be introduced. The magnitude of truncation errors is proportional to that of the trained weights. Hence, truncation errors can be modelled as multiplicative weight noise[17,18]. This paper focuses on the quantitative impact of multiplicative weight noise to the BAM capacity. We will study how many number of pattern pairs can be stored as fixed points when multiplicative weight noise presents. Since capacity is not meaningful without considering the error correction capability, we also present the capacity of BAM with multiplicative noise when there are some errors in the input pattern. The rest of this paper is organized as follows. Section 2 introduces the BAM model and multiplicative weight noise. In section 3, the capacity analysis on BAM with multiplicative weight noise is used. Simulation examples are given in Section 4. Then, we conclude our work in Section 5. 2 BAM with Multiplicative Weight Noise 2.1 BAM
The BAM, as proposed by Kosko [3], is a two-layer nonlinear feedback heteroas- sociative memory in which m library pairs (X 1 , Y
1 ), · · · , (X m , Y
m ) are stored, where X h
n and Y
h ∈ {−1, 1} p . There are two layers of neurons in BAM; layer F X has n neurons and layer F Y has p neurons. The connection matrix between the two layers is denoted as W . The encoding equation, as proposed by Kosko, is given by W = m
Y h X h T (1) which can be rewritten as w ji = m h=1 x ih y jh , (2) where X h = (x 1h , x
2h , · · · , x nh ) T and Y h = (y 1h , y
2h , · · · , y ph ) T . The recall process employs interlayer feedback. An initial pattern X (0) pre-
sented to F X is passed through W and is thresholded, and a new state Y (1) in F Y is obtained which is then passed back through W T and is thresholded again, leading to a new state X (1)
in F X . The process repeats until the state of BAM converges. Mathematically, the recall process is: Y (t+1) = sgn W X (t)
, and X (t+1)
= sgn W T Y (t+1) , (3) Analysis on BAM with Multiplicative Weight Noise 291
where sgn( ·) is the sign operator: sgn(x) = ⎧ ⎨ ⎩ +1 x > 0 −1 x < 0
state unchanged x = 0 . Using an element-by-element notation, the recall process can be written as: y (t+1)
i = sgn
n i=1
w ji x (t) i , and x (t+1) j = sgn ⎛ ⎝ p j=1 w ji y (t+1)
i ⎞ ⎠ , (4) where x
(t) i is the state of the ith F X neuron and y (t) j
Y neuron. The above bidirectional process produces a sequence of pattern pairs (X (t)
, Y (t)
): (X (1)
, Y (1)
), (X (2)
, Y (2)
), · · ·. This sequence converges to one of the fixed points (X f , Y f ) and this fixed point ideally should be one of the library pairs or nearly so. A fixed point (X f , Y f ) has the following properties: Y f
f ) and X
f = sgn(W
T Y f ). (5)
Hence, a library pair can be retrieved only if it is a fixed point. One of the advantages of Kosko’s encoding method is the ability of incremental learning, i.e., the ability of encoding new library pairs to the model based on the current connection matrix only. With the Kosko’s encoding method, BAM can only correctly store up to min(n,p)
2 log min(n,p) library pairs. When the number of library pairs exceeds that value, a library pair may not be stored as a fixed point. 2.2
Multiplicative Weight Noise In some electrical circuits, inaccuracy occurs in the implementation of trained weights. Errors in the weights may be caused by quantization error due to lim- ited bits used to store the trained weights, or percentage error due to voltage perturbation. In the digital implementation, such as DSP and FPGA, the trained weights are usually stored as floating-point format. Floating-point representation of real numbers is more desirable than integer representation because floating- point provides large dynamic range. When we use a low precision floating point format, such as 16-bit half-float[16], to represent trained weights, truncation er- rors will be introduced. The magnitude of the truncation errors is proportional to that of the trained weights[19]. Hence, the truncation errors can be modelled as multiplicative weight noise. An implementation of a weight w ji is denoted by ˜ w ji . In multiplicative weight noise, each implemented weight deviates from its nominal value by a random percent, i.e., ˜ w
= w ji + b ji w ji (6) where b
ji ’s are identical independent mean zero random variables with variance σ 2
. The density function of b ji ’s is symmetrical. 292 C.S. Leung, P.F. Sum, and T.-T. Wong 3 Analysis on BAM with Multiplicative Weight Noise 3.1 Capacity
We will investigate the BAM’s memory capacity when the multiplicative weight noise presented. The following assumptions and notations are used. – The dimensions, n and p, are large. Also, p = rn, where r is a positive constant. – Each component of library pairs (X h , Y h )’s is a
±1 equiprobable indepen- dent random variable. – EU j,h
is the event that n i ˜ w ji x ih is equal to y jh (the j-th component of the library pattern Y h ). Also, EU j,h is the complement event of EU j,h .
i,h is the event that p j
w ji y jh is equal to x ih (the i-th component of the library pattern X h ). Also, EV i,h is the complement event of EV i,h .
duce Lemma 1 and Lemma 2. They will assist us to derive the capacity of BAM with multiplicative weight noise. Lemma 1: The probability Prob(EU j,h
) is approximately equal to Q n (1 + σ 2 b )m for j = 1, · · · , n and h = 1, · · · , m, where Q(z) = 1 √ 2π ∞ z exp( −z 2 2 )dz.
Proof: Event EU j,h
means that sgn ( n i=1 ˜ w ji x ih ) = y jh . From (2) and (6), we have n
˜ w ji x ih = n i=1
w ji (1 + b j i)x
ih = n i=1 ( m h y jh x ih )(1 + b j i)x
ih = ny
jh + n i=1 ( m h =h y jh x ih )x ih + n i=1 ( m h =1 y jh x ih )b ji x ih (7) Without loss of generality, we consider the library pair (X h , Y
h ) having all components positive: X h = (1, . . . , 1) T and Y
h = (1, . . . , 1) T . This considera- tion is usually used [13] and does not affect our results. We can easily verify this by use of conditional probability. Now,(7) becomes n i=1
˜ w ji x ih = n + n i=1
( m h =h y jh x ih ) +
n i=1
( m h =1 y jh x ih )b ji (8) = n +
n i=1
α ji + n i=1
β ji , (9) Analysis on BAM with Multiplicative Weight Noise 293
where α ji ’s are independent identical zero mean random variables (i.e., E[α ji ] =
0, and E[α ji α ji ] = 0 for i = i ) and the variance of α ji ’s, denoted as Var[α ji ], is
equal to (m −1). Since b ji ’s are independent identical zero mean random variables and they are independent of ( m h =1 y jh x ih )’s. Hence, β ji ’s are independent identical zero mean random variables, where E[β ji ] = 0, E[β ji β ji ] = 0 for i = i ), Var[β ji ] = σ
2 b m. For large n, the summations, n i=1 α ji and n i=1
β ji , tend to normal with variances equal to (m − 1)n and σ 2 b
normal random variables is still a normal random variable. Hence, (9) becomes n i=1 ˜ w ji x ih = n + n i=1
α ji + n i=1
β ji = n + α + β . (10) Note that β ji ’s are independent of ( m h =h
y jh x ih )’s. We have E[αβ] = 0, E[α+ β] = 0 and Var[α + β] = (m − 1)n + σ 2 b
j,h means that α + β < −n. Hence, Prob(EU j,h ) ≈ Q n (m −1)+σ 2 b m . For large m, Q n (m −1)+σ 2 b m ≈ Q n (1+σ
2 b )m . (Proof completed). Using the similar way, we can have Lemma 2. Lemma 2: The probability Prob(EV i,h ) is approximately equal to Q p (1 + σ 2 b )m for i = 1, · · · , p and h = 1, · · · , m. Now, we start to estimate the capacity. Let the probability that a library pair (X h , Y h ) is fixed point be P ∗ : P ∗ = Prob (EU 1h ∩ · · · ∩ EU nh ∩ EV
1h ∩ · · · ∩ EU ph )
− Prob EU 1h ∪ · · · ∪ EU nh ∪ EV
1h ∪ · · · ∪ EV ph ≥ 1 − pProb(EU jh ) − nProb(EV ih ) .
(11) From Lemmas 1 and 2, (11) becomes P ∗
n (1 + σ
2 b )m − nQ p (1 + σ 2 b )m . (12)
Letting P B = pQ n (1+σ
2 b )m and P A = nQ p (1+σ
2 b )m , we get P ∗ ≥ 1 − P B − P A . (13) 294 C.S. Leung, P.F. Sum, and T.-T. Wong If z is large, Q(z)
≈ exp − z 2 2 − log z − 1 2
, (14)
which is quite accurate for z > 3. Using the approximation (14), P A = exp log p
− n 2(1 + σ 2 b )m − 1 2 log n (1 + σ 2 b )m − 1 2 log 2π = exp
log r + log n − n 2(1 + σ 2 b )m − 1 2 log
n (1 + σ
2 b )m − 1 2 log 2π = exp
log n − n 2(1 + σ 2 b )m − 1 2 log
n (1 + σ
2 b )m + constant (15)
Clearly, if m < n 2(1+σ 2 b ) log n , P A tends zero as n tends infinity. Similarly, we can get that as p → ∞ and m < p 2(1+σ
2 b ) log p , P B → 0. To sum up, for large n and p, If m <
min(n, p) 2(1 + σ
2 b ) log min(n, p) (16) then P
∗ → 1. That means if the number m of library pairs is less than min(n,p) 2(1+σ
2 b ) log min(n,p) , a library pair is with a very high chance to be a fixed point. So the capacity of BAM with multiplicative weight noise is equal to min(n,p) 2(1+σ
2 b ) log min(n,p) . 3.2
Error Correction In this section, we will investigate the capacity of BAM with weight noise when the initial input is a noise version X noise
h of a library pattern X h . Let X
noise h contains ρn bit errors, where ρ is the input noise level. If Y h = sgn( ˜ W X noise
h ) (17) X h = sgn( ˜ W T Y h ) (18) Y h = sgn( ˜ W X h ) , (19) then a noise version X noise h
pair (X h , Y h ). Similarly, we hope that a noise version Y noise h
h can suc-
cessfully recall the correct the desire library pair (X h , Y h ). We will study under what condition of m, the probability of successful recall tends to one. Define EU
noise j,h
be the event that n i ˜ w ji x noise
ih is equal to y jh (the j-th com- ponent of the library pattern Y h ). Also, EU noise j,h
is the complement event of EU noise j,h . Also, define EV noise i,h
be the event that p j ˜ w ji y noise
jh is equal to x ih (the i-th component of the library pattern X h ). Also, EV noise i,h
is the comple- ment event of EV noise i,h
. With the above definition, we can following the proof of Lemma 1 to get the following two lemmas. Analysis on BAM with Multiplicative Weight Noise 295
Lemma 3: The probability Prob(EU noise
i,h ) is approximately equal to Q (1
(1 + σ 2 b )m for i = 1, · · · , p and h = 1, · · · , m. Lemma 4: The probability Prob(EV noise i,h
) is approximately equal to Q (1 − 2ρ)p (1 + σ
2 b )m for i = 1, · · · , p and h = 1, · · · , m. Define P ∗∗
recall the desired library pair. It is not difficult to show that P ∗∗ ≥1−p Prob(EU jh ) + Prob(EU noise jh ) −n Prob(EV ih + Prob(EV noise ih )) . (20) From Lemmas 1-4, To sum up, for large n and p, if m < (1
2(1 + σ 2 b ) log min(n, p) (21)
then P ∗∗ → 1. That means, when there are ρn (or ρp) bit errors in the ini- tial input, the capacity of BAM with multiplicative weight noise is equal to (1 −2ρ) min(n,p) 2(1+σ 2 b ) log min(n,p) . 4 Simulation 4.1
Capacity We consider two dimensions, 512 and 1024. For each m, we randomly generate 1000 sets of library pairs. The Kosko’s rule is then used to encode the matrices. Afterwards, we add the multiplicative weight noise to the matrices. The variances σ 2
of weight noise are set to 0, 0.2, 0.4. Figure 1 shows the percentage of a library pair being successfully stored. From the figure, as the variance of weight noise increases, the successful rate decreases. This phenomena agrees with our expectation. From our analysis, i.e., (16), for n = p = 512, a BAM can store up to 41, 34, and 29 pairs for σ 2 b
all the corresponding successful rates are very large. Also, there are a sharply decreasing changes in successful for {m > 41, σ 2 b = 0 }, {m > 34, σ 2 b
}, and {m > 29, σ 2 b
}. 296 C.S. Leung, P.F. Sum, and T.-T. Wong 0 10
30 40 50 60 70 80 0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 Successful Rate σ b 2 =0
σ b 2 =0.2 σ b 2 =0.4
0 20 40 60 80 100 120 140
160 0 0.1 0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 m: Number of library pairs Successful Rate σ b 2 =0 σ b 2 =0.2 σ b 2 =0.4 (a) n = p = 512 (b) n = p = 1024 Fig. 1. Successful rate of a library pair being a fixed point. (a) The dimension is 512. (b) The dimension is 1024. For each value of m, we generate 1000 sets of library pairs. 0 10 20 30 40 50 60 70 80 0 0.1 0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 m : Number of library pairs Sucessful Recall Rate ρ =0.03125 ρ =0.0625
ρ =0.125
0 10 20 30 40 50 60 70 80 0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 m : Number of library pairs Successful Recall Rate ρ =0.03125 ρ =0.0625
ρ =0.125
(a) weight noise level σ 2 b = 0.2 (b) weight noise level σ 2 b
Fig. 2. Successful recall rate from a noise input. The dimension is 512. For each value of m, we generate 1000 sets of library pairs. For each library pattern, we generate 10 noise versions. Similarly, from (16), for n = p = 1024, a BAM can store up to 73, 61, and 52 pairs for σ 2 b equal to 0, 0.2 and 0.4, respectively. From Figure 1(b), all the corresponding successful rates are also large. Also, there are a sharply decreasing changes in successful for {m > 73, σ 2 b
}, {m > 61, σ 2 b = 0.2 }, and {m > 52, σ 2
= 0.4 }. To sum up, the simulation result is consistent with our analysis (16). 4.2 Error Correction The dimension is 512. We consider two weight noise levels, σ 2 b = 0.2, 0.4 ,and three input error levels, ρ = 0.003125, 0.0625, 0.125. For each m, we randomly generate 1000 sets of library pairs. The Kosko’s rule is then used to encode the matrices. Afterwards, we add the multiplicative weight noise to the matrices. For each library pair, we generate ten noise versions. We then feed the noise versions as initial input and check whether the desire library can be recalled Analysis on BAM with Multiplicative Weight Noise 297
0 20 40 60 80 100 0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 m: Number of library pairs Sucessful Recall Rate ρ =0.015625 ρ =0.03125
ρ =0.0625
0 20 40 60 80 100 0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 m: Number of library pairs Sucessful Recall Rate ρ =0.015625 ρ =0.03125
ρ =0.0625
(a) weight noise level σ 2 b = 0.2 (b) weight noise level σ 2 b
Fig. 3. Successful recall rate from a noise input. The dimension is 1024. For each value of m, we generate 1000 sets of library pairs. For each library pattern, we generate 10 noise versions. or not. Figures 2 and 3 shows the successful recall rate. From the figures, as the input error level ρ increases, the successful rate decreases. This phenomena agrees with our expectation. From our analysis, i.e., (21), for the dimension n = p = 512 and weight noise level σ
2 b = 0.2, a BAM can store up to 32, 30, and 26 pairs for the input error level ρ equal to 0.03125, 0.0625 and 0.125, respectively. From Figure 2(a), all the corresponding successful rates are large. Also, there are a sharply decreasing changes in successful recall rates for {m > 32, ρ = 0.03125}, {m > 30, ρ = 0.0625 }, and {m > 26, ρ = 0.125}. For other weight noise levels and dimension, we obtained similar phenomena. 5 Conclusion We have examined the statistical storage behavior of BAM with multiplicative weight noise. The capacity of a BAM is m < min(n,p) 2(1+σ
2 b ) log min(n,p) . When the num- ber of library pairs is less that value, the chance of it being a fixed point is very high. Since we expect BAM has certain error correction ability, we have investigated the capacity of BAM with weight noise when the initial input is a noise version of a library pattern. We show that if m < (1 −2ρ) min(n,p) 2(1+σ 2 b ) log min(n,p) , a
noise version with ρn (or ρp) errors has a high chance to recall the desire library pairs. Computer simulations have been carried out The results presented here can be extended to Hopfield network. By adopting the approach set above, we can easily obtain the result in Hopfield network by replacing min(n, p) with n in the above equations. Acknowledgement The work is supported by the Hong Kong Special Administrative Region RGC Earmarked Grants (Project No. CityU 115606) and (Project No. CUHK 416806). 298 C.S. Leung, P.F. Sum, and T.-T. Wong References 1. Kohonen, T.: Correlation matrix memories. IEEE Transaction Computer 21, 353– 359 (1972) 2. Palm, G.: On associative memory. Biolog. Cybern. 36, 19–31 (1980) 3. Kosko, B.: Bidirectional associative memories. IEEE Trans. Syst. Man, and Cy- bern. 18, 49–60 (1988) 4. Leung, C.S.: Encoding method for bidirectional associative memory using projec- tion on convex sets. IEEE Trans. Neural Networks 4, 879–991 (1993) 5. Leung, C.S.: Optimum learning for bidirectional associative memory in the sense of capacity. IEEE Trans. Syst. Man, and Cybern. 24, 791–796 (1994) 6. Wang, Y.F., Cruz, J.B., Mulligan, J.H.: Two coding strategies for bidirectional associative memory. IEEE Trans. Neural Networks 1, 81–92 (1990) 7. Lenze, B.: Improving leung’s bidirectional learning rule for associative memories. IEEE Trans. Neural Networks 12, 1222–1226 (2001) 8. Shen, D., Cruz, J.B.: Encoding strategy for maximum noise tolerance bidirectional associative memory. IEEE Trans. Neural Networks 16, 293–300 (2005) 9. Leung, C.S., Chan, L.W.: The behavior of forgetting learning in bidirectional as- sociative memory. Neural Computation 9, 385–401 (1997) 10. Leung, C.S., Chan, L.W., Lai, E.: Stability and statistical properties of second- order bidirectional associative memory. IEEE Transactions on Neural Networks 8, 267–277 (1997) 11. Wang, B.H., Vachtsevanos, G.: Storage capacity of bidirectional associative mem- ories. In: Proc. IJCNN 1991, Singapore, pp. 1831–1836 (1991) 12. Haines, K., Hecht-Nielsen, R.: A bam with increased information storage capacity. In: Proc. of the 1988 IEEE Int. Conf. on Neural Networks, pp. 181–190 (1988) 13. Amari, S.: Statistical neurodynamics of various versions of correlation associative memory. In: Proc. of the 1988 IEEE Int. Conf. on Neural Networks, pp. 181–190 (1988)
14. Burr, J.: Digital neural network implementations. In: Neural Networks, Concepts, Applications, and Implementations, vol. III, Prentice Hall, Englewood Cliffs, New Jersey (1991) 15. Holt, J., Hwang, J.-N.: Finite precision error analysis of neural network hardware implementations. IEEE Transactions on Computers 42(3), 281–290 (1993) 16. Lam, P.M., Leung, C.S., Wong, T.T.: Noise-resistant fitting for spherical harmon- ics. IEEE Transactions on Visualization and Computer Graphics 12(2), 254–265 (2006)
17. Bernier, J.L., Ortega, J., Rodriguez, M.M., Rojas, I., Prieto, A.: An accurate mea- sure for multilayer perceptron tolerance to weight deviations. Neural Processing Letters 10(2), 121–130 (1999) 18. Bernier, J.L., Diaz, A.F., Fernandez, F.J., Canas, A., Gonzalez, J., Martin-Smith, P., Ortega, J.: Assessing the noise immunity and generalization of radial basis function networks. Neural Processing Letters 18(1), 35–48 (2003) 19. Sripad, A., Snyder, D.: Quantization errors in floating-point arithmetic. IEEE Transactions on Speech, and Signal Processing 26, 456–463 (1978) Fuzzy ARTMAP with Explicit and Implicit Weights
Takeshi Kamio 1 , Kenji Mori 1 , Kunihiko Mitsubori 2 , Chang-Jun Ahn 1 , Hisato Fujisaka 1 , and Kazuhisa Haeiwa 1 1
Asaminami-ku, Hiroshima-shi, Hiroshima, 731-3194, Japan 2 Dept. of Electronics and Computer Systems, Takushoku University, 815-1, Tatemachi, Hachioji-shi, Tokyo, 193-0985, Japan kamio@im.hiroshima-cu.ac.jp Abstract. ARTMAP is one of the famous supervised learning systems. Many learning methods for ARTMAP have been proposed since it was devised as a system to solve Stability-Plasticity Dilemma. AL-SLMAP was implemented by slightly modifying FCSR which was the original learning method for fuzzy ARTMAP (FAM). Although AL-SLMAP can solve pattern recognition problems in a noisy environment more effec- tively than FCSR, AL-SLMAP is less suitable for region classification problems than FCSR. This means that AL-SLMAP has some problems which do not exist in FCSR. In this paper, we propose a learning method for FAM with explicit and implicit weights to overcome the problems. Keywords: ARTMAP, FAM, Explicit weight, Implicit weight. 1 Introduction Adaptive resonance theory neural network (ART) is an unsupervised learning system which can generate and grow categories by comparing the similarity be- tween an input and memories with the fineness of classification called “vigilance parameter”. In contrast, ARTMAP is a supervised learning system which con- sists of a learning ART (ART a ), a supervising ART (ART b ) and a map field (MF). ART a and ART b classify sample data and recognition codes respectively. MF maps each category in ART a to the corresponding one in ART b . An erro- neous mapping is corrected by the process called “match tracking (MT)”. Many learning methods for ART and ARTMAP have been proposed since they were devised as systems to solve the Stability-Plasticity Dilemma. Here, we discuss learning methods for fuzzy ARTMAP (FAM) [1] which can classify not only binary but also analog inputs. Fast commit slow recode (FCSR) [1] is the original learning method for FAM. Since FCSR has little ability to reduce the influence of the input noise, it may generate many categories unnecessarily and decrease the recognition perfor- mance. To solve this category proliferation problem, J. S. Lee et al. proposed AL-SLMAP [2] which is the combination of average learning (AL) [2] and slow M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 299–308, 2008. c Springer-Verlag Berlin Heidelberg 2008 300 T. Kamio et al. learning of map field weights vector (SLMAP) [3]. Also, they reported that AL-SLMAP could solve the character recognition problem in a noisy environ- ment more effectively than FCSR [2]. We have investigated the ability of AL-SLMAP in detail since we were very interested in it from the viewpoint of simplicity and performance. As a result, we have confirmed that AL-SLMAP cannot inhibit the category proliferation for the character recognition problem in a highly noisy environment [4] and that AL-SLMAP is less suitable for the region classification problem than FCSR. These facts made us realize that AL-SLMAP had problems in category selection, weight update, and MT. In this paper, we propose a novel FAM model and its learning method. Each node of ART a has a weight set for each individual recognition code. When a new node is created in ART a , all its weight sets define its category. That is to say, they work as explicit weights. As the learning proceeds, each node of ART a has weight sets which do not define its category. Such weight sets are called implicit weights. Our learning method uses both explicit and implicit weights to solve the problems in AL-SLMAP. Finally, it has been shown by simulations that our learning method is better than FCSR and AL-SLMAP. 2 Fuzzy ARTMAP Learned by AL-SLMAP 2.1 System
Here, we review fuzzy ARTMAP (FAM) [1] and AL-SLMAP [2]. First, we explain the behavior of fuzzy ART. As shown in Fig.1, it has an attentional subsystem (AS) and an orienting subsystem (OS). AS consists of an input layer (F 0 ), a matching layer (F 1 ) and a category layer (F 2 ). If a normalized original input a ∈ [0, 1]
n is given, F 0 provides F 1 with A
∈ [0, 1] 2n . A is the complement code of a (i.e., A ≡ [a,a
c ] = [a
1 , · · · , a n , 1
− a 1 , · · · , 1 − a n ]). After A goes through F 1 and reaches F 2 , F
2 node j calculates the choice strength T j :
j = |A ∧ U j | /(α + |U j |), (j = 1, · · · , m), (1) where (p
∧ q) i = min(p i , q
i ), |p| = |p i |, U j ∈ 2n is the bottom-up weight vector, α > 0 is the choice parameter and m is the number of F 2 nodes. The node J with the maximal choice strength is activated in F 2 . If several nodes have the maximal choice strength, the node with the minimal index is selected from them. F
2 activity y ∈ m
J = 1 and y j=J = 0. The activated F 2 node
J provides F 1 with D J ∈ 2n which is the same as the top-down weight vector, and then F 1 activity x becomes A ∧ D J . Then, OS calculates the matching degree S J from A and x: S J = |A ∧ D J | / |A| . (2) In addition, OS compares S J with the vigilance parameter ρ ∈ [0, 1]. If S J ≥ ρ, the node J has the category for the present input a. The category is defined by U J and D J . If S J < ρ, the node J is reset. The reset node J remains inactivated Fuzzy ARTMAP with Explicit and Implicit Weights 301
1 j J m 1
2
1
2
( Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling