Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
- Bu sahifa navigatsiya:
- Step 8)
- 3.1 Experimental Results
- Fig. 2.
- 3.2 Comparison with other Methods
- 4 Discussion
- 5 Conclusion
- Acknowledgements
Step 5) We perform backtracking. That is, the lastly deleted double/single of features is restored here with associated all components.
If single deletion process has not been finished then attempt to delete feat- ure single at a time using step 4
to filter the unnecessary ones, otherwise go to step 7 . If
SC mentioned in subsection 2.1.2.3 is satisfied, then continue. Otherwise, go to step 5 and then step 7 .
Feature Subset Selection Using Constructive Neural Nets 379
Before going to select the final subset, we again check through the existing features whether any irrelevant feature presents or not. The following steps are taken to accomplish this task. i)
accordance with worst rank, and then again retrain. ii)
Validate the existing features by using test set. For the next stage, save the classification accuracy A
′′ , the responsible deleted feature, and all its components. iii)
If DF<(TF-1) then continue the deletion process. Otherwise, go to the next step. Here, DF and TF refer to the number of deleted feature and total number of feature, respectively. iv)
C ′′ according to CA A C => ′′ . v)
If better A C ′′ are available, then identify the higher rank of feature among them. Otherwise, stop. Delete the higher ranked feature with corresponding worst ranked ones from the subset that obtained at step 7-i. After that, recall the components to the current network that was obtained at step 7-ii and stop.
Finally, we achieve the relevant feature subset with compact network. 3 Experimental Analysis This section evaluates CFSS’s performance on four well-known benchmark problems such as Breast cancer (BCR), Diabetes (DBT), Glass (GLS), and Iris (IRS) problems which are obtained from [13]. Input attributes and output classes of BCR are 9 and 2, 8 and 2 of DBT, 9 and 6 of GLS, 4 and 3 of IRS, respectively. All datasets are partitioned into three sets: a training set, a validation set, and a test set. The first 50% is used as training set for training the network, and the second 25% as validation set to check the condition of training and stop it when the overall training error goes high. The last 25% is used as test set to evaluate the generalization performance of the network and is not seen by the network during the whole training process. In all experiments, two bias nodes with a fixed input +1 are connected to the hidden layer and output layer. The learning rate η is set between [0.1,0.3] and the weights are initialized to random values between [+1.0, -1.0]. Each experiment was carried out 10 times and the presented results were the average of these 10 runs. The all experiments were done by Pentium-IV, 3 GHz desktop personal computer.
Table 1 shows the average results of CFSS where the number of selected features, classification accuracies and so on are incorporated for BCR, DBT, GLS, and IRS problems in before and after feature selection process. For clarification, in this table, we measured the training error by the section 2.5 of [12], and classification accuracy CA is the ratio of classified examples to the total examples of the particular data set. 380 M.M. Kabir, M. Shahjahan, and K. Murase -20 -10
0 10 20 30 40 0 1 2 3 4 5 6 7 8 9 10 Attribute A ttr
ibute C ontr ibution (%)
70 72 74 76 78 80 0 2 4 6 8 10 Subset Size Classif ication
A ccur
acy (%)
Fig. 2. Contribution of attributes in breast cancer problem for single run
the size of subset in diabetes problem for single run
standard deviations. Here, TERR, CA, FS, HN, ITN, and CTN refer to training error, classification accuracy, feature selection, hidden neuron, iteration, and connection respectively. Before FS After FS Name
Feature # TERR (%) CA
(%) Feature
# TERR
(%) CA
(%) HN# ITN# CTN# BCR 9 (0.00)
3.71 (0.002)
98.34 (0.31)
3.2 (0.56)
3.76 (0.003)
98.57 (0.52)
2.60 249.4 18.12 DBT 8
(0.00) 16.69
(0.006) 75.04
(2.06) 2.5
(0.52) 14.37
(0.024) 76.13
(1.74) 2.66 271.3 16.63 GLS 9 (0.00)
26.44 (0.026)
64.71 (1.70)
4.1 (1.19)
26.56 (0.028)
66.62 (2.18)
3.3 1480.9 42.63 IRS 4
(0.00) 4.69
(0.013) 97.29
(0.00) 1.3
(0.48) 8.97
(0.022) 97.29
(0.00) 3.2
2317.1 19.96
BCR, DBT, GLS, and IRS problems Name Computational Time
(Second) BCR 22.10
DBT 30.07 GLS 22.30
IRS 7.90
accuracy increase for BCR, DBT, GLS, and IRS problems Name Connection Decrease (%) Accuracy Increase (%) BCR 45.42 0.23
DBT 46.80 1.45
GLS 27.5 2.95
IRS 30.2 0.00
In each experiment, CFSS generates a subset having the minimum number of relevant features is available. Thereupon, CFSS produces better CA with a smaller number of hidden neurons HN. It is shown in Table 1 that, for example, in BCR, the network easily selects a minimum number of relevant features subset i.e. 3.2 which occurs 98.57% CA while the network used only 2.6 of hidden neurons to build the minimal network structure. In contrast, in case of remaining problems like DBT, GLS, and IRS, the results of those attributes are nearly similar as BCR except IRS. This is because, for IRS problem, the network cannot produce better accuracy any
Feature Subset Selection Using Constructive Neural Nets 381 more after feature selection rather it able to generate 1.3 relevant features subset among 4 attributes which is sufficient to exhibit the same network performance. In addition, Fig. 2 exhibits the arrangement of attributes during training according to their contribution in BCR. CFSS can easily delete the unnecessary ones and generate the subset {6,7,1}. In contrast, Fig. 3 shows the relationship between classification accuracy and subset size for DBT. We calculated the network connection of the pruned network at the final stage according to section 2.3 and the results are shown in Table 1. The computational time for completing the entire FSS process is exhibited in Table 2. After that, we estimated the connection decrement of the network corresponding to accuracy increment due to FSS as shown in Table 3 according to 2.3.1 and 2.3.2. We can see here the relation between the reduction of the network connection, and the increment of accuracy.
In this section, we compare the results of CFSS to those obtained by other methods NNFS and ICFS reported in [4] and [6], respectively. The results are summarized in Tables 4-7. Prudence should be considered because different technique has been involved in their methods for feature selection.
relevant features for BCR, DBT, GLS, and IRS data sets Name
CFSS NNFS ICFS (M 1) ICFS
(M 2) BCR 3.2 2.70 5 5 DBT 2.5 2.03 2 3 GLS 4.1
- 5 4
IRS 1.3 - - -
Table 5. Comparison on the average testing CA (%) for BCR, DBT, GLS, and IRS data sets Name CFSS NNFS ICFS (M 1) ICFS
(M 2) BCR 98.57 94.10 98.25 98.25 DBT 76.13 74.30 78.88 78.70 GLS
66.62 - 63.77 66.61
IRS 97.29 - - -
of hidden neurons for BCR, DBT, GLS, and IRS data sets Name CFSS NNFS ICFS (M 1)
ICFS (M 2)
BCR 2.60 12 33.55 42.05
DBT 2.66 12 8.15 21.45 GLS 3.3
- 62.5 53.95
IRS 3.2 - - -
Table 7. Comparison on the average number of connections for BCR, DBT, GLS, and IRS data sets Name CFSS NNFS ICFS (M 1) ICFS
(M 2) BCR 18.12 70.4 270.4 338.5 DBT 16.63 62.36 42.75 130.7 GLS 42.63 - 756 599.4 IRS 19.96 - - -
Table 4 shows the discrimination capability of CFSS in which the dimensionality of input layer is reduced for above-mentioned four problems. In case of BCR, the result of CFSS is quite better in terms of ICFS’s two methods while it is not so with NNFS. The result of DBT is average with comparing others. But, for the case of GLS, the result is comparable or better with two methods of ICFS. 382 M.M. Kabir, M. Shahjahan, and K. Murase The comparison on the average testing CA for all problems is shown in Table 5. It is seen that, the results of BCR and GLS are better with NNFS and ICFS while DBT with NNFS. The most important aspect however in our proposed method is, less number of connections of final network due to less number of HN in NN architecture. The results are shown in Table 6 and 7 where the number of HNs and connections of CFSS are much less in comparison to other three methods. Note that there are some missing data in Table 4-7, since IRS problem was not tested by NNFS and ICFS, and GLS problem by NNFS. 4 Discussion This paper presents a new combinatorial method for feature selection that generates subset in minimal computation due to minimal size of hidden layer as well as input layer. Constructive technique is used to achieve the compact size of hidden layer, and a straightforward contribution measurement leads to achieve reduced input layer showing better performance. Moreover, a composite combination of BE by means of double and single elimination, backtracking, and validation helps CFSS to generate subset proficiently. The results shown in Table 1 exhibit that CFSS generates subset with a small number of relevant features with producing better performance in four-benchmark problems. The results of relevant subset generation and generalization performance are better or comparable to other three methods as shown in Tables 4 and 5. From the long period, the wrapper approaches for FSS are overlooked because of the huge computation in processing. In CFSS, computational cost is much less. As seen in Table 3, due to FSS, 37.48% of computational cost is reduced in the advantage of 1.18% network accuracy for four problems in average. The computational time for different problems to complete the entire process in CFSS is shown in Table 2. We believe that these values are sufficiently low especially for clinical field. The system can give the diagnostic result of the patient to the doctor within a minute. Though the exact comparison is difficult, other methods such as NNFS and ICFS may take 4-10 times more since the numbers of hidden neurons and connections are much more as seen in Tables 6 and 7 respectively.
CFSS thus provides minimal computation in feature subset selection. In addition, during subset generation, we used to meet up the generated subset for validation in each step. The reason is that, during BE we build a composite SC, which eventually find out the local minima where the network training should be stopped. Due to implementing such criterion, network produces significant performance and thus no need to validate the generated subset finally. Furthermore, further checking for irrelevant features in BE brings the completeness of CFSS. In this study we applied CFSS to the datasets with smaller number of features up to 9. To get more relevant tests for real tasks, we intend to use CFSS with other datasets having a larger number of features in future. The issue of extracting rules from NN is always demandable to interpret the knowledge how it works. For this, a NN with compactness is desirable. As CFSS can give support to fulfill the requirements, rule extraction from NN is the further task in future efficiently.
Feature Subset Selection Using Constructive Neural Nets 383
This paper presents a new approach for feature subset selection based on contribution of input attributes in NN. The combination of constructive, contribution, and backward elimination carries the success of CFSS. Initially a basic constructive algorithm is used to determine a minimal and optimal structure of NN. In the latter part, one-by-one removal of input attributes is adopted that does not computationally expensive. Finally, a backward elimination with new stopping criteria are used to generate relevant feature subset efficiently. Moreover, to evaluate CFSS, we tested it on four real-world problems such as breast cancer, diabetes, glass and iris problems. Experimental results confirmed that, CFSS has a strong capability of feature selection, and it can remove the irrelevant features from the network and generates feature subset by producing compact network with minimal computational cost. Acknowledgements Supported by grants to KM from the Japanese Society for Promotion of Sciences, the Yazaki Memorial Foundation for Science and Technology, and the University of Fukui.
References 1. Liu, H., Tu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005)
2. Dash, M., Liu, H.: Feature Selection for Classification. Intelligent Data Analysis - An International Journal 1(3), 131–156 (1997) 3. Kohavi, R., John, G.H.: Wrapper for feature subset selection. Artificial Intelligence 97, 273–324 (1997) 4. Sateino, R., Liu, H.: Neural Network Feature Selector. IEEE Transactions on Neural Networks 8 (1997) 5. Milna, L.: Feature Selection using Neural Networks with Contribution Measures. In: 8th Australian Joint Conference on Artificial Intelligence, Canberra, November 27 (1995) 6. Guan, S., Liu, J., Qi, Y.: An incremental approach to Contribution-based Feature Selection. Journal of Intelligence Systems 13(1) (2004) 7. Schuschel, D., Hsu, C.: A weight analysis-based wrapper approach to neural nets feature subset selection. In: Tools with Artificial Intelligence: Proceedings of 10th IEEE International Conference (1998) 8. Hsu, C., Huang, H., Schuschel, D.: The ANNIGMA-Wrapper Approach to Fast Feature Selection for Neural Nets. IEEE Trans. on Systems, Man, and Cybernetics-Part B: Cybernetics 32(2), 207–212 (2002) 9. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection. Journal of Machine Learning Research (2002)
384 M.M. Kabir, M. Shahjahan, and K. Murase 10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoffs in the cascade correlation learning architecture, Technical Report TR-92-CSE-27, ECE Department, UMASS, Amherst (1994) 11. Rumelhart, D.E., McClelland, J.: Parallel Distributed Processing. MIT Press, Cambridge (1986) 12. Prechelt, L.: PROBEN1-A set of neural network benchmark problems and benchmarking rules, Technical Report 21/94, Faculty of Informatics, University of Karlsruhe, Germany (1994)
13. newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Dept. of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Dynamic Link Matching between Feature Columns for Different Scale and Orientation Yasuomi D. Sato 1 , Christian Wolff 1 , Philipp Wolfrum 1 ,
1,2 1 Frankfurt Institute for Advanced Studies (FIAS), Johann Wolfgang Goethe University, Max-von-Laue-Str. 1, 60438, Frankfurt am Main, Germany 2 Computer Science Department, University of Southern California, LA, 90089-2520, USA Abstract. Object recognition in the presence of changing scale and ori- entation requires mechanisms to deal with the corresponding feature transformations. Using Gabor wavelets as example, we approach this problem in a correspondence-based setting. We present a mechanism for finding feature-to-feature matches between corresponding points in pairs of images taken at different scale and/or orientation (leaving out for the moment the problem of simultaneously finding point correspondences). The mechanism is based on a macro-columnar cortical model and dy- namic links. We present tests of the ability of finding the correct feature transformation in spite of added noise. 1 Introduction When trying to set two images of the same object or scene into correspondence with each other, so that they can be compared in terms of similarity, it is nec- essary to find point-to-point correspondences in the presence of changes in scale or orientation (see Fig. 1). It is also necessary to transform local features (unless one chooses to work with features that are invariant to scale and orientation, accepting the reduced information content of such features). Correspondence- based object recognition systems [1,2,3,4] have so far mainly addressed the issue of finding point-to-point correspondences, leaving local features unchanged in the process [5,6]. In this paper, we propose a system that can not only transform fea- tures for comparison purposes, but also recognize the transformation parameters that best match two sets of local features, each one taken from one point in an image. Our eventual aim will be to find point correspondences and feature corre- spondences simultaneously in one homogeneous dynamic link matching system, but we here take point correspondences as given for the time being. Both theoretical [7] and experimental [8] investigations are suggesting 2D-Gabor-based wavelets as features to be used in visual cortex. These are best sampled in a log-polar manner [11]. This representation has been shown to be particularly useful for face recognition [12,13], and due to its inherent symmetry it is highly appropriate for implementing a transformation system for scale and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 385–394, 2008. c Springer-Verlag Berlin Heidelberg 2008
386 Y.D. Sato et al. Fig. 1. When images of an object differ in (a) scale or (b) orientation, correspondence between them can only established if also the features extracted at a given point (e.g., at the dot in the middle of the images) are transformed during comparison orientation. The work described here is based entirely on the macro-columnar model of [14]. In computer simulations we demonstrate feature transformation and transformation parameter recognition. 2 Concept of Scale and Rotation Invariance Let there be two images, called I and M (for image and model). The Gabor- based wavelet transform ¯ J L
jet, whose components are defined as convolutions of the image with a family of Gabor functions: ¯ J
k,l (x 0 , a, θ) = R 2 I(x) 1 a 2 ψ 1 a Q(θ)(x
0 − x) d
2 x, (1) ψ(x) = 1 D 2 exp(
− 1 2D 2 |x|
2 ) exp(i x T · e
1 ) − exp(− D 2 2 ) , (2)
Q(θ) = cos θ sin θ −sin θ cos θ . (3) Here a simultaneously represents the spatial frequency and controls the width of a Gaussian envelope window. D represents the standard deviation of the Gaussian. The individual components of the jet are indexed by n possible ori- entations and (m + m 1 + m
2 ) possible spatial frequencies, which are θ = πk/n (k ∈ {0, 1, . . . , (n−1)}) and a = a l 0 (0 < a 0 < 1, l ∈ {0, 1, . . . , (m+m 1 +m
−1)}). Here parameters m 1 and m
2 fix the number of scale steps by which the image can be scaled up or down, respectively. m (> 1) is the number of scales in the jets of the model domain. The Gabor jet ¯ J L is defined as the set { ¯
J L k,l } of N L Gabor wavelet components extracted from the position x 0 of the image. In order to test the ability to find the correct feature transformation described below, we add noise of strength σ 1 to the Gabor wavelet components, ˜ J L k,l = ¯
J L k,l (1 + σ 1 S ran ) where S ran are random numbers between 0 and 1. Instead of the Gabor jet ¯ J L itself, we employ the sum-normalized ˜ J L
. Jets can be visualized in angular-eccentricity coordinates, Dynamic Link Matching between Feature Columns 387
(a) (b)
Fig. 2. Comparison of Gabor jets in the model domain M and the image domain I (left and right, resp., in the two part-figures). Jets are visualized in log-polar coordinates: The orientation θ of jets is arranged circularly while the spatial frequency l is set radially. The jet J M of the model domain M is shown right while the jet J I from
the transformed image in the image domain I is shown left. Arrows indicate which components of J I and J
M are to be compared. (a) comparison of scaled jets and (b) comparison of rotated jets. in which orientation and spatial frequency of a jet are arranged circularly and radially (as shown in Fig.2). The components of jets that are taken from corresponding points in images I and M may be arranged according to orientation (rows) and scale (columns): J I = ⎛ ⎜ ⎝ 0 · · · 0 J I 0,m
1 −1 · · · J I 0,m
1 −1+m
0 · · · 0
.. . .. . .. . .. . .. . .. . 0 · · · 0 J I n
1 −1 · · · J I n −1,m 1 −1+m
0 · · · 0
⎞ ⎟ ⎠ , (4) J M = ⎛ ⎜ ⎝ J M 0,m 1 −1 · · · J M 0,m 1 −1+m
.. . .. . J M n −1,m
1 −1 · · · J M n −1,m 1 −1+m
⎞ ⎟ ⎠ . (5) Let us assume the two images M and I to be compared have the exact same structure, apart from being transformed relative to each other. Let us assume the transformation conforms to the sample grid of jet components. Then there will be pair-wise identities between components in Eqs.(4) and (5). If the jet in I is scaled relative to the jet in M, the non-zero components of J I are shifted along the horizontal (or, in Fig.2(a), radial) coordinate. If the image I, and correspondingly the jet J I , is rotated, then jet components are circularly permuted along the vertical axis in Eq.(4) (see Fig.2(b)). When comparing scaled and rotated jets, the non-zero components of J I are shifted along both axes simultaneously. There are model jet components of m different scales, and to allow for m 1 steps of scaling down the image I and m 2 steps of scaling it up. The jet in Eq.(4) is padded on the left and right with m 1 and m 2 columns of zeros, respectively. 388 Y.D. Sato et al. Fig. 3. A dynamic link matching network model of a pair of single macro-columns. In the I and M domains, the network consists of, respectively, N I and N
M feature units (mini-columns). The units in the I and M domains represent components of the jets J I and J M . The Control domain C controls interconnections between the I and M macro- columns, which are all-to-all at the beginning of a ν cycle. By comparing the two jets and through competition of control units in C, the initial all-to-all interconnections are dynamically reduced to a one-to-one interconnection. 3 Modelling a Columnar Network In analogy to [14], we set up a dynamical system of variables, both for jet com- ponents and for all possible correspondences between jet components. These variables are interpreted as the activity of cortical mini-columns (here called “units”, “feature units” in this case). The set of units corresponding to a jet, either in I or in M, form a cortical macro-column (or short “column”) and in- hibit each other, the strength of inhibition being controlled by a parameter ν, which cyclically ramps up and falls down. In addition, there is a column of con- trol units, forming the control domain C. Each control unit stands for one pair of possible values of relative scale and orientation between I and M and can evaluate the similarity in the activity of feature units under the corresponding transformation. The whole network is schematically shown in Fig.3. Each domain, or column in our simple case, consists of a set of unit activities {P L
, . . . , P L N L } in the respective domain (L = I or M). N I = (m + m
1 + m
2 ) × n and N M = m × n represent respectively the total number of units in the I and M domains. The control unit’s activations are P C = (P
C 1 , . . . , P C N C ) T , where N C = (m 1 + m
2 + 1)
×n. The equation system for the activations in the domains is given by dP L
dt = f
α (P L ) + κ L C E J L α + κ
L (1 − C E )E LL α + ξ
α (t),
(α = 1, . . . , N L and L = {M, I}\L), (6) dP C γ dt = f γ (P C ) + κ C F γ + ξ
γ (t),
(γ = 1, . . . , N C ), (7) Dynamic Link Matching between Feature Columns 389
Fig. 4. Time course of all activities in the domains I, M and C over a ν cycle for two jets without relative rotation and scale, for a system with size m = 5, n = 8, m 1
2 = 4. The activity of units in the M and I columns is shown within the bars on the right and the bottom of the panels, respectively, each of the eight sub- blocks corresponding to one orientation, with individual components corresponding to different scales. The indices k and l at the top label the orientation and scale of the I jet, respectively. The large matrix C contains, in the top version, as non-zero (white) entries all allowed component-to-component correspondences between the two jets (the empty, black, entries would correspond to forbidden relative scales outside the range set by m 1 and m 2 ). The matrix consists of 8 × 8 blocks, corresponding to the 8 × 8 possible correspondences between jet orientations, and each of these sub-matrices contains m ×(m+m
1 +m 2 ) entries to make room for all possible scale correspondences. Each control unit controls (and evaluates) a whole jet to jet correspondence with the help of its entries in this matrix. The matrix in the lowest panel has active entries corresponding to just one control unit (for identical scale and orientation). Active units are white, silent units black. Unit activity is shown for three moments within one ν cycle. At t = 3.4 ms, the system has already singled out the correct scale, but is still totally undecided as to relative orientation. System parameters are κ I = κ M = 2,
κ C = 5, C E = 0.99, σ = 0.015 and σ 1 = 0.1.
390 Y.D. Sato et al. where ξ α
γ (t) are Gaussian noise of strength σ 2 . C
E ∈ [0, 1] scales the ratio between the Gabor jet J L α and the total input E LL α from units in the other domain, L , and κ L controls the strength of the projection L to L. κ C controls
the strength of the influence of the similarity F γ on the control units. Here we explain the other terms in the above equations. The function f α ( ·) is: f α (P L ) = a 1 P L α P L α − ν(t) max β=1,...,N L {P L β } − (P L α ) 2 , (8) where a 1 = 25. The function ν(t) describes a cycle in time during which the feedback is periodically modulated: ν(t) =
(ν max
− ν min
) t −T k T 1 + ν min
, (T k t < T 1 + T k ) ν max , (T 1 + T
k t < T
1 + T
k + T
relax ) . (9) Here T
1 [ms] is the duration during which ν increases with t while T k (k =
1, 2, 3, . . .) are the periodic times at which the inhibition ν starts to increase. T relax is a relaxation time after ν has been increased. ν min
= 0.4, ν max
= 1, T 1 = 36.0 ms and T relax
= T 1 /6 are set up in our study. Because of the increasing ν(t) and the noise in Eqs.(6) and (7), only one minicolumn in each domain remains active at the end of the cycle, while the others are deactivated, as shown in Fig.4. If we define the mean-free column activities as ˜ P L
:= P L α − (1/N L ) α P L α , the interaction terms in Eqs. (6) and (7) are E LL α = N C β=1 N L α =1 P C β C β αα ˜ P L α , (10) F β = N M γ=1 N I γ =1 ˜ P M γ C β γγ ˜ P I γ . (11)
Here N C = n ×(m 1 +m 2 +1) is the number of permitted transformations between the two domains (n different orientations and m 1 + m 2 + 1 possible scales). The expression C β αα designates a family of matrices, rotating and scaling one jet into another. If we write β = (o, s) so that o identifies the rotation and s the scaling, then we can write the C matrices as tensor product: C β = A o ⊗ B
s , where A o is an n
× n matrix with a wrap-around shifted diagonal of entries 1 and zeros otherwise, implementing rotation o of jet components at any orientation, and B s is an m
× (m + m 1 + m 2 ) matrix with a shifted diagonal of entries 1, and zeros otherwise, implementing an up or down shift of jet components of any scale. The C matrix implementing identity looks like the lowest panel in Fig. 4. The function E of Eq. (10), transfers feature unit activity from one domain to the other, and is a weighted superposition of all possible transformations, weighted with the activity of the control units. The function of F in Eq. (11) evaluates the similarity of the feature unit activity in one domain with that in the other under mutual transformation.
Dynamic Link Matching between Feature Columns 391
Through the columnar dynamics of Eq. (6), the relative strengths of jet com- ponents are expressed in the history of feature unit activities (the strongest jet component, for instance, letting its unit stay on longest) so that the sum of products in Eq. (11), integrated over time in Eq. (7), indeed expresses jet similarities. The time course of all unit activities in our network model during a ν cycle is demonstrated in Fig.4. In this figure, the non-zero components of the jets used in the I and M domains are identical without any rotation and scale. At the beginning of the ν cycle, all units in the I, M and C domains are active, as indicated in Fig.4 by the white jet-bars on the right and the bottom, and indicated by all admissible correspondences in the matrix being on. The activity state of the units in each domain is gradually reduced following Eqs.(6) – (11). In the intermediate state at t = 3.4 ms, all control units for the wrong scale have switched off, but the control units for all orientations are still on. At t = 12.0 ms, finally, only one control unit has survived, the one that correctly identifies the identity map. This state remains stable during the rest of the ν cycle. 4 Results
In this Section, we describe tests performed on the functionality of our macro- columnar model by studying the robustness of matching to noise introduced into the image jets (parameter σ 1 ) and the dynamics of the units, Eqs. (6) and (7) (parameter σ) and robustness to the admixture of input from the other domain, Eq. (6), due to non-vanishing (1 − C E
surviving control unit corresponds to the correct transformation pair which the I jet differs from the M jet. Jet noise is to model differences between jets actually extracted from natural images. Dynamic noise is to reflect signal fluctuations to be expected in units (minicolumns) composed of spiking neurons, and a non- zero admixture parameter may be relevant in a system in which transfer of jet information between the domains is desired. For both of the experiments described below, we used Gabor jets extracted from the center of real facial images. The transformed jets used for matching were extracted from images that had been scaled and rotated relative to the center po- sition by exactly the same factors and angles that are also used in the Gabor trans- form. This ensures that there exists a perfect match between any two jets produced from two different rotated and scaled versions of the same face. The first experiment uses a single facial image. This experiment investigates the influence of noise in the units (σ) and of the parameter C E . Using a pair of the same jets, we obtain temporal averages of the correct matching over 10 ν cycles for each size (n = 4, 5, . . . , 9, 10, m = 4, m 1 = 0 and m 2 = m
− 1) of the macro-column. From these temporal averages, we calculate the sampling average over all n. The standard error is estimated using [σ d / √ N 1 ] where σ d is standard deviation of the sampling average. N 1 is the sample size. Fig.5(a) shows results for κ I = κ M = 1 and κ C = 5. For strong noise (σ = 0.015 to 0.02), the correct matching probabilities for C E = 0.98 to C E = 0.96
392 Y.D. Sato et al. (a) (b)
0 0.2
0.4 0.6
0.8 1 0 0.005 0.01
0.015 0.02
Probability Correct σ C E =1.0
C E =0.98 C E =0.96 C E =0.95 C E =0.94 0.5 0.75
1 0 0.005 0.01 0.015
0.02 Probability Correct σ C
=1.0 C E =0.98 C E =0.96 C E =0.95 C E =0.94 Fig. 5. Probability of correct match between units in the I and M domains. (a) κ I =
M = 1 and κ C = 5. (b) κ I = 2.3, κ
M = 5 and κ C = 5.
0 0.2
0.4 0.6
0.8 1 0 0.05 0.1
0.15 0.2
Probability Correct σ 1 σ =0.000
Fig. 6. Probability of the correct matching for comparisons of two identical faces. 6 facial images are used. Our system is set with m = 4, m 1 = m
2 = m
− 1, n = 8, κ I = κ M = 6.5, κ C = 5.0, C
E = 0.98 and σ = 0.0. take better values than the ones for C E = 1. Interestingly, for these low C E values, matching gets worse for weaker noise, collapsing in the case of σ = 0. This effect requires further investigation. However, for κ I = 2.3 and κ M = 5,
we have found that the correct matching probability takes higher values than in case of Fig.5(a). In particular, our system model with higher noise σ = 0.015 even demonstrates perfect correct matching for C E = 0.98, independent of n. Next, we have investigated the robustness of the matching against the second type of noise (σ 1 ), using 6 different face types. Since the original image is resized and/or rotated with m 1 + m 2 + 1 = 7 scales and n = 8 orientations, a total of 56 images could be employed in the I domain. Here our system is set with κ I = κ M = 6.5, κ C = 5.0 and C E = 0.98. For each face image, we take temporal averages of the correct matching probability for each size-orientation of the image, in a similar manner as described above. Averages and standard errors of the temporal averages on 6 different facial images can be plotted, in terms of σ 1 [Fig. 6]. The result of this experiment is independent of σ as long as σ 0.02.
As a result of Fig. 6, as random noise in the jets is increased, the correct matching probability smoothly decreases for one image, or it abruptly shifts down to around 0 at a certain σ 1 for the other image. We have also obtained Dynamic Link Matching between Feature Columns 393
perfect matching, independent of σ 1 when using the same but rotated or resized face. The most interesting point that we would like to insist is that the probability takes higher values than 87.74% as σ < 0.02 and σ 1
we can say that network model has a high recognition ability for scale and orientation invariance, which is not dependent of different facial types. 5 Discussion and Conclusion The main purpose of this communication is to convey a simple concept. Much more work needs to be done on the way to practical applications, for instance, more experiments with features extracted from independently taken images of different scale and orientation to better bring out the strengths and weaknesses of the approach. In addition to comparing and transforming local packages of feature values (our jets), it will be necessary to also handle spatial maps, that is, sets of point-to-point correspondences, a task we will approach next. Real applications involve, of course, a continuous range of transformation parameters, whereas we here had admitted only transformations from the same sample grid used for defining the family of wavelets in a jet. We hope to address this problem by working with continuous superpositions of transformation matrices for the neighboring transformation parameter values that straddle the correct value. Our approach may be seen as using brute force, as it requires many separate control units to sample the space of transformation parameters with enough density. However, the same set of control units can be used in a whole region of visual space, and in addition to that, for controlling point-to-point maps between regions, as spatial maps and feature maps stand in one-to-one correspondence to each other. The number of control units can be reduced even further if transfor- mations are performed in sequence, by consecutive, separate layers of dynamic links. Thus, the transformation from an arbitrary segment of primary visual cor- tex to an invariant window could be done in the sequence translation – scaling – rotation. In this case, the number of required control units would be the sum and not the product of the number of samples for each individual transformation. A disadvantage of that approach might be added difficulties in finding correspon- dences between the domains. Further work is required to elucidate these issues. Acknowledgements This work was supported by the EU project Daisy, FP6-2005-015803 and by the Hertie Foundation. We would like to thank C. Weber for help with preparing the manuscript. References 1. Anderson, C.H., Van Essen, D.C., Olshausen, B.A.: Directed visual attention and the dynamic control of information flow. In: Itti, L., Rees, G., Tsotsos, J.K. (eds.) Neurobiology of attention, pp. 11–17. Academic Press/Elservier (2005) 2. Weber, C., Wermter, S.: A self-organizing map of sigma-pi units. Neurocomput- ing 70, 2552–2560 (2007)
394 Y.D. Sato et al. 3. Wiskott, L., v.d. Malsburg, C.: Face Recognition by Dynamic Link Matching, ch. 4. In: Sirosh, J., Miikkulainen, R., Choe, Y. (eds.) Lateral Interactions in the Cortex: Structure and Function, vol. 4, Electronic book (1996) 4. Wolfrum, P., von der Malsburg, C.: What is the optimal architecture for visual information routing? Neural Comput. 19 (2007) 5. Lades, M.: Invariant Object Recognition with Dynamical Links, Robust to Varia- tions in Illumination. Ph.D. Thesis, Ruhr-Univ. Bochum (1995) 6. Maurer, T., von der Malsburg, C.: Learning Feature Transformations to Recognize Faces Rotated in Depth. In: Fogelman-Souli´ e, F., Rault, J.C., Gallinari, P., Drey- fus, G. (eds.) Proc. of the International Conference on Artificial Neural Networks ICANN 1995, EC2 & Cie, p. 353 (1995) 7. Daugmann, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985) 8. Jones, J., Palmer, L.: An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. 58, 1233–1258 (1987) 9. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: Proc. of the third IEEE Conference on Computer Vision and Pattern Recognition, pp. 84–91 (1994) 10. Wiskott, L., Fellous, J.-M., Kr¨ uger, von der Malsburg, C.: Face Recognition by Elastic Bunch Graph Matching. IEEE Trans. Pattern Anal. & Machine Intelli- gence 19, 775–779 (1997) 11. Marcelja, S.: Mathematical description of the responses of simple cortical cells. J. Optical Soc. Am. 70, 1297–1300 (1980) 12. Yue, X., Tjan, B.C., Biederman, I.: What makes faces special? Vision Res. 46, 3802–3811 (2006) 13. Okada, K., Steffens, J., Maurer, T., Hong, H., Elagin, E., Neven, H., von der Malsburg, C.: The Bochum/USC Face Recognition System and How it Fared in the FERET Phase III Test. In: Wechsler, H., Phillips, P.J., Bruce, V., Fogelman Souli´
e, F., Huang, T.S. (eds.) Face Recognition: FromTheory to Applications, pp. 186–205. Springer, Heidelberg (1998) 14. L¨ ucke, J., von der Malsburg, C.: Rapid Correspondence Finding in Networks of Cortical Columns. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 668–677. Springer, Heidelberg (2006) Perturbational Neural Networks for Incremental Learning in Virtual Learning System Eiichi Inohira 1 , Hiromasa Oonishi 2 , and Hirokazu Yokoi 1 1
{inohira,yokoi}@life.kyutech.ac.jp 2 Mitshbishi Heavy Industries, Ltd., Japan Abstract. This paper presents a new type of neural networks, a pertur- bational neural network to realize incremental learning in autonomous humanoid robots. In our previous work, a virtual learning system has been provided to realize exploring plausible behavior in a robot’s brain. Neural networks can generate plausible behavior in unknown environ- ment without time-consuming exploring. Although an autonomous robot should grow step by step, conventional neural networks forget prior learn- ing by training with new dataset. Proposed neural networks features adding output in sub neural network to weights and thresholds in main neural network. Incremental learning and high generalization capabil- ity are realized by slightly changing a mapping of the main neural net- work. We showed that the proposed neural networks realize incremental learning without forgetting through numerical experiments with a two- dimensional stair-climbing bipedal robot. 1 Introduction Recently, humanoid robots such as ASIMO [1] dramatically develop in view of hardware and are prospective for working just like a human. Although many researchers have studied artificial intelligence for a long time, humanoid robots have not so high autonomy as experts are not wanted. Humanoid robots should accomplish missions by itself rather than experts give them solutions such as models, algorithms, and programs. Researchers have studied to realize a robot with learning ability through trial and error. Such studies uses so-called soft computing techniques such as rein- forcement learning [2] and central pattern generator (CPG) [3]. Learning of a robot saves expert’s work to some degree but takes much time. These techniques are less efficient than humans. Humans instantly act depending on a situation by using imagination and experience. Even if a human fails, a failure will serve himself as experience. In particular, humans know characteristics of environment and their behavior and simulate trial and error to explore plausible behavior in their brains. In our previous work, Yamashita and et al. [4] have proposed a bipedal walking control system with virtual environment based on motion control mechanism of primates. In this study, exploring plausible behavior by trial and error is carried M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 395–404, 2008. c Springer-Verlag Berlin Heidelberg 2008 396 E. Inohira, H. Oonishi, and H. Yokoi out in a robot’s brain. However, the problem is that exploring takes much time. Kawano and et al. [5] have introduced learning of a relation of environmental data and an optimized behavior into the previous work to save time. Generating behavior becomes fast because a trained neural network (NN) immediately out- puts plausible behavior due to its generalization capability, which is an ability to obtain a desired output at an unknown input. However, a problem in this previous work is that conventional neural networks cannot realize incremental learning. It means that a trained NN forgets prior learning by training with new dataset. Realizing incremental learning is indispensable to learn various behav- iors step by step. We presents a new type of NNs, perturbational neural networks (PNN) to achieve incremental learning. PNN consists of main NN and sub NN. Main NN learns with representative dataset and never changes after that. Sub NN learns with new dataset as output error of main NN for the dataset is canceled. Sub NN is connected with main NN as additional terms of weights and thresholds in main NN. We guess that connections through weights and thresholds slightly changes characteristics of main NN, which means that perturbation is given to mapping of main NN, and that incremental learning does not sacrifice its generalization capability. 2 A Virtual Learning System for a Biped Robot We assume that a robot tries to achieve a task in an unknown environment. Vir- tual learning is defined as learning of virtual experience in virtual environment in a robot’s brain. Virtual environment is generated from sensory information on real environment around a robot and used for exploring behavior fit for a task without real action. Exploring such behavior in virtual environment is regarded as virtual experience by trial and error or ingenuity. Virtual learning is to memorize a rela- tion between environment and behavior under a particular task. Benefits of virtual learning are (1) to reduce risk of robot’s failure in the real world such as falling and colliding and (2) to enable a robot to immediately act and achieve learned tasks in similar environments. The former is involved in a robot’s safety and physical wastage. The latter leads to saving work and time to achieve a task. We presented virtual learning system for a biped robot [4,5]. This virtual learning system has virtual environment and NNs for learning as shown in Fig.1. We assume that central pattern generator (CPG) generates motion of a biped robot. CPG of a biped robot consists of neural oscillators corresponding to its joints. Then behavior is described as CPG parameters rather than joint angles. CPG parameters for controlling a robot are called CPG input. Optimized CPG input for a plausible behavior is given by exploring in virtual environment. Virtual learning is realized by NNs. A NN learned a mapping from environ- mental data to optimized CPG input, i.e, a mapping from real environment to robot’s behavior. NNs have generalization capability, which means that desired output can be generated at an unknown input. When a NN learned a mapping
PNNs for Incremental Learning in Virtual Learning System 397
Exploring optimized motion Does it serve a purpose ? NN CPG input Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling