Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
neurons in hidden and output layers. 3.1
Experimental Results Table 2 shows the results of PDCA for 50 independent runs on several classifica- tion problems. The generalization performance of produced ANNs is measured by testing error rate (TER), which refers to the percentage of wrong classifica- tion produced by trained ANNs. The number of hidden layers (HL) and hidden neurons (HN), respectively, refer to the total number of hidden layers and hidden neurons in produced ANNs. The SD refers to the standard deviation of different results produced by PDCA in 50 independent runs. It is quite natural that complex architectures will be needed for solving diffi- cult problems and simple architectures for solving easy problems. PDCA shows this characteristics which is seen from Table 2. For example, ANNs produced by PDCA had on average 2.84 hidden layers for the glass problem, while 1.06 hidden layers for the cancer problem. It is well known that the glass and cancer are one of the most difficult and easy problems, respectively, in the realm of ma- chine learning [13]. It is also apparent from their error rates achieved by PDCA; for the glass problem it is 0.323, while for the cancer problem it is 0.009. 324 M.A. Sattar, M.M. Islam, and K. Murase Table 2. Performance of PDCA based on number of hidden layers (HL), number of hidden neurons (HN), number of epochs and testing error rate (TER) for eight classification problems. The results were averaged over 50 independent runs. Problem
HL HN Epochs TER Mean SD Mean SD Mean SD Mean SD Cancer 1.06 0.7 3.36 1.2 289.34 20 0.009 0.003 Card 1.54 1.1 2.14 1.3 195.82 18 0.116 0.021 Diabetes 2.04 1.2 4.20 1.5 220.76 27 0.183 0.032 Glass
2.84 1.5 4.54 1.6 311.50 30 0.323 0.035 Gene
1.00 1.1 1.28 0.9 135.90 17 0.092 0.021 Iris
1.46 0.8 2.46 1.0 125.30 11 0.035 0.007 Letter
2.76 1.3 24.32 3.8 8037.84 280 0.128 0.010 Thyroid 2.04 0.9 4.28 1.7 260.34 16 0.011 0.001 Table 3. Performance of PDCAWL based on number of hidden neurons (HN), number of epochs and testing error rate (TER) for four classification problems. The results were averaged over 50 independent runs. Problem
HN Epochs
TER Mean SD Mean SD Mean SD Diabetes 5.64 1.3 261.32 25 0.214 0.035 Glass
6.58 1.6 370.90 30 0.364 0.029 Letter
30.64 4.2 8590.18 291 0.167 0.021 Thyroid 6.86 1.4 301.42 16 0.015 0.001 Table 4. Performance of PDCAST based on number of hidden layers (HL), num- ber of hidden neurons (HN), number of epochs and testing error rate (TER) for five classification problems. The results were averaged over 50 independent runs. Problem
HL HN Epochs TER Mean SD Mean SD Mean SD Mean SD Cancer 1.20 0.9 4.13 1.2 315.77 23 0.013 0.004 Glass 2.78 1.5 4.17 1.4 301.33 23 0.298 0.030 Gene 1.61 1.3 3.13 1.6 290.17 19 0.128 0.023 Iris 1.45 0.7 2.53 1.0 130.84 9 0.038 0.007 Letter
2.82 1.8 36.93 4.1 9220.02 311 0.176 0.035 In order to observe the effect of layer stopping criterion, a new set of ex- periments was carried out where PDCA without the layer stopping criterion (PDCAWL) was used and the result is depicted in Table 3. Since PDCAWL does not use the layer stopping criterion, it can only able to design single hidden layered ANNs. The comparison of the Table 3 with the Table 2 indicates the positive effect of using the layer stopping criterion in PDCA. To observe the effect of using different training sets in PDCA, another new set of experiments was carried out where all the newly added neurons were trained
A New Constructive Algorithm for Designing and Training ANNs 325
with the same training set. We call this version of PDCA as PDCAST. The result of the experiment is depicted in Table 4. The effect of using different training sets can easily be understood when the results of PDCAST (Table 4) are compared with those of PDCA (Table 2). It is observed that PDCA produces more compact ANN architectures and takes less number of epochs in comparison with its counterpart PDCAST. The t-test based on number of hidden neurons, number of epochs and testing error rate indicated that PDCA was significantly better than its counterpart PDCAWL and PDCAST at 95% confidence level. 4 Comparison This section compares the performance of PDCA with CCA [1] and aCasper [19]. The reason for selecting these algorithms is that they are able to design multiple hidden layered ANNs like PDCA. CCA and aCasper use quickprop [2] and RPROP [14] algorithms, respectively, for training ANNs. It is known that quickprop and RPROP algorithms are faster than backpropagation algorithm used in PDCA. In this section, we therefore compare PDCA with CCA and aCasper in terms of average TER and HN of 50 independent runs. Table 5 compares PDCAs results with those produced by CCA and aCasper on six classification problems. It can be seen from the table that PDCA achieved the smallest TER for five out of six problems and the second smallest next to aCasper [19] for one problem. The improved performance of PDCA could be attributed to a couple of factors such as determining the number of hidden layers in ANNs and of neurons in hidden layers automatically and using different data for training different hidden neurons. Table 5. Comparison among PDCA, CCA [1], and aCasper [19], in terms of average testing error rate (TER) and number of hidden neurons (HN) Algorithm Cancer Card Diabetes Glass Gene Thyroid PDCA HN
2.14 4.20
4.54 1.28 4.28
TER 0.009 0.116 0.183
0.323 0.092 0.011
CCA HN 5.18 1.07 9.78
8.07 2.73 25.04
TER 0.019 0.137 0.245
0.347 0.133 0.030
aCasper HN 4.86 0.12 3.02
4.18 0.00 4.64
TER 0.018 0.135 0.231
0.306 0.117 0.016
5 Conclusions This paper proposes a new constructive algorithm to design as well as train ANNs. Neither the number of hidden layers in an ANN nor the number of neu- rons in hidden layers needs to be predefined and fixed. They are determined automatically during the training of an ANN. Experiments have been carried out on several benchmark classification problems to evaluate how well PDCA
326 M.A. Sattar, M.M. Islam, and K. Murase performed on different problems in comparison with other ANN design algo- rithms. In almost all cases, PDCA outperformed the others. However, our ex- perimental study appeared to have revealed one weakness of PDCA in dealing with the glass problem, which has a small number of training examples. It would be interesting in the future to analyze PDCA more rigorously in order to gain more insights into when PDCA are most likely to perform well and for what kind of problems. The use of different activation functions like the one used in [8] in association with different training sets would also be an interesting future research topic. Acknowledgement. MMI is currently a Visiting Associate Professor at Uni- versity of Fukui supported by the Fellowship from Japanese Society for Promo- tion of Science (JSPS). This work was in part supported by grants to KM from JSPS, Yazaki Memorial Foundation for Science and Technology, and University of Fukui. References 1. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing System, vol. 2, pp. 524–532. Morgan Kaufmann, San Mateo, CA (1990) 2. Fahlman, S.E.: An empirical study of learning speed in back-propagation networks. Tech Report CMU-CS-88-162, Carnegie Mellon University (1988) 3. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are uni- versal approximators. Neural Networks 2, 359–366 (1989) 4. Kwok, T.Y., Yeung, D.Y.: Constructive algorithms for structure learning in feed- forward neural networks for regression problems. IEEE Transactions on Neural Networks 8, 630–645 (1997) 5. Lehtokangas, M.: Modified cascade-correlation learning for classification. IEEE Transactions on Neural Networks 11, 795–798 (2000) 6. Lehtokangas, M.: Fast initialization for cascade-correlation learning. IEEE Trans- actions on Neural Networks 10, 410–413 (1999) 7. Lehtokangas, M.: Modeling with constructive backpropagation. Neural Net- works 12, 707–716 (1999) 8. Ma, L., Khorasani, K.: Constructive feedforward neural networks using hermite polynomial activation functions. IEEE Transactions on Neural Networks 16, 821– 833 (2005) 9. Monirul Islam, Md., Murase, K.: A new algorithm to design compact two-hidden- layer artificial neural networks. Neural Networks 14, 1265–1278 (2001) 10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoffs in the cascade correlation learning architecture. IEEE Transactions on Neural Networks 5, 930– 935 (1994) 11. Prechelt, L.: Automatic early stopping using cross validation: quantifying the cri- teria. Neural Networks 11, 761–767 (1998) 12. Reed, R.: Pruning algorithms - a survey. IEEE Transactions on Neural Networks 4, 740–747 (1993) 13. Prechelt, L.: PROBEN1- A set of neural network benchmark problems and bench- marking rules. Technical Report 21/94, Faculty of Informatics, University of Karl- sruhe, Germany (1994) A New Constructive Algorithm for Designing and Training ANNs 327
14. Riedmiller, M., Braun, H.: RPROP - A Fast Adaptive Learning Algorithm. In: Proc. of the 1992 International Symposium on Computer and Information Sciences, Turkey, November 1992, pp. 279–285 (1992) 15. Schapire, R.E.: The strength of weak learnability. Machine Learning 5, 197–227 (1990) 16. Schwenk, H., Bengio, Y.: Boosting Neural Networks. Neural Computation 12, 1869– 1887 (2000) 17. Setiono, R., Hui, L.C.K.: Use of quasi-Newton method in a feed forward neural network construction algorithm. IEEE Trans. Neural Networks 6, 273–277 (1995) 18. Tamura, S., Tateishi, M.: Capabilities of a four-layered feedforward neural network: four layers versus three. IEEE Transactions on Neural Networks 8, 251–255 (1997) 19. Treadgold, N.K., Gedeon, T.D.: Exploring constructive cascade networks. IEEE Transactions on Neural Networks 10, 1335–1350 (1999) 20. Yao, X.: A review of evolutionary artificial neural networks. International Journal of Intelligent Systems 8, 529–567 (1993)
Effective Learning with Heterogeneous Neural Networks
Llu´ıs A. Belanche-Mu˜ noz
Dept. de Llenguatges i Sistemes Inform` atics,
Universitat Polit` ecnica de Catalunya, Barcelona, Spain belanche@lsi.upc.edu Abstract. This paper introduces a class of neuron models accepting heterogeneous inputs and weights. The neuron model computes a user- defined similarity function between inputs and weights. The neuron trans- fer function is formed by composition of an adapted logistic function with the power mean of the partial input-weight similarities. The resulting neu- ron model is capable of dealing directly with mixtures of continuous quan- tities (crisp or fuzzy) and discrete quantities (ordinal, integer, binary or nominal). There is also provision for missing values. An artificial neural network using these neuron models is trained using a breeder genetic algo- rithm until convergence. A number of experiments are carried out using several real-world benchmarking problems. The network is compared to a standard radial basis function network and to a multi-layer perceptron and shown to learn from non-trivial data sets with superior generaliza- tion ability in most cases, at a comparable computational cost. A further advantage is the interpretability of the learned weights. 1 Introduction In many important domains from the real world, objects are described by a mix- ture of continuous and discrete variables, usually containing missing information and characterized by an underlying uncertainty or imprecision. For example, in the well-known UCI repository [1] over half of the problems contain explicitly declared nominal attributes, let alone other data types, usually unreported. In the case of artificial neural networks (ANN), this heterogeneous information has to be encoded in the form of real-valued quantities, although in most cases there is enough domain knowledge to characterize the nature of the variables. We present a general framework for dealing with data heterogeneity in ANNs under the conceptual cover of similarity, in which a class of neuron models accept- ing heterogeneous inputs and weights computes a user-defined similarity function between these inputs and weights. The similarity function is defined by compo- sition of a specific power mean of the partial input-weight similarities plus a logistic function. The resulting neuron model then accepts mixtures of contin- uous and discrete quantities, with explicit provision for missing information. Other data types are possible by extension of the model. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 328–337, 2008. c Springer-Verlag Berlin Heidelberg 2008 Effective Learning with Heterogeneous Neural Networks 329
An artificial neural network using these neuron models is trained using a breeder genetic algorithm until convergence. A number of experiments are car- ried out that illustrate the validity of the approach, using several benchmarking problems (classification and regression), selected as representatives because of the diversity in the richness and kind of data heterogeneity. The network is com- pared to a standard radial basis function neural network and to a multi-layer perceptron, and shown to learn from non-trivial data sets with a generalization ability that is superior in most cases to both networks. A further advantage of the approach is the interpretability of the learned weights. The paper is organized as follows. Section 2 further motivates the basis of the approach and reviews some widespread ways of coping with data heterogeneity in ANNs; section 3 introduces material on similarity measures; section 4 builds the heterogeneous neural network and proposes specific similarity measures. Finally, section 5 presents experimental work. 2 Motivation The action of a feed-forward ANN may be viewed as a mapping through which points in the input space are transformed into corresponding points in an output space. The task is to find a structure in the data, to transform it from an input space to a new hidden space (the space spanned by the hidden units) in such a way that the problem is simpler when projected to this space. This processing is repeated for all the subsequent hidden layers. It seems clear that adequate trans- formations will require less units and therefore a hidden space of less dimensions. This in turn leads to simpler mappings which are less likely to overfit. The integration of heterogeneous data sources in information processing sys- tems has been advocated elsewhere [2]. In this sense, a shortcoming of the exis- tent neuron models is the difficulty of adding prior knowledge to the model. ANN current practice assumes that an input may be faithfully represented as a point in R n , and the geometry of this space is meant to capture the meaningful relations in input space. There is no particular reason why this should be the case. Knowledge representation influences the practical success of a learning task [3], but has received little attention in the ANN community. The form in which data are presented to the network is of primary relevance. In essence, the task of the hidden layer(s) is to find a new, more convenient representation for the problem given the data representation chosen, a crucial factor for a successful learning process that can have a great impact on generalization ability [4]. As Anderson and Rosenfeld put it, “prior information and invariances should be built into the design of a ANN, thereby simplifying the network design by not having to learn them” [5]. Further, the activity of the units should have a well defined meaning in terms of the input patterns [6]. Without the aim of being exhaustive, commonly used methods (see, e.g. [7,4]) are the following: Ordinal variables coded as real-valued or using a thermometer. Nominal variables coded using a 1-out-of-c or c − 1 representation. 330 L.A. Belanche-Mu˜ noz Missing information is difficult to handle, specially when the lost parts are of significant size. The entire examples can be removed, or the values “filled in” with the mean, median, nearest neighbour or by adding another input equal to one only if the value is absent. Statistical approaches need to model the input distribution itself [4], or are computationally intensive[8]. Vagueness and uncertainty are considerations usually put aside. These encodings being intuitive, their precise effect on network performance is not clear, because of the change in input distribution, the increase (sometimes acute) in input dimension and other subtler mathematical effects derived from imposing an order or a continuum where there was none. This issue can drama- tically affect generalization for the worse, due to the curse of dimensionality. In this scenario, the use of hidden units which can adequately capture the desired mapping and do not add input dimension is thus of great practical importance. 3 Similarity Measures Let us represent patterns belonging to a space X = ∅ as a vector x i of n com- ponents, where each component x ik represents the value of a particular feature k. A similarity measure is a unique number expressing how “like” two patterns are, given these features. A similarity measure may be defined as a function s : X × X −→ I
s ⊂ R. Assume that s is upper bounded, exhaustive and to- tal. This implies that I s is upper bounded and also that sup I s exists. Define s max
≡ sup I s and define s min ≡ inf I
s if it exists. Let us denote by s ij the
similarity between x i and x j , that is, s ij = s(x
i , x
j ). A similarity measure may be required to satisfy the following axioms, for any x i , x j ∈ X:
Axiom S1. (Reflexivity) s(x i , x i ) = s
max . This implies sup I s ∈ I
s . Axiom S2. (Consistency) s(x i , x
j ) = s
max = ⇒ x i = x
j . Axiom S3. (Symmetry) s(x j , x
i ) = s(x
i , x
j ). Axiom S4. (Boundedness) s is lower bounded when ∃a ∈ R such that s(x i
j ) ≥ a, for all x i , x
j ∈ X. This is equivalent to ask that inf I s exists.
Axiom S5. (Closedness) a lower bounded function s is closed if there exist x i , x j ∈ X such that s(x i , x
j ) = s
min . This is equivalent to ask that inf I s ∈ I
s . The similarity axioms enumerated above ensure a consistent definition of such functions, but should be taken as desiderata. Some similarity relations may fulfill part or all of them [9]. Other properties (like transitivity) may be of great interest in some contexts, but are not relevant for this work. The only requirement expressed so far about the set X is the existence of an equivalence relation. However, elements in this set can be simple or complex (i.e. composed by one or more variables). For atomic elements the similarity can be trivially computed, but for complex elements, we need to define a way to combine the partial similarities s k for each variable k to get a meaningful value s ijk
= s k (x ik , x
jk ). This combination has an important semantic role and it is not a trivial choice. We use in this work the concept of an A-average [10], as
Effective Learning with Heterogeneous Neural Networks 331
follows. Let [a, b] be a non-empty real interval. Call A(x 1 , . . . , x n ) the A-average of x 1
n ∈ [a, b] to every n-place real function A fulfilling: Axiom A1. A is continuous, symmetric and strictly increasing in each x i . Axiom A2. A(x, . . . , x) = x. Axiom A3. For any positive integer k ≤ n: A(x
1 , . . . , x n ) = A(y
k , . . . , y k k
, x i k+1 , . . . , x i n ) where y
k = A(x
i 1 , , . . . , x i k ) and (i 1 , ..., i
n ) is a permutation of (1, . . . , n). The means defined by these axioms fulfill Cauchy’s property of means, namely, that min x i ≤ A(x
1 , . . . , x n )
i with equality only if x 1 = x
2 = . . . = x n .
than idempotency; thus, the previous bounds actually impose idempotency. A further interesting property of these means is that we can add averaged elements to an A-average without changing the overall result. Formally, A(x 1 , ..., x n ) =
A(z 1 , ..., z m , x
1 , ..., x
n ) if and only if A(z 1 , ..., z
m ) = A(x
1 , ..., x
n ). As a conse- quence, if y = A(x 1 , ..., x n ), then A(x 1 , ..., x
n , A(x
1 , ..., x
n )) = y.
Theorem. Let f : [a, b] −→ R be a continuous, strictly monotone mapping. Let g be the inverse function of f . Then, A(x
1 , ..., x
n ) ≡ g 1 n n i=1 f (x
i ) is a well-defined A-average for all n ∈ N and x i ∈ [a, b] [10]. An important class of A-averages is formed by choosing f (z) = z q , q ∈ R, to obtain: M q (x 1 , ..., x n ) =
1 n n i=1 (x i ) q 1 q Several well-known means are found by choosing particular values of q. Specif- ically, the arithmetic mean for q = 1, the geometric mean for q = 0, the quadratic mean for q = 2 and the harmonic mean for q = −1. A property of the means M q is that M q (x 1 , ..., x n ) ≥ M q (x 1 , ..., x
n ) if and only if q > q , with equality only if x 1
2 = . . . = x n . We shall refer to these as the power means. 4 Heterogeneous Neural Networks 4.1 The Heterogeneous Neuron Model We develop in this section neuron models allowing for heterogeneous and im- precise inputs, defined as a mapping h : ˆ H n
out ⊆ R. Here R denotes the reals and ˆ H n is a cartesian product of an arbitrary number n of source sets ˆ H (k) , k = 1 . . . , n. These source sets may include extended reals ˆ R k
R k ∪ {X } 332 L.A. Belanche-Mu˜ noz (where each R k is a real interval), extended families of fuzzy sets ˆ F k = F k ∪{X }, and extended finite sets of the form ˆ O k = O k ∪ {X }, ˆ M k = M k ∪ {X }, where the
O k have a full order relation, while the M k have not. The special symbol X extends the source sets and denotes the unknown element (missing information), behaving as an incomparable element w.r.t. any ordering relation. According to this definition, neuron inputs are vectors composed of n elements among which there might be reals, fuzzy sets, ordinals, categorical and missing data. Consider now a function s : ˆ H n
H n → I s a similarity function in ˆ H n
we take I s ⊆ [0, 1] for simplicity. This function is formed by combination of n partial similarities s k : ˆ H (k)
× ˆ H (k) → I s . The s k measures are normalized to a common real interval ([0, 1] in this case) and computed according to dif- ferent formulas for different variables (possibly but not necessaryly determined by variable type alone). The desired neuron model can be devised that is both similarity-based and handles data heterogeneity and missing values, as follows. Let Γ i
∈ ˆ H n having a weight vector μ i ∈ ˆ
H n and smoothing parameter γ i . Then
Γ i (x) = ˇ s(γ i M q ([s
k (x k , μ i,k
)] n k=1 )), q ∈ R, γ
i ∈ (0, 1]
The activation function ˇ s is any sigmoid-like automorphism (a monotonic bijection) in [0, 1]. The eventually missing similarities (because one or both of their arguments is missing) are regarded as ignorance and must not contribute in favour nor against the overall measure. In order to do this, let C = {k 1 , . . . , k m } be the set of indexes of the similarities that could not be computed. Then define s X (x, μ i ) = M q ([s
k (x k , μ i,k
)] n k=1,k / ∈C ), that is the power means of the success- fully computed similarities. Thanks to the previous properties of A-averages, M q ([s k (x k , μ
i,k )] n k=1 ) = M
q ([s
k (x k , μ i,k
)] n k=1,k / ∈C , s
X (x, μ
i ), . . . , s X (x, μ
i ) m times ). 4.2 Heterogeneous Neural Networks Heterogeneous neural networks are neural architectures built out of the previ- ously defined neuron models, thus allowing for heterogeneous or missing inputs. A feed-forward architecture, with a hidden layer composed of heterogeneous neurons and a linear output layer is a straightforward choice, thus conforming a hybrid structure. The heterogeneous neural network computes the function: f hnn
(x) = h i=1 α i Γ i (x)
(1) where h > 0 is the number of (heterogeneous) hidden neurons and {α i
set of mixing coefficients. The HNN thus keeps linearity in the output layer and interpretability in the hidden layer. It can be naturally seen as a generalization of the RBF. This is so because the response of hidden neurons is localized: centered at a given object (the neuron weight, where response is maximum), falling down Effective Learning with Heterogeneous Neural Networks 333
as the input is less and less similar to that center. The general training procedure for the heterogeneous neural network (HNN for short) is based on evolutionary algorithms, due to the missing information and the likely non-differentiability of the similarity measure. Specifically, the training procedure for the HNN used in this work is based on a Breeder Genetic Algorithm (BGA) [11], a method in mid position between Genetic Algorithms (GA) and Evolution Strategies. While in GA selection is stochastic and meant to mimic Darwinian evolution, BGA selection is driven by a deterministic breeding mechanism, an artificial method in which only the best individuals —usually a fixed percentage τ of total population size μ— are selected to be recombined and mutated, as the basis to form a new generation. The used BGA does not need any coding scheme, being able to cope with heterogeneous or missing genes [12]. 4.3 Heterogeneous Similarity Measures We present in this section specific measures defined to belong to the interval [0, 1] for the sake of illustration, but there are many variations that may be superior by making better use of available expert knowledge. Ordinal variables. It is assumed that the values of the variable form a linearly ordered space ( O, ). Let x ik , x
jk ∈ O, x
ik x jk , P lk the fraction of values of variable k that take on the value x lk and the summation run through all the ordinal values x lk such that x ik x lk and x lk x jk [13].
s ijk
= 2log(P
ik + . . . + P jk )
ik + . . . + logP jk (2)
Nominal variables. It is assumed that no partial order exists among these values and the only possible comparison is equality. The basic similarity measure for these variables is the overlap. Let N be a nominal space and x ik , x
jk ∈ N .
Then s ijk
= 1 if x ik = x jk and 0 otherwise. Continuous variables. Let x, y ∈ Γ = [r
− , r
+ ] ⊂ R, r + > r
− . The standard metric in R is a metric in Γ . Therefore, for any two values x ik , x
jk ∈ R:
s ijk
= ˆ s R ⎛ ⎜ ⎝ |x ik − x jk | sup x,y ∈Γ |x − y| ⎞ ⎟ ⎠ (3) where ˆ
s R : [0, 1] −→ [0, 1] is a decreasing continuous function. The family ˆs R (z) = (1 − z
β ) α , 0 < β ≤ 1, α ≥ 1 is used here, with α = 2, β = 1. Integer variables. Given that N ⊂ R, any distance-based similarity in R is also valid in N. A flexible choice does not limit the range of integer values (as- sumed positive for convenience). In this case, a self-normalizing distance-based similarity s ijk = ˆ
s N ( |x ik − x jk |) is indicated, where ˆs N : [0,
∞) −→ (0, 1] is a decreasing continuous function. In particular, the function ˆ s N
1 1+z
can be used.
334 L.A. Belanche-Mu˜ noz Binary variables. In the data analysis literature there are many similarity mea- sures defined on collections of binary variables (see e.g. [14]). This is mostly due to the uncertainty over how to accommodate negative (i.e. false-false) matches. The present situation is that of comparison of a single binary variable rather than a whole vector. In general, one should know which of the two matches is the really relevant (true-true or false-false). For these reasons, treating the variable as purely nominal can result in a loss of relevant information. Since this meta-knowledge is usually not handy, we use in this work a frequency- based approach, as follows. Let x ik , x
jk be two binary values and let P lk be
lk . We define: s ijk
= h(1 − P
ik , 1
− P jk ), where h(x, y) = 2xy x+y
is the harmonic mean between x and y. This measure compares the agreement on the rarity of each value: the similarity is higher the less frequent the values are. Fuzzy variables. For variables representing fuzzy sets, similarity relations from the point of view of fuzzy theory have been defined elsewhere [15], and different choices are possible. In possibility theory, the possibility expresses the likeliness of co-occurrence of two vague propositions, with a value of one standing for absolute certainty. For two fuzzy sets ˜ A, ˜ B of a reference set X, it is defined as: Π ( ˜
A) ( ˜
B) = sup u ∈X (μ ˜ A ∩ ˜ B (u)) = sup u ∈X (min (μ ˜ A (u), μ ˜ B (u))) In our case, if F k is an arbitrary family of fuzzy sets, and x ik , x jk ∈ F
k , the following similarity relation can be used s ijk = Π
(x ik ) (x jk ). Notice that this measure indeed fulfills axioms S1, S2 and S3. 5 An Experimental Comparison A number of experiments are carried out to illustrate the validity of the ap- proach, using nine benchmarking problems, selected as representatives because Table 1. Basic features of the data sets. ‘Missing’ refers to the percentage of missing values, p to the number of examples and n to that of variables. Types are: C (continu- ous), N (nominal), I (integer), D (ordinal) and F (fuzzy). ‘Output’ is R for regression or a number indicating the number of classes. Name p n
Data Types Output Missing Annealing 898 28
6C,19N,3I 6 none Audiology 226 69
8D,61N 4 2.3% Boston Housing 506 13
11C,1I,1B R none Credit Screening 690 15
6C,6N,3B 2 0.6% Heart Disease 920 13
3C,5N,3I,2B 2 16.2% Horse Colic 364 20 6D,5N,2I,5C,2B 3 26.1%
South African Heart 462 9 7C,1I,1N
2 none
Servo Data 167 4
2I,2N R none Solar Flares 1066 9
4N,3D,2B 2 none Effective Learning with Heterogeneous Neural Networks 335
of the richness in data heterogeneity, taken from [7] and [1]. Their main charac- teristics are displayed in Table 1. For every data set, the available documentation is analysed for an assessment on the more appropriate treatment. Missing infor- mation is also properly identified. The HNN is compared to a standard radial basis function network (RBF) and to a multi-layer perceptron (MLP). The three networks are trained using 54 56
58 60
62 64
66 68
70 72
0 3 6 9 12
15 18
21 24
27 30
RBF MLP
HNN (a) Audiology 0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 0 3 6 9 12 15 18
21 24
27 30
RBF MLP
HNN (b) Boston Housing 55 60
65 70
75 80
85 90
0 3 6 9 12
15 18
21 24
27 30
RBF MLP
HNN (c) Credit Screening 55 60
65 70
75 80
0 3 6 9 12
15 18
21 24
27 30
RBF MLP
HNN (d) Heart Disease 45 50
55 60
65 70
0 3 6 9 12
15 18
21 24
27 30
RBF MLP
HNN (e) Horse Colic 74 76
78 80
82 84
86 88
90 92
94 0 3 6 9 12 15 18
21 24
27 30
RBF MLP
HNN (f) Annealing 60 62
64 66
68 70
72 74
76 78
0 3 6 9 12
15 18
21 24
27 30
RBF MLP
HNN (g) South African Heart 76 76.5
77 77.5
78 78.5
79 79.5
80 80.5
81 81.5
0 3 6 9 12
15 18
21 24
27 30
RBF MLP
HNN (h) Solar Flares Fig. 1. Generalization results for the networks. See text for an explanation of axis.
336 L.A. Belanche-Mu˜ noz exactly the same procedure to exclude this source of variation from the analysis. The RBF and MLP networks need a pre-processing of the data, following the recommendations in [7]. The RBF neuron model has provision for a smoothing parameter, different for every hidden neuron. The network architecture is fixed to one single layer of hidden neurons ranging from 1 to 30 plus as many output neurons as required by the task, logistic for classification problems and linear otherwise. The input variables for the RBF and MLP network are standardized to zero mean, unit standard deviation. This is not needed by the heterogeneous neurons because they compute a normalized measure, but is beneficial for the other networks. The output is presented for all networks in 1-out-of-c mode (that is, c outputs) for classification problems with c classes. The error measure in this case is generalized cross-entropy (GCE) [4]. For regression problems, the mean is subtracted to the continuous output and the normalized root mean square (NRMS) error is reported. Each data set is split in three parts, for training (TR), validation (VA) and test (TE), as 50%-25%-25%. The BGA task is to minimize error (either NRMS or GCE) on the TR part, until 500 generations or convergence. Then the network having the lowest error on the VA part is returned. The reported performance is the NRMS or GCE of that network on the TE part. This process is repeated ten times and the average is taken. We present performance results in Fig. 1(a) to (h). In each figure, the abs- cissa represents the number of neurons in the hidden layer, and the ordinate the average NRMS or GCE of that network on the TE part, as explained above. For classification problems, the abscissa value 0 is used to indicate the percentage of examples in the majority class (that is, the minimum acceptable performance, as a reference). For regression problems, this value is set to 1.0. Note that for clas- sification problems, higher values are better, whereas for regression problems, lower values are better. The HNN shows enhanced performance for most of the problems, specially for those displaying higher heterogeneity or missingness, and less so when this is not the case. This is reasonable since the HNN behaves as a RBF network whene there is no heterogeneity or missingness. The interpretabil- ity of the learned weights is also enhanced, since the weights are of the same type as their matching input. We additionally present performance results for the Servo Data (Fig. 2), which are specially interesting. This is a small data set with only 4 variables (none of them continuous) and less than 200 examples. For 0 2 4 6 8 10 12
14 16
0 3 6 9 12
15 18
21 24
27 30
RBF MLP
HNN (a)
0.4 0.5
0.6 0.7
0.8 0.9
1 1.1
0 3 6 9 12
15 18
21 24
27 30
RBF HNN
(b) Fig. 2. Generalization performance results for the Servo Data. Left: heavy overfit by the MLP network. Right: only RBF and HNN.
Effective Learning with Heterogeneous Neural Networks 337
this data set the MLP overfits heavily, due to the amplification in the number of variables introduced by the coding scheme of the nominal ones, which are many-valued. 6 Conclusions An heterogeneous neuron that computes a similarity index in [0, 1], followed by a logistic function has been proposed. Neural architectures making use of such neurons are termed heterogeneous neural networks. A body of experimental evi- dence has shown very promising results in several difficult learning tasks. These results also illustrate the importance of data preparation and the use of a more adequate similarity measure that captures relations in the data. As further work, we are currently working on an clustering-based alternative training scheme. References 1. Murphy, P.M., Aha, D.: UCI Repository of machine learning databases. UCI Dept. of Information and Computer Science (1991) 2. Kaburlassos, V., Petridis, V.: Learning and decision-making in the framework of fuzzy lattices. In: New learning paradigms in soft computing, Physica (2002) 3. Natarajan, B.: Machine learning: A theoretical approach. Morgan Kaufmann, San Francisco (1991) 4. Bishop, C.: Neural Networks for Pattern Recognition, Oxford (1995) 5. Anderson, J.A., Rosenfeld, E. (eds.): Neurocomputing: foundations of research. MIT Press, Cambridge (1988) 6. Omohundro, S.M.: Geometric Learning Algorithms. Technical Report 89-041. Univ. of Berkeley, CA (1989) 7. Prechelt, L.: Proben1-A set of Neural Network Benchmark Problems and Bench- marking Rules. Fac. f¨ ur Informatik. U. Karlsruhe T.R. 21/94 8. Tresp, V., Ahmad, S., Neuneier, R.: Training Neural Networks with Deficient Data. In: NIPS, vol. 6, Morgan Kaufmann, San Francisco (1994) 9. Chandon, J.L., Pinson, S.: Analyse Typologique. Masson (1981) 10. Hille, E.: Methods in Classical and Functional Analysis. Addison-Wesley, Reading (1971) 11. M¨
uhlenbein, H., Schlierkamp-Voosen, D.: Predictive Models for the Breeder Ge- netic Algorithm. Evolutionary Computation 1(1) (1993) 12. Belanche, L.: Evolutionary Optimization of Heterogeneous Problems. In: Guerv´ os,
J.J.M., Adamidis, P.A., Beyer, H.-G., Fern´ andez-Villaca˜ nas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 475–484. Springer, Heidelberg (2002) 13. Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of In- ternational Conference on Machine Learning (1998) 14. Everitt, B.: Cluster Analysis. Heinemann Books (1977) 15. Zadeh, L.: Fuzzy Sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1, 3–28 (1978)
Pattern-Based Reasoning System Using Self-incremental Neural Network for Propositional Logic Akihito Sudo, Manabu Tsuboyama, Chenli Zhang, Akihiro Sato, and Osamu Hasegawa Dept. of Computer Intelligence and Systems Science, Tokyo Institute of Technology Imaging Science and Engineering Lab., Tokyo Institute of Technology 4259 Nagatsuta-cho, Midori-ku, Yokohama, Japan {sudo,hasegawa}@isl.titech.ac.jp http://www.isl.titech.ac.jp/ ∼hasegawalab/ Abstract. We propose an architecture for reasoning with pattern-based if-then rules that is effective for intelligent systems like robots solving varying tasks autonomously in a real environment. The proposed sys- tem can store pattern-based if-then rules of propositional logic, includ- ing conjunctions, disjunctions, negations, and implications. The naive pattern-based reasoning can store pattern-based if-then rules and make inferences using them. However, it remains insufficient for intelligent sys- tems operating in a real environment. The proposed system uses an algo- rithm that is inspired by self-incremental neural networks such as SONIN and SOINN-AM in order to achieve incremental learning, generalization, avoidance of duplicate results, and robustness to noise, which are impor- tant properties for intelligent systems. Keywords: Reasoning, neural network, intelligent agent, intelligent robot. 1
Reasoning is an essential process of human intelligence. Although many reason- ing systems have been proposed, they remain insufficient for intelligent systems that must crawl about in the real world, such as robots. We consider that one reason is that most of them address only symbol-based if-then rules. Several rea- soning systems have been recognized for their effectiveness and have been com- mercialized, but they perform only in a specific domain. Although production systems are a much sought conventional reasoning system, they require human experts to input their knowledge in advance. Consequently, such systems perform well only in a specific domain, the knowledge of which an expert has provided. In contrast, nobody can predict correctly what the complete knowledge list for the intelligent systems operating in a varying environment; it must therefore learn knowledge that is provided sequentially from environments in an incremental manner. If the intelligent system uses a conventional symbol-based reasoning M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 338–347, 2008. c Springer-Verlag Berlin Heidelberg 2008
Pattern-Based Reasoning System Using Self-incremental Neural Network 339
system to make inferences using learned knowledge from environments, it must convert learned patterns to symbols before reasoning because what it obtains from environments through its sensors are patterns, not symbols. This strategy seems to be impractical currently and might be impossible even in the future. Converting patterns to symbols remains a difficult task so far in spite of many studies having been done as a task of classification. Discriminating cats from dogs is not an easy problem for current discriminators. Problems will remain even after one invents a classifier as powerful as a human. Intelligent systems must generate new symbols when they encounter what should be addressed as another symbol than those they already know. Humans must think or act with or about the object to assign a symbol to it. Then, humans must form inferences using patterns obtained from the object because it has not been symbolized yet. Intelligent systems also must make inferences using patterns to yield new symbols.
Several approaches were proposed to realize pattern-based reasoning systems. Tsukimoto proposed pattern-based reasoning where perceptrons, which have learned patterns, are employed as atomic propositions of if-then rules[4]. In [3], a pattern-based reasoning system was proposed where binary vectors are employed as atomic propositions using a non-monotone neural network[3]. The method proposed in [4] can deal with both conjunction, disjunction and nega- tion although one proposed by Yamane et al. can’t. On the other hand, patterns should be represented with vectors rather than perceptrons for robots because it is difficult to determine when a new perceptron is generated as a new proposi- tion. Moreover, real-valued data should be dealt with because sensor information take real value although [3] deal with only binary vectors. There has been no study which realizes a pattern-based reasoning system which can deal with both conjunction, disjunction and negation and employs real-valued vector as a rep- resentation of a pattern. In this paper, we propose a novel reasoning system that can deal with both conjunction, disjunction and negation and employs real-valued vector as a rep- resentation of a pattern. The proposed method can learn if-then rules, atomic propositions of which are patterns represented as real-valued vectors. For in- stance, the proposed method can learn “((A ∧B)∨C)→(D∧E∧F)”, where A–F are patterns represented as real-valued vectors 1 . When the proposed system having learned if-then rules obtains a fact, it produces an inference from the fact, pick- ing up and connecting learned if-then rules. It outputs C ∨D and (E∧¬F)∨D if the proposed system which has learned (A ∧B)→(C∨D) and C→(E∧¬F) obtains “A ∧B” as a fact. In addition to basic function for pattern-based reasoning discribed above, some additional functions are realized by the proposed method for application to an intelligent system like a robot operating in dynamic environments: in- cremental learning, generalization ability, avoidance of duplication of reasoning results, and robustness to noise. The proposed system achieves them through 1 We use ∧, ∨, ¬, →, respectively, for conjunction, disjunction, negation, and implication. 340 A. Sudo et al. Fig. 1. Two types of noisy data the use of a self-incremental neural network inspired by SOINN[1] and SOINN- AM[2]. Incremental learning of if-then rules is required because a user cannot provide a system that has complete knowledge in advance. The proposed sys- tem can learn if-then rules that are provided incrementally without collapsing previously learned knowledge. In [3], about 20% of reasoning is incorrect when it has incrementally learned new if-then-rules which is the same amount of if- then rules that had been learned previously in a batch manner. The proposed system does not store the same knowledge as (or very similar knowledge to) a previously stored one. This enables the proposed system to avoid suffering re- dundancy of memory size when it learns much knowledge incrementally. In a real environment, robots rarely obtain identical data to previously learned data. The proposed system can find out similarly learned if-then rules when different data are given. The proposed system categorizes learned if-then rules in online manner to avoid duplication of reasoning results. If the if-then rules are not cate- gorized, similar if-then rules are used completely as other rules, which engenders the duplication of results. For example, if the system which does not categorize knowledge stores “A → B” and “A’ →B’” where (A, A’) and (B, B’) are very similar pattern pairs, both B and B’ are outputs even when they can be inter- preted as the same result. Note that our approach is categorizing if-then rules, not each pattern. We consider that our approach engenders generating symbols by intelligent systems as a result of their own reasoning. Two types of noise exist in real environments, as shown in Fig. 1. Noise shown in Fig. 1(a) taints the original data. Noise shown in Fig. 1(b) is a pattern that is independent from the original patterns. The proposed system is robust to both types of noise. 2 Proposed Method The proposed system consists of long-term memory, short-term memory, a learn- ing machine, and a reasoning machine, as shown in Fig. 2. The long-term memory stores if-then rules learned from environments. The short-term memory tem- porarily stores input data. The learning machine executes a learning algorithm when if-then rules are provided. Unnecessary if-then rules are eliminated and if- then rules are categorized into several clusters in an online manner when learning. The reasoning machine executes a reasoning algorithm and outputs an OR tree as a reasoning result when a fact like A ∧ B is given. Not only the minimum function, but also incremental learning, generalization, avoidance of duplication of reasoning results, and robustness to noise are achieved through the use of
Pattern-Based Reasoning System Using Self-incremental Neural Network 341
Learning Machine Find winners in LTM Add Inputted If-then Rule New knowledge? Make an edge between the first winner and the second winner Eliminate old edges The number of input data is greater than threshold? Remove knowledge belonging to cluster which has less than two members. YES
NO YES
Reasoning Machine Find conditioning parts similar to input Add conclusion parts corresponding to the similar conditioning parts as child nodes Similar conditioning part exists? Find conditioning parts similar to leafs of the tree Generate a tree which has single node holding input YES
Short Term Memory Long Term Memory Reference update
Reference Reference Reference Input
Output Learning Phase If-Then Rule Reasoning Phase Fact Cluster 1 Cluster 2 Cluster 3 ・・・ Fig. 2. Architecture of the proposed system algorithms that are inspired by a self-incremental neural network that is based on SOINN[1] and SOINN-AM[2]. 2.1 Learning Algorithm The learning machine executes the learning algorithm when the system obtains an if-then rule. First, the input if-then rule is resolved into several if-then rules, both the conditional part and the sequential part of which are conjunctive of literals, which is accomplished as follows: both the conditional part and the sequential part are transformed to disjunctive normal forms; then every combi- nation of clauses is picked up. For example, if “(A ∧B)∨(C∧¬D)→(¬E∧F)” is input, it is resolved to “(A ∧B)→(¬E∧F)” and “(C∧¬D)→(¬E∧F)”. Note that a conditional part and sequential part of any if-then rule can be transformed to a disjunctive normal form because any form of a logical formula can be transformed to a disjunctive normal form. The resolved if-then rules are stored temporarily in the short-term memory. The system would suffer a high computational load and duplication of rea- soning results if all the resolved if-then rules were simply stored in the long-term memory. The steps described in the paragraph below let the learning machine determine which if-then rule should be stored in the long-term memory and cat- egorize if-then rules in the long-term memory into several clusters using edges, which are carried out in online manner. Those steps are executed repeatedly for every resolved if-then rule. Note that a “learning datum” below denotes a single resolved if-then rule. First, the most similar if-then rule (first winner) and the second-most similar if-then rule (second winner) to a learning datum are sought among the if-then rules in the long-term memory which are the same form of the learning datum. Here, the same form means that the number of positive and negative propositions of the conditional part and the sequential part are identical to each other. For example, “A ∧¬B→C” is regarded as having the same form of “D∧¬E→F”, but 342 A. Sudo et al. is not regarded as the same form of “ ¬D∧¬E→F”. The similarity between the same-form if-then rules are calculated using the following distance measure d: d(k, k ) = 1 √
min {k 1 , ···,k
M }∈N
M M i=1 P i − P k i + min {k 1 , ···,k
N }∈N
N N i=1 Q i − Q k i + min {k 1 , ···,k
˜ M }∈N M ˜ M i=1 ˜ P i − ˜
P k i + min
{k 1 , ···,k ˜ N }∈N N ˜ N i=1
˜ Q i − ˜ Q k i , where D represents the dimension of the patterns, N M = {1, 2, · · · , M}, N N = {1, 2, · · · , N}, and the learning data k and an if-then rule k stored in the long- term memory are denoted respectively as follows. M i=1
P i ∧ ⎛ ⎝ N j=1 ¬Q j ⎞ ⎠ →
⎛ ⎝ ˜ M i=1
˜ P i ⎞ ⎠ ∧
⎛ ⎝ ˜ N j=1
¬ ˜ Q j ⎞ ⎠ M i=1 P i ∧ ⎛ ⎝ N j=1
¬Q j ⎞ ⎠ → ⎛ ⎝ ˜ M i=1 ˜ P i ⎞ ⎠ ∧
⎛ ⎝ ˜ N j=1
¬ ˜ Q j ⎞ ⎠ Then, whether a learning datum is a member of the same cluster of both the first winner and the second winner is determined. The method of the determination is described in the end of this subsection. The learning datum is added to the long- term memory as new knowledge and the learning process for the learning data finishes if the learning datum is not a member of the same cluster of either the first winner or the second winner. If the learning datum is a member of the same cluster of both the first winner and the second winner, then the learning datum is not added to the long-term memory and the previously stored knowledge in the long-term memory is updated as follows. An edge between the first winner and the second winner is generated if the edge has not been generated yet, and the age of the edge is set to zero. That edge is used to generate clusters of if- then rules in the long-term memory. If-then rules connected with an edge are regarded as members of the same cluster. Subsequently, the age of every edge emanating from the first winner is increased by 1. Every edge with age greater than a parameter Λ edge , is removed. If-then rules in the long-term memory which have less than two edges are removed if a parameter λ is an integer multiple of the number of the previously provided learning data. The noise depicted in Fig. 1(b) is expected to be eliminated by this step. Here, we describe the method to determine whether a learning datum is a member of the same cluster of both the first and the second winners. The simi- larity threshold of the first (or second) winner is calculated as follows: s = max
k ∈N d(k, k w ) (if N = ∅) max
k ∈A d(k, k w ) (if N = ∅) ,
Pattern-Based Reasoning System Using Self-incremental Neural Network 343
A∧B C D∧¬E G F ¬J H∧¬I Fig. 3. Example of an OR tree as a result of reasoning where N is the set of all the if-then rules connected to the first (or second) winner by the edge, A is the set of all the if-then rules in the long-term memory, and k w denotes the first (or second) winner. The learning datum is regarded as a member of the same cluster of both the first and second winners if the distance between the first winner and the learning datum is less than the similarity threshold of the first winner and the distance between the second winner and the learning datum is less than the similarity threshold of the second winner. 2.2 The Reasoning Algorithm When a fact such as A ∧B is input, the system generates the tree, the root of which is the input fact; no node exists without the root. The distance between the fact and the conditional part of each if-then rule in the long term memory is calculated. If there exist conditional parts with distance of less than δ r , their sequential parts are added to the tree as children of the root. Here we describe details of the reasoning algorithm. When a fact is input, the tree exists for which the root is the fact and no node without the root exists. The distance between the fact and the conditional part of each if-then rule in the long-term memory is calculated. If there exist conditional parts with distance less than δ r , their sequential parts are added to the tree as children of the root, where the only nearest sequential part is added from the same cluster if several sufficiently close sequential parts exist in the same cluster. The reasoning process terminates if no child is added to the tree. The above steps are applied to the children instead of the fact if a new child exists. The above steps are executed repeatedly as long as a new child is generated. Figure 3 shows an example of an output tree. The reasoning machine outputs that tree when A’ ∧B’ is obtained as a fact if the long- term memory holds “A ∧B→C∨(D∧¬E)” and “C’→F∨G∨(H∧I), D’∧¬E’→¬J” where (A, A’) - (E, E’) are pairs of sufficiently similar patterns. An output tree should be addressed as an OR tree; then a deduced propo- sition is a disjunctive normal form generated by combining several proposi- tions in nodes. If the reasoning machine outputs the tree shown in Fig. 3, then “C ∨(D∧¬E)”, “C∨¬J”, “F∨G∨(H∧I) ∨(D∧¬E)”, and “F∨G∨(H∧I) ∨¬J” are correct reasoning results. A reasoning result that is represented as a disjunc- tive normal form can be interpreted as the possibility of an environment. The system discovers the possibility of places that the system can reach from the present location if the present location is input to the system as a fact and “the supermarket or the station or the drug store” is the reasoning result.
344 A. Sudo et al. Competitive Layer A ¬B A D D ¬C ¬B ¬C Conditional Rule ・・・
・・・ Input Layer Sequential Rule Fig. 4. Neural Network Model of the Proposed Method 2.3 Neural Network Model The architecture of the proposed system is as shown in Fig. 2 in terms of knowl- edge engineering; the proposed system can be interpreted as a self-incremental neural network like SOINN[1], SOINN-AM[2], and Growing Neural Gas[5]. The proposed neural network model has an input layer and a competitive layer, as shown in Fig. 4. The input layer obtains an if-then rule or a fact. New nodes are generated adaptively when the input layer obtains an if-then rule. Each if-then rule is represented with several nodes connected by edges, although other self- incremental neural networks represent learned data with a single node. Those edges are different-type edges from those used to generate clusters. The pro- posed model requires m × n neurons to represent an if-then rule, the conditional part of which has m literals and the sequential part has n literals. Each neuron holds a pair of literals picked up from the conditional part and the sequential part one-by-one, which means that each neuron holds two vectors and flags of negation. Neurons representing A ∧¬B→¬C∧D are shown in Fig. 4. 3 Simulation Results In the experiment, the proposed system learned the sequentially provided if- then rules, the atomic propositions of which were the images taken in the real environment. It then produced inferences using the learned knowledge. We used 280 images of the 14 objects shown in Fig. 5 taken from 20 different angles. Each image has 56 × 46 pixels. The if-then rules the systems learned were “A→B”, A. Close door B. Open door C. Lab.
F. Elevator G. Desk-1 H. Desk-2 I. Desk drawer-1 (close) J. Desk drawer-2 (close) K. Desk drawer-1 (open) L. Desk drawer-2 (open) M. Down stairs N. Room plate D. Wall E. Hallway O. White Noise Fig. 5. Images for the experiment Pattern-Based Reasoning System Using Self-incremental Neural Network 345
Table 1. Number of members in the clusters and compression rate Cluster No. If-then rule Number of members Compression rate (%) 1 AB
78.25 2 BD 145 63.75
3 BE 50 87.5 4 E(CN) 64 84 5 EF 31 92.25 6 EM 146 63.5 7 (CN)(GI) 84 79 8 (CN)(HJ) 111
72.25 9 (GI)K 140 65 10 (HJ)L 122
69.5 Total
— 980
75.5 ∧ ∧ ∧ Fig. 6. Reasoning results 0 50
10 15 20 25 30 35 SN ratio co rr ec t r
ea so ni ng ra tio (% ) 系列1 系列2 系列1
δr = 0.3 δr = 0.2
δr = 0.1 Fig. 7. Ratio of correct reasoning when noisy data are given “B →D∨E”, “E→(C∧N)∨F∨M”, “(C∧N)→(G∧I) ∨ (H∧J)”, “(G∧I)→K”, and “(H ∧J)→L”. They mean “the closed door → the open door”, “the open door → the wall ∨ the hallway”, “the hallway → (the laboratory ∧ the room plate) ∨ the elevator ∨ the down stairs”, “the laboratory ∧ the room plate→(the desk 1 ∧ the closed drawer 1)∨(the desk 2 ∧ the closed drawer 2)”, “(the desk 1 ∧ the
346 A. Sudo et al. closed drawer 1) → the open drawer 1”, “(the desk 2 ∧ the closed drawer 2)→ the open drawer 2”. The system obtained if-then rules, each atomic proposition of which was selected randomly from 20 different-angled images. Random selection from among them provided learning data for 4400 times for each lf-then rule. The parameters were set to Λ edge = 100, λ = 50. The number of if-then rules in the long-term memory after learning was 980; these rules are categorized into 10 different clusters. Each cluster had members cor- responding respectively only to “A →B”, “B→D”, “B→E”, “E→(C∧N)”, “E→F”, “E →M”, “(C∧N)→(G∧I)”, “(C∧N)→(H∧J)”, “(G∧I)→K”, or “(H∧J)→L”; one example is the members of clusters that consisted only of if-then rules correspond- ing to “A →B”. Consequently, the system was able to categorize if-then rules ap- propriately, which were the same rules as those in a symbol-based manner, into the same cluster. Table 1 shows the number of members of each cluster and data compression rates. We estimated the compression rate as (1 − N m
l ), where N m is the number of members of the cluster and N l is the number of if-then rules pro- vided to the system as learning data. Although the learning data consisted of 4000 if-then rules, the atomic propositions of which were different from one another, only 980 rules were stored in the long-term memory in all. The system eliminated 3020 rules as similar rules of previously stored rules. Note that the system obtained the learning data sequentially. The result of the learning indicate that the system learns new data without collapsing previously learned data and avoids consuming memory through eliminating similar data. We provided one learned image as a fact to test whether the system can pro- duce inferences correctly. The parameter was set to δ r = 0.3. The reasoning result is shown in Fig. 6. This tree has neither omissions nor duplication. The system was able to avoid duplication of results because it generated clusters ap- propriately. In all, 500,376 nodes were created in the output tree when we made the system regard every if-then rule in the long-term memory as belonging to different clusters. The reasoning result can be interpreted thusly: the system discovered where it can reach from the front of the closed door. One can find the importance of storing an if-then rule which has a conjunction in the reasoning about the contents of the closed drawers. The system would not make a correct inference if it were unable to deal with the conjunction because the closed draw- ers appear to be almost identical. To examine whether the system can reason correctly from the type of noisy data shown in Fig. 1(b), we provided ten white noise samples, one example of which is shown in Fig. 5O. Then, the system made no child node of the white noises because it was able to judge that all the white noise consisted of unknown patterns. The system proposed in [3] cannot judge whether an input fact has been learned. It outputs a wrong pattern that is entirely unrelated to learned patterns when it obtains an unknown pattern. We generated noisy data by adding Gaussian noise to 20 images of Fig. 5A to examine that the system can reason correctly from the type of noisy data shown in Fig. 1(a). Actually, 10 noisy images were generated from the same image in each noise level, which means that the system obtained 200 noisy images of the same noisy level. We estimated the correct reasoning ratio as “the number of
Pattern-Based Reasoning System Using Self-incremental Neural Network 347
correct reasoning / the number of reasoning with the noisy image”. Figure 7 shows that a boundary SN ratio exists. The system outputs the same result as that of the system that obtained the original image if the SN ratio is greater than the boundary. The system regards the fact as an unknown pattern and generates no child node of the fact if the SN ratio of a fact is less than the boundary. The boundary can be configured through setting. The system proposed in [3] also has the boundary of noise level to produce correct inferences, but a user cannot configure it. 4 Conclusion and Future Work We proposed a novel reasoning system that can learn a pattern-based if-then rule including conjunction, disjunction, and negation. The proposed system is suit- able for systems operating in the real world, such as intelligent robots, because the proposed system achieves incremental learning, generalization, avoidance of duplication of results, and robustness to noise. We will develop this study in two directions. The first direction is applying results of this study to development of a robot. Using the pattern-based reasoning system, a robot is expected to learn knowledge autonomously from an environment and solve tasks by reasoning us- ing the learned knowledge. The second direction is an extension of the proposed system for a robot to solve complicated problems. Acknowledgments. This study was supported by the Industrial Technology Research Grant Program in 2004 from the New Energy and Industrial Technol- ogy Development Organization (NEDO) of Japan. References 1. Shen, F., Hasegawa, O.: An Incremental Network for On-line Unsupervised Classi- fication and Topology Learning. Neural Networks 19, 90–106 (2006) 2. Sudo, A., Sato, A., Hasegawa, O.: Associative Memory for Online Incremental Learn- ing in Noisy Environments. In: Proc. of the 2005 International Joint Conference on Neural Networks (IJCNN 2005) (accepted, 2007) 3. Yamane, K., Hasuo, T., Suemitsu, A., Morita, M.: Pattern-based reasoning using trajectory attractors. IEICE Trans. Information and System J90-D, 933–944 (2007) 4. Tsukimoto, H.: Pattern Reasoning: Logical Reasoning of Neural Networks. IEICE Trans. Information and System J83-D-II, 744–753 (2000) 5. Fritzke, B.: A growing neural gas network learns topologies. In: Proc. of Advances in Neural Information Processing Systems (NIPS), pp. 625–632 (1995)
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 348–357, 2008. © Springer-Verlag Berlin Heidelberg 2008 Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling