Lecture Notes in Computer Science

bet	30/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 26 27 28 29 30 31 32 33 ... 88

neurons in hidden and output layers.

3.1

Experimental Results

Table 2 shows the results of PDCA for 50 independent runs on several classiﬁca-

tion problems. The generalization performance of produced ANNs is measured

by testing error rate (TER), which refers to the percentage of wrong classiﬁca-

tion produced by trained ANNs. The number of hidden layers (HL) and hidden

neurons (HN), respectively, refer to the total number of hidden layers and hidden

neurons in produced ANNs. The SD refers to the standard deviation of diﬀerent

results produced by PDCA in 50 independent runs.

It is quite natural that complex architectures will be needed for solving diﬃ-

cult problems and simple architectures for solving easy problems. PDCA shows

this characteristics which is seen from Table 2. For example, ANNs produced

by PDCA had on average 2.84 hidden layers for the glass problem, while 1.06

hidden layers for the cancer problem. It is well known that the glass and cancer

are one of the most diﬃcult and easy problems, respectively, in the realm of ma-

chine learning [13]. It is also apparent from their error rates achieved by PDCA;

for the glass problem it is 0.323, while for the cancer problem it is 0.009.

324

M.A. Sattar, M.M. Islam, and K. Murase

Table 2. Performance of PDCA based on number of hidden layers (HL), number

of hidden neurons (HN), number of epochs and testing error rate (TER) for eight

classiﬁcation problems. The results were averaged over 50 independent runs.

Problem

Epochs

TER

Mean SD Mean SD Mean SD Mean SD

Cancer

1.06 0.7 3.36 1.2 289.34 20 0.009 0.003

Card

1.54 1.1 2.14 1.3 195.82 18 0.116 0.021

Diabetes 2.04 1.2 4.20 1.5 220.76 27 0.183 0.032

Glass

2.84 1.5 4.54 1.6 311.50 30 0.323 0.035

Gene

1.00 1.1 1.28 0.9 135.90 17 0.092 0.021

Iris

1.46 0.8 2.46 1.0 125.30 11 0.035 0.007

Letter

2.76 1.3 24.32 3.8 8037.84 280 0.128 0.010

Thyroid 2.04 0.9 4.28 1.7 260.34 16 0.011 0.001

Table 3. Performance of PDCAWL based on number of hidden neurons (HN), number

of epochs and testing error rate (TER) for four classiﬁcation problems. The results were

averaged over 50 independent runs.

Problem

Epochs

TER

Mean SD Mean SD Mean SD

Diabetes 5.64 1.3 261.32 25 0.214 0.035

Glass

6.58 1.6 370.90 30 0.364 0.029

Letter

30.64 4.2 8590.18 291 0.167 0.021

Thyroid 6.86 1.4 301.42 16 0.015 0.001

Table 4. Performance of PDCAST based on number of hidden layers (HL), num-

ber of hidden neurons (HN), number of epochs and testing error rate (TER) for ﬁve

classiﬁcation problems. The results were averaged over 50 independent runs.

Problem

Epochs

TER

Mean SD Mean SD Mean SD Mean SD

Cancer

1.20 0.9 4.13 1.2 315.77 23 0.013 0.004

Glass

2.78 1.5 4.17 1.4 301.33 23 0.298 0.030

Gene

1.61 1.3 3.13 1.6 290.17 19 0.128 0.023

Iris

1.45 0.7 2.53 1.0 130.84

9 0.038 0.007

Letter

2.82 1.8 36.93 4.1 9220.02 311 0.176 0.035

In order to observe the eﬀect of layer stopping criterion, a new set of ex-

periments was carried out where PDCA without the layer stopping criterion

(PDCAWL) was used and the result is depicted in Table 3. Since PDCAWL

does not use the layer stopping criterion, it can only able to design single hidden

layered ANNs. The comparison of the Table 3 with the Table 2 indicates the

positive eﬀect of using the layer stopping criterion in PDCA.

To observe the eﬀect of using diﬀerent training sets in PDCA, another new set

of experiments was carried out where all the newly added neurons were trained

A New Constructive Algorithm for Designing and Training ANNs

325

with the same training set. We call this version of PDCA as PDCAST. The

result of the experiment is depicted in Table 4. The eﬀect of using diﬀerent

training sets can easily be understood when the results of PDCAST (Table 4)

are compared with those of PDCA (Table 2). It is observed that PDCA produces

more compact ANN architectures and takes less number of epochs in comparison

with its counterpart PDCAST. The t-test based on number of hidden neurons,

number of epochs and testing error rate indicated that PDCA was signiﬁcantly

better than its counterpart PDCAWL and PDCAST at 95% conﬁdence level.

Comparison

This section compares the performance of PDCA with CCA [1] and aCasper

[19]. The reason for selecting these algorithms is that they are able to design

multiple hidden layered ANNs like PDCA. CCA and aCasper use quickprop [2]

and RPROP [14] algorithms, respectively, for training ANNs. It is known that

quickprop and RPROP algorithms are faster than backpropagation algorithm

used in PDCA. In this section, we therefore compare PDCA with CCA and

aCasper in terms of average TER and HN of 50 independent runs.

Table 5 compares PDCAs results with those produced by CCA and aCasper

on six classiﬁcation problems. It can be seen from the table that PDCA achieved

the smallest TER for ﬁve out of six problems and the second smallest next to

aCasper [19] for one problem. The improved performance of PDCA could be

attributed to a couple of factors such as determining the number of hidden layers

in ANNs and of neurons in hidden layers automatically and using diﬀerent data

for training diﬀerent hidden neurons.

Table 5. Comparison among PDCA, CCA [1], and aCasper [19], in terms of average

testing error rate (TER) and number of hidden neurons (HN)

Algorithm

Cancer Card Diabetes Glass Gene Thyroid

PDCA

HN

3.36

2.14

4.20

4.54 1.28

4.28

TER 0.009 0.116

0.183

0.323 0.092

0.011

CCA

5.18

1.07

9.78

8.07 2.73

25.04

TER 0.019 0.137

0.245

0.347 0.133

0.030

aCasper

4.86

0.12

3.02

4.18 0.00

4.64

TER 0.018 0.135

0.231

0.306 0.117

0.016

Conclusions

This paper proposes a new constructive algorithm to design as well as train

ANNs. Neither the number of hidden layers in an ANN nor the number of neu-

rons in hidden layers needs to be predeﬁned and ﬁxed. They are determined

automatically during the training of an ANN. Experiments have been carried

out on several benchmark classiﬁcation problems to evaluate how well PDCA

326

M.A. Sattar, M.M. Islam, and K. Murase

performed on diﬀerent problems in comparison with other ANN design algo-

rithms. In almost all cases, PDCA outperformed the others. However, our ex-

perimental study appeared to have revealed one weakness of PDCA in dealing

with the glass problem, which has a small number of training examples. It would

be interesting in the future to analyze PDCA more rigorously in order to gain

more insights into when PDCA are most likely to perform well and for what

kind of problems. The use of diﬀerent activation functions like the one used in

[8] in association with diﬀerent training sets would also be an interesting future

research topic.

Acknowledgement. MMI is currently a Visiting Associate Professor at Uni-

versity of Fukui supported by the Fellowship from Japanese Society for Promo-

tion of Science (JSPS). This work was in part supported by grants to KM from

JSPS, Yazaki Memorial Foundation for Science and Technology, and University

of Fukui.

References

1. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In:

Touretzky, D.S. (ed.) Advances in Neural Information Processing System, vol. 2,

pp. 524–532. Morgan Kaufmann, San Mateo, CA (1990)

2. Fahlman, S.E.: An empirical study of learning speed in back-propagation networks.

Tech Report CMU-CS-88-162, Carnegie Mellon University (1988)

3. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are uni-

versal approximators. Neural Networks 2, 359–366 (1989)

4. Kwok, T.Y., Yeung, D.Y.: Constructive algorithms for structure learning in feed-

forward neural networks for regression problems. IEEE Transactions on Neural

Networks 8, 630–645 (1997)

5. Lehtokangas, M.: Modiﬁed cascade-correlation learning for classiﬁcation. IEEE

Transactions on Neural Networks 11, 795–798 (2000)

6. Lehtokangas, M.: Fast initialization for cascade-correlation learning. IEEE Trans-

actions on Neural Networks 10, 410–413 (1999)

7. Lehtokangas, M.: Modeling with constructive backpropagation. Neural Net-

works 12, 707–716 (1999)

8. Ma, L., Khorasani, K.: Constructive feedforward neural networks using hermite

polynomial activation functions. IEEE Transactions on Neural Networks 16, 821–

833 (2005)

9. Monirul Islam, Md., Murase, K.: A new algorithm to design compact two-hidden-

layer artiﬁcial neural networks. Neural Networks 14, 1265–1278 (2001)

10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoﬀs in the cascade

correlation learning architecture. IEEE Transactions on Neural Networks 5, 930–

935 (1994)

11. Prechelt, L.: Automatic early stopping using cross validation: quantifying the cri-

teria. Neural Networks 11, 761–767 (1998)

12. Reed, R.: Pruning algorithms - a survey. IEEE Transactions on Neural Networks 4,

740–747 (1993)

13. Prechelt, L.: PROBEN1- A set of neural network benchmark problems and bench-

marking rules. Technical Report 21/94, Faculty of Informatics, University of Karl-

sruhe, Germany (1994)

A New Constructive Algorithm for Designing and Training ANNs

327

14. Riedmiller, M., Braun, H.: RPROP - A Fast Adaptive Learning Algorithm. In:

Proc. of the 1992 International Symposium on Computer and Information Sciences,

Turkey, November 1992, pp. 279–285 (1992)

15. Schapire, R.E.: The strength of weak learnability. Machine Learning 5, 197–227

(1990)

16. Schwenk, H., Bengio, Y.: Boosting Neural Networks. Neural Computation 12, 1869–

1887 (2000)

17. Setiono, R., Hui, L.C.K.: Use of quasi-Newton method in a feed forward neural

network construction algorithm. IEEE Trans. Neural Networks 6, 273–277 (1995)

18. Tamura, S., Tateishi, M.: Capabilities of a four-layered feedforward neural network:

four layers versus three. IEEE Transactions on Neural Networks 8, 251–255 (1997)

19. Treadgold, N.K., Gedeon, T.D.: Exploring constructive cascade networks. IEEE

Transactions on Neural Networks 10, 1335–1350 (1999)

20. Yao, X.: A review of evolutionary artiﬁcial neural networks. International Journal

of Intelligent Systems 8, 529–567 (1993)

Eﬀective Learning with Heterogeneous Neural

Networks

Llu´ıs A. Belanche-Mu˜

noz

Dept. de Llenguatges i Sistemes Inform`

atics,

Universitat Polit`

ecnica de Catalunya,

Barcelona, Spain

belanche@lsi.upc.edu

Abstract. This paper introduces a class of neuron models accepting

heterogeneous inputs and weights. The neuron model computes a user-

deﬁned similarity function between inputs and weights. The neuron trans-

fer function is formed by composition of an adapted logistic function with

the power mean of the partial input-weight similarities. The resulting neu-

ron model is capable of dealing directly with mixtures of continuous quan-

tities (crisp or fuzzy) and discrete quantities (ordinal, integer, binary or

nominal). There is also provision for missing values. An artiﬁcial neural

network using these neuron models is trained using a breeder genetic algo-

rithm until convergence. A number of experiments are carried out using

several real-world benchmarking problems. The network is compared to

a standard radial basis function network and to a multi-layer perceptron

and shown to learn from non-trivial data sets with superior generaliza-

tion ability in most cases, at a comparable computational cost. A further

advantage is the interpretability of the learned weights.

Introduction

In many important domains from the real world, objects are described by a mix-

ture of continuous and discrete variables, usually containing missing information

and characterized by an underlying uncertainty or imprecision. For example, in

the well-known UCI repository [1] over half of the problems contain explicitly

declared nominal attributes, let alone other data types, usually unreported. In

the case of artiﬁcial neural networks (ANN), this heterogeneous information has

to be encoded in the form of real-valued quantities, although in most cases there

is enough domain knowledge to characterize the nature of the variables.

We present a general framework for dealing with data heterogeneity in ANNs

under the conceptual cover of similarity, in which a class of neuron models accept-

ing heterogeneous inputs and weights computes a user-deﬁned similarity function

between these inputs and weights. The similarity function is deﬁned by compo-

sition of a speciﬁc power mean of the partial input-weight similarities plus a

logistic function. The resulting neuron model then accepts mixtures of contin-

uous and discrete quantities, with explicit provision for missing information.

Other data types are possible by extension of the model.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 328–337, 2008.

c Springer-Verlag Berlin Heidelberg 2008

Eﬀective Learning with Heterogeneous Neural Networks

329

An artiﬁcial neural network using these neuron models is trained using a

breeder genetic algorithm until convergence. A number of experiments are car-

ried out that illustrate the validity of the approach, using several benchmarking

problems (classiﬁcation and regression), selected as representatives because of

the diversity in the richness and kind of data heterogeneity. The network is com-

pared to a standard radial basis function neural network and to a multi-layer

perceptron, and shown to learn from non-trivial data sets with a generalization

ability that is superior in most cases to both networks. A further advantage of

the approach is the interpretability of the learned weights.

The paper is organized as follows. Section 2 further motivates the basis of the

approach and reviews some widespread ways of coping with data heterogeneity in

ANNs; section 3 introduces material on similarity measures; section 4 builds the

heterogeneous neural network and proposes speciﬁc similarity measures. Finally,

section 5 presents experimental work.

Motivation

The action of a feed-forward ANN may be viewed as a mapping through which

points in the input space are transformed into corresponding points in an output

space. The task is to ﬁnd a structure in the data, to transform it from an input

space to a new hidden space (the space spanned by the hidden units) in such a

way that the problem is simpler when projected to this space. This processing is

repeated for all the subsequent hidden layers. It seems clear that adequate trans-

formations will require less units and therefore a hidden space of less dimensions.

This in turn leads to simpler mappings which are less likely to overﬁt.

The integration of heterogeneous data sources in information processing sys-

tems has been advocated elsewhere [2]. In this sense, a shortcoming of the exis-

tent neuron models is the diﬃculty of adding prior knowledge to the model.

ANN current practice assumes that an input may be faithfully represented as

a point in

, and the geometry of this space is meant to capture the meaningful

relations in input space. There is no particular reason why this should be the

case. Knowledge representation inﬂuences the practical success of a learning task

[3], but has received little attention in the ANN community. The form in which

data are presented to the network is of primary relevance. In essence, the task

of the hidden layer(s) is to ﬁnd a new, more convenient representation for the

problem given the data representation chosen, a crucial factor for a successful

learning process that can have a great impact on generalization ability [4]. As

Anderson and Rosenfeld put it, “prior information and invariances should be

built into the design of a ANN, thereby simplifying the network design by not

having to learn them” [5]. Further, the activity of the units should have a well

deﬁned meaning in terms of the input patterns [6]. Without the aim of being

exhaustive, commonly used methods (see, e.g. [7,4]) are the following:

Ordinal variables coded as real-valued or using a thermometer.

Nominal variables coded using a 1-out-of-c or c

− 1 representation.

330

L.A. Belanche-Mu˜

noz

Missing information is diﬃcult to handle, specially when the lost parts are of

signiﬁcant size. The entire examples can be removed, or the values “ﬁlled in”

with the mean, median, nearest neighbour or by adding another input equal

to one only if the value is absent. Statistical approaches need to model the

input distribution itself [4], or are computationally intensive[8].

Vagueness and uncertainty are considerations usually put aside.

These encodings being intuitive, their precise eﬀect on network performance

is not clear, because of the change in input distribution, the increase (sometimes

acute) in input dimension and other subtler mathematical eﬀects derived from

imposing an order or a continuum where there was none. This issue can drama-

tically aﬀect generalization for the worse, due to the curse of dimensionality. In

this scenario, the use of hidden units which can adequately capture the desired

mapping and do not add input dimension is thus of great practical importance.

Similarity Measures

Let us represent patterns belonging to a space X =

∅ as a vector x

of n com-

ponents, where each component x

represents the value of a particular feature

k. A similarity measure is a unique number expressing how “like” two patterns

are, given these features. A similarity measure may be deﬁned as a function

s : X

× X −→ I

⊂ R. Assume that s is upper bounded, exhaustive and to-

tal. This implies that I

is upper bounded and also that sup I

exists. Deﬁne

max

≡ sup I

and deﬁne s

min

≡ inf I

if it exists. Let us denote by s

the

similarity between x

and x

, that is, s

= s(x

, x

). A similarity measure may

be required to satisfy the following axioms, for any x

, x

∈ X:

Axiom S1. (Reﬂexivity) s(x

, x

) = s

max

. This implies sup I

∈ I

Axiom S2. (Consistency) s(x

, x

) = s

max

⇒ x

= x

Axiom S3. (Symmetry) s(x

, x

) = s(x

, x

Axiom S4. (Boundedness)

s is lower bounded when

∃a ∈ R such that

s(x

i

, x

)

≥ a, for all x

, x

∈ X. This is equivalent to ask that inf I

exists.

Axiom S5. (Closedness) a lower bounded function s is closed if there exist

, x

∈ X such that s(x

, x

) = s

min

. This is equivalent to ask that inf I

∈ I

The similarity axioms enumerated above ensure a consistent deﬁnition of such

functions, but should be taken as desiderata. Some similarity relations may fulﬁll

part or all of them [9]. Other properties (like transitivity) may be of great interest

in some contexts, but are not relevant for this work.

The only requirement expressed so far about the set X is the existence of

an equivalence relation. However, elements in this set can be simple or complex

(i.e. composed by one or more variables). For atomic elements the similarity can

be trivially computed, but for complex elements, we need to deﬁne a way to

combine the partial similarities s

for each variable k to get a meaningful value

ijk

= s

, x

). This combination has an important semantic role and it is

not a trivial choice. We use in this work the concept of an A-average [10], as

Eﬀective Learning with Heterogeneous Neural Networks

331

follows. Let [a, b] be a non-empty real interval. Call A(x

, . . . , x

) the A-average

of x

1

, ..., x

∈ [a, b] to every n-place real function A fulﬁlling:

Axiom A1. A is continuous, symmetric and strictly increasing in each x

Axiom A2. A(x, . . . , x) = x.

Axiom A3. For any positive integer k

≤ n:

A(x

, . . . , x

) = A(y

, . . . , y

k

times

, x

k+1

, . . . , x

)

where y

= A(x

, , . . . , x

) and (i

, ..., i

) is a permutation of (1, . . . , n).

The means deﬁned by these axioms fulﬁll Cauchy’s property of means, namely,

that min x

≤ A(x

, . . . , x

)

≤ max x

with equality only if x

= x

= . . . = x

.

The proof is straightforward using axioms A1 and A2. This result is even stronger

than idempotency; thus, the previous bounds actually impose idempotency. A

further interesting property of these means is that we can add averaged elements

to an A-average without changing the overall result. Formally, A(x

, ..., x

) =

A(z

, ..., z

, x

, ..., x

) if and only if A(z

, ..., z

) = A(x

, ..., x

). As a conse-

quence, if y = A(x

, ..., x

), then A(x

, ..., x

, A(x

, ..., x

)) = y.

Theorem. Let f : [a, b]

−→ R be a continuous, strictly monotone mapping. Let

g be the inverse function of f . Then,

A(x

, ..., x

)

≡ g

i=1

f (x

)

is a well-deﬁned A-average for all n

∈ N and x

∈ [a, b] [10]. An important class

of A-averages is formed by choosing f (z) = z

, q

∈ R, to obtain:

, ..., x

) =

i=1

)

Several well-known means are found by choosing particular values of q. Specif-

ically, the arithmetic mean for q = 1, the geometric mean for q = 0, the quadratic

mean for q = 2 and the harmonic mean for q =

−1. A property of the means M

is that M

, ..., x

)

≥ M

, ..., x

) if and only if q > q , with equality only

if x

1

= x

= . . . = x

. We shall refer to these as the power means.

Heterogeneous Neural Networks

4.1

The Heterogeneous Neuron Model

We develop in this section neuron models allowing for heterogeneous and im-

precise inputs, deﬁned as a mapping h : ˆ

n

→ R

out

⊆ R. Here R denotes

the reals and ˆ

is a cartesian product of an arbitrary number n of source sets

(k)

, k = 1 . . . , n. These source sets may include extended reals ˆ

k

=

∪ {X }

332

L.A. Belanche-Mu˜

noz

(where each

is a real interval), extended families of fuzzy sets ˆ

∪{X },

and extended ﬁnite sets of the form ˆ

∪ {X }, ˆ

∪ {X }, where

the

have a full order relation, while the

have not. The special symbol

extends the source sets and denotes the unknown element (missing information),

behaving as an incomparable element w.r.t. any ordering relation. According to

this deﬁnition, neuron inputs are vectors composed of n elements among which

there might be reals, fuzzy sets, ordinals, categorical and missing data.

Consider now a function s : ˆ

n

× ˆ

→ I

a similarity function in ˆ

n

, where

we take I

⊆ [0, 1] for simplicity. This function is formed by combination of

n partial similarities s

: ˆ

(k)

× ˆ

(k)

→ I

. The s

measures are normalized

to a common real interval ([0, 1] in this case) and computed according to dif-

ferent formulas for diﬀerent variables (possibly but not necessaryly determined

by variable type alone). The desired neuron model can be devised that is both

similarity-based and handles data heterogeneity and missing values, as follows.

Let Γ

i

(x) the function computed by the i-th neuron, where x

∈ ˆ

having a

weight vector μ

∈ ˆ

and smoothing parameter γ

. Then

(x) = ˇ

s(γ

([s

, μ

i,k

)]

k=1

)), q

∈ R, γ

∈ (0, 1]

The activation function ˇ

s is any sigmoid-like automorphism (a monotonic

bijection) in [0, 1]. The eventually missing similarities (because one or both of

their arguments is missing) are regarded as ignorance and must not contribute in

favour nor against the overall measure. In order to do this, let C =

, . . . , k

}

be the set of indexes of the similarities that could not be computed. Then deﬁne

(x, μ

) = M

([s

, μ

i,k

)]

k=1,k /

∈C

), that is the power means of the success-

fully computed similarities. Thanks to the previous properties of A-averages,

([s

, μ

i,k

)]

k=1

) = M

([s

, μ

i,k

)]

k=1,k /

∈C

, s

(x, μ

), . . . , s

(x, μ

)

times

4.2

Heterogeneous Neural Networks

Heterogeneous neural networks are neural architectures built out of the previ-

ously deﬁned neuron models, thus allowing for heterogeneous or missing inputs.

A feed-forward architecture, with a hidden layer composed of heterogeneous

neurons and a linear output layer is a straightforward choice, thus conforming a

hybrid structure. The heterogeneous neural network computes the function:

hnn

(x) =

i=1

(x)

(1)

where h > 0 is the number of (heterogeneous) hidden neurons and

{α

i

} is the

set of mixing coeﬃcients. The HNN thus keeps linearity in the output layer and

interpretability in the hidden layer. It can be naturally seen as a generalization of

the RBF. This is so because the response of hidden neurons is localized: centered

at a given object (the neuron weight, where response is maximum), falling down

Eﬀective Learning with Heterogeneous Neural Networks

333

as the input is less and less similar to that center. The general training procedure

for the heterogeneous neural network (HNN for short) is based on evolutionary

algorithms, due to the missing information and the likely non-diﬀerentiability

of the similarity measure. Speciﬁcally, the training procedure for the HNN used

in this work is based on a Breeder Genetic Algorithm (BGA) [11], a method in

mid position between Genetic Algorithms (GA) and Evolution Strategies. While

in GA selection is stochastic and meant to mimic Darwinian evolution, BGA

selection is driven by a deterministic breeding mechanism, an artiﬁcial method in

which only the best individuals —usually a ﬁxed percentage τ of total population

size μ— are selected to be recombined and mutated, as the basis to form a new

generation. The used BGA does not need any coding scheme, being able to cope

with heterogeneous or missing genes [12].

4.3

Heterogeneous Similarity Measures

We present in this section speciﬁc measures deﬁned to belong to the interval [0, 1]

for the sake of illustration, but there are many variations that may be superior

by making better use of available expert knowledge.

Ordinal variables. It is assumed that the values of the variable form a linearly

ordered space (

O, ). Let x

, x

∈ O, x

, P

the fraction of values of

variable k that take on the value x

and the summation run through all the

ordinal values x

such that x

and x

[13].

ijk

2log(P

+ . . . + P

)

logP

+ . . . + logP

(2)

Nominal variables. It is assumed that no partial order exists among these

values and the only possible comparison is equality. The basic similarity measure

for these variables is the overlap. Let

N be a nominal space and x

, x

∈ N .

Then s

ijk

= 1 if x

= x

and 0 otherwise.

Continuous variables. Let x, y

∈ Γ = [r

−

, r

]

⊂ R, r

> r

−

. The standard

metric in

R is a metric in Γ . Therefore, for any two values x

, x

∈ R:

ijk

= ˆ

⎛

⎜

⎝

− x

sup

x,y

∈Γ

|x − y|

⎞

⎟

⎠

(3)

where ˆ

: [0, 1]

−→ [0, 1] is a decreasing continuous function. The family ˆs

(z) =

− z

)

, 0 < β

≤ 1, α ≥ 1 is used here, with α = 2, β = 1.

Integer variables. Given that

N ⊂ R, any distance-based similarity in R is

also valid in

N. A ﬂexible choice does not limit the range of integer values (as-

sumed positive for convenience). In this case, a self-normalizing distance-based

similarity s

ijk

= ˆ

(

− x

|) is indicated, where ˆs

: [0,

∞) −→ (0, 1] is a

decreasing continuous function. In particular, the function ˆ

N

(z) =

1+z

can be

used.

334

L.A. Belanche-Mu˜

noz

Binary variables. In the data analysis literature there are many similarity mea-

sures deﬁned on collections of binary variables (see e.g. [14]). This is mostly due

to the uncertainty over how to accommodate negative (i.e. false-false) matches.

The present situation is that of comparison of a single binary variable rather

than a whole vector. In general, one should know which of the two matches

is the really relevant (true-true or false-false). For these reasons, treating the

variable as purely nominal can result in a loss of relevant information. Since

this meta-knowledge is usually not handy, we use in this work a frequency-

based approach, as follows. Let x

, x

be two binary values and let P

be

again the fraction of values of variable k that take on the value x

. We deﬁne:

ijk

= h(1

− P

, 1

− P

), where h(x, y) =

2xy

x+y

is the harmonic mean between

x and y. This measure compares the agreement on the rarity of each value: the

similarity is higher the less frequent the values are.

Fuzzy variables. For variables representing fuzzy sets, similarity relations from

the point of view of fuzzy theory have been deﬁned elsewhere [15], and diﬀerent

choices are possible. In possibility theory, the possibility expresses the likeliness

of co-occurrence of two vague propositions, with a value of one standing for

absolute certainty. For two fuzzy sets ˜

A, ˜

B of a reference set X, it is deﬁned as:

( ˜

B) = sup

∈X

(μ

∩ ˜

(u)) = sup

∈X

(min (μ

(u), μ

(u)))

In our case, if

is an arbitrary family of fuzzy sets, and x

, x

∈ F

the following similarity relation can be used s

ijk

= Π

)

). Notice that this

measure indeed fulﬁlls axioms S1, S2 and S3.

An Experimental Comparison

A number of experiments are carried out to illustrate the validity of the ap-

proach, using nine benchmarking problems, selected as representatives because

Table 1. Basic features of the data sets. ‘Missing’ refers to the percentage of missing

values, p to the number of examples and n to that of variables. Types are: C (continu-

ous), N (nominal), I (integer), D (ordinal) and F (fuzzy). ‘Output’ is R for regression

or a number indicating the number of classes.

Name

p n

Data Types

Output Missing

Annealing

898 28

6C,19N,3I

none

Audiology

226 69

8D,61N

2.3%

Boston Housing

506 13

11C,1I,1B

none

Credit Screening

690 15

6C,6N,3B

0.6%

Heart Disease

920 13

3C,5N,3I,2B

16.2%

Horse Colic

364 20 6D,5N,2I,5C,2B

26.1%

South African Heart 462 9

7C,1I,1N

none

Servo Data

167 4

2I,2N

none

Solar Flares

1066 9

4N,3D,2B

none

Eﬀective Learning with Heterogeneous Neural Networks

335

of the richness in data heterogeneity, taken from [7] and [1]. Their main charac-

teristics are displayed in Table 1. For every data set, the available documentation

is analysed for an assessment on the more appropriate treatment. Missing infor-

mation is also properly identiﬁed.

The HNN is compared to a standard radial basis function network (RBF)

and to a multi-layer perceptron (MLP). The three networks are trained using

RBF

MLP

HNN

(a) Audiology

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

RBF

MLP

HNN

(b) Boston Housing

RBF

MLP

HNN

RBF

MLP

HNN

(d) Heart Disease

RBF

MLP

HNN

(e) Horse Colic

RBF

MLP

HNN

(f) Annealing

RBF

MLP

HNN

(g) South African Heart

76.5

77.5

78.5

79.5

80.5

81.5

RBF

MLP

HNN

(h) Solar Flares

Fig. 1. Generalization results for the networks. See text for an explanation of axis.

336

L.A. Belanche-Mu˜

noz

exactly the same procedure to exclude this source of variation from the analysis.

The RBF and MLP networks need a pre-processing of the data, following the

recommendations in [7]. The RBF neuron model has provision for a smoothing

parameter, diﬀerent for every hidden neuron. The network architecture is ﬁxed

to one single layer of hidden neurons ranging from 1 to 30 plus as many output

neurons as required by the task, logistic for classiﬁcation problems and linear

otherwise. The input variables for the RBF and MLP network are standardized

to zero mean, unit standard deviation. This is not needed by the heterogeneous

neurons because they compute a normalized measure, but is beneﬁcial for the

other networks. The output is presented for all networks in 1-out-of-c mode (that

is, c outputs) for classiﬁcation problems with c classes. The error measure in this

case is generalized cross-entropy (GCE) [4]. For regression problems, the mean

is subtracted to the continuous output and the normalized root mean square

(NRMS) error is reported. Each data set is split in three parts, for training

(TR), validation (VA) and test (TE), as 50%-25%-25%. The BGA task is to

minimize error (either NRMS or GCE) on the TR part, until 500 generations

or convergence. Then the network having the lowest error on the VA part is

returned. The reported performance is the NRMS or GCE of that network on

the TE part. This process is repeated ten times and the average is taken.

We present performance results in Fig. 1(a) to (h). In each ﬁgure, the abs-

cissa represents the number of neurons in the hidden layer, and the ordinate the

average NRMS or GCE of that network on the TE part, as explained above. For

classiﬁcation problems, the abscissa value 0 is used to indicate the percentage of

examples in the majority class (that is, the minimum acceptable performance, as

a reference). For regression problems, this value is set to 1.0. Note that for clas-

siﬁcation problems, higher values are better, whereas for regression problems,

lower values are better. The HNN shows enhanced performance for most of the

problems, specially for those displaying higher heterogeneity or missingness, and

less so when this is not the case. This is reasonable since the HNN behaves as a

RBF network whene there is no heterogeneity or missingness. The interpretabil-

ity of the learned weights is also enhanced, since the weights are of the same

type as their matching input. We additionally present performance results for

the Servo Data (Fig. 2), which are specially interesting. This is a small data set

with only 4 variables (none of them continuous) and less than 200 examples. For

RBF

MLP

HNN

(a)

0.4

0.5

0.6

0.7

0.8

0.9

1.1

RBF

HNN

(b)

Fig. 2. Generalization performance results for the Servo Data. Left: heavy overﬁt by

the MLP network. Right: only RBF and HNN.

Eﬀective Learning with Heterogeneous Neural Networks

337

this data set the MLP overﬁts heavily, due to the ampliﬁcation in the number

of variables introduced by the coding scheme of the nominal ones, which are

many-valued.

Conclusions

An heterogeneous neuron that computes a similarity index in [0, 1], followed by

a logistic function has been proposed. Neural architectures making use of such

neurons are termed heterogeneous neural networks. A body of experimental evi-

dence has shown very promising results in several diﬃcult learning tasks. These

results also illustrate the importance of data preparation and the use of a more

adequate similarity measure that captures relations in the data. As further work,

we are currently working on an clustering-based alternative training scheme.

References

1. Murphy, P.M., Aha, D.: UCI Repository of machine learning databases. UCI Dept.

of Information and Computer Science (1991)

2. Kaburlassos, V., Petridis, V.: Learning and decision-making in the framework of

fuzzy lattices. In: New learning paradigms in soft computing, Physica (2002)

3. Natarajan, B.: Machine learning: A theoretical approach. Morgan Kaufmann, San

Francisco (1991)

4. Bishop, C.: Neural Networks for Pattern Recognition, Oxford (1995)

5. Anderson, J.A., Rosenfeld, E. (eds.): Neurocomputing: foundations of research.

MIT Press, Cambridge (1988)

6. Omohundro, S.M.: Geometric Learning Algorithms. Technical Report 89-041. Univ.

of Berkeley, CA (1989)

7. Prechelt, L.: Proben1-A set of Neural Network Benchmark Problems and Bench-

marking Rules. Fac. f¨

ur Informatik. U. Karlsruhe T.R. 21/94

8. Tresp, V., Ahmad, S., Neuneier, R.: Training Neural Networks with Deﬁcient Data.

In: NIPS, vol. 6, Morgan Kaufmann, San Francisco (1994)

9. Chandon, J.L., Pinson, S.: Analyse Typologique. Masson (1981)

10. Hille, E.: Methods in Classical and Functional Analysis. Addison-Wesley, Reading

(1971)

11. M¨

uhlenbein, H., Schlierkamp-Voosen, D.: Predictive Models for the Breeder Ge-

netic Algorithm. Evolutionary Computation 1(1) (1993)

12. Belanche, L.: Evolutionary Optimization of Heterogeneous Problems. In: Guerv´

os,

J.J.M., Adamidis, P.A., Beyer, H.-G., Fern´

andez-Villaca˜

nas, J.-L., Schwefel, H.-P.

(eds.) PPSN 2002. LNCS, vol. 2439, pp. 475–484. Springer, Heidelberg (2002)

13. Lin, D.: An Information-Theoretic Deﬁnition of Similarity. In: Proceedings of In-

ternational Conference on Machine Learning (1998)

14. Everitt, B.: Cluster Analysis. Heinemann Books (1977)

15. Zadeh, L.: Fuzzy Sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1,

3–28 (1978)

Pattern-Based Reasoning System

Using Self-incremental Neural Network

for Propositional Logic

Akihito Sudo, Manabu Tsuboyama, Chenli Zhang,

Akihiro Sato, and Osamu Hasegawa

Dept. of Computer Intelligence and Systems Science, Tokyo Institute of Technology

Imaging Science and Engineering Lab., Tokyo Institute of Technology

4259 Nagatsuta-cho, Midori-ku, Yokohama, Japan

{sudo,hasegawa}@isl.titech.ac.jp

http://www.isl.titech.ac.jp/

∼hasegawalab/

Abstract. We propose an architecture for reasoning with pattern-based

if-then rules that is eﬀective for intelligent systems like robots solving

varying tasks autonomously in a real environment. The proposed sys-

tem can store pattern-based if-then rules of propositional logic, includ-

ing conjunctions, disjunctions, negations, and implications. The naive

pattern-based reasoning can store pattern-based if-then rules and make

inferences using them. However, it remains insuﬃcient for intelligent sys-

tems operating in a real environment. The proposed system uses an algo-

rithm that is inspired by self-incremental neural networks such as SONIN

and SOINN-AM in order to achieve incremental learning, generalization,

avoidance of duplicate results, and robustness to noise, which are impor-

tant properties for intelligent systems.

Keywords: Reasoning, neural network, intelligent agent, intelligent

robot.

1

Introduction

Reasoning is an essential process of human intelligence. Although many reason-

ing systems have been proposed, they remain insuﬃcient for intelligent systems

that must crawl about in the real world, such as robots. We consider that one

reason is that most of them address only symbol-based if-then rules. Several rea-

soning systems have been recognized for their eﬀectiveness and have been com-

mercialized, but they perform only in a speciﬁc domain. Although production

systems are a much sought conventional reasoning system, they require human

experts to input their knowledge in advance. Consequently, such systems perform

well only in a speciﬁc domain, the knowledge of which an expert has provided. In

contrast, nobody can predict correctly what the complete knowledge list for the

intelligent systems operating in a varying environment; it must therefore learn

knowledge that is provided sequentially from environments in an incremental

manner. If the intelligent system uses a conventional symbol-based reasoning

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 338–347, 2008.

c Springer-Verlag Berlin Heidelberg 2008

Pattern-Based Reasoning System Using Self-incremental Neural Network

339

system to make inferences using learned knowledge from environments, it must

convert learned patterns to symbols before reasoning because what it obtains

from environments through its sensors are patterns, not symbols. This strategy

seems to be impractical currently and might be impossible even in the future.

Converting patterns to symbols remains a diﬃcult task so far in spite of many

studies having been done as a task of classiﬁcation. Discriminating cats from

dogs is not an easy problem for current discriminators. Problems will remain

even after one invents a classiﬁer as powerful as a human. Intelligent systems

must generate new symbols when they encounter what should be addressed as

another symbol than those they already know. Humans must think or act with

or about the object to assign a symbol to it. Then, humans must form inferences

using patterns obtained from the object because it has not been symbolized

yet. Intelligent systems also must make inferences using patterns to yield new

symbols.

Several approaches were proposed to realize pattern-based reasoning systems.

Tsukimoto proposed pattern-based reasoning where perceptrons, which have

learned patterns, are employed as atomic propositions of if-then rules[4]. In

[3], a pattern-based reasoning system was proposed where binary vectors are

employed as atomic propositions using a non-monotone neural network[3]. The

method proposed in [4] can deal with both conjunction, disjunction and nega-

tion although one proposed by Yamane et al. can’t. On the other hand, patterns

should be represented with vectors rather than perceptrons for robots because

it is diﬃcult to determine when a new perceptron is generated as a new proposi-

tion. Moreover, real-valued data should be dealt with because sensor information

take real value although [3] deal with only binary vectors. There has been no

study which realizes a pattern-based reasoning system which can deal with both

conjunction, disjunction and negation and employs real-valued vector as a rep-

resentation of a pattern.

In this paper, we propose a novel reasoning system that can deal with both

conjunction, disjunction and negation and employs real-valued vector as a rep-

resentation of a pattern. The proposed method can learn if-then rules, atomic

propositions of which are patterns represented as real-valued vectors. For in-

stance, the proposed method can learn “((A

∧B)∨C)→(D∧E∧F)”, where A–F are

patterns represented as real-valued vectors

. When the proposed system having

learned if-then rules obtains a fact, it produces an inference from the fact, pick-

ing up and connecting learned if-then rules. It outputs C

∨D and (E∧¬F)∨D if

the proposed system which has learned (A

∧B)→(C∨D) and C→(E∧¬F) obtains

“A

∧B” as a fact.

In addition to basic function for pattern-based reasoning discribed above,

some additional functions are realized by the proposed method for application

to an intelligent system like a robot operating in dynamic environments: in-

cremental learning, generalization ability, avoidance of duplication of reasoning

results, and robustness to noise. The proposed system achieves them through

We use

∧, ∨, ¬, →, respectively, for conjunction, disjunction, negation, and

implication.

340

A. Sudo et al.

Fig. 1. Two types of noisy data

the use of a self-incremental neural network inspired by SOINN[1] and SOINN-

AM[2]. Incremental learning of if-then rules is required because a user cannot

provide a system that has complete knowledge in advance. The proposed sys-

tem can learn if-then rules that are provided incrementally without collapsing

previously learned knowledge. In [3], about 20% of reasoning is incorrect when

it has incrementally learned new if-then-rules which is the same amount of if-

then rules that had been learned previously in a batch manner. The proposed

system does not store the same knowledge as (or very similar knowledge to) a

previously stored one. This enables the proposed system to avoid suﬀering re-

dundancy of memory size when it learns much knowledge incrementally. In a

real environment, robots rarely obtain identical data to previously learned data.

The proposed system can ﬁnd out similarly learned if-then rules when diﬀerent

data are given. The proposed system categorizes learned if-then rules in online

manner to avoid duplication of reasoning results. If the if-then rules are not cate-

gorized, similar if-then rules are used completely as other rules, which engenders

the duplication of results. For example, if the system which does not categorize

knowledge stores “A

→ B” and “A’ →B’” where (A, A’) and (B, B’) are very

similar pattern pairs, both B and B’ are outputs even when they can be inter-

preted as the same result. Note that our approach is categorizing if-then rules,

not each pattern. We consider that our approach engenders generating symbols

by intelligent systems as a result of their own reasoning. Two types of noise

exist in real environments, as shown in Fig. 1. Noise shown in Fig. 1(a) taints

the original data. Noise shown in Fig. 1(b) is a pattern that is independent from

the original patterns. The proposed system is robust to both types of noise.

Proposed Method

The proposed system consists of long-term memory, short-term memory, a learn-

ing machine, and a reasoning machine, as shown in Fig. 2. The long-term memory

stores if-then rules learned from environments. The short-term memory tem-

porarily stores input data. The learning machine executes a learning algorithm

when if-then rules are provided. Unnecessary if-then rules are eliminated and if-

then rules are categorized into several clusters in an online manner when learning.

The reasoning machine executes a reasoning algorithm and outputs an OR tree

as a reasoning result when a fact like A

∧ B is given. Not only the minimum

function, but also incremental learning, generalization, avoidance of duplication

of reasoning results, and robustness to noise are achieved through the use of

Pattern-Based Reasoning System Using Self-incremental Neural Network

341

Learning Machine

Find winners in LTM

Add Inputted If-then Rule

New knowledge?

Make an edge between the

first winner and the second

winner

Eliminate old edges

The number of input data is

greater than threshold?

Remove knowledge belonging to

cluster which has less than two

members.

YES

Reasoning Machine

Find conditioning parts

similar to input

Add conclusion parts

corresponding to the similar

conditioning parts

as child nodes

Similar conditioning part exists? Find conditioning parts similar

to leafs of the tree

Generate a tree which has

single node holding input

YES

Short Term Memory

Long Term Memory

Reference

update

Reference

Input

Output

Learning Phase

If-Then Rule

Reasoning Phase

Fact

Cluster 1 Cluster 2 Cluster 3

･･･

Fig. 2. Architecture of the proposed system

algorithms that are inspired by a self-incremental neural network that is based

on SOINN[1] and SOINN-AM[2].

2.1

Learning Algorithm

The learning machine executes the learning algorithm when the system obtains

an if-then rule. First, the input if-then rule is resolved into several if-then rules,

both the conditional part and the sequential part of which are conjunctive of

literals, which is accomplished as follows: both the conditional part and the

sequential part are transformed to disjunctive normal forms; then every combi-

nation of clauses is picked up. For example, if “(A

∧B)∨(C∧¬D)→(¬E∧F)” is

input, it is resolved to “(A

∧B)→(¬E∧F)” and “(C∧¬D)→(¬E∧F)”. Note that a

conditional part and sequential part of any if-then rule can be transformed to a

disjunctive normal form because any form of a logical formula can be transformed

to a disjunctive normal form. The resolved if-then rules are stored temporarily

in the short-term memory.

The system would suﬀer a high computational load and duplication of rea-

soning results if all the resolved if-then rules were simply stored in the long-term

memory. The steps described in the paragraph below let the learning machine

determine which if-then rule should be stored in the long-term memory and cat-

egorize if-then rules in the long-term memory into several clusters using edges,

which are carried out in online manner. Those steps are executed repeatedly for

every resolved if-then rule. Note that a “learning datum” below denotes a single

resolved if-then rule.

First, the most similar if-then rule (ﬁrst winner) and the second-most similar

if-then rule (second winner) to a learning datum are sought among the if-then

rules in the long-term memory which are the same form of the learning datum.

Here, the same form means that the number of positive and negative propositions

of the conditional part and the sequential part are identical to each other. For

example, “A

∧¬B→C” is regarded as having the same form of “D∧¬E→F”, but

342

A. Sudo et al.

is not regarded as the same form of “

¬D∧¬E→F”. The similarity between the

same-form if-then rules are calculated using the following distance measure d:

d(k, k ) =

√

D

min

···,k

}∈N

i=1

− P

min

···,k

}∈N

i=1

− Q

min

···,k

}∈N

i=1

− ˜

min

···,k

}∈N

i=1

− ˜

where D represents the dimension of the patterns,

{1, 2, · · · , M}, N

{1, 2, · · · , N}, and the learning data k and an if-then rule k stored in the long-

term memory are denoted respectively as follows.

i=1

∧

⎛

⎝

j=1

¬Q

⎞

⎠ →

⎛

⎝

i=1

⎞

⎠ ∧

⎛

⎝

j=1

¬ ˜

⎞

⎠

i=1

∧

⎛

⎝

j=1

¬Q

⎞

⎠ →

⎛

⎝

i=1

⎞

⎠ ∧

⎛

⎝

j=1

¬ ˜

⎞

⎠

Then, whether a learning datum is a member of the same cluster of both the ﬁrst

winner and the second winner is determined. The method of the determination is

described in the end of this subsection. The learning datum is added to the long-

term memory as new knowledge and the learning process for the learning data

ﬁnishes if the learning datum is not a member of the same cluster of either the

ﬁrst winner or the second winner. If the learning datum is a member of the same

cluster of both the ﬁrst winner and the second winner, then the learning datum

is not added to the long-term memory and the previously stored knowledge in

the long-term memory is updated as follows. An edge between the ﬁrst winner

and the second winner is generated if the edge has not been generated yet, and

the age of the edge is set to zero. That edge is used to generate clusters of if-

then rules in the long-term memory. If-then rules connected with an edge are

regarded as members of the same cluster. Subsequently, the age of every edge

emanating from the ﬁrst winner is increased by 1. Every edge with age greater

than a parameter Λ

edge

, is removed. If-then rules in the long-term memory which

have less than two edges are removed if a parameter λ is an integer multiple of

the number of the previously provided learning data. The noise depicted in

Fig. 1(b) is expected to be eliminated by this step.

Here, we describe the method to determine whether a learning datum is a

member of the same cluster of both the ﬁrst and the second winners. The simi-

larity threshold of the ﬁrst (or second) winner is calculated as follows:

s =

max

∈N

d(k, k

) (if N =

∅)

max

∈A

d(k, k

) (if N =

∅)

Pattern-Based Reasoning System Using Self-incremental Neural Network

343

A∧B

D∧￢E

￢J

H∧￢I

Fig. 3. Example of an OR tree as a result of reasoning

where N is the set of all the if-then rules connected to the ﬁrst (or second) winner

by the edge, A is the set of all the if-then rules in the long-term memory, and k

denotes the ﬁrst (or second) winner. The learning datum is regarded as a member

of the same cluster of both the ﬁrst and second winners if the distance between

the ﬁrst winner and the learning datum is less than the similarity threshold of

the ﬁrst winner and the distance between the second winner and the learning

datum is less than the similarity threshold of the second winner.

2.2

The Reasoning Algorithm

When a fact such as A

∧B is input, the system generates the tree, the root of

which is the input fact; no node exists without the root. The distance between

the fact and the conditional part of each if-then rule in the long term memory

is calculated. If there exist conditional parts with distance of less than δ

, their

sequential parts are added to the tree as children of the root. Here we describe

details of the reasoning algorithm. When a fact is input, the tree exists for which

the root is the fact and no node without the root exists. The distance between

the fact and the conditional part of each if-then rule in the long-term memory

is calculated. If there exist conditional parts with distance less than δ

, their

sequential parts are added to the tree as children of the root, where the only

nearest sequential part is added from the same cluster if several suﬃciently close

sequential parts exist in the same cluster. The reasoning process terminates if

no child is added to the tree. The above steps are applied to the children instead

of the fact if a new child exists. The above steps are executed repeatedly as long

as a new child is generated. Figure 3 shows an example of an output tree. The

reasoning machine outputs that tree when A’

∧B’ is obtained as a fact if the long-

term memory holds “A

∧B→C∨(D∧¬E)” and “C’→F∨G∨(H∧I), D’∧¬E’→¬J”

where (A, A’) - (E, E’) are pairs of suﬃciently similar patterns.

An output tree should be addressed as an OR tree; then a deduced propo-

sition is a disjunctive normal form generated by combining several proposi-

tions in nodes. If the reasoning machine outputs the tree shown in Fig. 3, then

“C

∨(D∧¬E)”, “C∨¬J”, “F∨G∨(H∧I) ∨(D∧¬E)”, and “F∨G∨(H∧I) ∨¬J” are

correct reasoning results. A reasoning result that is represented as a disjunc-

tive normal form can be interpreted as the possibility of an environment. The

system discovers the possibility of places that the system can reach from the

present location if the present location is input to the system as a fact and

“the supermarket or the station or the drug store” is the reasoning result.

344

A. Sudo et al.

Competitive Layer

￢B

￢C

￢B

￢C

Conditional Rule

・・・

Input Layer

Sequential Rule

Fig. 4. Neural Network Model of the Proposed Method

2.3

Neural Network Model

The architecture of the proposed system is as shown in Fig. 2 in terms of knowl-

edge engineering; the proposed system can be interpreted as a self-incremental

neural network like SOINN[1], SOINN-AM[2], and Growing Neural Gas[5]. The

proposed neural network model has an input layer and a competitive layer, as

shown in Fig. 4. The input layer obtains an if-then rule or a fact. New nodes are

generated adaptively when the input layer obtains an if-then rule. Each if-then

rule is represented with several nodes connected by edges, although other self-

incremental neural networks represent learned data with a single node. Those

edges are diﬀerent-type edges from those used to generate clusters. The pro-

posed model requires m

× n neurons to represent an if-then rule, the conditional

part of which has m literals and the sequential part has n literals. Each neuron

holds a pair of literals picked up from the conditional part and the sequential

part one-by-one, which means that each neuron holds two vectors and ﬂags of

negation. Neurons representing A

∧¬B→¬C∧D are shown in Fig. 4.

Simulation Results

In the experiment, the proposed system learned the sequentially provided if-

then rules, the atomic propositions of which were the images taken in the real

environment. It then produced inferences using the learned knowledge. We used

280 images of the 14 objects shown in Fig. 5 taken from 20 diﬀerent angles. Each

image has 56

× 46 pixels. The if-then rules the systems learned were “A→B”,

A. Close door B. Open door

C. Lab.

F. Elevator

G. Desk-1

H. Desk-2

I. Desk drawer-1

(close)

J. Desk drawer-2

(close)

K. Desk drawer-1

(open)

L. Desk drawer-2

(open)

M. Down stairs N. Room plate

D. Wall

E. Hallway

O. White Noise

Fig. 5. Images for the experiment

Pattern-Based Reasoning System Using Self-incremental Neural Network

345

Table 1. Number of members in the clusters and compression rate

Cluster No. If-then rule Number of members Compression rate (%)

AB

87

78.25

145

63.75

87.5

E(CN)

92.25

146

63.5

(CN)(GI)

(CN)(HJ)

111

72.25

(GI)K

140

(HJ)L

122

69.5

Total

—

980

75.5

∧

Fig. 6. Reasoning results

50

100

SN ratio

t r

tio

)

系列1

系列2

系列1

δr = 0.3

δr = 0.2

δr = 0.1

Fig. 7. Ratio of correct reasoning when noisy data are given

“B

→D∨E”, “E→(C∧N)∨F∨M”, “(C∧N)→(G∧I) ∨ (H∧J)”, “(G∧I)→K”, and

“(H

∧J)→L”. They mean “the closed door → the open door”, “the open door →

the wall

∨ the hallway”, “the hallway → (the laboratory ∧ the room plate) ∨

the elevator

∨ the down stairs”, “the laboratory ∧ the room plate→(the desk 1

∧ the closed drawer 1)∨(the desk 2 ∧ the closed drawer 2)”, “(the desk 1 ∧ the

346

A. Sudo et al.

closed drawer 1)

→ the open drawer 1”, “(the desk 2 ∧ the closed drawer 2)→ the

open drawer 2”. The system obtained if-then rules, each atomic proposition of

which was selected randomly from 20 diﬀerent-angled images. Random selection

from among them provided learning data for 4400 times for each lf-then rule.

The parameters were set to Λ

edge

= 100, λ = 50.

The number of if-then rules in the long-term memory after learning was 980;

these rules are categorized into 10 diﬀerent clusters. Each cluster had members cor-

responding respectively only to “A

→B”, “B→D”, “B→E”, “E→(C∧N)”, “E→F”,

“E

→M”, “(C∧N)→(G∧I)”, “(C∧N)→(H∧J)”, “(G∧I)→K”, or “(H∧J)→L”; one

example is the members of clusters that consisted only of if-then rules correspond-

ing to “A

→B”. Consequently, the system was able to categorize if-then rules ap-

propriately, which were the same rules as those in a symbol-based manner, into

the same cluster. Table 1 shows the number of members of each cluster and data

compression rates. We estimated the compression rate as (1

− N

m

/N

), where N

is the number of members of the cluster and N

is the number of if-then rules pro-

vided to the system as learning data. Although the learning data consisted of 4000

if-then rules, the atomic propositions of which were diﬀerent from one another,

only 980 rules were stored in the long-term memory in all. The system eliminated

3020 rules as similar rules of previously stored rules. Note that the system obtained

the learning data sequentially. The result of the learning indicate that the system

learns new data without collapsing previously learned data and avoids consuming

memory through eliminating similar data.

We provided one learned image as a fact to test whether the system can pro-

duce inferences correctly. The parameter was set to δ

= 0.3. The reasoning

result is shown in Fig. 6. This tree has neither omissions nor duplication. The

system was able to avoid duplication of results because it generated clusters ap-

propriately. In all, 500,376 nodes were created in the output tree when we made

the system regard every if-then rule in the long-term memory as belonging to

diﬀerent clusters. The reasoning result can be interpreted thusly: the system

discovered where it can reach from the front of the closed door. One can ﬁnd the

importance of storing an if-then rule which has a conjunction in the reasoning

about the contents of the closed drawers. The system would not make a correct

inference if it were unable to deal with the conjunction because the closed draw-

ers appear to be almost identical. To examine whether the system can reason

correctly from the type of noisy data shown in Fig. 1(b), we provided ten white

noise samples, one example of which is shown in Fig. 5O. Then, the system

made no child node of the white noises because it was able to judge that all the

white noise consisted of unknown patterns. The system proposed in [3] cannot

judge whether an input fact has been learned. It outputs a wrong pattern that

is entirely unrelated to learned patterns when it obtains an unknown pattern.

We generated noisy data by adding Gaussian noise to 20 images of Fig. 5A to

examine that the system can reason correctly from the type of noisy data shown

in Fig. 1(a). Actually, 10 noisy images were generated from the same image in

each noise level, which means that the system obtained 200 noisy images of the

same noisy level. We estimated the correct reasoning ratio as “the number of

Pattern-Based Reasoning System Using Self-incremental Neural Network

347

correct reasoning / the number of reasoning with the noisy image”. Figure 7

shows that a boundary SN ratio exists. The system outputs the same result as

that of the system that obtained the original image if the SN ratio is greater than

the boundary. The system regards the fact as an unknown pattern and generates

no child node of the fact if the SN ratio of a fact is less than the boundary. The

boundary can be conﬁgured through setting. The system proposed in [3] also

has the boundary of noise level to produce correct inferences, but a user cannot

conﬁgure it.

Conclusion and Future Work

We proposed a novel reasoning system that can learn a pattern-based if-then rule

including conjunction, disjunction, and negation. The proposed system is suit-

able for systems operating in the real world, such as intelligent robots, because

the proposed system achieves incremental learning, generalization, avoidance of

duplication of results, and robustness to noise. We will develop this study in two

directions. The ﬁrst direction is applying results of this study to development of

a robot. Using the pattern-based reasoning system, a robot is expected to learn

knowledge autonomously from an environment and solve tasks by reasoning us-

ing the learned knowledge. The second direction is an extension of the proposed

system for a robot to solve complicated problems.

Acknowledgments. This study was supported by the Industrial Technology

Research Grant Program in 2004 from the New Energy and Industrial Technol-

ogy Development Organization (NEDO) of Japan.

References

1. Shen, F., Hasegawa, O.: An Incremental Network for On-line Unsupervised Classi-

ﬁcation and Topology Learning. Neural Networks 19, 90–106 (2006)

2. Sudo, A., Sato, A., Hasegawa, O.: Associative Memory for Online Incremental Learn-

ing in Noisy Environments. In: Proc. of the 2005 International Joint Conference on

Neural Networks (IJCNN 2005) (accepted, 2007)

3. Yamane, K., Hasuo, T., Suemitsu, A., Morita, M.: Pattern-based reasoning using

trajectory attractors. IEICE Trans. Information and System J90-D, 933–944 (2007)

4. Tsukimoto, H.: Pattern Reasoning: Logical Reasoning of Neural Networks. IEICE

Trans. Information and System J83-D-II, 744–753 (2000)

5. Fritzke, B.: A growing neural gas network learns topologies. In: Proc. of Advances

in Neural Information Processing Systems (NIPS), pp. 625–632 (1995)

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 348–357, 2008.

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 26 27 28 29 30 31 32 33 ... 88