Lecture Notes in Computer Science

bet	69/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 65 66 67 68 69 70 71 72 ... 88

| D

) =

| X

)π

(u)

u=1

| X

)π

(u)

n=1

| x

)π

(u)

u=1

n=1

| x

)π

(u)

(3)

where π

denotes the subjective belief of the agent for user s; π

(u) is the prior

belief over the models. With this posterior, the Bayesian predictive distribution

is given as

p

s

| x

)

≡

u=1

| D

| x

(4)

The ﬁnal prediction is made, according to the predictive distribution, as

t

new

= argmax

| x

= x

new

(5)

which we refer to as (Bayesian) collaborative prediction in this study. We note

that in Eq. (4) the predictive distribution now does not depend on the other

users’ data. This may be an advantage in some distributed environments, like

when the users are in diﬀerent sites and the communication between them is

costly.

The posterior probability of other users become large when the user s have

only a limited number of training data and also there exist similar users to user s

in the sence that they generate similar outputs given same inputs. The posterior

integration will then incorporate additional knowledge from the similar users’

models to the prediction of user s. The introduced knowledge is expected to be

eﬀective in improving the prediction accuracy. In addition, the model averaging

would reduces the uncertainty/variance of the model. On the other hand, if the

user s have suﬃcient amount of data or there are no similar users, the posterior

probability p(u = s) would be close to one, which reduces the collaborative

predictor to the individual predictor; this is quite natural, because the model is

successfully learned individually or user s is isolated.

Bayesian Collaborative Predictors

747

Experimental Results

4.1

Printer Usage Data

We evaluated the proposed approach by using a real-world user modeling task.

The dataset used here, which we refer to as “printer usage” dataset, is the

set of electronic log data that were collected through the daily use of printer

devices that are shared via network within a section of a company to which

one of the authors belongs. The motivation has presented in detail in [5], while

the current task setting and the dataset are slightly diﬀerent from the previous

ones. The log data consist of many records, where a single record corresponds to

one printing job output via network to one of the shared printers. Each record

originally consists of a number of attributes, including user ID, context-related

identiﬁers such as date or time, and content- or function-related identiﬁers such

as number of document pages or usage of duplex/simplex printing. The aim of

the experiment here, however, is not to construct full user models but to evaluate

the basic performance of our approach. Then, we pre-processed the original log

data to produce a rather compact task setting which was appropriate for the

evaluation purpose.

The log data were ﬁrst separated into those of each individual according to

the user ID. Then, in each individual log, only the records that meet the follow-

ing condition were extracted: Paper size is A4, with only one copy and only one

page per single sheet. This is usually the default setting of printer interfaces,

and the usage of most users is strongly biased toward this setting. By limiting

to the default condition, the distribution of the other attributes became regu-

larly balanced and thus the dataset became suitable for normative performance

evaluation.

In this reduced dataset, we consider the ﬁve attributes other than the ﬁxed

attributes, which are summarized in Table 1. The values of the attribute mod-

elName were replaced by anonymous ones. We quantized the values of the at-

tribute docPages, which can take any natural number, as shown in the table.

The attribute docExt may originally take various values, but the number of

frequently-appeared values are small, such as pdf, xls, html, etc. We thus ex-

tracted only the records that include these frequent values and removed the

others from the experiment. Finally, we removed the records including missing

values. In this experiment, we also ﬁxed both of the input and target variables

within the ﬁve attributes; that is, we consistently set x =

{docExt, docPages}

and t =

{modelName, duplex, colorMode}, where the number of possible real-

ization is

|x| = 25 and |t| = 20. After the pre-processing, the total number of

users became 76, and the numbers of individual training data were quite diﬀerent

within the range from 2 to 1, 192 (Fig. 2).

4.2

Simulation Setting

Although our method is particularly expected to improve the prediction perfor-

mance of users located at the rightward of Fig. 2 (with small sample sizes) in real

748

J. Hirayama et al.

Table 1. Five attributes in printer logs

Attribute

Description

Values

modelName

Anonymous form of model names

Pr1, Pr2, Pr3, Pr4, Pr5

docExt

File extensions of original document

doc, html, pdf, ppt, xls

docPages

Number of pages in original document 1, 2-5, 6-20, 21-50, 51-over

duplex

Duplex or simplex

duplex, simplex

colorMode

Color mode

monochrome, fullColor

40

60

500

1000

User index (sorted)

Num of training data

Fig. 2. The original numbers of data for 76 users.

situations, the limited number of data for such users prevents the quantitative

evaluation of prediction performance. In this experiment, therefore, we tested the

collaborative predictors for the top 16 users (the leftmost users in Fig. 2), with ar-

tiﬁcially varying the number of training data, N

, from relatively small to large by

random selection. For each setting of N

, twenty runs were repeatedly conducted,

in each of which both the collaborative and individual predictors were constructed

based on the same data and then their performances were evaluated (see below).

In contrast, the number of test data was commonly set at 200 in every run.

In each single run for a user s, both the training data and test data were

randomly selected from the original data without overlapping. Then the BN

for user s was ﬁrst constructed based on the training dataset. The BN was

trained in a standard manner as described in Sec. 2, where our implementation

used the “deal” package written in the language R. To realize the structure

learning, we simply used a heuristic optimization provided by “deal” as the

function heuristic() (see [1] for the detail), which searches for a good structure

according to the partially-greedy stepwise ascent of the score function starting

from multiple initial conditions. The Dirichlet hyperparameters were commonly

set at a constant such that the total number of prior pseudo-counts [6] was ﬁve.

To obtain the conditional model p

(t

s

| x

) from the resultant joint distribution,

we used the enumerative method.

After learning, with this individual predictor p

, we constructed the corre-

sponding collaborative predictors ¯

as follows. First, in prior to the simulation,

we trained individual predictors for all the 76 users based on the data of the

numbers shown in Fig. 2, without limiting the number of the top 16 user’s data.

Then an ensemble of 76 individual predictors, p

org

1

, p

org

, . . . , p

org

, which we re-

ferred to as the original ensemble, was prepared in advance. The collaborative

predictor ¯

s

was then constructed by ﬁrst replacing p

org

in the original ensemble

with p

, and then making the collaborative predictor ¯

based on the replaced

Bayesian Collaborative Predictors

749

ensemble. The prior π

(u) was simply set as uniform. To evaluate the prediction

performance for the test dataset, we calculated the test accuracy, deﬁned as the

fraction of the test cases in which the predictions of the three target variables

were all correct.

4.3

Results

Fig. 3 shows the test accuracies by individual and collaborative predictors. Each

panel corresponds to one of the 16 users, where the vertical axis denotes the

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

0.5

1.0

Fig. 3. Test accuracy for 16 users. The vertical axis denotes the test accuracy and the

horizontal the number of training data. The solid (black) and dash (gray) lines respec-

tively show the mean over the 20 runs by the collaborative (proposed) and individual

predictors. The errorbar represents

±SD (standard deviation).

750

J. Hirayama et al.

10

20

−0.4

0.0

0.4

0.8

Number of training data

Differences of test accuracies

x

x

0.000

0.015

0.030

Number of training data

Variances of test accuracies

Fig. 4. Left: The diﬀerence in test accuracy (Collaborative

− Individual). This ﬁgure

collectively shows the 16 users’ results. A mark x denotes an actual value of test ac-

curacy. A solid line denotes the mean value, with an errorbar of

±SD. Right: The

individual variance of test accuracy. Solid and dash lines respectively denote the col-

laborative and individual predictors. An errorbar denotes the standard deviation over

the 16 users.

test accuracy and the horizontal the number of training data, where only the

cases with N

= 5, 10, 20, 30, 40, and 50 are shown. In this ﬁgure, the individual

predictors often exhibited relatively lower accuracy and larger variances when

is less than about 20 in comparison to the cases with larger N

. In contrast,

it was shown that the collaborative predictors can improve these undesirable

results of individual predictors in that they showed a higher mean accuracy for

a number of users, and also smaller variances for almost all the users.

Fig. 4 (left) shows the improvement in test accuracy by the Bayesian collabo-

ration over the individual predictor (i.e, the accuracy by collaborative predictor

minus that by corresponding individual one). This ﬁgure collectively plots the

results of the 16 users, where each point denotes the improvement in a single run

for a single user. The improvement was achieved in many runs, especially with

small samples. Fig.4 (right) shows the variance of test accuracy of each individ-

ual user against the number of training data. Each errorbar denotes the standard

deviation over the 16 users. This ﬁgure clearly shows the variance of test perfor-

mance was substantially reduced by collaborative prediction in comparison to

the individual one especially in the cases of small samples.

Summary

In this article, we proposed a new collaborative framework for user modeling

with special interest in its application to BNs, which have recently been an

popular modeling tool in general user modeling tasks. Our method is essentially

a simple use of Bayesian principle, but the key idea that regards the other

users’ models as the target user’s ones is consistent with the basic assumption

of collaborative methods, i.e., there may be some other users similar to the

target user. The eﬀectiveness of the proposed method was demonstrated with

a real-world dataset related to user modeling. While the improvements by our

Bayesian Collaborative Predictors

751

method was showed only in the too limited range of sample sizes, say less than

20, it should be noted that this range will likely to be extended in a more

realistic problem having large number of variables, where the needed amount

of training data become increased. More detailed performance evaluation, and

also the investigation of remained issues such as computational cost or eﬀective

setting of prior distribution π

(u) are our future tasks.

References

1. B¨

ottcher, S.G., Dethlefsen, C.: deal: A package for learning Bayesian networks.

Journal of Statistical Software 8(20) (2003)

2. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms

for collaborative ﬁltering. In: Proc. 14th Conf. on Uncertainty in Artiﬁcial Intelli-

gence, pp. 43–52. Morgan Kaufmann, San Francisco (1998)

3. Godoy, D., Amandi, A.: User proﬁling in personal information agents: a survey.

Knowl. Eng. Rev. 20(4), 329–361 (2005)

4. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on

graphical structures and their application to expert systems. Journal of the Royal

Statistical Society. Series B 50(2), 157–224 (1988)

5. Nakatomi, M., Iga, S., Shinnishi, M., Nagatsuka, T., Shimada, A.: What aﬀects

printing options? - Toward personalization & recommendation system for print-

ing devices. In: International Conference on Intelligent User Interfaces (Workshop:

Beyond Personalization 2005) (2005)

6. Neapolitan, R.E.: Learning Bayesian Networks. Prentice-Hall, Inc., Upper Saddle

River (2003)

7. Niculescu-Mizil, A., Caruana, R.: Inductive transfer for Bayesian network structure

learning. In: Proc. 11th International Conf. on AI and Statistics (2007)

8. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San

Mateo (1988)

9. Schafer, J.B., Konstan, J.A., Riedl, J.: E-commerce recommendation applications.

Data Mining and Knowledge Discovery 5(1–2), 115–153 (2001)

10. Webb, G.I., et al.: Machine learning for user modeling. User Modeling and User-

Adapted Interaction 11(1–2), 19–29 (2001)

11. Zhang, Y., Burer, S., Nick Street, W.: Ensemble pruning via semi-deﬁnite pro-

gramming. Journal of Machine Learning Research 7, 1315–1338 (2006)

12. Zukerman, I., Albrecht, D.: Predictive statistical models for user modeling. User

Modeling and User-Adapted Interaction 11(1) (2001)

Discovery of Linear Non-Gaussian Acyclic

Models in the Presence of Latent Classes

Shohei Shimizu

1,2

and Aapo Hyv¨

arinen

Helsinki Institute for Information Technology, Finland

The Institute of Statistical Mathematics, Japan

http://www.hiit.fi/neuroinf

Abstract. An eﬀective way to examine causality is to conduct an exper-

iment with random assignment. However, in many cases it is impossible

or too expensive to perform controlled experiments, and hence one of-

ten has to resort to methods for discovering good initial causal models

from data which do not come from such controlled experiments. We have

recently proposed such a discovery method based on independent com-

ponent analysis (ICA) called LiNGAM and shown how to completely

identify the data generating process under the assumptions of linearity,

non-gaussianity, and no latent variables. In this paper, after brieﬂy reca-

pitulating this approach, we extend the framework to cases where latent

classes (hidden groups) are present. The model identiﬁcation can be ac-

complished using a method based on ICA mixtures. Simulations conﬁrm

the validity of the proposed method.

Introduction

An eﬀective way to examine causality is to conduct an experiment with random

assignment [1]. However, in many cases it is impossible or too expensive to

perform controlled experiments. Hence one often has to resort to methods for

discovering good initial causal models from data which do not come from such

controlled experiments, though obviously one can never fully prove the validity

of a causal model from such uncontrolled data alone. Thus, developing methods

for causal inference from uncontrolled data is a fundamental problem with a very

large number of potential applications such as social sciences [2], gene network

estimation [3] and brain connectivity analysis [4].

Previous methods developed for statistical causal analysis of non-experimental

data [2, 5, 6] generally work in one of two settings. In the case of discrete data,

no functional form for the dependencies is usually assumed. On the other hand,

when working with continuous variables, a linear-Gaussian approach is almost

invariably taken and has hence been based solely on the covariance structure of

the data. Because of this, additional information (such as the time-order of the

variables and prior information) is usually required to obtain a full causal model

of the variables. Without such information, algorithms based on the Gaussian

assumption cannot in most cases distinguish between multiple equally possible

causal models.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 752–761, 2008.

c Springer-Verlag Berlin Heidelberg 2008

Discovery of Linear Non-Gaussian Acyclic Models

753

We have recently shown that when working with continuous-valued data, a

signiﬁcant advantage can be achieved by departing from the Gaussianity assump-

tion [7,8,9]. The linear-Gaussian approach usually only leads to a set of possible

models that are equivalent in their covariance structure. The simplest such case

is that of two variables, x

and x

. A method based only on the covariance ma-

trix has no way of preferring x

→ x

over the reverse model x

← x

[2, 7].

However, a linear-non-Gaussian setting actually allows the linear acyclic model

to be uniquely identiﬁed [9].

In this paper, we extend our previous work to cases where latent classes (hidden

groups) are present. The paper is structured as follows. In Section 2 we brieﬂy de-

scribe the basics of LiNGAM and subsequently extend the framework in Section 3.

Some illustrative examples are provided in Section 4, and the proposed method is

empirically evaluated in Section 5. Section 6 concludes the paper.

LiNGAM

Here we provide a brief review of our previous work [9]. Assume that we observe

data generated from a process with the following properties:

1. The observed variables x

, i =

{1 . . . n} can be arranged in a causal order

k(i), deﬁned to be an ordering of the variables such that no later variable in

the order participates in generating the value of any earlier variable. That

is, the generating process is recursive [2], meaning it can be represented

graphically by a directed acyclic graph (DAG) [5, 6].

2. The value assigned to each variable x

is a linear function of the values

already assigned to the earlier variables, plus a ‘disturbance’ (noise) term e

and plus an optional constant term μ

, that is

k(j)

+ e

+ μ

(1)

3. The disturbances e

are all continuous random variables having non-gaussian

distributions with zero means and non-zero variances, and the e

are inde-

pendent of each other, i.e. p(e

, . . . , e

) =

A model with these three properties we call a Linear, Non-Gaussian, Acyclic

Model, abbreviated LiNGAM.

We assume that we observe a large number of data vectors x (containing the

components x

), and each is generated according to the above described process,

with the same causal order k(i), same coeﬃcients b

, same constants μ

, and the

disturbances e

sampled independently from the same distributions. Note that the

above assumptions imply that there are no unobserved (latent) confounders [5]

(hidden variables). Spirtes et al. [6] call this the causally suﬃcient case.

To see how we can identify the parameters of the model from the set of data

vectors x, we start by subtracting out the mean of each variable x

, leaving us

with the following system of equations:

x = Bx + e,

(2)

754

S. Shimizu and A. Hyv¨

arinen

where B is a matrix that contains the coeﬃcients b

and that could be permuted

(by simultaneous equal row and column permutations) to strict lower triangular-

ity if one knew a causal ordering k(i) of the variables. (Strict lower triangularity

is here deﬁned as lower triangular with all zeros on the diagonal.) Solving for x

one obtains

x = Ae,

(3)

where A = (I

− B)

−1

. Again, A could be permuted to lower triangularity (al-

though not strict lower triangularity, actually in this case all diagonal elements

will be non-zero) with an appropriate permutation k(i). Taken together, equa-

tion (3) and the independence and non-gaussianity of the components of e deﬁne

the standard linear independent component analysis (ICA) model [10,11], which

is known to be identiﬁable.

While ICA is essentially able to estimate A (and W = A

−1

), there are two

important indeterminacies that ICA cannot solve: First and foremost, the order

of the independent components is in no way deﬁned or ﬁxed. Thus, we could

reorder the independent components and, correspondingly, the columns of A

(and rows of W) and get an equivalent ICA model (the same probability density

for the data). In most applications of ICA, this indeterminacy is of no signiﬁcance

and can be ignored, but in LiNGAM, we can and we have to ﬁnd the correct

permutation as described in [9]: the correct permutation is the only one which

has no zeros in the diagonal.

The second indeterminacy of ICA concerns the scaling of the independent

components. In ICA, this is usually handled by assuming all independent com-

ponents to have unit variance, and scaling W and A appropriately. On the other

hand, in LiNGAM (as in structural equation modeling, SEM [2]) we allow the

disturbance variables to have arbitrary (non-zero) variances, but ﬁx their weight

(connection strength) to their corresponding observed variable to unity. This re-

quires us to re-normalize the rows of W so that all the diagonal elements equal

unity in order to obtain B.

Our LiNGAM discovery algorithm [9] can thus be brieﬂy summarized: First,

use a standard ICA algorithm to obtain an estimate of the demixing matrix W,

permute its rows such that there are no zeros on its diagonal, rescale each row

by dividing by the element on the diagonal, and ﬁnally compute B = I

− W ,

where W denotes the permuted and rescaled W. To ﬁnd a causal order k(i) we

must subsequently ﬁnd a second permutation, to be applied equally both to the

rows and columns of B, which yields strict lower triangularity.

LiNGAM in the Presence of Latent Classes

In this section, we extend the basic LiNGAM above to cases where latent (hid-

den) classes are present.

3.1

Motivation

Let us begin by an example. Regarding a child’s and a parent’s height, earlier

studies (e.g., [12]) pointed out that there is a hereditary eﬀect on height, which

Discovery of Linear Non-Gaussian Acyclic Models

755

is especially stronger between a child and the same-sex parent. This implies that

the connection strengths from parent’s height to child’s height (and possibly the

network structures) could be diﬀerent between the two classes (same-sex and

diﬀerent-sex children). This is a nonlinear relation between child’s and parent’s

height even if the relations are still linear in each class, which cannot be found

if the class-membership is ignored (see Section 4 for some artiﬁcial examples).

In cases where such class-membership is observed, we only have to analyze each

class separately. However, in many cases, it would be quite diﬃcult to detect

and observe class-membership especially before collecting data. Thus, we need

a sophisticated method to estimate latent classes in a data-driven way. In the

following, we extend the basic LiNGAM so that the method can estimate latent

classes of samples that have similar network structures.

3.2

Model

Let us assume that the data are generated by the following mixture density:

p(x

|Θ) =

k=1

p(x

|μ

, B

)p(C = k),

(4)

where Θ = [θ

,

· · · , θ

], θ

= [μ

, vec(B

)

]

, μ

is a mean vector, B

is a

connection strength matrix for class k and C is a discrete variable that indicates

the class k = 1,

· · · , K. (The vec(·) denotes the vectorization operator which

creates a column vector from a matrix by stacking its columns. ) Here, we do

not assume that we know the number of latent classes K and a priori probability

p(C = k). Moreover, the data within class k are assumed to be generated by the

LiNGAM model:

x = B

x + (I

− B

)μ

+ e

(5)

where e

is the disturbance (error) vector for class k. Note that the means,

connection strengths and structure of the network (μ

and B

) can be diﬀerent

between classes. See Section 4 for some illustrative examples.

3.3

Model Identiﬁcation Using ICA Mixtures

We propose that the new model above can be estimated using ICA mixture

models [13]. As in the basic LiNGAM, ICA model holds for each class:

x = μ

+ A

(6)

where A

= (I

− B

)

−1

. Then the mixture density is just the ICA mixture model

[13]. After μ

and A

are estimated, we can obtain estimates of B

and causal

orderings k(i) for class k in the same manner as the basic LiNGAM (Section 2).

Some estimation methods for the ICA mixtures have been proposed [13, 14].

Here we employ the minimum β-divergence method [14] since the β-divergence

method does not require that the number of classes K and a priori probability

756

S. Shimizu and A. Hyv¨

arinen

p(C = k) are known, which is a big advantage over [13]. Some drawbacks are

that one has to tell the algorithm whether the disturbances e

are super- or

sub-gaussian and select a tuning parameter β using a cross-validation technique

[15]. Fortunately, the ﬁrst problem can be solved by (possibly non-parametric)

estimation of the source densities [16, 14].

Illustrative Examples

In this section, we provide two illustrative examples of the LiNGAM in the pres-

ence of latent classes (abbreviated as LcLiNGAM) proposed above. We selected

k

and B

manually as explained below. The disturbances followed the Laplace

distribution with zero means and selected the variances so that observed vari-

ables had unit variances. Moreover, the number of latent classes was 2, and

250 data points were generated for each class. Note that the numbers of latent

classes were estimated as well by the β-divergence method [14]. The scatterplots

of observed variables were shown in Figure 1.

4.1

Example 1

We generated data using the following means, connection strengths and struc-

tures of networks:

Class 1: μ

=

0

, B

0 0

0.3 0

(7)

Class 2: μ

, B

0 0

0.3 0

(8)

Fig. 1. Left: Scatterplots of the observed variables in Example 1. Right: Scatterplots of

the observed variables in Example 2. In the scatterplots, “.” denote members of class

1 and “+” those of class 2.

Discovery of Linear Non-Gaussian Acyclic Models

757

Both classes 1 and 2 had the same causal orders x

→ x

, but diﬀerent means

(μ

1

and μ

). The diﬀerent mean structures created a strong correlation (0.88)

for the whole data, although the connection strength in each class was rather

weak (0.3).

The estimation results by the LcLiNGAM and basic LiNGAM were as follows:

LcLiNGAM

Class 1: μ

−0.02

0.06

, B

0.30 0

(9)

Class 2: μ

=

4.01

5.01

, B

0.41 0

(10)

LiNGAM:

μ =

−0.09

3.99

, B =

0 0.81

(11)

The LcLiNGAM successfully recovered the means and structures of the networks

and estimated connection strengths fairly well for both latent classes. However,

the basic LiNGAM failed to ﬁnd the correct causal order and overestimated the

connection strength.

4.2

Example 2

Next, we tried data whose means, connection strengths and structures of network

were as follows:

Class 1: μ

=

0

, B

0 0

0.3 0

(12)

Class 2: μ

, B

0 0.3

0 0

(13)

Now the two classes had the diﬀerent causal orders: x

→ x

for class 1 and

1

← x

for class 2. The connection strengths were the same but the mean

structures were diﬀerent between the classes.

The estimation results were as follows:

LcLiNGAM:

Class 1: μ

=

−0.02

0.07

, B

0.39 0

(14)

Class 2: μ

=

5.01

4.01

, B

0 0.41

(15)

LiNGAM:

μ =

3.88

0.08

, B =

0 0.78

(16)

Obviously, the orders of latent classes are not recovered. In the examples, for the

clarity, we permuted the classes so that the diﬀerences of estimates and true values

were minimized.

758

S. Shimizu and A. Hyv¨

arinen

The LcLiNGAM estimated the connection strengths fairly well and found correct

causal orders for each class. However, the basic LiNGAM could not ﬁnd that the

two classes have diﬀerent causal orders because it cannot represent any diﬀerence

between the classes; it estimated, rather arbitrarily, one single causal order x

←x

Simulation

To further verify the validity of our method, we performed experiments with

simulated data. We repeatedly performed the following experiment:

1. First, we randomly constructed a strictly lower-triangular matrix B for each

class, where the number of classes was 2 and the number of variables was 4.

We also randomly selected variances of the disturbance variables. We further

generated values for the constants μ

making the classes have small overlap.

2. Next, we generated data with sample size 500 by independently drawing the

disturbance variables e

from the uniform distribution with zero mean and

unit variance for each class. The observed data X were generated according

to the assumed recursive process and were combined to create a whole data.

3. Finally, we fed the data to our discovery algorithm. The β-divergence method

was employed to estimate ICA mixtures. Here we told the algorithm that

the disturbances were sub-gaussian.

4. We compared the estimated parameters to the generating parameters. In

particular, we made a scatterplot of the entries in the estimates μ

and B

Fig. 2. Left: Scatterplots of the estimated μ

versus the original (generating) values.

Right: Scatterplots of the estimated b

versus the original (generating) values. Five

data sets were generated for the scatterplots.

We ﬁrst set μ

= 0 and took as the elements of μ

1.5 times the sum of standard

deviations of corresponding observed variables of each class multiplied by -1 with

probability 50%.

Discovery of Linear Non-Gaussian Acyclic Models

759

against the corresponding ones in μ

and B

. (Note that the numbers of

latent classes were estimated as well.)

Figure 2 gives scatterplots of the elements of estimated μ

and B

versus

the generating ones. The left is the scatterplot of the estimated means μ

versus

the original (generating) values. The right is the scatterplot of the estimated

connection strengths b

versus the original (generating) values. We can see that

the estimation works well, as evidenced by the grouping of the data points onto

the main diagonal.

Conclusion

Developing statistical causal inference methods based on non-experimental data

is a fundamental problem with a large number of potential applications. Previ-

ous methods developed for linear causal models [2, 6, 5] have been based on an

explicit or implicit assumption of gaussianity, and have hence been based solely

on the covariance structure of the data. Therefore, algorithms based on the

gaussian assumption cannot in most cases distinguish between multiple equally

possible causal models. In previous work, we have shown that an assumption

of non-gaussianity of the disturbance variables, together with the assumption of

linearity and no latent variables, allows the linear acyclic model to be completely

identiﬁed [9].

In this paper, we extended our previous work to cases where latent (hidden)

classes are present. The new method can identify the DAG structures within

latent classes and would enjoy a wider variety of applications.

Although in the artiﬁcial experiments our method worked well, obviously we

need to evaluate its empirical performance by more extensive simulations as well

as real-world data. For example, in many real situations latent classes would be

much more overlapping than in the simulations. Unfortunately, however, for such

heavily overlapping cases, ICA estimation methods are still under development

[14]. These are important topics for future research.

As a further analysis, it is quite important to investigate what characterizes

the latent classes in order to understand how the model can be applied, for ex-

ample, in the design of practical interventions. The estimated means, connection

strengths and structures of networks could provide an interpretation of the la-

tent classes. For example, in Examples 1 and 2 above (Section 4), the diﬀerences

of the means would be useful to interpret the diﬀerence of the two classes (and

probably classify new samples). An additional or alternative way is to analyze

the samples classiﬁed into the latent classes using logistic regression analysis if

some covariates such as sex and age are available. One direction of future re-

search would be to combine the latent class LiNGAM and logistic regression to

improve the class distinction ability.

3

Such a combination of mixture modeling and logistic regression has been proposed

in the context of structural equation modeling (SEM) [17], although such SEM that

are based on gaussian mixtures requires that causal orders are prespeciﬁed since

non-gaussianity is not utilized for the model identiﬁcation.

760

S. Shimizu and A. Hyv¨

arinen

A related topic is the case where hidden confounding (continuous) variables

are present (Latent variable LiNGAM) [18]. We would like to mention a useful

connection between the two extensions of the basic LiNGAM. In the latent class

LiNGAM discussed here, we basically have a binary (discrete) hidden confound-

ing variable (=class membership) which determines the connection strengths

when the structure of the network is the same for the diﬀerent classes. In future

work, we will consider a unifying framework that combines the two extensions.

Acknowledgment

This work was partially carried out at Transdisciplinary Research Integration

Center, Research Organization of Information and Systems. The authors would

like to thank Patrik Hoyer for his valuable comments and Nurul Mollah, Mihoko

Minami and Shinto Eguchi for providing access to their Matlab code for ICA

mixtures. S.S. was supported by Grant-in-Aid for Scientiﬁc Research from Japan

Society for the Promotion of Science.

References

1. Holland, P.W.: Statistics and causal inference (with discussion). Journal of the

American Statistical Association 81, 945–970 (1986)

2. Bollen, K.A.: Structural Equations with Latent Variables. John Wiley & Sons,

Chichester (1989)

3. Imoto, S., Goto, T., Miyano, S.: Estimation of genetic networks and functional

structures between genes by using Bayesian network and nonparametric regression.

In: Proc. Paciﬁc Symposium on Biocomputing, vol. 7, pp. 175–186 (2002)

4. Kim, J., Zhu, W., Chang, L., Bentler, P.M., Ernst, T.: Uniﬁed structural equation

modeling approach for the analysis of multisubject, multivariate functional MRI

data. Human Brain Mapping 28, 85–93 (2007)

5. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University

Press, Cambridge (2000)

6. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn.

MIT Press, Cambridge (2000)

7. Shimizu, S., Kano, Y.: Use of non-normality in structural equation modeling: Ap-

plication to direction of causation. Journal of Statistical Planning and Inference

(in press, 2006)

8. Shimizu, S., Hyv¨

arinen, A., Hoyer, P.O., Kano, Y.: Finding a causal ordering via in-

dependent component analysis. Computational Statistics & Data Analysis 50(11),

3278–3293 (2006)

9. Shimizu, S., Hoyer, P.O., Hyv¨

arinen, A., Kerminen, A.: A linear non-gaussian

acyclic model for causal discovery. J. of Machine Learning Research 7, 2003–2030

(2006)

10. Comon, P.: Independent component analysis. a new concept? Signal Processing 36,

62–83 (1994)

11. Hyv¨

arinen, A., Karhunen, J., Oja, E.: Independent component analysis. Wiley,

New York (2001)

Discovery of Linear Non-Gaussian Acyclic Models

761

12. Tanner, J., Israelsohn, W.: Parent-child correlation for body measurements of chil-

dren between the ages one month and seven years. Ann.Hum.Genet. 26, 245–259

(1963)

13. Lee, T.W., Lewicki, M., Sejnowski, T.: ICA mixture models for unsupervised clas-

siﬁcation of non-gaussian sources and automatic context switching in blind signal

separation. IEEE Trans. on Pattern Recognition and Machine Intelligence 22(10),

1–12 (2000)

14. Mollah, M.N.H., Minami, M., Eguchi, S.: Exploring latent structure of mixture ICA

models by the minimum β-divergence method. Neural Computation 18, 166–190

(2006)

15. Minami, M., Eguchi, S.: Adaptive selection for minimum β-divergence method. In:

Proc. the Fourth Int. Conf. on Independent Component Analysis and Blind Signal

Separation (ICA 2003), Nara, Japan, pp. 475–480 (2003)

16. Pham, D.T., Garrat, P.: Blind separation of mixture of independent sources

through a quasi-maximum likelihood approach. Signal Processing 45, 1457–1482

(1997)

17. Muth´

en, B.O.: Beyond SEM: General latent variables modeling. Behaviormetrika 29,

81–117 (2002)

18. Hoyer, P.O., Shimizu, S., Kerminen, A.: Estimation of linear, non-gaussian causal

models in the presence of confounding latent variables. In: Proc. the third European

Workshop on Probabilistic Graphical Models (PGM2006), pp. 155–162 (2006)

Eﬃcient Incremental Learning Using

Self-Organizing Neural Grove

Hirotaka Inoue

and Hiroyuki Narihisa

Department of Electrical Engineering and Information Science,

Kure College of Technology,

2-2-11 Agaminami, Kure-shi, Hiroshima, 737-8506 Japan

hiro@kure-nct.ac.jp

Department of Information and Computer Engineering,

Okayama University of Science,

1-1 Ridai-cho, Okayama-shi, Okayama, 700-0005 Japan

narihisa@ice.ous.ac.jp

Abstract. Multiple classiﬁer systems (MCS) have become popular dur-

ing the last decade. Self-generating neural tree (SGNT) is one of the

suitable base-classiﬁers for MCS because of the simple setting and fast

learning. In an earlier paper, we proposed a pruning method for the

structure of the SGNT in the MCS to reduce the computational cost

and we called this model as self-organizing neural grove (SONG). In this

paper, we investigate a performance of incremental learning using SONG

for a large scale classiﬁcation problem. The results show that the SONG

can ensure rapid and eﬃcient incremental learning.

Introduction

Recently, to improve the classiﬁcation accuracy, multiple classiﬁer systems (MCS)

such as neural network ensembles, bagging, and boosting have been used for practi-

cal data mining applications [1]. When developing classiﬁers using learning meth-

ods, while more training data can reduce the prediction error, the learning process

can itself get computationally intractable. This issue is becoming more evident

today, because there are complex classiﬁcation problems waiting to be solved in

many domains, where large amounts of data are already available [2]. Ideally, it

is desirable to be able to consider all the examples simultaneously, to get the best

possible estimate of class distribution. On the other hand when the training set

is large, all the examples cannot be loaded into memory at one go. One approach

to overcome this constraint is to train the classiﬁer using an incremental learning

technique, whereby only subsets of the data are to be considered at any one time

and results subsequently combined.

Neural networks have great advantages of adaptability, ﬂexibility, and univer-

sal nonlinear input-output mapping capability. However, to apply these neural

networks, it is necessary to determine the network structure and some para-

meters by human experts, and it is quite diﬃcult to choose the right network

structure suitable for a particular application at hand. Moreover, they require

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 762–770, 2008.

c Springer-Verlag Berlin Heidelberg 2008

Eﬃcient Incremental Learning Using SONG

763

a long training time to learn the input-output relation of the given data. These

drawbacks prevent neural networks being the base classiﬁer of the MCS for prac-

tical applications.

Self-generating neural networks (SGNN) [3] have simple network design and

high speed learning. SGNN are an extension of the self-organizing maps (SOM)

of Kohonen [4] and utilize the competitive learning which is implemented as a

self-generating neural tree (SGNT). The abilities of SGNN make it suitable for

the base classiﬁer of the MCS. In order to improve in the accuracy of SGNN, we

proposed ensemble self-generating neural networks (ESGNN) for classiﬁcation [5]

as one of the MCS. Although the accuracy of ESGNN improves by using various

SGNN, the computation cost, that is, the computation time and the memory

capacity increases in proportion to the increase in number of SGNN in the MCS.

Therefore, we proposed a pruning method for the structure of the SGNN in the

MCS to reduce the computation time and the memory capacity and we called

this model as self-organizing neural grove (SONG) [6,7].

In this paper, we investigate a performance of an incremental learning us-

ing the SONG for a large scale classiﬁcation problem in UCI machine learning

repository [8]. We use letter recognition dataset as the classiﬁcation problem.

We investigate the relation between the number of training data and the clas-

siﬁcation accuracy, the number of nodes and the computation time. The results

show that the SONG can ensure the rapid and eﬃcient incremental learning.

The rest of the paper is organized as follows: the next section shows how

to construct the SONG. Then Section 3 shows expreimental results. Finally we

present some conclusions, and outline plans for future work.

Self-Organizing Neural Grove

In this section, we describe how to prune redundant leaves in the SONG. We

implement the pruning method as two stages; the on-line pruning method and

the oﬀ-line optimization method. First, we mention the on-line pruning method

in learning of SGNT. Second, we show the optimization method in constructing

the SONG. Finally, we show a simple example of the pruning method for a two

dimensional classiﬁcation problem.

2.1

On-line Pruning of Self-Generating Neural Tree

SGNT is based on SOM and implemented as a competitive learning. The SGNT

can be constructed directly from the given training data without any intervening

human eﬀort. The SGNT algorithm is deﬁned as a tree construction problem of

how to construct a tree structure from the given data which consist of multiple

attributes under the condition that the ﬁnal leaves correspond to the given data.

Before we describe the SGNT algorithm, we denote some notations.

– input data vector: e

∈ IR

– root, leaf, and node in the SGNT: n

764

H. Inoue and H. Narihisa

Input:

A set of training examples E = {e_i},

i = 1, ... , N.

A distance measure d(e_i,w_j).

Program Code:

copy(n_1,e_1);

for (i = 2, j = 2; i <= N; i++) {

n_win = choose(e_i, n_1);

if (leaf(n_win)) {

copy(n_j, w_win);

connect(n_j, n_win);

j++;

}

copy(n_j, e_i);

connect(n_j, n_win);

j++;

prune(n_win);

}

Output:

Constructed SGNT by E.

Fig. 1. SGNT algorithm

Table 1. Sub procedures of the SGNT algorithm

Sub procedure

Speciﬁcation

copy(n

, e

win

) Create n

, copy e

win

as w

in n

choose(e

, n

)

Decide n

win

for e

leaf (n

win

)

Check n

win

whether n

win

is a leaf or not.

connect(n

, n

win

)

Connect n

as a child leaf of n

win

.

prune(n

win

)

Prune leaves if the leaves have the same class.

– weight vector of n

: w

∈ IR

– the number of the leaves in n

: c

– distance measure: d(e

, w

– winner leaf for e

in the SGNT: n

win

.

The SGNT algorithm is a hierarchical clustering algorithm. The pseudo C

code of the SGNT algorithm is given in Fig. 1. In Fig. 1, several sub procedures

are used. Table 1 shows the sub procedures of the SGNT algorithm and their

speciﬁcations.

In order to decide the winner leaf n

win

in the sub procedure choose(e i,n 1),

the competitive learning is used. If an n

includes the n

win

as its descendant in

the SGNT, the weight w

(k = 1, 2, . . . , m) of the n

is updated as follows:

jk

← w

· (e

− w

≤ k ≤ m.

(1)

Eﬃcient Incremental Learning Using SONG

765

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 65 66 67 68 69 70 71 72 ... 88