Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
π s (u | D s ) = p u (T s | X
s )π s (u) U u=1 p u (T s | X
s )π s (u) = N s n=1
p u (t n s | x n s )π s (u)
U u=1
N s n=1 p u (t n s | x n s )π s (u)
, (3)
where π s denotes the subjective belief of the agent for user s; π s (u) is the prior belief over the models. With this posterior, the Bayesian predictive distribution is given as ¯ p
(t s | x s ) ≡ U u=1
π s (u | D s )p u (t s | x s ). (4) The final prediction is made, according to the predictive distribution, as ˆ t
s = argmax
t s ¯ p s (t s | x
s = x
new s ), (5) which we refer to as (Bayesian) collaborative prediction in this study. We note that in Eq. (4) the predictive distribution now does not depend on the other users’ data. This may be an advantage in some distributed environments, like when the users are in different sites and the communication between them is costly.
The posterior probability of other users become large when the user s have only a limited number of training data and also there exist similar users to user s in the sence that they generate similar outputs given same inputs. The posterior integration will then incorporate additional knowledge from the similar users’ models to the prediction of user s. The introduced knowledge is expected to be effective in improving the prediction accuracy. In addition, the model averaging would reduces the uncertainty/variance of the model. On the other hand, if the user s have sufficient amount of data or there are no similar users, the posterior probability p(u = s) would be close to one, which reduces the collaborative predictor to the individual predictor; this is quite natural, because the model is successfully learned individually or user s is isolated.
Bayesian Collaborative Predictors 747
4 Experimental Results 4.1 Printer Usage Data We evaluated the proposed approach by using a real-world user modeling task. The dataset used here, which we refer to as “printer usage” dataset, is the set of electronic log data that were collected through the daily use of printer devices that are shared via network within a section of a company to which one of the authors belongs. The motivation has presented in detail in [5], while the current task setting and the dataset are slightly different from the previous ones. The log data consist of many records, where a single record corresponds to one printing job output via network to one of the shared printers. Each record originally consists of a number of attributes, including user ID, context-related identifiers such as date or time, and content- or function-related identifiers such as number of document pages or usage of duplex/simplex printing. The aim of the experiment here, however, is not to construct full user models but to evaluate the basic performance of our approach. Then, we pre-processed the original log data to produce a rather compact task setting which was appropriate for the evaluation purpose. The log data were first separated into those of each individual according to the user ID. Then, in each individual log, only the records that meet the follow- ing condition were extracted: Paper size is A4, with only one copy and only one page per single sheet. This is usually the default setting of printer interfaces, and the usage of most users is strongly biased toward this setting. By limiting to the default condition, the distribution of the other attributes became regu- larly balanced and thus the dataset became suitable for normative performance evaluation. In this reduced dataset, we consider the five attributes other than the fixed attributes, which are summarized in Table 1. The values of the attribute mod- elName were replaced by anonymous ones. We quantized the values of the at- tribute docPages, which can take any natural number, as shown in the table. The attribute docExt may originally take various values, but the number of frequently-appeared values are small, such as pdf, xls, html, etc. We thus ex- tracted only the records that include these frequent values and removed the others from the experiment. Finally, we removed the records including missing values. In this experiment, we also fixed both of the input and target variables within the five attributes; that is, we consistently set x = {docExt, docPages} and t = {modelName, duplex, colorMode}, where the number of possible real- ization is |x| = 25 and |t| = 20. After the pre-processing, the total number of users became 76, and the numbers of individual training data were quite different within the range from 2 to 1, 192 (Fig. 2). 4.2 Simulation Setting Although our method is particularly expected to improve the prediction perfor- mance of users located at the rightward of Fig. 2 (with small sample sizes) in real 748 J. Hirayama et al. Table 1. Five attributes in printer logs Attribute Description Values
modelName Anonymous form of model names Pr1, Pr2, Pr3, Pr4, Pr5 docExt
File extensions of original document doc, html, pdf, ppt, xls docPages Number of pages in original document 1, 2-5, 6-20, 21-50, 51-over duplex Duplex or simplex duplex, simplex colorMode Color mode monochrome, fullColor 20 40
0 500
1000 User index (sorted) Num of training data Fig. 2. The original numbers of data for 76 users. situations, the limited number of data for such users prevents the quantitative evaluation of prediction performance. In this experiment, therefore, we tested the collaborative predictors for the top 16 users (the leftmost users in Fig. 2), with ar- tificially varying the number of training data, N u , from relatively small to large by random selection. For each setting of N u , twenty runs were repeatedly conducted, in each of which both the collaborative and individual predictors were constructed based on the same data and then their performances were evaluated (see below). In contrast, the number of test data was commonly set at 200 in every run. In each single run for a user s, both the training data and test data were randomly selected from the original data without overlapping. Then the BN for user s was first constructed based on the training dataset. The BN was trained in a standard manner as described in Sec. 2, where our implementation used the “deal” package written in the language R. To realize the structure learning, we simply used a heuristic optimization provided by “deal” as the function heuristic() (see [1] for the detail), which searches for a good structure according to the partially-greedy stepwise ascent of the score function starting from multiple initial conditions. The Dirichlet hyperparameters were commonly set at a constant such that the total number of prior pseudo-counts [6] was five. To obtain the conditional model p s (t
| x s ) from the resultant joint distribution, we used the enumerative method. After learning, with this individual predictor p s , we constructed the corre- sponding collaborative predictors ¯ p s as follows. First, in prior to the simulation, we trained individual predictors for all the 76 users based on the data of the numbers shown in Fig. 2, without limiting the number of the top 16 user’s data. Then an ensemble of 76 individual predictors, p org 1
org 2 , . . . , p org U , which we re- ferred to as the original ensemble, was prepared in advance. The collaborative predictor ¯ p s
org s in the original ensemble with p s , and then making the collaborative predictor ¯ p s based on the replaced Bayesian Collaborative Predictors 749
ensemble. The prior π s (u) was simply set as uniform. To evaluate the prediction performance for the test dataset, we calculated the test accuracy, defined as the fraction of the test cases in which the predictions of the three target variables were all correct. 4.3
Results Fig. 3 shows the test accuracies by individual and collaborative predictors. Each panel corresponds to one of the 16 users, where the vertical axis denotes the 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 0 20 40 0.5
1.0 Fig. 3. Test accuracy for 16 users. The vertical axis denotes the test accuracy and the horizontal the number of training data. The solid (black) and dash (gray) lines respec- tively show the mean over the 20 runs by the collaborative (proposed) and individual predictors. The errorbar represents ±SD (standard deviation). 750 J. Hirayama et al. 0 10
30 40 50 −0.4 0.0
0.4 0.8
Number of training data Differences of test accuracies x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 10 20 30 40 50 0.000 0.015
0.030 Number of training data Variances of test accuracies Fig. 4. Left: The difference in test accuracy (Collaborative − Individual). This figure collectively shows the 16 users’ results. A mark x denotes an actual value of test ac- curacy. A solid line denotes the mean value, with an errorbar of ±SD. Right: The individual variance of test accuracy. Solid and dash lines respectively denote the col- laborative and individual predictors. An errorbar denotes the standard deviation over the 16 users. test accuracy and the horizontal the number of training data, where only the cases with N u = 5, 10, 20, 30, 40, and 50 are shown. In this figure, the individual predictors often exhibited relatively lower accuracy and larger variances when N u is less than about 20 in comparison to the cases with larger N u . In contrast, it was shown that the collaborative predictors can improve these undesirable results of individual predictors in that they showed a higher mean accuracy for a number of users, and also smaller variances for almost all the users. Fig. 4 (left) shows the improvement in test accuracy by the Bayesian collabo- ration over the individual predictor (i.e, the accuracy by collaborative predictor minus that by corresponding individual one). This figure collectively plots the results of the 16 users, where each point denotes the improvement in a single run for a single user. The improvement was achieved in many runs, especially with small samples. Fig.4 (right) shows the variance of test accuracy of each individ- ual user against the number of training data. Each errorbar denotes the standard deviation over the 16 users. This figure clearly shows the variance of test perfor- mance was substantially reduced by collaborative prediction in comparison to the individual one especially in the cases of small samples. 5 Summary In this article, we proposed a new collaborative framework for user modeling with special interest in its application to BNs, which have recently been an popular modeling tool in general user modeling tasks. Our method is essentially a simple use of Bayesian principle, but the key idea that regards the other users’ models as the target user’s ones is consistent with the basic assumption of collaborative methods, i.e., there may be some other users similar to the target user. The effectiveness of the proposed method was demonstrated with a real-world dataset related to user modeling. While the improvements by our Bayesian Collaborative Predictors 751
method was showed only in the too limited range of sample sizes, say less than 20, it should be noted that this range will likely to be extended in a more realistic problem having large number of variables, where the needed amount of training data become increased. More detailed performance evaluation, and also the investigation of remained issues such as computational cost or effective setting of prior distribution π s (u) are our future tasks. References 1. B¨
ottcher, S.G., Dethlefsen, C.: deal: A package for learning Bayesian networks. Journal of Statistical Software 8(20) (2003) 2. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proc. 14th Conf. on Uncertainty in Artificial Intelli- gence, pp. 43–52. Morgan Kaufmann, San Francisco (1998) 3. Godoy, D., Amandi, A.: User profiling in personal information agents: a survey. Knowl. Eng. Rev. 20(4), 329–361 (2005) 4. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B 50(2), 157–224 (1988) 5. Nakatomi, M., Iga, S., Shinnishi, M., Nagatsuka, T., Shimada, A.: What affects printing options? - Toward personalization & recommendation system for print- ing devices. In: International Conference on Intelligent User Interfaces (Workshop: Beyond Personalization 2005) (2005) 6. Neapolitan, R.E.: Learning Bayesian Networks. Prentice-Hall, Inc., Upper Saddle River (2003) 7. Niculescu-Mizil, A., Caruana, R.: Inductive transfer for Bayesian network structure learning. In: Proc. 11th International Conf. on AI and Statistics (2007) 8. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo (1988) 9. Schafer, J.B., Konstan, J.A., Riedl, J.: E-commerce recommendation applications. Data Mining and Knowledge Discovery 5(1–2), 115–153 (2001) 10. Webb, G.I., et al.: Machine learning for user modeling. User Modeling and User- Adapted Interaction 11(1–2), 19–29 (2001) 11. Zhang, Y., Burer, S., Nick Street, W.: Ensemble pruning via semi-definite pro- gramming. Journal of Machine Learning Research 7, 1315–1338 (2006) 12. Zukerman, I., Albrecht, D.: Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction 11(1) (2001) Discovery of Linear Non-Gaussian Acyclic Models in the Presence of Latent Classes Shohei Shimizu 1,2
and Aapo Hyv¨ arinen
1 1 Helsinki Institute for Information Technology, Finland 2 The Institute of Statistical Mathematics, Japan http://www.hiit.fi/neuroinf Abstract. An effective way to examine causality is to conduct an exper- iment with random assignment. However, in many cases it is impossible or too expensive to perform controlled experiments, and hence one of- ten has to resort to methods for discovering good initial causal models from data which do not come from such controlled experiments. We have recently proposed such a discovery method based on independent com- ponent analysis (ICA) called LiNGAM and shown how to completely identify the data generating process under the assumptions of linearity, non-gaussianity, and no latent variables. In this paper, after briefly reca- pitulating this approach, we extend the framework to cases where latent classes (hidden groups) are present. The model identification can be ac- complished using a method based on ICA mixtures. Simulations confirm the validity of the proposed method. 1 Introduction An effective way to examine causality is to conduct an experiment with random assignment [1]. However, in many cases it is impossible or too expensive to perform controlled experiments. Hence one often has to resort to methods for discovering good initial causal models from data which do not come from such controlled experiments, though obviously one can never fully prove the validity of a causal model from such uncontrolled data alone. Thus, developing methods for causal inference from uncontrolled data is a fundamental problem with a very large number of potential applications such as social sciences [2], gene network estimation [3] and brain connectivity analysis [4]. Previous methods developed for statistical causal analysis of non-experimental data [2, 5, 6] generally work in one of two settings. In the case of discrete data, no functional form for the dependencies is usually assumed. On the other hand, when working with continuous variables, a linear-Gaussian approach is almost invariably taken and has hence been based solely on the covariance structure of the data. Because of this, additional information (such as the time-order of the variables and prior information) is usually required to obtain a full causal model of the variables. Without such information, algorithms based on the Gaussian assumption cannot in most cases distinguish between multiple equally possible causal models. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 752–761, 2008. c Springer-Verlag Berlin Heidelberg 2008
Discovery of Linear Non-Gaussian Acyclic Models 753
We have recently shown that when working with continuous-valued data, a significant advantage can be achieved by departing from the Gaussianity assump- tion [7,8,9]. The linear-Gaussian approach usually only leads to a set of possible models that are equivalent in their covariance structure. The simplest such case is that of two variables, x 1 and x 2 . A method based only on the covariance ma- trix has no way of preferring x 1 → x 2 over the reverse model x 1 ← x
2 [2, 7].
However, a linear-non-Gaussian setting actually allows the linear acyclic model to be uniquely identified [9]. In this paper, we extend our previous work to cases where latent classes (hidden groups) are present. The paper is structured as follows. In Section 2 we briefly de- scribe the basics of LiNGAM and subsequently extend the framework in Section 3. Some illustrative examples are provided in Section 4, and the proposed method is empirically evaluated in Section 5. Section 6 concludes the paper. 2 LiNGAM Here we provide a brief review of our previous work [9]. Assume that we observe data generated from a process with the following properties: 1. The observed variables x i , i = {1 . . . n} can be arranged in a causal order k(i), defined to be an ordering of the variables such that no later variable in the order participates in generating the value of any earlier variable. That is, the generating process is recursive [2], meaning it can be represented graphically by a directed acyclic graph (DAG) [5, 6]. 2. The value assigned to each variable x i is a linear function of the values already assigned to the earlier variables, plus a ‘disturbance’ (noise) term e i , and plus an optional constant term μ i , that is x i = k(j) b ij x j + e i + μ
i . (1) 3. The disturbances e i are all continuous random variables having non-gaussian distributions with zero means and non-zero variances, and the e i are inde- pendent of each other, i.e. p(e 1 , . . . , e n ) =
i p i (e i ). A model with these three properties we call a Linear, Non-Gaussian, Acyclic Model, abbreviated LiNGAM. We assume that we observe a large number of data vectors x (containing the components x i ), and each is generated according to the above described process, with the same causal order k(i), same coefficients b ij , same constants μ i , and the disturbances e i sampled independently from the same distributions. Note that the above assumptions imply that there are no unobserved (latent) confounders [5] (hidden variables). Spirtes et al. [6] call this the causally sufficient case. To see how we can identify the parameters of the model from the set of data vectors x, we start by subtracting out the mean of each variable x i , leaving us with the following system of equations: x = Bx + e, (2)
754 S. Shimizu and A. Hyv¨ arinen where B is a matrix that contains the coefficients b ij and that could be permuted (by simultaneous equal row and column permutations) to strict lower triangular- ity if one knew a causal ordering k(i) of the variables. (Strict lower triangularity is here defined as lower triangular with all zeros on the diagonal.) Solving for x one obtains x = Ae, (3)
where A = (I − B)
−1 . Again, A could be permuted to lower triangularity (al- though not strict lower triangularity, actually in this case all diagonal elements will be non-zero) with an appropriate permutation k(i). Taken together, equa- tion (3) and the independence and non-gaussianity of the components of e define the standard linear independent component analysis (ICA) model [10,11], which is known to be identifiable. While ICA is essentially able to estimate A (and W = A −1 ), there are two important indeterminacies that ICA cannot solve: First and foremost, the order of the independent components is in no way defined or fixed. Thus, we could reorder the independent components and, correspondingly, the columns of A (and rows of W) and get an equivalent ICA model (the same probability density for the data). In most applications of ICA, this indeterminacy is of no significance and can be ignored, but in LiNGAM, we can and we have to find the correct permutation as described in [9]: the correct permutation is the only one which has no zeros in the diagonal. The second indeterminacy of ICA concerns the scaling of the independent components. In ICA, this is usually handled by assuming all independent com- ponents to have unit variance, and scaling W and A appropriately. On the other hand, in LiNGAM (as in structural equation modeling, SEM [2]) we allow the disturbance variables to have arbitrary (non-zero) variances, but fix their weight (connection strength) to their corresponding observed variable to unity. This re- quires us to re-normalize the rows of W so that all the diagonal elements equal unity in order to obtain B. Our LiNGAM discovery algorithm [9] can thus be briefly summarized: First, use a standard ICA algorithm to obtain an estimate of the demixing matrix W, permute its rows such that there are no zeros on its diagonal, rescale each row by dividing by the element on the diagonal, and finally compute B = I − W , where W denotes the permuted and rescaled W. To find a causal order k(i) we must subsequently find a second permutation, to be applied equally both to the rows and columns of B, which yields strict lower triangularity. 3 LiNGAM in the Presence of Latent Classes In this section, we extend the basic LiNGAM above to cases where latent (hid- den) classes are present. 3.1 Motivation Let us begin by an example. Regarding a child’s and a parent’s height, earlier studies (e.g., [12]) pointed out that there is a hereditary effect on height, which Discovery of Linear Non-Gaussian Acyclic Models 755
is especially stronger between a child and the same-sex parent. This implies that the connection strengths from parent’s height to child’s height (and possibly the network structures) could be different between the two classes (same-sex and different-sex children). This is a nonlinear relation between child’s and parent’s height even if the relations are still linear in each class, which cannot be found if the class-membership is ignored (see Section 4 for some artificial examples). In cases where such class-membership is observed, we only have to analyze each class separately. However, in many cases, it would be quite difficult to detect and observe class-membership especially before collecting data. Thus, we need a sophisticated method to estimate latent classes in a data-driven way. In the following, we extend the basic LiNGAM so that the method can estimate latent classes of samples that have similar network structures. 3.2 Model
Let us assume that the data are generated by the following mixture density: p(x
|Θ) = K k=1 p(x |μ k , B k )p(C = k), (4) where Θ = [θ 1 ,
k ], θ
k = [μ
T k , vec(B k ) T ] T , μ k is a mean vector, B k is a
connection strength matrix for class k and C is a discrete variable that indicates the class k = 1, · · · , K. (The vec(·) denotes the vectorization operator which creates a column vector from a matrix by stacking its columns. ) Here, we do not assume that we know the number of latent classes K and a priori probability p(C = k). Moreover, the data within class k are assumed to be generated by the LiNGAM model: x = B
k x + (I
− B k )μ k + e
k , (5) where e k is the disturbance (error) vector for class k. Note that the means, connection strengths and structure of the network (μ k and B k ) can be different between classes. See Section 4 for some illustrative examples. 3.3
Model Identification Using ICA Mixtures We propose that the new model above can be estimated using ICA mixture models [13]. As in the basic LiNGAM, ICA model holds for each class: x = μ
k + A
k e k , (6)
where A k = (I − B k ) −1 . Then the mixture density is just the ICA mixture model [13]. After μ k and A k are estimated, we can obtain estimates of B k and causal orderings k(i) for class k in the same manner as the basic LiNGAM (Section 2). Some estimation methods for the ICA mixtures have been proposed [13, 14]. Here we employ the minimum β-divergence method [14] since the β-divergence method does not require that the number of classes K and a priori probability 756 S. Shimizu and A. Hyv¨ arinen p(C = k) are known, which is a big advantage over [13]. Some drawbacks are that one has to tell the algorithm whether the disturbances e i are super- or sub-gaussian and select a tuning parameter β using a cross-validation technique [15]. Fortunately, the first problem can be solved by (possibly non-parametric) estimation of the source densities [16, 14]. 4 Illustrative Examples In this section, we provide two illustrative examples of the LiNGAM in the pres- ence of latent classes (abbreviated as LcLiNGAM) proposed above. We selected μ k
k manually as explained below. The disturbances followed the Laplace distribution with zero means and selected the variances so that observed vari- ables had unit variances. Moreover, the number of latent classes was 2, and 250 data points were generated for each class. Note that the numbers of latent classes were estimated as well by the β-divergence method [14]. The scatterplots of observed variables were shown in Figure 1. 4.1
Example 1 We generated data using the following means, connection strengths and struc- tures of networks: Class 1: μ 1 =
0 , B
1 = 0 0 0.3 0 (7)
Class 2: μ 2 = 4 5 , B 2 = 0 0 0.3 0 . (8) Fig. 1. Left: Scatterplots of the observed variables in Example 1. Right: Scatterplots of the observed variables in Example 2. In the scatterplots, “.” denote members of class 1 and “+” those of class 2.
Discovery of Linear Non-Gaussian Acyclic Models 757
Both classes 1 and 2 had the same causal orders x 1 → x 2 , but different means (μ 1
2 ). The different mean structures created a strong correlation (0.88) for the whole data, although the connection strength in each class was rather weak (0.3). The estimation results by the LcLiNGAM and basic LiNGAM were as follows: LcLiNGAM
1 : Class 1: μ 1 = −0.02 0.06 , B
1 = 0 0 0.30 0
(9) Class 2: μ 2 =
5.01 , B
2 = 0 0 0.41 0
, (10)
LiNGAM: μ =
−0.09 3.99
, B = 0 0.81
0 0 . (11) The LcLiNGAM successfully recovered the means and structures of the networks and estimated connection strengths fairly well for both latent classes. However, the basic LiNGAM failed to find the correct causal order and overestimated the connection strength. 4.2
Example 2 Next, we tried data whose means, connection strengths and structures of network were as follows: Class 1: μ 1 =
0 , B
1 = 0 0 0.3 0 (12)
Class 2: μ 2 = 5 4 , B 2 = 0 0.3 0 0 . (13) Now the two classes had the different causal orders: x 1 → x 2 for class 1 and x 1
2 for class 2. The connection strengths were the same but the mean structures were different between the classes. The estimation results were as follows: LcLiNGAM: Class 1: μ 1 =
0.07 , B
1 = 0 0 0.39 0
(14) Class 2: μ 2 =
4.01 , B
2 = 0 0.41 0 0 , (15) LiNGAM:
μ = 3.88
0.08 , B =
0 0.78 0 0 . (16)
1 Obviously, the orders of latent classes are not recovered. In the examples, for the clarity, we permuted the classes so that the differences of estimates and true values were minimized. 758 S. Shimizu and A. Hyv¨ arinen The LcLiNGAM estimated the connection strengths fairly well and found correct causal orders for each class. However, the basic LiNGAM could not find that the two classes have different causal orders because it cannot represent any difference between the classes; it estimated, rather arbitrarily, one single causal order x 1 ←x 2 . 5 Simulation To further verify the validity of our method, we performed experiments with simulated data. We repeatedly performed the following experiment: 1. First, we randomly constructed a strictly lower-triangular matrix B for each class, where the number of classes was 2 and the number of variables was 4. We also randomly selected variances of the disturbance variables. We further generated values for the constants μ i making the classes have small overlap. 2 2. Next, we generated data with sample size 500 by independently drawing the disturbance variables e i from the uniform distribution with zero mean and unit variance for each class. The observed data X were generated according to the assumed recursive process and were combined to create a whole data. 3. Finally, we fed the data to our discovery algorithm. The β-divergence method was employed to estimate ICA mixtures. Here we told the algorithm that the disturbances were sub-gaussian. 4. We compared the estimated parameters to the generating parameters. In particular, we made a scatterplot of the entries in the estimates μ k and B k Fig. 2. Left: Scatterplots of the estimated μ i versus the original (generating) values. Right: Scatterplots of the estimated b ij versus the original (generating) values. Five data sets were generated for the scatterplots. 2 We first set μ 1 = 0 and took as the elements of μ 2 1.5 times the sum of standard deviations of corresponding observed variables of each class multiplied by -1 with probability 50%. Discovery of Linear Non-Gaussian Acyclic Models 759
against the corresponding ones in μ k and B k . (Note that the numbers of latent classes were estimated as well.) Figure 2 gives scatterplots of the elements of estimated μ k and B
k versus
the generating ones. The left is the scatterplot of the estimated means μ i versus the original (generating) values. The right is the scatterplot of the estimated connection strengths b ij versus the original (generating) values. We can see that the estimation works well, as evidenced by the grouping of the data points onto the main diagonal. 6 Conclusion Developing statistical causal inference methods based on non-experimental data is a fundamental problem with a large number of potential applications. Previ- ous methods developed for linear causal models [2, 6, 5] have been based on an explicit or implicit assumption of gaussianity, and have hence been based solely on the covariance structure of the data. Therefore, algorithms based on the gaussian assumption cannot in most cases distinguish between multiple equally possible causal models. In previous work, we have shown that an assumption of non-gaussianity of the disturbance variables, together with the assumption of linearity and no latent variables, allows the linear acyclic model to be completely identified [9]. In this paper, we extended our previous work to cases where latent (hidden) classes are present. The new method can identify the DAG structures within latent classes and would enjoy a wider variety of applications. Although in the artificial experiments our method worked well, obviously we need to evaluate its empirical performance by more extensive simulations as well as real-world data. For example, in many real situations latent classes would be much more overlapping than in the simulations. Unfortunately, however, for such heavily overlapping cases, ICA estimation methods are still under development [14]. These are important topics for future research. As a further analysis, it is quite important to investigate what characterizes the latent classes in order to understand how the model can be applied, for ex- ample, in the design of practical interventions. The estimated means, connection strengths and structures of networks could provide an interpretation of the la- tent classes. For example, in Examples 1 and 2 above (Section 4), the differences of the means would be useful to interpret the difference of the two classes (and probably classify new samples). An additional or alternative way is to analyze the samples classified into the latent classes using logistic regression analysis if some covariates such as sex and age are available. One direction of future re- search would be to combine the latent class LiNGAM and logistic regression to improve the class distinction ability. 3 3
in the context of structural equation modeling (SEM) [17], although such SEM that are based on gaussian mixtures requires that causal orders are prespecified since non-gaussianity is not utilized for the model identification.
760 S. Shimizu and A. Hyv¨ arinen A related topic is the case where hidden confounding (continuous) variables are present (Latent variable LiNGAM) [18]. We would like to mention a useful connection between the two extensions of the basic LiNGAM. In the latent class LiNGAM discussed here, we basically have a binary (discrete) hidden confound- ing variable (=class membership) which determines the connection strengths when the structure of the network is the same for the different classes. In future work, we will consider a unifying framework that combines the two extensions. Acknowledgment This work was partially carried out at Transdisciplinary Research Integration Center, Research Organization of Information and Systems. The authors would like to thank Patrik Hoyer for his valuable comments and Nurul Mollah, Mihoko Minami and Shinto Eguchi for providing access to their Matlab code for ICA mixtures. S.S. was supported by Grant-in-Aid for Scientific Research from Japan Society for the Promotion of Science. References 1. Holland, P.W.: Statistics and causal inference (with discussion). Journal of the American Statistical Association 81, 945–970 (1986) 2. Bollen, K.A.: Structural Equations with Latent Variables. John Wiley & Sons, Chichester (1989) 3. Imoto, S., Goto, T., Miyano, S.: Estimation of genetic networks and functional structures between genes by using Bayesian network and nonparametric regression. In: Proc. Pacific Symposium on Biocomputing, vol. 7, pp. 175–186 (2002) 4. Kim, J., Zhu, W., Chang, L., Bentler, P.M., Ernst, T.: Unified structural equation modeling approach for the analysis of multisubject, multivariate functional MRI data. Human Brain Mapping 28, 85–93 (2007) 5. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge (2000) 6. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000) 7. Shimizu, S., Kano, Y.: Use of non-normality in structural equation modeling: Ap- plication to direction of causation. Journal of Statistical Planning and Inference (in press, 2006) 8. Shimizu, S., Hyv¨ arinen, A., Hoyer, P.O., Kano, Y.: Finding a causal ordering via in- dependent component analysis. Computational Statistics & Data Analysis 50(11), 3278–3293 (2006) 9. Shimizu, S., Hoyer, P.O., Hyv¨ arinen, A., Kerminen, A.: A linear non-gaussian acyclic model for causal discovery. J. of Machine Learning Research 7, 2003–2030 (2006) 10. Comon, P.: Independent component analysis. a new concept? Signal Processing 36, 62–83 (1994) 11. Hyv¨
arinen, A., Karhunen, J., Oja, E.: Independent component analysis. Wiley, New York (2001) Discovery of Linear Non-Gaussian Acyclic Models 761
12. Tanner, J., Israelsohn, W.: Parent-child correlation for body measurements of chil- dren between the ages one month and seven years. Ann.Hum.Genet. 26, 245–259 (1963) 13. Lee, T.W., Lewicki, M., Sejnowski, T.: ICA mixture models for unsupervised clas- sification of non-gaussian sources and automatic context switching in blind signal separation. IEEE Trans. on Pattern Recognition and Machine Intelligence 22(10), 1–12 (2000) 14. Mollah, M.N.H., Minami, M., Eguchi, S.: Exploring latent structure of mixture ICA models by the minimum β-divergence method. Neural Computation 18, 166–190 (2006)
15. Minami, M., Eguchi, S.: Adaptive selection for minimum β-divergence method. In: Proc. the Fourth Int. Conf. on Independent Component Analysis and Blind Signal Separation (ICA 2003), Nara, Japan, pp. 475–480 (2003) 16. Pham, D.T., Garrat, P.: Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. Signal Processing 45, 1457–1482 (1997)
17. Muth´ en, B.O.: Beyond SEM: General latent variables modeling. Behaviormetrika 29, 81–117 (2002) 18. Hoyer, P.O., Shimizu, S., Kerminen, A.: Estimation of linear, non-gaussian causal models in the presence of confounding latent variables. In: Proc. the third European Workshop on Probabilistic Graphical Models (PGM2006), pp. 155–162 (2006) Efficient Incremental Learning Using Self-Organizing Neural Grove Hirotaka Inoue 1 and Hiroyuki Narihisa 2 1 Department of Electrical Engineering and Information Science, Kure College of Technology, 2-2-11 Agaminami, Kure-shi, Hiroshima, 737-8506 Japan hiro@kure-nct.ac.jp 2 Department of Information and Computer Engineering, Okayama University of Science, 1-1 Ridai-cho, Okayama-shi, Okayama, 700-0005 Japan narihisa@ice.ous.ac.jp Abstract. Multiple classifier systems (MCS) have become popular dur- ing the last decade. Self-generating neural tree (SGNT) is one of the suitable base-classifiers for MCS because of the simple setting and fast learning. In an earlier paper, we proposed a pruning method for the structure of the SGNT in the MCS to reduce the computational cost and we called this model as self-organizing neural grove (SONG). In this paper, we investigate a performance of incremental learning using SONG for a large scale classification problem. The results show that the SONG can ensure rapid and efficient incremental learning. 1 Introduction Recently, to improve the classification accuracy, multiple classifier systems (MCS) such as neural network ensembles, bagging, and boosting have been used for practi- cal data mining applications [1]. When developing classifiers using learning meth- ods, while more training data can reduce the prediction error, the learning process can itself get computationally intractable. This issue is becoming more evident today, because there are complex classification problems waiting to be solved in many domains, where large amounts of data are already available [2]. Ideally, it is desirable to be able to consider all the examples simultaneously, to get the best possible estimate of class distribution. On the other hand when the training set is large, all the examples cannot be loaded into memory at one go. One approach to overcome this constraint is to train the classifier using an incremental learning technique, whereby only subsets of the data are to be considered at any one time and results subsequently combined. Neural networks have great advantages of adaptability, flexibility, and univer- sal nonlinear input-output mapping capability. However, to apply these neural networks, it is necessary to determine the network structure and some para- meters by human experts, and it is quite difficult to choose the right network structure suitable for a particular application at hand. Moreover, they require M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 762–770, 2008. c Springer-Verlag Berlin Heidelberg 2008 Efficient Incremental Learning Using SONG 763
a long training time to learn the input-output relation of the given data. These drawbacks prevent neural networks being the base classifier of the MCS for prac- tical applications. Self-generating neural networks (SGNN) [3] have simple network design and high speed learning. SGNN are an extension of the self-organizing maps (SOM) of Kohonen [4] and utilize the competitive learning which is implemented as a self-generating neural tree (SGNT). The abilities of SGNN make it suitable for the base classifier of the MCS. In order to improve in the accuracy of SGNN, we proposed ensemble self-generating neural networks (ESGNN) for classification [5] as one of the MCS. Although the accuracy of ESGNN improves by using various SGNN, the computation cost, that is, the computation time and the memory capacity increases in proportion to the increase in number of SGNN in the MCS. Therefore, we proposed a pruning method for the structure of the SGNN in the MCS to reduce the computation time and the memory capacity and we called this model as self-organizing neural grove (SONG) [6,7]. In this paper, we investigate a performance of an incremental learning us- ing the SONG for a large scale classification problem in UCI machine learning repository [8]. We use letter recognition dataset as the classification problem. We investigate the relation between the number of training data and the clas- sification accuracy, the number of nodes and the computation time. The results show that the SONG can ensure the rapid and efficient incremental learning. The rest of the paper is organized as follows: the next section shows how to construct the SONG. Then Section 3 shows expreimental results. Finally we present some conclusions, and outline plans for future work. 2 Self-Organizing Neural Grove In this section, we describe how to prune redundant leaves in the SONG. We implement the pruning method as two stages; the on-line pruning method and the off-line optimization method. First, we mention the on-line pruning method in learning of SGNT. Second, we show the optimization method in constructing the SONG. Finally, we show a simple example of the pruning method for a two dimensional classification problem. 2.1 On-line Pruning of Self-Generating Neural Tree SGNT is based on SOM and implemented as a competitive learning. The SGNT can be constructed directly from the given training data without any intervening human effort. The SGNT algorithm is defined as a tree construction problem of how to construct a tree structure from the given data which consist of multiple attributes under the condition that the final leaves correspond to the given data. Before we describe the SGNT algorithm, we denote some notations. – input data vector: e i ∈ IR m . – root, leaf, and node in the SGNT: n j .
764 H. Inoue and H. Narihisa Input: A set of training examples E = {e_i}, i = 1, ... , N. A distance measure d(e_i,w_j). Program Code: copy(n_1,e_1); for (i = 2, j = 2; i <= N; i++) { n_win = choose(e_i, n_1); if (leaf(n_win)) { copy(n_j, w_win); connect(n_j, n_win); j++;
} copy(n_j, e_i); connect(n_j, n_win); j++;
prune(n_win); } Output: Constructed SGNT by E. Fig. 1. SGNT algorithm Table 1. Sub procedures of the SGNT algorithm Sub procedure Specification copy(n
j , e
i /w win ) Create n j , copy e i /w win as w j in n j . choose(e i , n
1 ) Decide n win for e
i . leaf (n win ) Check n win whether n win is a leaf or not. connect(n j , n win ) Connect n j as a child leaf of n win .
win ) Prune leaves if the leaves have the same class. – weight vector of n j : w j ∈ IR
m . – the number of the leaves in n j : c
j . – distance measure: d(e i , w
j ). – winner leaf for e i in the SGNT: n win .
code of the SGNT algorithm is given in Fig. 1. In Fig. 1, several sub procedures are used. Table 1 shows the sub procedures of the SGNT algorithm and their specifications. In order to decide the winner leaf n win in the sub procedure choose(e i,n 1), the competitive learning is used. If an n j includes the n win as its descendant in the SGNT, the weight w jk (k = 1, 2, . . . , m) of the n j is updated as follows: w jk
jk + 1 c j · (e ik − w
jk ), 1 ≤ k ≤ m. (1)
|
ma'muriyatiga murojaat qiling