Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
T
... SGNT 1
SGNT 2
SGNT K Input Combiner
Σ
o 1
o 2
o K Output
o Fig. 2. An MCS which is constructed from K SGNTs. The test dataset T is entered each SGNT, the output o i is computed as the output of the winner leaf for the input data, and the MCS’s output is decided by voting outputs of K SGNTs. After all training data are inserted into the SGNT as the leaves, the leaves have each class label as the outputs and the weights of each node are the averages of the corresponding weights of all its leaves. The whole network of the SGNT reflects the given feature space by its topology. For more details concerning how to construct and perform the SGNT, see [3]. Note, to optimize the structure of the SGNT effectively, we remove the threshold value of the original SGNT algorithm in [3] to control the number of leaves based on the distance because of the trade-off between the memory capacity and the classification accuracy. In order to avoid the above problem, we introduce a new pruning method in the sub procedure prune(n win). We use the class label to prune leaves. For leaves connected to the n win , if those leaves have the same class label, then the parent node of those leaves is given the class label and those leaves are pruned. In the next sub-section, we describe how to optimize the structure of the SGNT in the MCS to improve the classification accuracy. 2.2
Optimization of the SONG The SGNT has the capability of high speed processing. However, the accuracy of the SGNT is inferior to the conventional approaches, such as nearest neighbor, because the SGNT has no guarantee to reach the nearest leaf for unknown data. Hence, we construct an MCS by taking the majority of plural SGNT’s outputs to improve the accuracy (Figure 2). Although the accuracy of the SONG is comparable to the accuracy of conven- tional approaches, the computational cost increases in proportion to increase in the number of SGNTs in the SONG. In particular, the huge memory requirement prevents the use of the SONG for large datasets even with latest computers. In order to improve the classification accuracy, we propose an optimization method of the SONG for classification. This method has two parts, the merge phase 766 H. Inoue and H. Narihisa 1 begin initialize j = the height of the SGNT 2 do for each subtree’s leaves in the height j 3 if the ratio of the most class ≥ α, 4 then merge all leaves to parent node 5 if all subtrees are traversed in the height j, 6 then j
← j − 1 7 until j = 0 8 end. Fig. 3. The merge phase 1 begin initialize α = 0.5 2 do for each α 3 evaluate the merge phase with 10-fold CV 4 if the best classification accuracy is obtained, 5 then record the α as the optimal value 6 α
7 until α = 1 8 end. Fig. 4. The evaluation phase and the evaluation phase. The merge phase is performed as a pruning algorithm to reduce dense leaves (Figure 3). This phase uses the class information and a threshold value α to decide which subtree’s leaves to prune or not. For leaves that have the same parent node, if the proportion of the most common class is greater than or equal to the threshold value α, then these leaves are pruned and the parent node is given the most common class. The optimum threshold values α of the given problems are different from each other. The evaluation phase is performed to choose the best threshold value by introducing 10-fold cross validation. (Figure 4). 2.3
Simple Example of the Pruning Method We show an example of the pruning algorithm in Figure 5. This is a two- dimensional classification problem with two equal circular Gaussian distribu- tions that have an overlap. The shaded plane is the decision region of class 0 and the other plane is the decision region of class 1 by the SGNT. The dotted line is the ideal decision boundary. The number of training samples is 200 (class0: 100,class1: 100) (Figure 5(a)). The unpruned SGNT is given in Figure 5(b). In this case, 200 leaves and 120 nodes are automatically generated by the SGNT algorithm. In this unpruned SGNT, the height is 7 and the number of units is 320. In this, we define the unit to count the sum of the root, nodes, and leaves of the SGNT. The root is the node which is of height 0. The unit is used as a measure of the memory requirement in the next section. Figure 5(c) shows the pruned SGNT after the merge phase in α = 1. In this case, 159 leaves and 107
Efficient Incremental Learning Using SONG 767
0 0.2
0.4 0.6
0.8 1 0 0.2 0.4
0.6 0.8
1 x 2
1 class0
class1 (a)
class0 class1
node 0 0.2 0.4 0.6
0.8 1
1 0
0.4 0.6
0.8 1
2 0
2 3 4 5 6 7 Height (b)
class0 class1
node 0 0.2 0.4 0.6
0.8 1
1 0
0.4 0.6
0.8 1
2 0
2 3 4 5 6 7 Height (c)
class0 class1
node 0 0.2 0.4 0.6
0.8 1
1 0
0.4 0.6
0.8 1
2 0
2 3 4 5 6 7 Height (d)
Fig. 5. An example of the SGNT’s pruning algorithm, (a) a two dimensional classifi- cation problem with two equal circular Gaussian distribution, (b) the structure of the unpruned SGNT, (c) the structure of the pruned SGNT (α = 1), and (d) the structure of the pruned SGNT (α = 0.6). The shaded plane is the decision region of class 0 by the SGNT and the doted line shows the ideal decision boundary. nodes are pruned away and 54 units remain. The decision boundary is the same as the unpruned SGNT. Figure 5(d) shows the pruned SGNT after the merge phase in α = 0.6. In this case, 182 leaves and 115 nodes are pruned away and only 23 units remain. Moreover, the decision boundary is improved more than the unpruned SGNT because this case can reduce the effect of the overlapping class by pruning the SGNT. In the above example, we use all training data to construct the SGNT. The structure of the SGNT is changed by the order of the training data. Hence, we can construct the MCS from the same training data by changing the input order. We call this approach “shuffling”. To show how well the MCS is optimized by the pruning algorithm, we show an example of the MCS in the same problem used above. Figure 6(a) and Figure 6(b) show the decision region of the MCS in α = 1 and α = 0.6, respectively. We set the number of SGNTs K as 25. The result of Figure 6(b) is a better estimation of the ideal decision region than the result of Figure 6(a). 768 H. Inoue and H. Narihisa 0 0.2
0.4 0.6
0.8 1 0 0.2 0.4
0.6 0.8
1 x 2
1 class0
class1 (a)
0 0.2
0.4 0.6
0.8 1 0 0.2 0.4
0.6 0.8
1 x 2
1 class0
class1 (b)
Fig. 6. An example of the MCS’s decision boundary (K = 25), (a) α = 1, and (b) α = 0.6. The shaded plane is the decision region of class 0 by the MCS and the doted line shows the ideal decision boundary. 3 Experimental Results We investigate the relation between the number of training data and the classifi- cation accuracy, the number of nodes, and the computation time of SONG with bagging for the benchmark problem in the UCI repository [8]. In this experiment, we use a modified Euclidean distance measure as follows: d(x, y) = m i=1 a i · (x i − y
i ) 2 , (2)
a i = 1 max
j − min
j , (1
≤ j ≤ N). (3)
We use letter recognition dataset in this experiment since it contain large scale data (the number of input dimension: 16, the number of classes: 26, and the number of entries: 20000). The objective of this dataset is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. The results for other benchmark problems and comparative study is shown in [7,9]. First, we divide letter recognition dataset into ten parts. Second, we select one of the ten parts as the testing data. Third, we enter one of the remaining nine parts to the SONG for training. Forth, we test the SONG using the testing data. Finally, we continue the training and the testing until all nine parts dataset is entered the SONG. We set the number of SGNT K in the SONG as 1,3,5,9,15, and 25. To select the optimum threshold value α, we set the different threshold values α which are moved from 0.5 to 1; α = [0.5, 0.55, 0.6, . . . , 1]. All computations
Efficient Incremental Learning Using SONG 769
0.65 0.7
0.75 0.8
0.85 0.9
0.95 1 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Classification accuracy (%) # of N K=1
K=3 K=5
K=9 K=15
K=25 Fig. 7. The relation between the number of training data and the classification accuracy Table 2. The compression ratio of the SONG for letter dataset The number of N ( ×10 3
2 4 6 8 10 12 14 16 18 The compression ratio (%) 34.8 29.2 26.6 24.6 23.3 22.1 21.0 20.1 19.4 of the SONG are performed on an IBM PC-AT machine (CPU: Intel Pentium4 2.26GHz, Memory: 512MB). Figure 7 shows the relation between the number of training data and the classification accuracy. The more the number of training data increases, the more the classification accuracy improves for all the number of ensembles K. The width of the improvement is wide in small K for all the number of N . As the memory requirement, we count the number of units which is the sum of the root, nodes, and leaves of the SGNT. In this paper, we use below defined compression ratio: compression ratio = number of remaining units number of total units . (4) Table 2 shows the compression ratio of the memory requirement in the SONG. The compression ratio decreases gradually as the number of training data increases. It means that the SONG has a capability of good compression for large scale data. This supports that the SONG can be effectively used for large scale datasets. Finally, Table 3 shows the computation time for training (total and interval) and testing in K = 25, α = 1.0. At first, the computation time for training and the computation time for testing is same in N = 2000. The interval training time is slightly larger than the testing time to generate new leaves and to prune redundant leaves in SONG with the exception of N = 2000. The testing time increases slightly in proportion to increase the number of training data since SONG is search the nearest node in the particular subtree which is already pruned. In conclusion, SONG is practical for incremental learning and large- scale data mining.
770 H. Inoue and H. Narihisa Table 3. The computation time of training and testing in K = 25, α = 1.0 The number of N ( ×10 3
2 4 6 8 10 12 14 16 18 The total training time (s) 0.25 0.61 1.02 1.5 1.98 2.48 3.01 3.55 4.12 The interval training time (s) 0.25 0.36 0.41 0.48 0.48 0.5 0.53 0.54 0.57 The testing time (s) 0.25 0.31 0.36 0.34 0.39 0.43 0.48 0.45 0.44 4 Conclusions In this paper, we investigated the performance of incremental learning of SONG. Experimental results showed that the memory requirement reduces effectively, and the accuracy increases in proportion to increase the number of training data. In conclusion, the SONG is a useful and practical incremental learning method to classify large datasets. In future work, we will study a more effective pruning algorithm and a parallel and distributed processing of the SONG for large scale data mining. References 1. Quinlan, J.R.: Bagging, Boosting, and C4.5. In: Proceedings of the Thirteenth Na- tional Conference on Artificial Intelligence, August 4–8, 1996, pp. 725–730. AAAI Press, MIT Press (1996) 2. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge (1996) 3. Wen, W.X., Jennings, A., Liu, H.: Learning a neural tree. In: Proc. of the Interna- tional Joint Conference on Neural Networks, Beijing, China, November 3–6, 1992, vol. 2, pp. 751–756 (1992) 4. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995) 5. Inoue, H., Narihisa, H.: Improving generalization ability of self-generating neural networks through ensemble averaging. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 177–180. Springer, Berlin (2000) 6. Inoue, H., Narihisa, H.: Effective Pruning Method for a Multiple Classifier System Based on Self-Generating Neural Networks. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 11–18. Springer, Berlin (2003) 7. Inoue, H., Narihisa, H.: Self-organizing neural grove: Efficient multiple classifier system using pruned self-generating neural trees. In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guerv´ os, J.J., Bullinaria, J.A., Rowe, J.E., Tiˇ no, P., Kab´ an, A., Schwefel, H.-P. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 1113–1122. Springer, Berlin (2004) 8. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases, University of California, Irvine, Dept of Information and Computer Science (1998), Datasets is available at http://www.ics.uci.edu/ ∼ mlearn/MLRepository.html 9. Inoue, H.: Self-organizing neural grove. WSEAS Trans. on Computers 5(10), 2238– 2244 (2006) Design of an Unsupervised Weight Parameter Estimation Method in Ensemble Learning Masato Uchida 1 , Yousuke Maehara 2 , and Hiroyuki Shioya 3 1
3–8–1 Asano, Kokurakita-ku, Kitakyushu-shi, Fukuoka 801–0001, Japan m.uchida@ndrc.kyutech.ac.jp 2 Graduate School of Computer Science and Systems Engineering, Muroran Institute of Technology 27–1 Mizumoto-cho, Muroran-shi, Hokkaido 050–8585, Japan maehara@saint.csse.muroran-it.ac.jp 3 Department of Computer Science and Systems Engineering, Muroran Institute of Technology 27–1 Mizumoto-cho, Muroran-shi, Hokkaido 050–8585, Japan shioya@csse.muroran-it.ac.jp Abstract. A learning method using an integration of multiple com- ponent predictors as an ultimate predictor is generically referred to as ensemble learning. The present paper proposes a weight parameter es- timation method for ensemble learning under the constraint that we do not have any information of the desirable (true) output. The proposed method is naturally derived from a mathematical model of ensemble learning, which is based on an exponential mixture type probabilistic model and Kullback divergence. The proposed method provides a le- gitimate strategy for weight parameter estimation under the abovemen- tioned constraint if it is assumed that
the accuracy
of all
multiple predictors are the same. We verify the effectiveness of the pro- posed method through numerical experiments. 1 Introduction A learning method in which an ultimate predictor, called an ensemble predictor, which is an integration of multiple component predictors, is built is generically referred to as ensemble learning. The purpose of ensemble learning is to enhance accuracy and to reduce performance fluctuation through integration. Representa- tive ensemble learning methods include Bagging [1], Boosting [2] and their deriv- atives. These methods perturb the data, which is composed of input information and corresponding desirable (true) output information, by resampling to induce diversity of component predictors. For example, in a boosting method called AdaBoost [2], a weak learner/predictor (slightly better than random guessing) is trained iteratively while increasing the intensity of misclassified samples and decreasing the intensity of correctly classified samples. The ensemble predictor M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 771–780, 2008. c Springer-Verlag Berlin Heidelberg 2008
772 M. Uchida, Y. Maehara, and H. Shioya is then built by integrating the multiple trained component predictors according to their performance with respect to the input/output data. On the other hand, the simple ensemble learning method proposed in [3] builds an ensemble pre- dictor using the weighted average of multiple component predictors, where the weights are determined according to the squared loss of the ensemble predictor with respect to the input/output data. The above observation indicates that a number of existing studies concerning ensemble learning have considered how to use the given input/output data to obtain an efficient ensemble predictor. However, even when the input/output data is not given, the essential strategy (i.e., integration of multiple component predictors) of ensemble learning itself is hopeful if multiple trained component predictors for building the ensemble pre- dictor are on hand. That is, it is expected that the ensemble strategy will provide an effective solution by using the multiple available trained component predictors even to a prediction problem for which the answer (i.e., true output) is unknown and will be revealed in the future. Such prediction problems are not unusual, and typical examples include “Who will win the next Academy Awards?”, “How long will the current cabinet be in power?”, and “Will a new item of a certain com- pany make a big hit?”. Although these social/political/economical examples are not from laboratory experiments, even for these kinds of problems, it is expected that a convincing prediction can be obtained by aggregating opinions that are collected from various individuals. However, a number of previous studies that considered ensemble learning cannot be applied to solving the abovementioned problems, despite the expectation for the essential strategy of ensemble learning. This is because these previous studies implicitly assumed the input/output data to be available for building ensemble predictor, while the output information of the abovementioned problems cannot be obtained in principle. Thus, the ques- tion as to how to construct an ensemble predictor without referring to output information arises, where it is assumed that we can apply both the knowledge of the problem to be solved (i.e., input information) and opinions to the prob- lem (i.e., multiple trained component predictors). By answering such a question, the current application fields of ensemble learning would be expanded beyond laboratory experiments. Therefore, the present paper proposes a new ensemble learning method called unsupervised ensemble learning that builds an ensemble predictor without us- ing output information. The proposed unsupervised ensemble learning method is based on an ensemble learning model proposed in [4,5], which is a gener- alization of the method proposed in [3]. Although the generalization model is designed under the assumption that output data can be used, as was the case in previous studies, the proposed unsupervised ensemble learning method is natu- rally derived from the generalization model by focusing on its structure that is characterized by an exponential mixture type probabilistic model and Kullback divergence. In addition, if the accuracies of the individual multiple predictors are assumed to be approximately the same, the proposed method becomes a legitimate strategy under the constraint that there is no output information. Design of an Unsupervised Weight Parameter Estimation Method 773
2 Ensemble Learning with Supervised Weight Parameter Estimation 2.1
Fundamental Model First, we briefly review the formulation of the ensemble learning method pro- posed in [3]. Let us consider a component predictor that outputs f θ i
∈ R) for the in- put x = (x 1 , . . . , x m ) T ( ∈ R
m ), where θ i = (θ
(1) i , . . . , θ (k i ) i ) T ( ∈ R
k i ) is the set of modifiable parameters of the predictor f θ i (x) for i = 1, . . . , M . In addition, assume that i.i.d. (independent and identically distributed) sample data com- posed of input x and corresponding desirable (true) output y are obtained from a certain probability density function (pdf) p ∗ (x, y) (= p ∗ (x)p
∗ (y |x)), where the set of i.i.d. sample data is denoted by D n i (x, y) = {(x
(i) 1 , y (i) 1 ), . . . , (x (i) n i , y (i)
n i ) }. Here, the ensemble learning method based on the square loss function, which were proposed in [3], involves finding ˆ θ i = arg min θ i
x,y)∈D ni ( x,y) (y − f θ i (x)) 2 (1)
and then using ¯ f ˆ θ,β
(x) = M i=1 β i f ˆ θ i (x) (2)
as an ensemble predictor, where θ = (θ
T 1 , . . . , θ T M ) T ∈ R
È M i=1 k i , β = (β 1 , . . . , β M ) T ∈ R M , M i=1
β i = 1, β i > 0. Note that β M is defined using β 1 , . . . , β M −1
M = 1
− M −1 i=1 β i without loss of generality. This means that the essential dimension of β is M − 1. Hereafter, the parameter β is referred to as the weight parameter. Now, if we can use the extra input/output data D n (x , y ) = {(x 1 , y 1 ), . . . , (x n
n ) }, which is the set of i.i.d. sample data obtained from p ∗ (x, y), the value of β can be estimated as ˆ β S = arg min β (
n ( x ,y ) (y − ¯
f ˆ θ,β (x)) 2 , (3) where ˆ
β S = ( ˆ β S,1
, . . . , ˆ β S,M −1 ) and ˆ
β S,M
is defined as ˆ β S,M = 1 − M −1 i=1
ˆ β S,i . Note that the estimation in Eq. (3) is supervised in the sense that it is executed by using output data y.
774 M. Uchida, Y. Maehara, and H. Shioya The present paper proposes an unsupervised weight parameter estimation method. That is, the proposed method estimates the value of β without using output data y. The proposed estimation method is based on the generalization model of ensemble learning [4,5] that includes the fundamental ensemble learn- ing model mentioned in this section as a special case. A brief review of [4,5] is given in the next section. 2.2 General Model Based on the Exponential Mixture Model Gaussian Exponential Mixture Model. Let us define a conditional Gaussian pdf of y given x using a function g(x) ( ∈ R) as follows: p G,g (y |x) =
1 √ 2πσ exp − 1 2σ 2 (y − g(x)) 2 , where σ is a positive constant value. Using the above definition, we identify p G,g (y |x) with g(x). In addition, Eq. (1) can be rewritten as ˆ θ
= arg min θ i − ( x,y)∈D ni ( x,y) log p G,f
θi (y |x) . (4) On the other hand, we know the following relationship [6], which corresponds to Eq. (2): p G, ¯ f ˆ θ,β (y |x) =
M i=1
p G,f
ˆ θi (y |x) β i R M i=1 p G,f
ˆ θi (y |x) β i dy . (5) We herein refer to p G, ¯
f θ,β
(y |x) as a Gaussian exponential mixture model. In addition, Eq. (3) can be rewritten using Eq. (5) as ˆ β S = arg min β −
x,y)∈D n ( x ,y ) log p
G, ¯ f ˆ θ,β (y |x) . (6) General Exponential Mixture Model. Let us denote the set of all pdf on X (∈ R
m ) by
P(X ) and the set of all conditional pdf on Y (∈ R l ) given X (∈ R m ) by P(Y|X ). Here, let us define a new conditional pdf, which can be re- garded as a generalization of Eq. (5), using p i (y
i ( Y|X ) ⊂ P(Y|X ), i = 1, . . . , M ) as ¯ p β (y |x) def = M i=1 p i (y |x)
β i Y M i=1
p i (y |x) β i dy , (7) where we assume β = (β
1 , . . . , β M −1
T ∈ R
M −1 , M i=1
β i = 1, Y M i=1 p i (y |x) β i dy < ∞ (∀x ∈ X ). We herein refer to ¯ p β (y |x) as an exponential mixture model. Design of an Unsupervised Weight Parameter Estimation Method 775
On the other hand, for ∀p ∗ (x), ∀q(x) ∈ P(X ) and∀p ∗ (y
the Kullback divergence D( · ·) between p ∗ (y
∗ (x) and q(y |x)q(x) meets the following chain rule: D(p ∗
|x))p ∗ (x) q(y |x)q(x)) = D(p ∗ (x) q(x)) + D(p ∗ (y |x) q(y|x)). (8) If p
∗ (x)
≡ q(x), the first term on the right-hand side of Eq. (8) is 0. In the context of learning, this condition means that the same information (problem) is input to both the teacher and the learner. In the present paper, we assume this condition and focus on only the second term on the right-hand side of Eq. (8). Under this assumption, the process of the ensemble learning with supervised weight parameter estimation results in a sequence of three operations: ˆ p
(y |x)
def = arg
min p( y|x)∈P i ( Y|X ) D(p ∗ (y |x) p(y|x)), (9)
¯ ˆ p β (y |x) def = arg
min p( y|x)∈P(Y|X ) M i=1
β i D(p(y |x) ˆp i (y |x)), (10)
ˆ β S def = arg min β D(p
∗ (y |x) ¯ˆp β (y |x)). (11) Here, Eqs. (1) and (4) correspond to Eq. (9), Eqs. (2) and (5) correspond to Eq. (10), and Eqs. (3) and (6) correspond to Eq. (11). Note that it is easily confirmed that Eq. (10) is well defined using Jensen’s inequality. As for the efficiency of ¯ p ˆ β S (y |x), the following equation is derived from the fact that ¯ p β
|x) is a kind of exponential family [7]: D(p
∗ (y |x) ¯p ˆ β S (y |x))
= D(p ∗ (y |x) p i (y |x)) − D(¯p ˆ β S (y |x) p i (y |x)), (i = 1, . . . , M) ≤ min i=1,...,M D(p ∗
|x) p i (y |x)). (12)
3 Ensemble Learning with Unsupervised Weight Parameter Estimation 3.1
Motivation The efficiency shown in Eq. (12) indicates that the supervised weight parameter estimation is the recommended strategy. However, the efficiency can be enjoyed only when the output data (i.e., p ∗ (y
is needed in order to obtain ˆ β S through Eq. (11). This limitation narrows the application range of ensemble learning. For example, the supervised method cannot work for the prediction problems considered in Section 1, although not only are such problems quite common but also the essence of the ensemble strategy would be useful for solving such problems. This has motivated us to propose an unsupervised weight parameter estimation method that does not need the output data at all.
776 M. Uchida, Y. Maehara, and H. Shioya The underlying assumptions of the proposed method are two-fold. First, it is assumed that we have knowledge about the problem to be solved. This knowledge is vital input information for predictors/solvers because the predictors cannot do anything if the problem itself is unknown. Therefore, the first assumption is quite natural. Note that the input information is usually given as the numerical data in the context of learning. Second, it is assumed that we have multiple opinions on the solution of the given problem. In the context of learning, the assumption means that we have multiple component predictors that have been trained in some way in advance. On the other hand, in the context of the examples considered in Section 1, the assumption means that we can collect opinions from various people. The second assumption is primal because the purpose of the present paper provides a method for integrating the multiple trained component predictors or aggregating opinions from various people. 3.2 Proposed Method General Case: Exponential Mixture Model. The observation of the fol- lowing equation, which is derived in [4], yields to the key idea of the proposed method: D(p
∗ (y |x) ¯p β (y |x)) I = M i=1 β i D(p ∗ (y |x) p i (y |x)) II − M i=1
β i D(¯ p β (y |x) p i (y |x)) III
. (13)
Equation (13) reveals two important points. First, the minimization of I with respect to β is equivalent to the maximization of III with respect to β when D(p ∗ (y |x) p i (y |x)) = D(p ∗ (y |x) p j (y |x)) holds for i, j = 1, . . . , M and i = j because II is constant with respect to β in that case. Second, the maximization of III with respect to β does not depend on p ∗ (y |x), which is required in the supervised weight parameter estimation method. There- fore, the maximization of III can be realized in an unsupervised manner. Based on the above two points, we proposed a new weight parameter estima- tion method that is formulated as the following equation: ˆ β
= arg max β M i=1 β i D(¯ p β (y |x) p
i (y |x)), where ˆ β U = { ˆβ
U,1 , . . . , ˆ β U,M
−1 } and ˆβ
U,M is defined as ˆ β U,M
= 1 − M −1 i=1
ˆ β U,i . The weight ˆ β U
i (y |x), (i = 1, . . . , M) have similar Design of an Unsupervised Weight Parameter Estimation Method 777
efficiency (i.e., D(p ∗ (y |x) p i (y |x)) = D(p ∗ (y |x) p j (y |x)), i, j = 1, . . . , M and i = j). We hereafter use the following definition: c(β) def
= M i=1 β i D(¯ p β (y |x) p i (y |x)) The present paper assumes that the input data D n (x ) =
{x 1 , . . . , x n }, which is the set of i.i.d. sample data is obtained from p ∗ (x). In this case, we replace c(β) by ˆ c(β), which is defined as follows: ˆ c(β)
def = 1 n x∈D
n ( x) M i=1
β i Y ¯ p β (y |x) log
¯ p β (y |x)
p i (y |x) dy .
The value of β that maximizes ˆ c(β) can be found using a gradient descent method that is given by β (t+1) = β (t)
+ η ∇ β ˆ c(β)
| β=β
(t) , where β (t) = (β
(t) 1 , . . . , β (t) M −1 ) is a value of β at update time t, η is a sufficiently small positive value, and ∇ β
1 , . . . , ∂/∂β M −1
with respect to β. Special Case: Gaussian Exponential Mixture Model. On the other hand, the proposed method has an additional interesting properties if p i (y |x) is mod- eled by conditional Gaussian pdf. In the following, we consider the case in which p i
|x) = p G,f
ˆ θi (y |x). In this case, the following equation holds, where, for sim- plicity, we set σ = 1. ˆ c(β) =
1 n x∈D n ( x) 1 2 M i=1 β i ( ¯ f ˆ θ,β (x)
− f ˆ θ i (x))
2 = 1 n x∈D
n ( x) 1 4 M i,j=1 β i β j (f ˆ θ i (x) − f
ˆ θ j (x)) 2 . Here, using β M = 1 − M −1 i=1 β i , we obtain ∇ β ˆ c(β) = δ
− Aβ, where A = (a i,j )
−1)×(M−1) and δ = (δ 1 , . . . , δ M −1 ) are defined as a i,j
= 1 n x∈D n ( x) 1 2 {(f ˆ θ j (x)
− f ˆ θ M (x))
2 + (f
ˆ θ M (x) − f
ˆ θ i (x)) 2 − (f ˆ θ j (x) − f
ˆ θ i (x)) 2 } , δ i = 1 n x∈D n ( x) 1 2 (f ˆ θ M (x) − f
ˆ θ i (x)) 2 . 778 M. Uchida, Y. Maehara, and H. Shioya As a result, ∇ β ˆ c(β) = 0 yields the following equation if A is non-singular. ˆ β
= A −1 δ As an interesting example, let us consider a case that satisfies 1 n x∈D n ( x) 1 2 (f θ i (x) − f
θ j (x)) 2 = ε,
( ∀i, j = 1, 2, . . . , M, i = j) (14) where ε is a positive constant value. In this case, we obtain A −1 = 1 ε ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ M −1 M − 1 M − 1 M . . . − 1 M − 1 M M −1 M − 1 M . . .
− 1 M .. . . . . . .
. . .
. .. . − 1 M . . . − 1 M M −1 M − 1 M − 1 M . . .
− 1 M − 1 M M −1 M ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , δ = (ε, ε, . . . , ε) t . Let ˆ β A be a value of β when Eq. (14) is satisfied. We then obtain ˆ β A = 1 M , 1 M , . . . , 1 M t . The above property characterizes the meaning of the simple average in the con- text of unsupervised weight parameter estimation. 4 Numerical Examples 4.1 Experimental Conditions In order to evaluate the practical performance of the proposed method, we con- ducted numerical experiments on pattern recognition tasks using the spam e-mail dataset and the diabetes dataset provided by the UCI repository of machine learning databases [8]. The experiments are based on the Gaussian exponential mixture model described in Sections 2.2 and 3.2. We generated the multiple trained component predictors, which are modeled by multi-layer perceptrons (MLPs), by executing the training based on the gra- dient descent method with the square loss function using D n i
we executed the weight parameter estimation using D n (x ) for the proposed unsupervised method. In this way, we can emulate the situation described in Section 1. That is, the output information of the problem is not given while the knowledge about the problem to be solved (i.e., D n (x )) and the opinions regarding its solution (i.e., f ˆ θ i (x)) are on hand. For reference, we executed the supervised weight parameter estimation using D n (x , y ). Unknown samples are used to evaluate the generalized ability of the ensemble predictor. Detailed in- formation on experimental parameters is shown in Table 1. Design of an Unsupervised Weight Parameter Estimation Method 779
Table 1. Parameters of Numerical Experiments Parameter Spam E-mail Diabetes Total Number 4601 768
Number of |D n i (x, y)
| 100
50 Instances Known Sample |D n
| or |D n (x ) | 100
50 Unknown Sample 4001 518
Number of Input Attributes: m 57 8 Number of Output Attributes: l 1 1 Number of Hidden Units in MLP 7 3 Number of Component Predictors (MLPs): M 5 5 Table 2. Rate of Correct Answers (%) Sample
Spam E-mail Diabetes
Predictor(s) Condition 1 Condition 2 Condition 1 Condition 2 c = 65 c = 85 a = 85, b = 10 c = 65 c = 85 a = 75, b = 5 Known
f ˆ θ i (x) (ave.) 57.04 80.80 65.50 57.84 66.80 49.92 ¯ f ˆ θ, ˆ
β A (x) 68.80 83.00 78.70
63.60 68.20 58.80
¯ f ˆ θ, ˆ β U (x) 73.40 84.20 82.50 70.00 70.00 66.80 ¯ f ˆ θ, ˆ
β S (x) 74.00 84.60 83.90
71.40 72.80 70.80
Unknown f ˆ θ i (x) (ave.) 56.72 79.54 67.56 58.41 71.19 49.81 ¯
ˆ θ, ˆ
β A (x) 67.41 83.37 82.47
67.41 73.89 55.62
¯ f ˆ θ, ˆ β U (x) 70.14 85.12 83.68 69.53 74.49 69.74 ¯ f ˆ θ, ˆ
β S (x) 71.40 85.37 83.93
71.67 75.51 72.47
4.2 Results and Discussion We used two conditions to terminate the training of the component predictors. One condition is for making multiple component predictors with similar accu- racy (Condition 1) and the other is for making multiple component predictors with diverse accuracy (Condition 2). In Condition 1, we stop the training of all component predictors when the rate of correct answers with respect to D n i (x, y) achieves the same value c as in the process of training. In Condition 2, we also stop the training of the component predictors when the rate achieves a −b(i−1)%
for i = 1, . . . , M (= 5) (i is the index of the component predictor). We executed numerical experiments using 10 different sets of samples for each experimental condition, where each set of samples is made by random sampling from the original dataset. The results of the numerical experiments shown in Table 2 are the average of the results obtained for these 10 sets. As shown in this table, the efficiency increases in the order of f ˆ θ
(x) (ave.), ¯ f ˆ θ, ˆ β A (x), ¯ f ˆ θ, ˆ β U (x), ¯ f ˆ θ, ˆ
β S (x) for all experimental conditions considered herein. In the following, we examine the experimental results in more detail. The degree of performance improvement in Condition 1 becomes better as the value of c becomes smaller. This means that the effect of the proposed method
780 M. Uchida, Y. Maehara, and H. Shioya becomes larger as the performance of the component predictors becomes worse. Of course, the performance itself becomes better with the value of c. On the other hand, the result in Condition 2 shows that the proposed method provides a reasonable weight parameter, even when the assumption of the proposed method mentioned in Section 3.2 (i.e., all component predictors have similar efficiency) does not hold exactly. 5 Conclusion We proposed an unsupervised weight parameter estimation for ensemble learn- ing using a mathematical model that is based on an exponential mixture model and Kullback divergence. The proposed method is feasible even when we do not know the information about the desired output in advance. In addition, the pro- posed method is a legitimate strategy for weight parameter estimation in the above-mentioned unsupervised situation if the performances of component pre- dictors, which are used to build the ensemble predictor, are almost the same. The results of numerical experiments revealed that the performance of the proposed method is much better than that of the simple ensemble learning, even when the assumption on the performances of the component predictors is not satisfied. Acknowledgment This study was supported in part by the Japan Society for the Promotion of Science through a Grant-in-Aid for Scientific Research (S) (18100001). References 1. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 2. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119– 139 (1997) 3. Ueda, N., Nakano, R.: Generalization error of ensemble estimators. In: Proceedings of International Conference on Neural Networks 1996 (ICNN 1996), Washington, D. C., WA, USA, vol. 3, pp. 90–95 (1996) 4. Uchida, M., Shioya, H., Da-te, T.: Analysis and extension of ensemble learning (in Japanese). IEICE Transactions on Information and Systems, PT. 2 (Japanese Edition) J84-D-II, 1537–1542 (2001) 5. Uchida, M., Shioya, H.: A study on assignment of weight parameters in ensemble learning model (in Japanese). IEICE Transactions on Information and Systems, PT. 2 (Japanese Edition) J86-D-II, 1131–1134 (2003) 6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Chichester (1991) 7. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society (2000) 8. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)
Sparse Super Symmetric Tensor Factorization Andrzej Cichocki , Marko Jankovic, Rafal Zdunek , and Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Saitama, Japan cia@brain.riken.jp Abstract. In the paper we derive and discuss a wide class of algorithms for 3D Super-symmetric Nonnegative Tensor Factorization (SNTF) or non- negative symmetric PARAFAC, and as a special case: Symmetric Nonnegative Matrix Factorization (SNMF) that have many potential ap- plications, including multi-way clustering, feature extraction, multi- sensory or multi-dimensional data analysis, and nonnegative neural sparse coding. The main advantage of the derived algorithms is relatively low com- plexity, and in the case of multiplicative algorithms possibility for straight- forward extension of the algorithms to L-order tensors factorization due to some nice symmetric property. We also propose to use a wide class of cost functions such as Squared Euclidean, Kullback Leibler I-divergence, Alpha divergence and Beta divergence. Preliminary experimental results confirm the validity and good performance of some of these algorithms, especially when the data have sparse representations. 1 Introduction – Problem Formulation Tensors (also known as n-way arrays or multidimensional arrays) are used in a va- riety of applications ranging from Neuroscience and psychometrics to chemomet- rics [1, 2, 3, 4, 5, 6, 7, 8]. Non-negative Matrix Factorization (NMF), Non-negative Tensor Factorization (NTF) and parallel factor analysis (PARAFAC) models with non-negativity constraints have been recently proposed as sparse and quite efficient representations of signals, images, or general data [3, 4, 2, 5, 9, 10, 11, 12, 13, 14, 15]. From a viewpoint of data analysis, NTF is very attractive because it takes into account spatial and temporal correlations between variables more accurately than 2D matrix factorizations, such as NMF, and it usually provides sparse common factors or hidden (latent) components with physiological mean- ing and interpretation [5]. In most applications, especially in neuroscience (EEG, fMRI), the PARAFAC models have been used [12, 16, 17]. In this paper, we con- sider the special form of the PARAFAC model (referred to here as the SNTF model), but with additional nonnegativity and sparsity constraints [6, 7, 18, 19]. In general case, the PARAFAC model can be described as a factorization of a Also with Systems Research Institute PAN and Warsaw University of Technology; Dept. of EXE; Warsaw; Poland. Also with Institute of Telecommunications, Teleinformatics, and Acoustics; Wroclaw University of Technology; Poland. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 781–790, 2008. c Springer-Verlag Berlin Heidelberg 2008
782 A. Cichocki et al. given 3D tensor Y ∈ R
I ×J×Q
into three unknown matrices: A ∈ R
I ×J repre- senting the common factors, basis matrix, dictionary matrix or mixing matrix (depending on the applications), D ∈ R Q
usually representing scaling ma- trix, and X ∈ R J
representing second common factors, hidden components or sources (See Fig.1). Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling