Lecture Notes in Computer Science
Probabilistic Tensor Analysis with Akaike and
Download 12.42 Mb. Pdf ko'rish
|
- Bu sahifa navigatsiya:
- 2 Probabilistic Tensor Analysis
- 2.1 Latent Tensor Model
- 2.2 Probabilistic Tensor Analysis
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria Dacheng Tao 1 , Jimeng Sun 2 , Xindong Wu 3 , Xuelong Li 4 , Jialie Shen 5 ,
Stephen J. Maybank 4 , and Christos Faloutsos 2 1 Department of Computing, Hong Kong Polytechnic University, Hong Kong csdct@comp.polyu.edu.hk 2 Department of Computer Science, Carnegie Mellon University, Pittsburgh, USA jimeng@cs.cmu.edu, christos+@cs.cmu.edu 3 Department of Computer Science, University of Vermont, Burlington, USA xwu@cs.uvm.edu 4 Sch. Computer Science & Info Systems, Birkbeck, University of London, London, UK xuelong@dcs.bbk.ac.uk, sjmaybank@dcs.bbk.ac.uk 5 School of Information Systems, Singapore Management University, Singapore jlshen@smu.edu.sg Abstract. From data mining to computer vision, from visual surveillance to biometrics research, from biomedical imaging to bioinformatics, and from multimedia retrieval to information management, a large amount of data are naturally represented by multidimensional arrays, i.e., tensors. However, conventional probabilistic graphical models with probabilistic inference only model data in vector format, although they are very important in many statistical problems, e.g., model selection. Is it possible to construct multilinear probabilistic graphical models for tensor format data to conduct probabilistic inference, e.g., model selection? This paper provides a positive answer based on the proposed decoupled probabilistic model by developing the probabilistic tensor analysis (PTA), which selects suitable model for tensor format data modeling based on Akaike information criterion (AIC) and Bayesian information criterion (BIC). Empirical studies demonstrate that PTA associated with AIC and BIC selects correct number of models. Keywords: Probabilistic Inference, Akaike Information Criterion, Bayesian Information Criterion, Probabilistic Principal Component Analysis, and Tensor. 1 Introduction In computer vision, data mining, and different applications, objects are all naturally represented by tensors or multidimensional arrays, e.g., gray face image in biometrics, colour image in scene classification, colour video shot (as shown in Figure 1) in multimedia information management, TCP flow records in computer networks, and DBLP bibliography in data mining [6]. Therefore, tensor based data modeling becomes very popular and a large number of learning models have been developed from unsupervised learning to supervised learning, e.g., high order singular value decomposition [5] [1] [2], n-mode component analysis or tensor principal component
792 D. Tao et al.
Two indices are used for pixel locations; one index is used to locate the colour information; and the other index is used for time. The video shot comes from http://www–nlpir.nist.gov/ projects/trecvid/. analysis (TPCA) [5] [10] [6], three-mode data principal component analysis [4], tucker decomposition [9], tensorface [10], generalized low rank approximations of matrix [12], two dimensional linear discriminant analysis [11], dynamic tensor analysis [6] and general tensor discriminant analysis [7]. However, all these tensor based learning models lack systematic justifications at the probabilistic level. Therefore, it is impossible to conduct conventional statistical tasks, e.g., model selection, over existing tensor based learning models. The difficulty of applying probability theory, probabilistic inference, and probabilistic graphical modeling to tensor based learning models comes from the gap between the tensor based datum representation and the vector based input requirement in traditional probabilistic models. Is it possible to apply the probability theory and to develop probabilistic graphical models for tensors? Is it possible to apply the statistical models over specific tensor based learning models? To answer the above questions, we narrow down our focus on generalizing the probabilistic principal component analysis (PPCA) [8], which is an important generative model [3] for subspace selection or dimension reduction for vectors. This generalization forms the probabilistic tensor analysis (PTA), which is a generative model for tensors. The significances of PTA are as following: 1) Providing a probabilistic analysis for tensors. Based on PTA, a probabilistic graphical model can be constructed, the dimension reduction procedure for tensors could be understood at the probabilistic level, and the parameter estimation can be formulated by utilizing the expectation maximization (EM) algorithm under the maximum likelihood framework [3]; 2) Providing statistical methods for model selection based on the Akaike and Bayesian information criteria (AIC and BIC) [3]. It is impossible for conventional tensor based subspace analysis to find a criterion for model selection, i.e., determining the appropriate number of retained dimensions to model the original tensors. By
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria 793 providing specific utilizations of AIC and BIC, the model selection for tensor subspace analysis comes true. In PTA associated with AIC/BIC, the number of retained dimensions can be chosen by an iterative procedure; and 3) Providing a flexible framework for tensor data modeling. PTA assumes entries in tensor measurements are drawn from multivariate normal distributions. This assumption could be changed for different applications, e.g., multinomial/binomial distributions in sparse tensor modeling. With different assumptions, different dimension reduction algorithm could be developed for different applications.
In this Section, we first construct the latent tensor model which a multilinear mapping to relate the observed tensors with unobserved latent tensors. Based on the proposed latent tensor model, we propose the probabilistic tensor analysis with dimension reduction and data reconstruction. The detailed description about tensor terminologies can be found in [9]. 2.1 Latent Tensor Model Similar to the latent variable model [8], the latent tensor model multilinearly relates high dimensional observation tensors 1 2 1 M M l l l l i R − × × × × ∈
for 1
n ≤ ≤
to the corresponding latent tensors 1 2
M M l l l l i R − ′ ′ ′ ′ × × × × ∈
in the low dimensional space, i.e., 1
M T i i d i d U × = = + + ∏ T X M E , (1)
where 1 | d d l l M d d U R ′ ×
= ∈ is the d th projection matrix and 1 d M ≤ ≤
; ( )
1 1
i i n = = ∑ M T is the mean tensor of all observed tensors i T ; and
i E is the i th residue tensor and every entry of i E follows ( )
0, N σ . Moreover, the number of effective dimension of latent tensors is upper bounded by 1
l − for d th mode, i.e., 1
′ ≤ −
for the d th projection matrix d U . Here, 1 i n ≤ ≤
and n is the number of observed tensors. Projection matrices d U (1 d M ≤ ≤
) construct the multilinear mapping between the observations and the latent tensors. 2.2 Probabilistic Tensor Analysis Armed with the latent tensor model and the probabilistic principal component analysis (PPCA), a probabilistic treatment of TPCA is constructed by introducing hyper- parameters with prior distribution ( )
1 , | , |
d d p U D σ = M , where
{ } 1 | n i i D = = T is the set contains all observed tensors. According to the Bayesian theorem, with
the
probabilistic modeling of TPCA or PTA is defined by the predictive density, i.e., ( ) ( ) (
) 2 2 1 1 | | , | , , | , |
M d d d d p D p U p U D σ σ = = = T T M M , (2)
794 D. Tao et al. where 1
d d l l M d d U R ′ ×
= ∈ , ( ) 2 1 | , | ,
d d p U σ = T M is the predictive density with the given full probabilistic model, and ( ) 2 1 , | , | M d d p U D σ = M is the posterior probability. The model parameters ( ) 2 1 , | ,
d d U σ = M can be obtained by applying maximum a posterior (MAP). In PTA, to obtain ( ) 2 1 , | ,
d d U σ = M , we have the following concerns: 1) the probabilistic distributions are defined over vectors but not tensors. Although the vectorization operation is helpful to utilize the conventional probability theory and inferences, the computational cost will be very high and almost intractable for practical applications. Based on this perspective, it is important to develop a method to obtain the probabilistic model in a computational tractable way. In the next part, the decoupled probabilistic model is developed to significantly reduce the computational complexity. 2) how to determine the number of retained dimensions to model observed tensors? In probability theory and statistics, the Akaike information criterion (AIC) [3] and the Bayesian information criterion (BIC) [3] are popular in model selection. However, both AIC and BIC are developed for data represented by vectors. Therefore, it is important to generalize AIC and BIC for tensor data. Based on the above discussions, to obtain the projection matrices 1 | M d d U = is computationally intractable, because the model requires obtaining all projection matrices simultaneously. To construct a computationally tractable algorithm to obtain 1 |
d d U = , we can construct decoupled probabilistic model for (1), i.e., obtain each i T d μ 2 d σ ; d j x ;
t d nl d U 1
U −
U 1
1
+
i X n
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria 795 projection matrix d U separately and form an alternating optimization procedure. The decoupled predictive density ( ) 2 1 | , | , M d d p U σ = T M is defined as ( )
) 2 2 1 1 | , | , | ,
M d d d d d d d p U p U σ μ σ = = ∝ × ∏
T . (3)
where d l d R μ ∈ is the mean vector for the d th mode; 2 d σ is the variance of the noise to model the residue on the d th mode. The decoupled posterior distribution with the given observed tensors D is ( ) ( ) 2 2 1 1 1 , | , | , , | , | M M k d d d d d d k k M d p U D p U D U σ μ σ ≠ = ≤ ≤ = ∝ ∏ M . (4)
Therefore, based on (3) and (4), the decoupled predictive density is ( ) ( ) (
) ( ) 2 2 1 1 | | , , | | , , M k d d d d d k k M d d d d p D p U p D U U μ σ
μ σ ≠ ≤ ≤ = ∝ × × ∏ T T . (5)
With (4) and (5), we have the decoupled probabilistic graphical model as shown in Figure 2. Consider the column space of the projected data set { }
; d d j D t = , where the sample ; d l d j t R ∈ is the j th column of ( )
d d d d T U = × T and
d d l l d T R × ∈ , where d k k d l l ≠ ′ = ∏ . Therefore, 1 d j nl ≤ ≤
. Let 1 mat k M d d k k X U × = ⎛ ⎞ = ⎜ ⎟ ⎝ ⎠ ∏
and ;
l d j x R ′ ∈ is the j th column of d X . Suppose, ( )
| ~ , T d d d d d d t x N U x I μ σ
+ and the marginal distribution of the latent tensors ( )
~ 0,
x N I , then the marginal distribution of the observed projected mode–d vector d t is also Gaussian, i.e., ( ) ~ ,
d d t N C μ . The mean vector and the corresponding covariance matrix are ( ) ( ) ; 1 d L d d j d j t nl μ ′ = = ∑ and 2
d d d d C U U I σ = + , respectively. Based on the property of Gaussian properties, we have (
( ) 1 2 1 | ~ ,
d d d d d d d x t N M U t M μ σ
− − − , (6)
where ( ) 2 T d d d d M U U I σ = + . In this decoupled model, the objective is to find the d th projection matrix d U based
on MAP with given 1 | k d k k M U ≠ ≤ ≤ , i.e., calculate d U by maximizing ( )
( ) 1 log | log det tr , 2 d d d d d d d nl p U D U C C S − ⎡ ⎤ × ∝ − + ⎣ ⎦ (7) where
( ) log a is the natural logarithm of a and d S is the sample covariance matrix of
d t . The eq. (7) is benefited from the decoupled definition of the probabilistic model, defined by (5). Based on (7), the total log of posterior distribution is 796 D. Tao et al. ( )
( ) 1 1 1 log | log det
tr . 2 M M d d d d d d d d d nl L p U D U C C S − = = ⎧ ⎫ ⎡ ⎤ = × ∝ −
+ ⎨ ⎬ ⎣ ⎦ ⎩ ⎭ ∑ ∑ Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling