Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
X = Y Q I T + 1 A Q J I 1 1 J D Q I T 1 T J 1 N Fig. 1. General 3D PARAFAC model described by the set of matrix equations Y q =
q X + N
q , q = 1, 2, . . . , Q, where D q is a diagonal matrix that holds on the main diagonal the q-th row of D. In the special case of the SNTF, we impose nonnegativity and additional constraints, and I = T = Q, A = D = G ∈ R I
, and X = G T . A super-symmetric tensor is a tensor whose entries are invariant under any permutation of the indices. For example, a third order super-symmetric tensor Y ∈ R
I ×T ×Q
(with I = T = Q) has y itq
= y iqt
= y tiq
= y tqi
= y qit
= y qti
. Super-symmetric tensors arise naturally in multi-way clustering where they represent generalized affinity tensors, in higher order statistics, and blind source separation. Pierre Comon [20] has shown a nice relationship between super- symmetric tensors and polynomials. Zass and Shashua applied them to multi- way clustering problems [6,19,7], and Hazan et al. developed some multiplicative algorithms for the NTF [2]. We formulate the SNTF decomposition of a third order super-symmetric ten- sor Y ∈ R
I ×I×I
as three identical sparse nonnegative matrices G = [g 1 , g 2 , . . . , g J
∈ R I ×J with J << I according to the following factorization: Y =
J j=1
g j ◦ g j ◦ g
j + N ,
(1) where g
j ∈ R
I is the j-th column vector of the matrix G, the operator ◦ means outer product 1 , and N is a tensor representing error. The SNTF model can be described in the equivalent matrix form as Y q = GD q G T + N
q , (q = 1, 2, . . . , I) (2) 1 Note that if u, v, w are vectors, then [u ◦ v ◦ w] ijq
= u i v j w q . Sparse Super Symmetric Tensor Factorization 783
where Y q = Y :,:,q = [y
itq ] ∈ R I ×I + are (frontal) slices of the given tensor Y ∈ R I ×I×I
+ , I = Q = T is the number of the (horizontal, vertical, frontal) slices, G = [g ij
I ×J ∈ R I ×J + is the unknown matrix (super-common factor) to be estimated, D q ∈ R J ×J + is a diagonal matrix that holds the q-th row of G in its main diagonal, and N q
:,:,q ∈ R
I ×I is the q-th frontal slice of a tensor N ∈ R I ×I×I (not necessary super symmetric) representing error or noise, depending upon the application. The above algebraic system can be represented in an equivalent scalar form as follows y itq
= z itq
+ n itq
= J j=1 g ij g tj g qj + n itq
. (3)
The objective is to estimate a sparse matrix G, subject to some constraints like scaling to unit length vectors, non-negativity and other possible natural constraints such as orthogonality, sparseness and/or smoothness of all or some of the columns g j .
the matrix G is denoted by g ij , and its j-th column by g j , y
itq = [Y
q ] it means the it-th element of the q-th frontal slice Y q and z
itq = J j=1 g ij g tj g qj with
(i = 1, 2, . . . , I; t = 1, 2, . . . , I; q = 1, 2, . . . , I). 2 Multiplicative SNTF Algorithms 2.1 Generalized Alpha Divergence The most widely known and often used adaptive algorithms for NTF/NMF and also SNTF are based on alternating minimization of the squared Euclidean distance and the generalized Kullback-Leibler divergence [15,13,9]. In this paper, we propose to use a more general cost function: Alpha divergence. The 3D generalized alpha- divergence can be defined for our purpose as follows [1]: D (α)
A (Y ||Z)= ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ itq y itq
α(α − 1)
y itq
z itq
α −1 − 1 − y itq
− z itq
α , α = 0, 1, itq y itq ln y itq z itq
− y itq
+ z itq
α = 1, itq
z itq
ln z itq y itq
+ y itq
− z itq
, α = 0,
(4) where y
itq = [Y ]
itq and z
itq = [GD
q G T ] it for (i = 1, 2, . . . , I), (t = 1, 2, . . . , I), (q = 1, 2, . . . , I). 784 A. Cichocki et al. The choice of the parameter α ∈ R depends on the statistical distribution of noise and data. We recall, that as special cases of the alpha-divergence for α = 2, 0.5, −1, we obtain the Pearson’s chi squared, Hellinger and Neyman’s chi- square distances, respectively, while for the cases α = 1 and α = 0 the divergence has to be defined by the limits of (4 (a)) as α → 1 and α → 0, respectively. When these limits are evaluated, one obtains the generalized Kullback-Leibler divergence defined by equations (4 (c)) for α → 1. The gradient of the alpha divergence (4), for α = 0, can be expressed in compact form as: ∂D (α) A ∂g ij = 1 α tq g qj g tj 1 − y itq z itq
α , α = 0. (5) However, instead of applying here the standard gradient descent, we use a pro- jected (nonlinearly transformed) gradient approach (which can be considered as a generalization of the exponentiated gradient): Φ(g ij
← Φ(g ij ) − η ij ∂D (α) A ∂g ij , (6) where Φ(x) is a suitable chosen function. Hence, we have g ij
−1 Φ(g
ij ) − η ij ∂D (α) A ∂g ij , (7)
It can be shown that using such nonlinear scaling or transformation provides a sta- ble solution and the gradients are much better behaved in the Φ space. In our case, we employ Φ(x) = x α and choose the learning rates: η ij = αΦ(g
ij )/ tq g qj g jt , which leads to the generalized multiplicative alpha algorithm 2 : g ij ← g
ij tq g qj g jt (y itq
/z itq
) α tq g qj g tj 1/α
, (8)
with the normalization of the columns of G to unit length at each iteration, i.e.: g ij ← g ij / I p=1
g pj . This SNTF algorithm can be considered as a generalization of the EMML algorithm (for α = 1) proposed in [2, 6]. We may apply nonlinear projections or filtering via suitable nonlinear monotonic functions which increase or decrease the sparseness. In the simplest case, we can apply a very simple nonlinear transformation g tj ← (g
tj ) 1+α sp , ∀k, where α sp is a small coefficient, typically from 0.001 to 0.005, and it is positive if we want to increase sparseness of an estimated component and negative if we want to decrease the sparseness. Hence, the generalized alpha algorithm for the SNTF with sparsity control can take the following form: 2 For α = 0, instead of Φ(x) = x α we have used Φ(x) = ln(x). Sparse Super Symmetric Tensor Factorization 785
g ij ← ⎡ ⎣g ij tq g qj g tj (y itq /z itq ) α tq g qj g tj ω/α
⎤ ⎦ 1+α sp , (9) where ω is an over-relaxation parameter (typically, in the range (0, 2)) which controls the convergence speed, and α sp is a small parameter which controls sparsity of the estimated matrix G. 2.2
SMART Algorithm Alternative multiplicative SNTF algorithms can be derived using the exponen- tiated gradient (EG) descent updates instead of the standard additive gradient descent. For example, by using the alpha divergence (4) for α = 0, we have g ij
ij exp
−˜η i ∂D (0) A ∂g ij , (10) ∂D (0)
A ∂g ij = tq g qj g tj (ln z itq
− ln y itq
) . (11)
Hence, we obtain the simple multiplicative learning rules: g ij ← g ij exp tq η ij g qj g tj ln y itq z itq = g ij tq y itq
z itq
η ij g qj g tj . (12)
The nonnegative learning rates η ij can take different forms. Typically, in order to guarantee stability of the algorithm we assume that η ij = ˜ η j = ω ( T t=1
g tj ) −2 , where ω ∈ (0, 2) is an over-relaxation parameter. The above SNTF multiplicative algorithm can be considered as an alternat- ing minimization/projection extension of the well-known SMART (Simultaneous Multiplicative Algebraic Reconstruction Technique) [11, 21]. 2.3 Generalized Beta Divergence The generalized beta divergence can be considered as a complementary cost function of the generalized alpha divergence and can be defined as follows: D (β)
B (Y ||Z) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ itq y itq
y β itq − z β itq β − y β+1 itq
− z β+1
itq β + 1
, β > 0,
itq y itq ln( y itq z itq
) + y itq
− z itq
β = 0, itq
ln( z itq y itq
) + y itq z itq
− 1 , β =
−1. (13)
786 A. Cichocki et al. The choice of the parameter β depends on the statistical distribution of the data and the beta divergence corresponds to the Tweedie models [22]. For example, the optimal choice of the parameter β for a normal distribution is β = 1, for a gamma distribution is β = −1, for a Poisson distribution is β = 0, and for the compound Poisson β ∈ (−1, 0). From the beta generalized divergence, we can derive various kinds of SNTF algorithms: Multiplicative algorithms based on the standard gradient descent or the Exponentiated Gradient (EG) algorithms, additive algorithms using Pro- jected Gradient (PG) or Interior Point Gradient (IPG), quasi-Newton and Fixed Point (FP) ALS algorithms [23, 24, 25, 26, 27, 28, 9, 13]. In order to derive the multiplicative SNTF learning algorithm for a sparse factorization, we compute the gradient of a regularized beta divergence (13), with the additional regularization (sparsification) term J (G) = α G ||G|| 1 = α G ij g ij as:
∂D (β)
Breg ∂g ij = tq z β itq
− y itq
z β −1 itq g qj g tj + α G . (14) Applying the simple (the first-order) gradient descent approach: g ij ← g ij − η ij ∂D (β) Breg ∂g ij , (15)
and by choosing suitable learning rates: η ij = g ij / tq z β itq g qj g tj , we obtain a generalized SNTF beta algorithm: g ij ← g ij tq g qj g tj (y jtq /z 1 −β itq ) − α G + tq z β itq g qj g tj , (16) where [x] ε = max {ε, x} with a small ε = 10 −16
introduced to avoid zero and negative values. In the special case, for β = 0 the above algorithm simplifies to the generalized alternating EMML algorithm that is similar to the algorithm derived by Hazan et al. [2, 29]: g ij ← g ij tq g qj g tj (y jtq /z itq
) − α
G + tq g qj g tj . (17) 3 Simple Alternative Approaches for Super-Symmetric Tensor Decomposition 3.1
Averaging Approach For large dimensions of tensors (I >> 1), the above derived local algorithm could be computationally very time consuming.
Sparse Super Symmetric Tensor Factorization 787
In this section, we propose an alternative simple approach which converts the problem to a simple tri-NMF model: ¯ Y = G ¯
DG T + ¯ N , (18)
where ¯ Y =
Q q=1
Y q ∈ R I ×I , ¯ D = Q q=1 D q = diag { ¯ d 1 , ¯ d 2 , . . . , ¯ d J } and ¯ N =
Q q=1
N q ∈ R I ×I . The above system of linear algebraic equations can be represented in an equiv- alent scalar form as: ¯ y it
j g ij g tj ¯ d j + ¯ n it or equivalently in the vector form: ¯ Y =
j g j ¯ d j g T j + ¯ N where g j are columns of G. Such a simple model is justified if noise in the frontal slices is uncorrelated. It is interesting to note that the model can be written in the equivalent form: ¯ Y = ˜
G ˜ G T + ¯ N ,
(19) where ˜
G = G ¯ D 1/2 , assuming that ¯ D ∈ R J ×J is a non-singular matrix. Thus, the problem can be converted to a standard symmetric NMF problem to estimate matrix ˜
G. Using any available NMF algorithm: Multiplicative, FP-ALS, or PG, we can estimate the matrix ˜ G. For example, by minimizing the following regularized cost function: D( ¯ Y || ˜ G ˜ G T ) = 1 2 || ¯ Y − ˜ G ˜ G T || 2 F + α G || ˜ G || 1 (20) and applying the FP-ALS approach, we obtain the following simple algorithm ˜ G
Y T ˜ G − α
G E)( ˜
G T ˜ G) −1 + , (21)
subject to normalization the columns of ˜ G to unit-length in each iteration step, where E is the matrix of all ones of appropriate size. 3.2
Row-Wise and Column-Wise Unfolding Approach It is worth noting that the diagonal matrices D q are scaling matrices that can be absorbed by the matrix G. By defining the column-normalized matrices G q = GD q , we can use the following simplified models: Y q = G q G T + N q , (q = 1, . . . , Q) (22)
or equivalently Y q = GG T q + N q , (q = 1, . . . , Q). (23)
These simplified models can be described by a single compact matrix equation using column-wise or row-wise unfolding as follows Y c
c G T , (24)
788 A. Cichocki et al. or Y
= GG T r , (25)
where Y c = Y T r = [Y 1 ; Y
2 ; . . . ; Y Q ]
I 2 ×I is the column-wise unfolded matrix of the slices Y q and G
c = G
T r = [G 1 ; G
2 ; . . . ; G Q ]
J I ×I is column- wise unfolded matrix of the matrices G q = GD q (q = 1, 2, , . . . , I). Using any efficient NMF algorithm (multiplicative, IPN, quasi-Newton, or FP-ALS) [23, 24, 25, 26, 27, 28, 9, 13], we can estimate the matrix G. For example, by minimizing the following cost function: D(Y
c ||G
c G T ) = 1 2 ||Y c − G c G T || 2 F + α G ||G|| 1 (26)
and applying the FP-ALS approach, we obtain the following iterative algorithm: G ← ([Y T c G c − α
G E] + )(G T c G c ) −1 + (27) or equivalently G ← ([Y r G r − α G E] + )(G
T r G r ) −1 + , (28) where G c = G T r = [GD 1 ; GD
2 ; . . . ; GD Q ], D
q = diag
{g q } and g q means q-th row of G. 3.3
Semi-orthogonality Constraint The matrix G is usually very sparse and additionally satisfies orthogonality constraints. We can easily impose orthogonality constraint by incorporating ad- ditionally the following iterations: G ← G G
T G −1/2 . (29)
3.4 Simulation Results All the NTF algorithms presented in this paper have been tested for many difficult benchmarks for signals and images with various statistical distributions of signals and additive noise. Comparison and simulation results will be presented in the ICONIP-2007. 4 Conclusions and Discussion We have proposed the generalized and flexible cost function (controlled by spar- sity penalty/regularization terms) that allows us to derive a family of SNTF algorithms. The main objective and motivations of this paper is to derive sim- ple multiplicative algorithms which are especially suitable both for very sparse Sparse Super Symmetric Tensor Factorization 789
representation and highly over-determined cases. The basic advantage of the multiplicative algorithms is their simplicity and relatively straightforward gen- eralization to L-order tensors (L > 3). However, the multiplicative algorithms are relatively slow. We found that simple approaches which convert a SNTF problem to a sym- metric NMF (SNMF) or symmetric tri-NMF (ST-NMF) problem provide the more efficient and fast algorithms, especially for large scale problems. Moreover, by imposing orthogonality constraints, we can drastically improve performance, especially for noisy data. Obviously, there are many challenging open issues remaining, such as global convergence and an optimal choice of the associated parameters. References 1. Amari, S.: Differential-Geometrical Methods in Statistics. Springer, Heidelberg (1985)
2. Hazan, T., Polak, S., Shashua, A.: Sparse image coding using a 3D non-negative tensor factorization. In: International Conference of Computer Vision (ICCV), pp. 50–57 (2005) 3. Workshop on tensor decompositions and applications, CIRM, Marseille, France (2005) 4. Heiler, M., Schnoerr, C.: Controlling sparseness in non-negative tensor factoriza- tion. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 56–67. Springer, Heidelberg (2006) 5. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences. John Wiley and Sons, New York (2004) 6. Shashua, A., Zass, R., Hazan, T.: Multi-way clustering using super-symmetric non- negative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 595–608. Springer, Heidelberg (2006) 7. Zass, R., Shashua, A.: A unifying approach to hard and probabilistic clustering. In: International Conference on Computer Vision (ICCV), Beijing, China (2005) 8. Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analy- sis. In: Proc.of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006) 9. Berry, M., Browne, M., Langville, A., Pauca, P., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. In: Computational Statistics and Data Analysis (in press, 2006) 10. Cichocki, A., Zdunek, R., Amari, S.: Csiszar’s divergences for non-negative matrix factorization: Family of new algorithms. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 32–39. Springer, Heidelberg (2006) 11. Cichocki, A., Amari, S., Zdunek, R., Kompass, R., Hori, G., He, Z.: Extended SMART algorithms for non-negative matrix factorization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., ˙Zurada, J.M. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 548–562. Springer, Heidelberg (2006) 12. Cichocki, A., Zdunek, R.: NTFLAB for Signal Processing. Technical report, Labo- ratory for Advanced Brain Signal Processing, BSI, RIKEN, Saitama, Japan (2006)
790 A. Cichocki et al. 13. Dhillon, I., Sra, S.: Generalized nonnegative matrix approximations with Bregman divergences. In: Neural Information Proc. Systems, Vancouver, Canada, pp. 283– 290 (2005) 14. Kim, M., Choi, S.: Monaural music source separation: Nonnegativity, sparseness, and shift-invariance. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 617–624. Springer, Heidelberg (2006) 15. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix fac- torization. Nature 401, 788–791 (1999) 16. Mørup, M., Hansen, L.K., Herrmann, C.S., Parnas, J., Arnfred, S.M.: Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 938–947 (2006) 17. Miwakeichi, F., Martinez-Montes, E., Valds-Sosa, P., Nishiyama, N., Mizuhara, H., Yamaguchi, Y.: Decomposing EEG data into space −time−frequency components using parallel factor analysi. NeuroImage 22, 1035–1045 (2004) 18. Zass, R., Shashua, A.: Nonnegative sparse pca. In: Neural Information Processing Systems (NIPS), Vancuver, Canada (2006) 19. Zass, R., Shashua, A.: Doubly stochastic normalization for spectral clustering. In: Neural Information Processing Systems (NIPS), Vancuver, Canada (2006) 20. Comon, P.: Tensor decompositions-state of the art and applications. In: McWhirter, J.G., Proudler, I.K. (eds.) Institute of Mathematics and its Applications Conference on Mathematics in Signal Processing, pp. 18–20. Clarendon Press, Oxford, UK (2001) 21. Byrne, C.L.: Choosing parameters in block-iterative or ordered-subset reconstruc- tion algorithms. IEEE Transactions on Image Processing 14, 321–327 (2005) 22. Minami, M., Eguchi, S.: Robust blind source separation by Beta-divergence. Neural Computation 14, 1859–1886 (2002) 23. Cichocki, A., Zdunek, R., Choi, S., Plemmons, R., Amari, S.I.: Novel multi-layer nonnegative tensor factorization with sparsity constraints. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4432, pp. 271–280. Springer, Heidelberg (2007) 24. Cichocki, A., Zdunek, R.: Regularized alternating least squares algorithms for non- negative matrix/tensor factorizations. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4493, pp. 793–802. Springer, Heidelberg (2007) 25. Cichocki, A., Zdunek, R., Amari, S.: New algorithms for non-negative matrix fac- torization in applications to blind source separation. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP2006, Toulouse, France, vol. 5, pp. 621–624 (2006) 26. Cichocki, A., Zdunek, R., Choi, S., Plemmons, R., Amari, S.: Nonnegative ten- sor factorization using Alpha and Beta divergencies. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA, vol. III, pp. 1393–1396 (2007) 27. Zdunek, R., Cichocki, A.: Nonnegative matrix factorization with constrained second-order optimization. Signal Processing 87, 1904–1916 (2007) 28. Zdunek, R., Cichocki, A.: Nonnegative matrix factorization with quadratic pro- gramming. Neurocomputing (accepted, 2007) 29. Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to statistics and computer vision. In: Proc. of the 22-th International Conference on Machine Learning, Bonn, Germany (2005)
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 791–801, 2008. © Springer-Verlag Berlin Heidelberg 2008 Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling