Lecture Notes in Computer Science

bet	73/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 69 70 71 72 73 74 75 76 ... 88

3 Empirical Study

(8)

To implement MAP for the d

projection matrix

d

U estimation, the EM algorithm

is applied here. The expectation of the log likelihood of complete data with respect to

(

)

2

;

d j

d j

d

d

d

p x

t

U

is given by

( )

(

)

(

)

(

)

(

)

;

log

d

nl

M

n

M

c

i

i

j d

d j

d j

d

i

d

j

E L

E

p

U

E

p t

x

≠

⎡

⎤

⎡

⎤

⎣

⎦

⎣

⎦

∑∑

∑∑

T X

(9)

Here,

(

)

(

)

;

log

,

d j

d j

p t

x

with the given

|

k d

k

k M

U

≠

≤ ≤

is given by

(

)

(

)

( )

;

log

.

T

d j

d j

d j

d j

d

d j

d i

p t

x

x

d

t

U x

u

∝ −

−

(10)

It is impossible to maximize

( )

c

E L

with respect to all projection matrices

|

M

d

d

U

because different projection matrices are inter-related [7] during optimization

procedure, i.e., it is required to know

j d

U

≠

to optimize

d

U . Therefore, we need to

apply alternating optimization procedure [7] for optimization. To optimize the d

projection matrix

d

U with

2

d

, we need the decoupled expectation of the log

likelihood function on the d

mode:

(

)

(

)

(

)

( )

(

) (

)

(

)

(

)

;

log

tr

d

d

nl

d j

d j

j

T

T

nl

d j

d j

d j

d i

d j

d i

j

T

T

d j

d

d j

d i

d

d

d j

d j

E

p t

x

E x

x

d

t

u

t

u

E x

U

t

u

U U E x

x

⎡

⎤

⎣

⎦

⎛

⎞

⎡

⎤

−

⎜

⎟

⎣

⎦

⎜

⎟

∝ −

⎜

⎟

⎡

⎤

⎡

⎤

−

⎜

⎟

⎣

⎦

⎣

⎦

⎝

⎠

∑

(11)

Based on (6), then we have

(

)

;

d j

d

d

d j

d

E x

M U

t

−

⎡

⎤

−

⎣

⎦

(12)

and

;

.

T

T

d j

d j

d

d

d j

d j

E x

x

M

E x

E x

−

⎡

⎤

⎡

⎤

⎡

⎤

⎣

⎦

⎣

⎦

⎣

⎦

(13)

Eq. (12) and (13) form the expectation step or E-step.

The maximization step or M-step is obtained by maximizing

(

)

(

)

;

log

,

d

nl

d j

d j

j

E

p t

x

⎡

⎤

⎣

⎦

∑

with respect to

d

U

and

2

d

. In detail, by setting

(

)

(

)

;

log

0

d

d

nl

U

d j

d j

j

E

p t

x

⎡

⎤

⎡

⎤

∂

⎢

⎥

⎣

⎦

⎣

⎦

∑

, we have

(

)

;

1

d

nl

nl

T

d

d j

d j

d j

d j

d

j

j

U

E x

x

E x

t

−

⎡

⎤ ⎡

⎤

⎡

⎤

⎡

⎤

−

⎢

⎥ ⎢

⎥

⎣

⎦

⎣

⎦

⎣

⎦

⎣

⎦

∑

;

(14)

Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria

797

and by setting

(

)

(

)

;

log

0

d

d

nl

d j

d j

j

E

p t

x

⎡

⎤

⎡

⎤

∂

⎢

⎥

⎣

⎦

⎣

⎦

∑

, we have

(

)

(

)

;

d

T

nl

d j

d

d j

d

d j

d

d

T

T

i

d d

d j

d j

d

d

t

E x

U

t

nl l

E x

x

U U

⎧

⎫

⎡

⎤

−

⎪

⎣

⎦

⎨

⎬

⎡

⎤

⎪

⎣

⎦

⎩

⎭

∑

.

(15)

2.3 Dimension Reduction and Data Reconstruction

After having projection matrices

|

M

d

d

U

, the following operations are important for

different applications:

Dimension Reduction: Given the projection matrices

|

M

d

d

U

and an observed

tensor

1

M

M

l

l

l

l

R

−

× × ×

∈

T

in the high dimensional space, how to find the corresponding

latent tensor

2

1

M

M

l

l

l

l

R

−

′ ′

′

× × ×

∈

X

in the low dimensional space? From tensor algebra,

the dimension reduction is given by

1

d

M

d

d

U

∏

X

. However, the method is absent

the probabilistic perspective. Under the proposed decoupled probabilistic model,

X

obtained by maximizing

(

)

(

)

|

M

d

d

d

p

p x

t

∝

∏

X T

. The dimension reduction is

(

)

(

)

d

M

T

T

d

d

d

M U

−

∏

X

T M

(16)

Data Reconstruction: Given the projection matrices

|

M

d

d

U

and the latent tensor

M

M

l

l

l

l

R

−

′ ′

′

× × ×

∈

X

in the low dimensional space, how to approximate the

corresponding observed tensor

1

M

M

l

l

l

l

R

−

× × ×

∈

T

in the high dimensional space?

Based on (16), the data reconstruction procedure is given by

(

)

(

)

.

d

M

T

T

T

d

d

d

d

d

U

U U

M

−

∏

T

X

M

(17)

The reconstruction error is given by

ˆ

Fro

Fro

−

T T

.

2.4 Akaike and Bayesian Information Criteria for PTA

AIC and BIC are popular methods for model selection in statistics. However, they are

developed for vector data. In the proposed PTA, data are in tensor form. Therefore, it

is important to find a suitable method to utilize AIC and BIC for tensor based learning

models.

In PTA, the conventional AIC and BIC could be applied to determine the size of

M

d

d

U

. The exhaustive search based on AIC (BIC) is applied for model selection. In

detail, for AIC based model selection, we need to calculate the score of AIC

798

D. Tao et al.

(

)

(

)

(

)

(

) (

)

(

)

log det

tr

AIC

d

d

d

d

d d

d

d

T

T

d

d

d

d

d

d

d

d

J

U

l

l l

l

l

nl

U U

I

U U

I

S

−

′

′ ′

+ −

−

⎡

⎤

⎢

⎥

⎣

⎦

(18)

for each mode

(

)

1

M

d

d

l

−

∏

times, because the number of rows

d

l

′

in each projection

matrix

d

U changes from 1 to

(

)

1

d

l

−

. In determination stage, the optimal

*

d

l

′

(

)

arg min

,

d

AIC

d

d

d

d

d

l

l

J

U

l

′

(19)

where 1

1

d

d

l

l

′

≤ ≤ −

For BIC based model selection in PTA, we have similar definition as AIC,

(

)

( )

(

)

(

) (

)

(

)

log

log det

tr

d

d

BIC

d

d

d

d

d

d d

T

T

d

d

d

d

d

d

d

d

l

l

J

U

l

nl

l l

nl

U U

I

U U

I

S

−

′ ′ −

⎛

⎞

′

+ −

⎜

⎟

⎝

⎠

⎡

⎤

⎢

⎥

⎣

⎦

(20)

for each mode

(

)

1

M

d

d

l

−

∏

times. In determination stage, the optimal

*

d

l

′

(

)

arg min

,

d

BIC

d

d

d

d

d

l

l

J

U

l

′

(21)

where 1

1

d

d

l

l

′

≤ ≤ −

.

3 Empirical Study

In this Section, we utilize a synthetic data model, to evaluate BIC PTA in terms of

accuracy for model selection. For AIC PTA, we have the very similar experimental

results as BIC PTA. The accuracy is measured by the model selection error

*

M

d

d

d

l

l

′

−

∑

. Here,

d

l

′

is the real model, i.e., the real dimension of d

mode of the

unobserved latent tensor; and

*

d

′

is the selected model, i.e., the selected dimension of the

mode of the unobserved latent tensor by using BIC PTA. A multilinear transformation

is applied to map the tensor from the low dimensional space

M

l

l

l

R

′ ′

′

× × ×

to high

dimensional space

2

M

l

l

l

R

× × ×

1

d

M

T

i

i

d

i

d

U

∏

T

X

M

E

, where

2

M

l

l

l

i

R

′ ′

′

× × ×

∈

X

and every entry of every unobserved latent tensor

i

X

is generated from a single

Gaussian with mean zero and variance 1, i.e.,

( )

0,1

N

;

i

is the noise tensor and every

entry

j

e is drawn from

( )

0,1

N

is a scalar and we set it as 0.01, the mean tensor

M

l

l

l

R

× × ×

∈

M

is a random tensor and every entry in

is drawn from the uniform

distribution on the interval

[ ]

0,1 ; projection matrices

d

d

l

l

M

d

d

U

R

′ ×

∈

are random

matrices and every entry in

M

d

d

U

is drawn from the uniform distribution on the

interval

[ ]

0,1 ; and i denotes the i

tensor measurement.

Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria

799

Fig. 3. BIC Score matrices for the first and the second projection matrices. Each block

corresponds to a BIC score. The darker the block is the smaller the BIC score is. Based on this

Figure, we determine

*

3

′ =

and

2

l

′ =

based on BIC obtained by PTA and the model

selection error is 0. Figure 4 shows the Hinton diagram of the first and the second projection

matrices in the left and the right sub-figures, respectively. Projection matrices are obtained

from PTA by setting

*

7

′ =

and

5

l

′ =

In the first experiment, the data generator gives 10 measurements by setting

2

M

8

l

,

1

3

l

′ =

, and

2

l

′ =

. To determine

*

l

′

and

*

l

′

based on BIC

for PTA, we need to conduct PTA

(

)(

)

1

l

−

times and obtain two BIC score

matrices for the first mode projection matrix

1

U

and the second projection matrix

2

U

respectively, as shown in Figure 3. In this Figure, every block corresponds to a BIC

score and the darker the block is the smaller the corresponding BIC score is. We use a

light rectangular to hint the darkest block in each BIC score matrix and the block

corresponds to the smallest value. In the first BIC score matrix, as shown in the left

sub-figure of Figure 3, the smallest value locates at

( )

3, 5 . Because this BIC score

matrix is calculated for the first mode projection matrix based on (20), we can set

3

l

′ =

according to (21). Similar to the determination of

′

, we determine

′ =

according to the second BIC score matrix, as shown in the right of Figure 3, because

the smallest value locates at

( )

7, 2 . For this example, the model selection error is

0

d

d

d

l

l

′

−

∑

We repeat the experiments with the similar setting as the first experiment in this

Section 30 times, but

1

l

′

, and

2

l

′

are randomly set with the following

requirements:

l l

≤

l l

′ ′

≤

1

l

′ <

, and

l

l

′ <

. The total model

selection errors for BIC PTA are 0. We also conduct 30 experiments for third order

tensor, with similar setting as described above and

1

l

′

3

l

, and

3

l

′

are setting

800

D. Tao et al.

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 69 70 71 72 73 74 75 76 ... 88