Lecture Notes in Computer Science

bet	45/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 41 42 43 44 45 46 47 48 ... 88

Experimental Bayesian Generalization Error of Non-regular Models

473

Table 1. Average generalization errors in Example 1

(μ

, σ

)

(0,1)= G

(0,0.5)

(0,2)

(2,1)

(2,0.5)

(2,2)

(10,1)

(10,0.5)

(10,2)

th G

]

0.001

0.00025

0.004

0.005

0.00425

0.008

0.101

0.10025

0.104

]

0.001055 0.000239 0.004356 0.006162 0.005107 0.009532 0.109341 0.106619 0.108042

]

0.000874 0.000170 0.003466 0.004523 0.003670 0.006669 0.079280 0.078802 0.080180

]

0.000394 0.000059 0.001475 0.002374 0.001912 0.003736 0.040287 0.038840 0.039682

M G

]

—

0.002000

—

0.028784

—

1.79

×10

—

M G

]

—

0.001356

—

0.019521

—

1.22

×10

—

M G

]

—

0.000667

—

0.009595

—

5.98

×10

—

× G

] 0.001182 0.000177 0.004425 0.007122 0.005736 0.011208 0.120861 0.116520 0.119046

Table 2. Average generalization errors in Example 2

(μ

, σ

)

(0,1)= G

(0,0.5)

(0,2)

(2,1)

(2,0.5)

(2,2)

(10,1)

(10,0.5)

(10,2)

]

0.000688 0.000260 0.002312 0.002977 0.003094 0.003423 0.013640 0.012769 0.012204

]

0.000251 0.000103 0.000934 0.001116 0.001291 0.001318 0.004350 0.003729 0.003743

]

0.000146 0.000062 0.000626 0.000705 0.000875 0.000918 0.002896 0.002357 0.002489

M G

]

—

0.001356

—

0.019521

—

1.22

×10

—

M G

]

—

0.000667

—

0.009595

—

5.98

×10

—

M G

]

—

0.000400

—

0.005757

—

3.59

×10

—

× G

] 0.000753 0.000309 0.002802 0.003348 0.003873 0.003954 0.013050 0.011187 0.011229

× G

] 0.000730 0.000310 0.003130 0.003525 0.004375 0.004590 0.014480 0.011785 0.012445

We can ﬁnd that ‘th G

[f

1

]’ are very close to G

] in spite of the fact that

the theoretical values are established in asymptotic cases. Based on this fact, the

accuracy of experiments can be evaluated to compare them. As for f

and f

they do not have any comparable theoretical value except for the upper bound.

We can conﬁrm that every value in G

, f

] is actually smaller than the bound.

Example 2 (Simple Neural Networks). Let assume that the true is the zero

function, and the learning models are three-layer perceptrons:

g(x) = 0, f

(x, a, b) = a tanh(bx), f

(x, a, b) = a

tanh(bx),

(x, a, b) = a

tanh(bx).

Table 2 shows the results.

In this example, we can also conﬁrm that the bound works. Combining the

results in the previous example, the bound tends to be tight when μ

is small.

As a matter of fact, the bound holds in small sample cases, i.e., the number of

training data n does not have to be suﬃciently large. Though we omit it because

of the lack of space, the bound is always larger than the experimental results in

n = 100, 200, . . . , 400. The property of the bound will be discussed in the next

section.

Discussions

First, let us conﬁrm if the sampling from the posterior was successfully done by

the MCMC method. Based on the algebraic geometrical method, the coeﬃcients

of G

0

(n) are derived in the models (cf. Table 3). As we mentioned, f

, f

and

474

K. Yamazaki and S. Watanabe

Table 3. The coeﬃcients of generalization error without the covariate shift

, f

α 1/2 1/2

1/6 1/10

β 1

, f

have the same theoretical error. According to the examples in the previous

section, we can compare the theoretical value to the experimental one,

(n)[f

] = 0.001

0.001055, G

(n)[f

, f

] = 0.000678

0.000874, 0.000688

(n)[f

, f

] = 0.000333

0.000394, 0.000251, G

(n)[f

] = 0.0002

0.000146.

In the sense of the generalization error, the MCMC method worked well though

there is some ﬂuctuation in the results. Note that it is still open how to evaluate

the method. Here we measured by the generalization error since the theoretical

value is known in G

(n). However, this index is just a necessary condition. To

develop an evaluation of the selected samples is our future study.

Next, we consider the behavior of G

(n). In the examples, the true function

was commonly the zero function g(x) = 0. It is an important case to learn

the zero function because we often prepare an enough rich model in practice.

Then, the learning function will be set up as f (x, w) =

i=k

t(w

)h(x, w

where h is the basis function t is the parameterization for its weight, and

w =

, w

, . . . , w

, w

}. Note that many practical models are

included in this expression. According to the redundancy of the function, some

of h(x, w

) learn the zero function. Our examples provided the simplest situa-

tions and highlighted the eﬀect of non-regularity in the learning models.

The errors G

(n) and G

(n) are generally expressed as

0

(n) =

−

− 1

n log n

+ o

n log n

(n) = R

− R

− 1

n log n

+ o

n log n

where R

, R

depend on f, g, q

, and q

. R

and R

cause the diﬀerence between

0

and G

in this expression. In Eq. (12), the coeﬃcient of 1/n is given by

0

+ (d

− d

). So

+ (d

− d

)

= 1 +

− d

Let us denote A

B as “A is the only factor to determine a value of B”. As

mentioned above,

f, g, q

, q

, R

Though f, g

α, β (cf. around Eqs.(5)-(6)), we should emphasize that

α, β, q

0

, q

, R

Experimental Bayesian Generalization Error of Non-regular Models

475

This fact is easily conﬁrmed by comparing f

to f

(also f

to f

). It holds

that G

1

(n)[f

] = G

(n)[f

] for all q

although they have the same α and β

(n)[f

] = G

(n)[f

]). Thus α and β are not enough informative to describe

1

and R

. Comparing the values in G

[f

2

] to the ones in G

], the basis

function (x in f

and tanh(bx) in f

) seems to play an important role.

To clarify the eﬀect of basis functions, let us ﬁx the function class. Examples

1 and 2 correspond to h(x, w

) = x and h(x, w

) = tanh(bx), respectively. The

values of G

] and 3

× G

] (also 3

× G

1

[f

] and 5

× G

]) can be regarded

as the same in any covariate shift. This implies

h, g, q

, q

i.e. the parameterization t(w

) will not aﬀect R

. Instead, it aﬀects the non-

regularity or the multiplicity and decides α and β. Though it is an unclear

factor, the inﬂuence of R

does not seem as large as R

Last, let us analyze properties of the upper bound M G

(n). According to the

above discussion, it holds that

≤ M.

The ratio G

/G

0

basically depends on g, h, q

and q

. However, M is determined

by only the training and test input distributions, q

, q

M . Therefore this

bound gives the worst case evaluation in any g and h. Considering the tightness

of the bound, we can still improve it based on the relation between the true and

learning functions.

Conclusions

In the former study, we have got the theoretical generalization error and its

upper bound under the covariate shift. This paper showed that the theoretical

value is supported by the experiments in spite of the fact that it is established

under an asymptotic case. We observed the tightness of the bound and discussed

an eﬀect of basis functions in the learning models.

In this paper, the non-regular models are simple lines and neural networks.

It is an interesting issue to investigate more general models. Though we mainly

considered the amount of G

(n), the computational cost for the MCMC method

strongly connects to the form of the learning function f . It is our future study

to take account of the cost in the evaluation.

Acknowledgements

The authors would like to thank Masashi Sugiyama, Motoaki Kawanabe, and

Klaus-Robert M¨

uller for fruitful discussions. The software to calculate the

MCMC method and technical comments were provided by Kenji Nagata. This

research partly supported by the Alexander von Humboldt Foundation, and

MEXT 18079007.

476

K. Yamazaki and S. Watanabe

References

1. Yamazaki, K., Kawanabe, M., Wanatabe, S., Sugiyama, M., M¨

uller, K.R.: Asymp-

totic bayesian generalization error when training and test distributions are diﬀer-

ent. In: Proceedings of the 24th International Conference on Machine Learning,

pp. 1079–1086 (2007)

2. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Vaughan, T.M.:

Brain-computer interfaces for communication and control. Clinical Neurophysiol-

ogy 113(6), 767–791 (2002)

3. Baldi, P., Brunak, S., Stolovitzky, G.A.: Bioinformatics: The Machine Learning

Approach. MIT Press, Cambridge (1998)

4. Shimodaira, H.: Improving predictive inference under covariate shift by weighting

the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–

244 (2000)

5. Sugiyama, M., M¨

uller, K.R.: Input-dependent estimation of generalization error

under covariate shift. Statistics & Decisions 23(4), 249–279 (2005)

6. Sugiyama, M., Krauledat, M., M¨

uller, K.R.: Covariate shift adaptation by impor-

tance weighted cross validation. Journal of Machine Learning Research 8 (2007)

7. Huang, J., Smola, A., Gretton, A., Borgwardt, K.M., Sch¨

olkopf, B.: Correcting

sample selection bias by unlabeled data. In: Sch¨

olkopf, B., Platt, J., Hoﬀman, T.

(eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press,

Cambridge, MA (2007)

8. Watanabe, S.: Algebraic analysis for non-identiﬁable learning machines. Neural

Computation 13(4), 899–933 (2001)

9. Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics 14, 1080–

1100 (1986)

10. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe,

O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 39–50. Springer,

Heidelberg (1999)

11. Watanabe, S.: Algebraic information geometry for learning machines with singu-

larities. Advances in Neural Information Processing Systems 14, 329–336 (2001)

12. Ogata, Y.: A monte carlo method for an objective bayesian procedure. Ann. Inst.

Statis. Math. 42(3), 403–433 (1990)

Using Image Stimuli to Drive fMRI Analysis

David R. Hardoon

, Janaina Mour˜

ao-Miranda

, Michael Brammer

and John Shawe-Taylor

The Centre for Computational Statistics and Machine Learning

Department of Computer Science

University College London

Gower St., London WC1E 6BT

{D.Hardoon,jst}@cs.ucl.ac.uk

Brain Image Analysis Unit

Centre for Neuroimaging Sciences (PO 89)

Institute of Psychiatry, De Crespigny Park

London SE5 8AF

{Janaina.Mourao-Miranda,Michael.Brammer}@iop.kcl.ac.uk

Abstract. We introduce a new unsupervised fMRI analysis method

based on Kernel Canonical Correlation Analysis which diﬀers from the

class of supervised learning methods that are increasingly being employed

in fMRI data analysis. Whereas SVM associates properties of the imaging

data with simple speciﬁc categorical labels, KCCA replaces these sim-

ple labels with a label vector for each stimulus containing details of the

features of that stimulus. We have compared KCCA and SVM analyses

of an fMRI data set involving responses to emotionally salient stimuli.

This involved ﬁrst training the algorithm ( SVM, KCCA) on a subset

of fMRI data and the corresponding labels/label vectors, then testing

the algorithms on data withheld from the original training phase. The

classiﬁcation accuracies of SVM and KCCA proved to be very similar.

However, the most important result arising from this study is that KCCA

in able in part to extract many of the brain regions that SVM identiﬁes

as the most important in task discrimination blind to the categorical

task labels.

Keywords: Machine learning methods, Kernel canonical correlation

analysis, Support vector machines, Classiﬁers, Functional magnetic res-

onance imaging data analysis.

Introduction

Recently, machine learning methodologies have been increasingly used to analyse

the relationship between stimulus categories and fMRI responses [1,2,3,4,5,6,7,8,

9,10]. In this paper, we introduce a new unsupervised machine learning approach

to fMRI analysis, in which the simple categorical description of stimulus type

(e.g. type of task) is replaced by a more informative vector of stimulus features.

We compare this new approach with a standard Support Vector Machine (SVM)

analysis of fMRI data using a categorical description of stimulus type.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 477–486, 2008.

c Springer-Verlag Berlin Heidelberg 2008

478

D.R. Hardoon et al.

The technology of the present study originates from earlier research carried

out in the domain of image annotation [11], where an image annotation method-

ology learns a direct mapping from image descriptors to keywords. Previous

attempts at unsupervised fMRI analysis have been based on Kohonen self-

organising maps, fuzzy clustering [12] and nonparametric estimation methods

of the hemodynamic response function, such as the general method described

in [13]. [14] have reported an interesting study which showed that the discrim-

inability of PCA basis representations of images of multiple object categories is

signiﬁcantly correlated with the discriminability of PCA basis representation of

the fMRI volumes based on category labels.

The current study diﬀers from conventional unsupervised approaches in that

it makes use of the stimulus characteristics as an implicit representation of a

complex state label. We use kernel Canonical Correlation Analysis (KCCA) to

learn the correlation between an fMRI volume and its corresponding stimulus.

Canonical correlation analysis can be seen as the problem of ﬁnding basis vectors

for two sets of variables such that the correlations of the projections of the

variables onto corresponding basis vectors are maximised. KCCA ﬁrst projects

the data into a higher dimensional feature space before performing CCA in the

new feature space. CCA [15, 16] and KCCA [17] have been used in previous

fMRI analysis using only conventional categorical stimulus descriptions without

exploring the possibility of using complex characteristics of the stimuli as the

source for feature selection from the fMRI data.

The fMRI data used in the following study originated from an experiment in

which the responses to stimuli were designed to evoke diﬀerent types of emotional

responses, pleasant or unpleasant. The pleasant images consisted of women in

swimsuits while the unpleasant images were a collection of images of skin dis-

eases. Each stimulus image was represented using Scale Invariant Feature Trans-

formation (SIFT) [18] features. Interestingly, some of the properties of the SIFT

representation have been modeled on the properties of complex neurons in the

visual cortex. Although not speciﬁcally exploited in the current paper, future

studies may be able to utilize this property to probe aspects of brain function

such as modularity.

In the current study, we present a feasibility study of the possibility of gener-

ating new activity maps by using the actual stimuli that had generated the fMRI

volume. We have shown that KCCA is able to extract brain regions identiﬁed by

supervised methods such as SVM in task discrimination and to achieve similar

levels of accuracy and discuss some of the challenges in interpreting the results

given the complex input feature vectors used by KCCA in place of categorical

labels. This work is an extension of the work presented in [19].

The paper is structured as follows. Section 2 gives a review of the fMRI data

acquisition as well as the experimental design and the pre-processing. These

are followed by a brief description of the scale invariant feature transformation

in Section 2.1. The SVM is brieﬂy described in Section 2.2 while Section 2.2

elaborates on the KCCA methodology. Our results in Section 3. We conclude

with a discussion in Section 4.

Using Image Stimuli to Drive fMRI Analysis

479

Materials and Methods

Due to the lack of space we refer the reader to [10] for a detailed account of the

subject, data acquisition and pre-processing applied to the data as well as to the

experimental design.

2.1

Scale Invariant Feature Transformation

Scale Invariant Feature Transformation (SIFT) was introduced by [18] and shown

to be superior to other descriptors [20]. This is due to the SIFT descriptors be-

ing designed to be invariant to small shifts in position of salient (i.e. prominent)

regions. Calculation of the SIFT vector begins with a scale space search in which

local minima and maxima are identiﬁed in each image (so-called key locations).

The properties of the image at each key location are then expressed in terms of

gradient magnitude and orientation. A canonical orientation is then assigned to

each key location to maximize rotation invariance. Robustness to reorientation

is introduced by representing local image regions around key voxels in a number

of orientations. A reference key vector is then computed over all images and the

data for each image are represented in terms of distance from this reference.

Interestingly, some of the properties of the SIFT representation have been mod-

eled on the properties of complex neurons in the visual cortex. Although not

speciﬁcally exploited in the current paper, future studies may be able to utilize

this property to probe aspects of brain function such as modularity.

Image Processing. Let f

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 41 42 43 44 45 46 47 48 ... 88