Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
Experimental Bayesian Generalization Error of Non-regular Models
473
Table 1. Average generalization errors in Example 1 (μ 1 , σ 1 ) (0,1)= G 0 (0,0.5) (0,2) (2,1)
(2,0.5) (2,2)
(10,1) (10,0.5)
(10,2) th G
1 [f 1 ] 0.001
0.00025 0.004
0.005 0.00425
0.008 0.101
0.10025 0.104
G 1 [f 1 ] 0.001055 0.000239 0.004356 0.006162 0.005107 0.009532 0.109341 0.106619 0.108042 G 1 [f 2 ] 0.000874 0.000170 0.003466 0.004523 0.003670 0.006669 0.079280 0.078802 0.080180 G 1 [f 3 ] 0.000394 0.000059 0.001475 0.002374 0.001912 0.003736 0.040287 0.038840 0.039682 M G 0 [f 1 ] — 0.002000 — — 0.028784 — — 1.79 ×10
26 — M G 0 [f 2 ] — 0.001356 — — 0.019521 — — 1.22 ×10 26 — M G 0 [f 3 ] — 0.000667 — — 0.009595 — — 5.98 ×10
25 — 3 × G 1 [f 3 ] 0.001182 0.000177 0.004425 0.007122 0.005736 0.011208 0.120861 0.116520 0.119046 Table 2. Average generalization errors in Example 2 (μ 1 , σ 1 ) (0,1)= G 0 (0,0.5) (0,2) (2,1)
(2,0.5) (2,2)
(10,1) (10,0.5)
(10,2) G 1 [f 4 ] 0.000688 0.000260 0.002312 0.002977 0.003094 0.003423 0.013640 0.012769 0.012204 G 1 [f 5 ] 0.000251 0.000103 0.000934 0.001116 0.001291 0.001318 0.004350 0.003729 0.003743 G 1 [f 6 ] 0.000146 0.000062 0.000626 0.000705 0.000875 0.000918 0.002896 0.002357 0.002489 M G
0 [f 4 ] — 0.001356 — — 0.019521 — — 1.22 ×10 26 — M G 0 [f 5 ] — 0.000667 — — 0.009595 — — 5.98 ×10
25 — M G 0 [f 5 ] — 0.000400 — — 0.005757 — — 3.59 ×10 25 — 3 × G
1 [f 5 ] 0.000753 0.000309 0.002802 0.003348 0.003873 0.003954 0.013050 0.011187 0.011229 5 × G 1 [f 6 ] 0.000730 0.000310 0.003130 0.003525 0.004375 0.004590 0.014480 0.011785 0.012445 We can find that ‘th G 1 [f
]’ are very close to G 1 [f 1 ] in spite of the fact that the theoretical values are established in asymptotic cases. Based on this fact, the accuracy of experiments can be evaluated to compare them. As for f 2 and f
3 , they do not have any comparable theoretical value except for the upper bound. We can confirm that every value in G 1 [f 2 , f
3 ] is actually smaller than the bound. Example 2 (Simple Neural Networks). Let assume that the true is the zero function, and the learning models are three-layer perceptrons: g(x) = 0, f 4 (x, a, b) = a tanh(bx), f 5 (x, a, b) = a 3 tanh(bx), f 6 (x, a, b) = a 5 tanh(bx). Table 2 shows the results. In this example, we can also confirm that the bound works. Combining the results in the previous example, the bound tends to be tight when μ 1 is small. As a matter of fact, the bound holds in small sample cases, i.e., the number of training data n does not have to be sufficiently large. Though we omit it because of the lack of space, the bound is always larger than the experimental results in n = 100, 200, . . . , 400. The property of the bound will be discussed in the next section. 4 Discussions First, let us confirm if the sampling from the posterior was successfully done by the MCMC method. Based on the algebraic geometrical method, the coefficients of G 0
2 , f
4 and
474 K. Yamazaki and S. Watanabe Table 3. The coefficients of generalization error without the covariate shift f 1 f 2 , f 4 f 3 , f 5 f 6 α 1/2 1/2 1/6 1/10 β 1
2 1 1 f 3 , f 5 have the same theoretical error. According to the examples in the previous section, we can compare the theoretical value to the experimental one, G 0 (n)[f 1 ] = 0.001 0.001055, G 0 (n)[f 2 , f
4 ] = 0.000678 0.000874, 0.000688 G 0 (n)[f 3 , f 5 ] = 0.000333 0.000394, 0.000251, G 0 (n)[f 6 ] = 0.0002 0.000146. In the sense of the generalization error, the MCMC method worked well though there is some fluctuation in the results. Note that it is still open how to evaluate the method. Here we measured by the generalization error since the theoretical value is known in G 0 (n). However, this index is just a necessary condition. To develop an evaluation of the selected samples is our future study. Next, we consider the behavior of G 1 (n). In the examples, the true function was commonly the zero function g(x) = 0. It is an important case to learn the zero function because we often prepare an enough rich model in practice. Then, the learning function will be set up as f (x, w) = K i=k t(w 1k )h(x, w 2k ), where h is the basis function t is the parameterization for its weight, and w = {w 11 , w 21 , w 12 , w
22 , . . . , w 1K , w
2K }. Note that many practical models are included in this expression. According to the redundancy of the function, some of h(x, w 2k ) learn the zero function. Our examples provided the simplest situa- tions and highlighted the effect of non-regularity in the learning models. The errors G 0 (n) and G 1 (n) are generally expressed as G 0
α n − β − 1
n log n + o
1 n log n
, G 1 (n) = R 1 α n − R
2 β − 1 n log n + o
1 n log n
, where R
1 , R
2 depend on f, g, q 0 , and q
1 . R
1 and R
2 cause the difference between G 0
1 in this expression. In Eq. (12), the coefficient of 1/n is given by b 0
1 − d
0 ). So
R 1 = b 0 + (d 1 − d
0 ) b 0 = 1 +
d 1 − d 0 α . Let us denote A B as “A is the only factor to determine a value of B”. As mentioned above, f, g, q
0 , q
1 R 1 , R 2 . Though f, g α, β (cf. around Eqs.(5)-(6)), we should emphasize that α, β, q 0
1 R 1 , R 2 . Experimental Bayesian Generalization Error of Non-regular Models 475
This fact is easily confirmed by comparing f 2 to f 4 (also f
3 to f
5 ). It holds that G 1
2 ] = G
1 (n)[f
4 ] for all q 1 although they have the same α and β (G 0 (n)[f 2 ] = G
0 (n)[f
4 ]). Thus α and β are not enough informative to describe R 1
2 . Comparing the values in G 1 [f
] to the ones in G 1 [f 4 ], the basis function (x in f 2 and tanh(bx) in f 4 ) seems to play an important role. To clarify the effect of basis functions, let us fix the function class. Examples 1 and 2 correspond to h(x, w 2 ) = x and h(x, w 2 ) = tanh(bx), respectively. The values of G 1 [f 1 ] and 3
× G 1 [f 3 ] (also 3 × G 1
5 ] and 5
× G 1 [f 6 ]) can be regarded as the same in any covariate shift. This implies h, g, q
0 , q
1 R 1 , i.e. the parameterization t(w 1 ) will not affect R 1 . Instead, it affects the non- regularity or the multiplicity and decides α and β. Though it is an unclear factor, the influence of R 2 does not seem as large as R 1 . Last, let us analyze properties of the upper bound M G 0 (n). According to the above discussion, it holds that R G 1 /G 0 ≤ M. The ratio G 1 /G
basically depends on g, h, q 0 and q 1 . However, M is determined by only the training and test input distributions, q 0 , q 1 M . Therefore this bound gives the worst case evaluation in any g and h. Considering the tightness of the bound, we can still improve it based on the relation between the true and learning functions. 5 Conclusions In the former study, we have got the theoretical generalization error and its upper bound under the covariate shift. This paper showed that the theoretical value is supported by the experiments in spite of the fact that it is established under an asymptotic case. We observed the tightness of the bound and discussed an effect of basis functions in the learning models. In this paper, the non-regular models are simple lines and neural networks. It is an interesting issue to investigate more general models. Though we mainly considered the amount of G 1 (n), the computational cost for the MCMC method strongly connects to the form of the learning function f . It is our future study to take account of the cost in the evaluation. Acknowledgements The authors would like to thank Masashi Sugiyama, Motoaki Kawanabe, and Klaus-Robert M¨ uller for fruitful discussions. The software to calculate the MCMC method and technical comments were provided by Kenji Nagata. This research partly supported by the Alexander von Humboldt Foundation, and MEXT 18079007.
476 K. Yamazaki and S. Watanabe References 1. Yamazaki, K., Kawanabe, M., Wanatabe, S., Sugiyama, M., M¨ uller, K.R.: Asymp- totic bayesian generalization error when training and test distributions are differ- ent. In: Proceedings of the 24th International Conference on Machine Learning, pp. 1079–1086 (2007) 2. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Vaughan, T.M.: Brain-computer interfaces for communication and control. Clinical Neurophysiol- ogy 113(6), 767–791 (2002) 3. Baldi, P., Brunak, S., Stolovitzky, G.A.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (1998) 4. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227– 244 (2000) 5. Sugiyama, M., M¨ uller, K.R.: Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions 23(4), 249–279 (2005) 6. Sugiyama, M., Krauledat, M., M¨ uller, K.R.: Covariate shift adaptation by impor- tance weighted cross validation. Journal of Machine Learning Research 8 (2007) 7. Huang, J., Smola, A., Gretton, A., Borgwardt, K.M., Sch¨ olkopf, B.: Correcting sample selection bias by unlabeled data. In: Sch¨ olkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA (2007) 8. Watanabe, S.: Algebraic analysis for non-identifiable learning machines. Neural Computation 13(4), 899–933 (2001) 9. Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics 14, 1080– 1100 (1986) 10. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 39–50. Springer, Heidelberg (1999) 11. Watanabe, S.: Algebraic information geometry for learning machines with singu- larities. Advances in Neural Information Processing Systems 14, 329–336 (2001) 12. Ogata, Y.: A monte carlo method for an objective bayesian procedure. Ann. Inst. Statis. Math. 42(3), 403–433 (1990)
Using Image Stimuli to Drive fMRI Analysis David R. Hardoon 1 , Janaina Mour˜ ao-Miranda 2 , Michael Brammer 2 , and John Shawe-Taylor 1 1 The Centre for Computational Statistics and Machine Learning Department of Computer Science University College London Gower St., London WC1E 6BT {D.Hardoon,jst}@cs.ucl.ac.uk 2 Brain Image Analysis Unit Centre for Neuroimaging Sciences (PO 89) Institute of Psychiatry, De Crespigny Park London SE5 8AF {Janaina.Mourao-Miranda,Michael.Brammer}@iop.kcl.ac.uk Abstract. We introduce a new unsupervised fMRI analysis method based on Kernel Canonical Correlation Analysis which differs from the class of supervised learning methods that are increasingly being employed in fMRI data analysis. Whereas SVM associates properties of the imaging data with simple specific categorical labels, KCCA replaces these sim- ple labels with a label vector for each stimulus containing details of the features of that stimulus. We have compared KCCA and SVM analyses of an fMRI data set involving responses to emotionally salient stimuli. This involved first training the algorithm ( SVM, KCCA) on a subset of fMRI data and the corresponding labels/label vectors, then testing the algorithms on data withheld from the original training phase. The classification accuracies of SVM and KCCA proved to be very similar. However, the most important result arising from this study is that KCCA in able in part to extract many of the brain regions that SVM identifies as the most important in task discrimination blind to the categorical task labels. Keywords: Machine learning methods, Kernel canonical correlation analysis, Support vector machines, Classifiers, Functional magnetic res- onance imaging data analysis. 1 Introduction Recently, machine learning methodologies have been increasingly used to analyse the relationship between stimulus categories and fMRI responses [1,2,3,4,5,6,7,8, 9,10]. In this paper, we introduce a new unsupervised machine learning approach to fMRI analysis, in which the simple categorical description of stimulus type (e.g. type of task) is replaced by a more informative vector of stimulus features. We compare this new approach with a standard Support Vector Machine (SVM) analysis of fMRI data using a categorical description of stimulus type. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 477–486, 2008. c Springer-Verlag Berlin Heidelberg 2008
478 D.R. Hardoon et al. The technology of the present study originates from earlier research carried out in the domain of image annotation [11], where an image annotation method- ology learns a direct mapping from image descriptors to keywords. Previous attempts at unsupervised fMRI analysis have been based on Kohonen self- organising maps, fuzzy clustering [12] and nonparametric estimation methods of the hemodynamic response function, such as the general method described in [13]. [14] have reported an interesting study which showed that the discrim- inability of PCA basis representations of images of multiple object categories is significantly correlated with the discriminability of PCA basis representation of the fMRI volumes based on category labels. The current study differs from conventional unsupervised approaches in that it makes use of the stimulus characteristics as an implicit representation of a complex state label. We use kernel Canonical Correlation Analysis (KCCA) to learn the correlation between an fMRI volume and its corresponding stimulus. Canonical correlation analysis can be seen as the problem of finding basis vectors for two sets of variables such that the correlations of the projections of the variables onto corresponding basis vectors are maximised. KCCA first projects the data into a higher dimensional feature space before performing CCA in the new feature space. CCA [15, 16] and KCCA [17] have been used in previous fMRI analysis using only conventional categorical stimulus descriptions without exploring the possibility of using complex characteristics of the stimuli as the source for feature selection from the fMRI data. The fMRI data used in the following study originated from an experiment in which the responses to stimuli were designed to evoke different types of emotional responses, pleasant or unpleasant. The pleasant images consisted of women in swimsuits while the unpleasant images were a collection of images of skin dis- eases. Each stimulus image was represented using Scale Invariant Feature Trans- formation (SIFT) [18] features. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not specifically exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. In the current study, we present a feasibility study of the possibility of gener- ating new activity maps by using the actual stimuli that had generated the fMRI volume. We have shown that KCCA is able to extract brain regions identified by supervised methods such as SVM in task discrimination and to achieve similar levels of accuracy and discuss some of the challenges in interpreting the results given the complex input feature vectors used by KCCA in place of categorical labels. This work is an extension of the work presented in [19]. The paper is structured as follows. Section 2 gives a review of the fMRI data acquisition as well as the experimental design and the pre-processing. These are followed by a brief description of the scale invariant feature transformation in Section 2.1. The SVM is briefly described in Section 2.2 while Section 2.2 elaborates on the KCCA methodology. Our results in Section 3. We conclude with a discussion in Section 4.
Using Image Stimuli to Drive fMRI Analysis 479
2 Materials and Methods Due to the lack of space we refer the reader to [10] for a detailed account of the subject, data acquisition and pre-processing applied to the data as well as to the experimental design. 2.1
Scale Invariant Feature Transformation Scale Invariant Feature Transformation (SIFT) was introduced by [18] and shown to be superior to other descriptors [20]. This is due to the SIFT descriptors be- ing designed to be invariant to small shifts in position of salient (i.e. prominent) regions. Calculation of the SIFT vector begins with a scale space search in which local minima and maxima are identified in each image (so-called key locations). The properties of the image at each key location are then expressed in terms of gradient magnitude and orientation. A canonical orientation is then assigned to each key location to maximize rotation invariance. Robustness to reorientation is introduced by representing local image regions around key voxels in a number of orientations. A reference key vector is then computed over all images and the data for each image are represented in terms of distance from this reference. Interestingly, some of the properties of the SIFT representation have been mod- eled on the properties of complex neurons in the visual cortex. Although not specifically exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. Image Processing. Let f l i Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling