Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
830
R. Xu and Y.-W. Chen paper. In this method 3D volumes are treated as the third order tensors directly without the unfolding to be 1D vectors. Additionally, the output of G3D-PCA is three matrices whose columns are the orthogonal bases respectively in the three mode subspaces. The bases in the third order tensor spaces can be constructed from the three matrices and the maximal available number for the tensor bases is not limited by the number of the training samples. According to these properties, our method is not suffered from the huge computing cost; and what’s more, it has good performance on generalization even when few samples are used in training. References 1. Cootes, T.F., Cooper, D., Taylor, C.J., Graham, J.: Active shape models – their training and application. Comput. Vision Image Understanding 61(1), 38–59 (1995) 2. van Assen, H.C., Danilouchkine, M.G., Behloul, F.: J., H., van der Geest, R.J., Reiber, J.H.C., Lelieveldt, B.P.F.: Cardiac lv segmentation using a 3d active shape model driven by fuzzy inference. In: Procedings of Lecture Notes in Computer Science: Medical Image Computing and Computer-Assisted Intervention, pp. 533– 540 (2003) 3. Kaus, M.R., von Berg, J., Weese, J., Niessen, W.J., Pekar, V.: Automated segmen- tation of the left ventricle in cardiac mri. Medical Image Analysis 8(3), 245–254 (2004) 4. Mitchell, S.C., Bosch, J.G., Lelieveldt, B.P.F., van der Geest, R.J., Reiber, J.H.C., Sonka, M.: 3d active appearance models: Segmentation of cardiac mr and ultra- sound images. IEEE Transactions on Medical Imaging 21(9), 1167–1178 (2002) 5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans- actions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 6. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuro- science 3(1), 71–86 (1991) 7. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear sigular value decompo- sition. SIAM Journal of Matrix Analysis and Application 21(4), 1253–1278 (2000) 8. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: On the best rank-1 and rank-(r1, r2,., rn) approximation of higher-order tensors. SIAM Journal of Matrix Analysis and Application 21(4), 1324–1342 (2000) 9. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear analysis of image ensembles: Ten- sorfaces. In: Procedings of European Conference on Computer Vision, pp. 447–460 (2002)
10. Xu, D., Yan, S.C., Zhang, L., Zhang, H.J., Liu, Z.L., Shum, H.Y.: Concurrent subspaces analysis. In: Procedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 203–208 (2005) 11. West, J., Fitzpatrick, J., et al.: Comparison and evaluation of retrospective in- termodality brain image registration techniques. J. Comput. Assist. Tomogr. 21, 554–566 (1997) Head Pose Estimation Based on Tensor Factorization Wenlu Yang 1,2
, Liqing Zhang 1, , and Wenjun Zhu 1 1 Department of Computer Science and Engineering, 800 Dongchuan Rd., Shanghai Jiaotong University, Shanghai 200240, China 2 Department of Electronic Engineering, 1550 Pudong Rd. Shanghai Maritime University, Shanghai 200135, China wenluyang@online.sh.cn, lqzhang@sjtu.edu.cn, wilsonzhu@sjtu.edu.cn Abstract. This paper investigates head pose estimation problem which is considered as front-end preprocessing for improving multi-view human face recognition. We propose a computational model for perceiving head poses based on the Non-negative Multi-way Factorization (NMWF). The model consists of three components: the tensor representation for multi- view faces, feature selection and head pose perception. To find the facial representation basis, the NMWF algorithm is applied to the training data of facial images. The face tensor includes three factors of facial images, poses, and people. The former two factors are used to construct the com- putational model for pose estimation. The discriminative measure for per- ceiving the head pose is defined by the similarity between tensor basis and the representation of testing facial image which is projection of faces on the subspace spanned by the basis “TensorFaces”. Computer simulation results show that the proposed model achieved satisfactory accuracy for estimating head poses of facial images in the CAS-PEAL face database. Keywords: Head pose estimation, Face recognition, Tensor decomposi- tion, Non-negative multi-way factorization, TensorFaces. 1 Introduction Human possess a remarkable ability to recognize faces regardless of facial geome- tries, expressions, head poses, lighting conditions, distances, and ages. Modelling the functions of face recognition and identification remains a difficult open prob- lem in the fields of computer vision and neuroscience. Particularly, face recogni- tion robust to head pose variation is still difficult in the complex natural envi- ronment. Therefore, head pose estimation is a very useful front-end processing for multi-view human face recognition. Head pose variation covers the follow- ing three free rotation parameters, such as yaw (rotation around the neck), tilt (rotation up and down), and roll (rotation left profile to right profile). In this paper, we will focus on yawing rotation which has many important applications. Previous methods on head pose estimation from 2D images can be roughly divided into two categories: template-based approach and appearance-based ap- Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 831–840, 2008. c Springer-Verlag Berlin Heidelberg 2008
832 W. Yang, L. Zhang, and W. Zhu proach. The template-based approaches are based on nearest neighbor classifi- cation with texture templates [1][2], or deduced from geometric information like configurations of facial landmarks [3][5][6]. Appearance-based approaches gener- ally regard head pose estimation as a parameter estimation problem. It could be handled using regression method, multi-class classification, or nonlinear dimen- sion reduction method. Typical appearance approaches include Support Vector Machines (SVM)[7][8], tree-structured AdaBoost pose classifier[9], soft margin AdaBoost[10], Neural Networks (NN)[11][12], Gabor Wavelet Networks[13], MDS[14], Principal component analysis (PCA)[15][16], Kernel PCA [17], Inde- pendent Component Analysis (ICA)[18][24], linear or non-linear embedding and mapping [19][20](ISOMAP[21], Local Linear Embedding (LLE)[22][23]). In this paper, we propose a computational model for perceiving head poses from facial images using tensor decomposition. The rest of this paper is organized as follows: In section 2, we propose an computational model for pose estimation and introduce the tensor decomposition algorithm the Nonnegative Multi-Way Factorization (NMWF). In section 4, the tensor bases for different head poses are obtained from training data and computer simulations are provided to show the performance of our proposed pose estimation method. Finally, discussions and conclusions are given in section 5. 2 Tensor Representation and Learning Algorithm In this section, we first review related linear models for Tensor Decomposition. Then we propose a computational model for perceiving multi-views or head poses of facial images. Then we introduce the Non-negative Multi-way Factor- ization (NMWF) algorithm for learning the representation basis functions of facial images, and we further define representation basis for different views of facial images. Furthermore, we propose a method for estimating the head poses from facial images. 2.1
Related Models for Tensor Decomposition PCA has been a popular technique in facial image recognition, and its gener- alization, known as ICA, has also been employed in face recognition[24]. PCA based face recognition method Eigenfaces works well when face images are well aligned. If facial images contain other factors, such as lighting, poses, and ex- pression, The recognition performance of Eigenfaces degenerate dramatically. Tensor representation is able to formulate more than two factors in one frame- work. Therefore, tensor models have richer representational capability that com- monly used matrix does. Grimes and Rao [25] proposed a bilinear sparse model to learn styles and contents from natural images. When facial images contain more than two factors, these linear and bilinear al- gebra do not work well. Some researchers proposed multilinear algebra to repre- sent multiple factors of image ensembles. Vasilescu and Terzopoulos [26] applied the higher-order tensors approach to factorize facial images, resulting the “Ten- sorFaces” representation of facial images. And later, they proposed a multilinear
Head Pose Estimation Based on Tensor Factorization 833
ICA model to learn the statistically independent components of multiple factors. Morup et al. proposed the Non-negative Multi-way Factorization (NMWF) algo- rithm, a generalization of Lee and Seung’s algorithm [28] of non-negative matrix factorization (NMF algorithm) to decompose the time-frequency representation of EEG[27]. Using the advantage of multiple factor representation in tensor model, we will use the tensor factorization approach to find the tensor basis for facial image and further propose a computational model for perceiving head poses from fa- cial images. The model consists of two parts: the training module and perception module, as shown in Fig.1. The training module is to find the tensor represen- tation using the non-negative multi-way algorithm. By using the coefficients of poses factor, we define the representation basis for different head poses. Fig. 1. Model for learning the “TensorFaces” and estimating poses The second module, perception module, is to estimate the head pose from input facial image. other part composed of testing set, basis functions, coefficients of poses, similarity analysis, and output of resulting poses. Given a facial image randomly selected from the testing set, the corresponding coefficients of poses are obtained by projecting the image onto the subspace spanned by the basis functions. Then analyze the similarity between the coefficients and factors of poses, and the maximal correlation coefficients results in the corresponding pose. The next section will give the brief introduction of the NMWF and the algorithm for pose estimation. 2.2 The NMWF Algorithm In this section, we briefly introduce tensor representation model and the Nonneg- ative Multi-way Factorization (NMWF) algorithm. Generally speaking, tensor is a generalized data representation of 2-dimensional matrix. Given an n-way array T , the objective of tensor factorization is to decompose T into the following N components
834 W. Yang, L. Zhang, and W. Zhu T ≈ T h
N j=1
U (1)
j ⊗ U
(2) j ⊗ · · · ⊗ U (n) j (1) This factorization is important for many applications, since each rank-1 com- ponent represents certain feature of the tensor data. There exist a number of questions needed to be discussed with this tensor factorization. The first ques- tion is how many components are good for the decomposition, this problem is related to the matrix rank estimation in the singular value decomposition of ma- trices. It is usually difficult to estimate the rank of a tensor because the tensor rank estimation is still an NP-hard and open problem. Refer to [4] for a tuto- rial on the tensor rank estimation. Second issue of the tensor factorization is what criterions shall be used for finding the tensor components. Commonly used criterions covers the least-squares error, relative entropy and more generalized divergence, the alpha-divergence. In this paper, we use the relative entropy as a criterion to find the tensor components. Given a nonnegative tensor T , the relative entropy of the tensor T and its approximation T h is defined as Div(
T ||T h ) = j 1 ,j 2 , ··· ,j N T j 1 ,j 2 , ··· ,j
N log
T j 1 ,j 2 , ··· ,j N T h j 1 ,j 2 , ··· ,j N − T j 1 ,j 2 , ··· ,j N + T h j 1 ,j 2 , ··· ,j N (2) Denote the inner product of two tensor as < T , R >=
j 1 ,j 2 , ··· ,j N T j 1 ,j 2 , ··· ,j
N R j 1 ,j 2 , ··· ,j
N . Then the relative entropy can be rewritten as Div( T ||T
h ) =<
T , log(T ) > − < T , log(T h ) > − < 1, T > + < 1, T h > . (3) To explore some specific properties of the data, such as sparseness and smooth- ness, some constraints may be imposed as additional conditions of the optimiza- tion problem. By using the gradient descent algorithm, we derive the learning algorithm as follows: U p
⇐ U p j k 1 ,k 2 , ··· ,k N k j =j T k 1 ,k 2 , ··· ,k
N É n i=1 U p ki /U p j T h k1,k2,··· ,kN k 1 ,k 2 , ··· ,k N i j =j n i=1 U p k i /U p j (4) The algorithm can be implemented alternatively by updating one set of para- meters and keeping the others unchanged. 3 Face Tensor Representation and Head Pose Estimation The facial images can be formulated into the framework of tensor representation. Different mode represents different factor in facial images. In this paper, we take Head Pose Estimation Based on Tensor Factorization 835
three factors: face image, head pose and people into consideration. Thus the facial tensor discussed in this paper consists of three modes. For a data set of facial im- ages with I 3 people, I 2 poses of each person and I 1 -dimensionality of each facial image, the tensor T I 1 I 2 I 3 denotes all facial images. According to the decomposi- tion of T I 1 I 2 I 3 , U (1) j , j = 1, 2, · · · , N represent the set of the basis functions for facial images, U (2) j
(3) j denote poses and people coefficients separately with respect to the basis function U (1)
j . Using all basis functions U (1) j
· · · , N, the k-th vector U 3 k
· · · , I 2 ) of coefficients of people, and the l-th vector U 2 l of pose coefficients, the facial image T k,l is reconstructed as follows: T h k,l = N j=1 U (1) j U (2) j (k)U
(3) j (l). (5) Therefore, the parameters U (2) j
(3) j (l), j = 1, 2, · · · , N are considered as the feature representation on the tensor basis T I
I 2 I 3 , U
(1) j , j = 1, 2, · · · , N. In order to find the tensor representation of an input image, we use the least square error method. Given a new facial image X, we attempts to find the coefficients S = (s 1 , s
2 , · · · , s N ) T of the following tensor representation, X =
N j=1
s j U (1) j . (6) The least square error solution of the above equation is given by the following solution S = A −1 B, where A = (A i,j
) = < U (1)
i , U
(1) j > and B = (B
j ) =
< U (j)
i , X > .
3.1 Head Pose Estimation After obtaining the decomposition of the tensor D I 1 I 2 I 3 , we are able to use the second mode of the model for estimating the poses of faces randomly selected from the testing set. Given a facial image I = U 1 S, here, U 1 denotes the basis functions and S the representation of the image on the subspace spanned by the bases. Comput- ing S = (U T 1
1 ) −1 U T 1 I, the S is obtained. And then the similarity of S and U 2,k (k = 1, 2, · · · , I 2 ) is calculated by the correlation measure Corr(S, U 2,k
) = < S, U 2,k
> < S, S >< U 2,k
, U 2,k
> . (7) For different head pose facial images, the feature representation of U 2,k
in S are different. Therefor, head poses can be easily estimated by maximizing the similarity between S and U 2,k
. 836 W. Yang, L. Zhang, and W. Zhu 4 Simulations and Results In this section, we will present simulation results to verify the performance of the computational model. 4.1 Face Database The facial images are selected from the pose subset, images of 1040 subjects across 7 different poses {±45 ◦
±30 ◦ , ±15 ◦ , 0 ◦ } shown in Fig. 2, included in the public CASE-PEAL-R1 face database[29]. In our experiments, we first crop the original facial images to patches of 300 × 250 pixels which include whole faces by the positions of eyes. Then resize them to 36 ×30 for the limit of compu- tational memory. The tensor D in Equation (1) is a 3-mode tensor with pixels, poses, and people. For the limited computing resources, images of one hundred subjects across seven different poses are randomly selected. And the facial tensor D 1080 ×7×100 is generated for finding the tensor basis functions. We use all other images in the pose subset as testing set. Fig. 2. Examples of facial images in CASE-PEAL-R1 face database[29] 4.2 TensorFaces Representation The tensor D is decomposed by the NMWF, and the basis functions or “Tensor- Faces” are obtained, shown as in Fig.3. Carefully examining the “TensorFaces” tells us that the “TensorFaces” have the favorable characteristics for seven pose representation. When different facial images are projected onto the tensor basis, the resulting coefficients S of seven poses are apparently discriminative. The subsets of “TensorFaces” projected onto subspace in different angles of yawing rotation are shown in Fig. 4. 4.3 Head Pose Estimation According to the algorithm mentioned in section 3.1, we estimate head poses of facial images randomly selected from the testing set. For test stage, 500 facial images were used. The accuracy rate of pose estimation is shown in Figs. 6. Compared with the other methods, our method shows some advantages such as simplicity, effectivity, and high accuracy. First testing on the same face database included in the CAS-PEAL, Ma[30] used the combination of SVM classifier and Head Pose Estimation Based on Tensor Factorization 837
Fig. 3. Basis functions decomposed by the NMWF from facial images across seven poses
Fig. 4. Subsets of “TensorFaces” in different poses. (Left) Pose angle of yaw rotation: -30
◦ . (Right) Pose angle of yaw rotation: 0 ◦ .
838 W. Yang, L. Zhang, and W. Zhu −5 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4 PC1 PC2
PC3 −45
−15 +15
+45 −2 −1 0 1 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 PC1 PC3
PC2 −45
−30 −15
+00 +15
+30 +45
Fig. 5. The left figure shows the sample features of four poses projected in 2- dimensional space. The right figure shows sample features of seven poses, projected in 3-dimensional space. 0 2 4 6 8 10 0.985
0.99 0.995
1 1.005
0 2 4 6 8 0.985 0.99 0.995
1 Fig. 6. Pose estimation results. (Left) The X-axis denotes indexes of tests, and Y-axis denotes pose estimation results. (Right) It is different from the (Left) that the results are rearranged by poses. The X-axis denotes poses in the range of {-45 ◦
◦ , -15
◦ , 0
◦ , +15 ◦ , +30
◦ , +45
◦ }. The Y-axis denotes pose estimation results by poses. LGBPH features for pose estimation, showed that their best accuracy of pose estimation is 97.14%. Using the same data, the worst accuracy we obtained is 98.6% and the average is 99.2%. Second in the preprocessing process of the face images, Ma et al. applied geometric normalization, histogram equalization, and facial mask to the images, and cropped them to 64 ×64 pixels. And we simply cropped the image to include the head and resized it to 36 ×30 for the limit of computing resources. Therefore, our preprocessing is simpler and the results is better. By the way, if the facial images are not cropped, the resulting accuracy of pose estimation is worse, averagely about 90.4%. On the other hand, the training set, randomly selected from the face database, composed of 100 subjects with seven poses. Our supplemental simulations show that if the subjects in the training set increase, the resulting accuracy of pose estimation will further improve, even close to 100%.
Head Pose Estimation Based on Tensor Factorization 839
5 Discussions and Conclusions In this paper, we proposed a computational model for head pose estimation based on tensor decomposition. Using the NMWF algorithm, facial images are decomposed into three factors such as the “TensorFaces”, poses, and people. And then, these facial images are projected onto the subspace spanned by the “TensorFaces”. The resulting representations can be used for pose estimation and further face recognition. Consider the correlation between representation of facial images and components of poses as the similarity measure of pose estimation, the head pose of the facial images included in the public CAS-PEAL database are estimated. The simulation results show that the model is very efficient. Acknowledgments. The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and the National High-Tech Re- search Program of China (Grant No. 2006AA01Z125). Portions of the research in this paper use the CAS-PEAL face database collected under the sponsor of the Chinese National Hi-Tech Program and ISVISION Tech. Co. Ltd. References 1. Bichsel, M., Pentland, A.: Automatic interpretation of human head movements. In: 13th International Joint Conference on Artificial Intelligence (IJCAI), Workshop on Looking At People, Chambery France (1993) 2. McKenna, S., Gong, S.: Real time face pose estimation. Real-Time Imaging 4(5), 333–347 (1998) 3. Heinzmann, J., Zelinsky, A.: 3D facial pose and gaze point estimation using a robust real-time tracking paradigm. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 142–147 (1998) 4. Bro, R.: PARAFAC. Tutorial and applications, Chemom. Intell. Lab. Syst. In: Special Issue 2nd Internet Conf. in Chemometrics (INCINC 1996), vol. 38, pp. 149–171 (1997) 5. Xu, M., Akatsuka, T.: Detecting Head Pose from Stereo Image Sequence for Active Face Recognition. In: Proceedings of the 3rd. International Conference on Face & Gesture Recognition, pp. 82–87 (1998) 6. Hu, Y.X., Chen, L.B., Zhou, Y., Zhang, H.J.: Estimating face pose by facial asym- metry and geometry. In: Proceedings of Automatic Face and Gesture Recognition, pp. 651–656 (2004) 7. Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vec- tor machines (SVM). In: Proceedings of Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 154–156 (1998) 8. Moon, H., Miller, M.: Estimating facial pose from a sparse representation. In: International Conference on Image Processing, vol. 1, pp. 75–78 (2004) 9. Yang, Z., Ai, H., Okamoto, T.: Multi-view face pose classification by tree-structured classifier. In: International Conference on Image Processing, vol. 2, pp. 358–361 (2005)
10. Guo, Y., Poulton, G., Li, J., Hedley, M.: Soft margin AdaBoost for face pose classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 221–224 (2003)
840 W. Yang, L. Zhang, and W. Zhu 11. Baluja, S., Sahami, M., Rowley, H.A.: Efficient face orientation discrimination. In: International Conference on Image Processing, vol. 1, pp. 589–592 (2004) 12. Voit, M., Nickel, K., Stiefelhagen, R.: Multi-view head pose estimation using neural networks. In: The 2nd Canadian Conference on Computer and Robot Vision, pp. 347–352 (2005) 13. Kruger, V., Sommer, G.: Gabor Wavelet Networks for Object Representation. Jour- nal of the Optical Society of America 19(6), 1112–1119 (2002) 14. Chen, L.B., Zhang, L., Hu, Y.X., Li, M.J., Zhang, H.J.: Head Pose Estimation Us- ing Fisher Manifold Learning. In: Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pp. 203–207 (2003) 15. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 16. Martinez, A.M., adn Kak, A.C.: PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 228–233 (2001) 17. Li, S.Z., Fu, Q.D., Gu, L., Scholkopf, B., Cheng, Y.M., Zhang, H.J.: Kernel Machine Based Learning for Multi-view Face Detection and Pose Estimation. International Conference of Computer Vision 2, 674–679 (2001) 18. Li, S.Z., Lv, X.G., Hou, X.W., Peng, X.H., Cheng, Q.S.: Learning Multi-View Face Subspaces and Facial Pose Estimation Using Independent Component Analysis. IEEE Transactions on Image Processing 14(6), 705–712 (2005) 19. Hu, N., Huang, W.M., Ranganath, S.: Head pose estimation by non-linear embed- ding and mapping. In: International Conference on Image Processing, vol. 2, pp. 342–345 (2005) 20. Raytchev, B., Yoda, I., Sakaue, K.: Head Pose Estimation By Nonlinear Manifold Learning. In: International Conference on Pattern Recognition, vol. 4, pp. 462–466 (2004)
21. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000) 22. Saul, L.K., Roweis, S.T.: Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research 4, 119–155 (2003) 23. Roweis, S., Saul, L.: Nonlinear Dimensionality Reduction by Locally Linear Embed- ding. Science 290, 2323–2326 (2000) 24. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13(6), 1450–1464 (2002) 25. Grimes David, B., Rao Rajesh, P.N.: Bilinear Sparse Coding for Invariant Vision. Neural Computation 17(1), 47–73 (2005) 26. Alex, M., Vasilescu, O., Terzopoulos, D.: Multilinear analysis of image ensembles: TensorFaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002) 27. Mørup, M., Hansen, L.K., Parnas, J., Arnfred, S.M.: Decomposing the time- frequency representation of EEG using non-negative matrix and multi-way fac- torization. Technical reports (2006) 28. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac- torization. Nature 401(6755), 788–791 (1999) 29. Gao, W., Cao, B., Shan, S., Zhou, D., Zhang, X., Zhao, D.: The CAS-PEAL Large-Scale Chinese Face Database and Evaluation Protocols, Technical Report No. JDL TR 04 FR 001, Joint Research & Development Laboratory, CAS (2004) 30. Ma, B.P., Zhang, W.C., Shan, S.G., Chen, X.L., Gao, W.: Robust Head Pose Estimation Using LGBP. In: Proceeding of International Conference on Pattern Recognition (2), pp. 512–515 (2006)
Kernel Maximum a Posteriori Classification with Error Bound Analysis Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu Dept. of Computer Science and Engineering, The Chinese Univ. of Hong Kong, Shatin, N.T., Hong Kong {zlxu,kzhuang,jkzhu,king,lyu}@cse.cuhk.edu.hk Abstract. Kernel methods have been widely used in data classifica- tion. Many kernel-based classifiers like Kernel Support Vector Machines (KSVM) assume that data can be separated by a hyperplane in the fea- ture space. These methods do not consider the data distribution. This paper proposes a novel Kernel Maximum A Posteriori (KMAP) classi- fication method, which implements a Gaussian density distribution as- sumption in the feature space and can be regarded as a more generalized classification method than other kernel-based classifier such as Kernel Fisher Discriminant Analysis (KFDA). We also adopt robust methods for parameter estimation. In addition, the error bound analysis for KMAP indicates the effectiveness of the Gaussian density assumption in the fea- ture space. Furthermore, KMAP achieves very promising results on eight UCI benchmark data sets against the competitive methods. 1 Introduction Recently, kernel methods have been regarded as the state-of-the-art classification approaches [1]. The basic idea of kernel methods in supervised learning is to map data from an input space to a high-dimensional feature space in order to make data more separable. Classical kernel-based classifiers include Kernel Support Vector Machine (KSVM) [2], Kernel Fisher Discriminant Analysis (KFDA) [3], and Kernel Minimax probability Machine [4,5]. The reasonability behind them is that the linear discriminant functions in the feature space can represent complex separating surfaces when mapped back to the original input space. However, one drawback of KSVM is that it does not consider the data distribution and cannot directly output the probabilities or confidences for classification. Therefore, it is hard to be applied in systems that reason under uncertainty. On the other hand, in statistical pattern recognition, the probability densities can be estimated from data. Future examples are then assigned to the class with the Maximum A Posteriori (MAP) [6]. One typical probability density function is the Gaussian density function. The Gaussian density functions are easy to handle. However, the Gaussian distribution cannot be easily satisfied in the input space and it is hard to deal with non-linearly separable problems. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 841–850, 2008. c Springer-Verlag Berlin Heidelberg 2008
842 Z. Xu et al. To solve these problems, we propose a Kernel Maximum a Posteriori (KMAP) Classification method under Gaussianity assumption in the feature space. Dif- ferent from KSVM, we make the Gaussian density assumption, which implies that data can be separated by more complex surfaces in the feature space. Gen- erally, distributions other than the Gaussian distribution can also be assumed in the feature space. However, under a distribution with a complex form, it is hard to get a close form solution and easy to trap in over-fitting. Moreover, with the Gaussian assumption, a kernelized version can be derived without knowing the explicit form of the mapping functions for our model. In addition, to indi- cate the effectiveness of our assumption, we calculate a separability measure and the error bound for bi-category data sets. The error bound analysis shows that Gaussian density distribution can be more easily satisfied in the feature space. This paper is organized as follows. Section 2 derives the MAP decision rules in the feature space, and analyzes its separability measures and upper error bounds. Section 3 presents the experiments against other classifiers. Section 4 reviews the related work. Section 5 draws conclusions and lists possible future research directions. 2 Main Results In this section, our MAP classification model is derived. Then, we adopt a special regularization to estimate the parameters. The kernel trick is used to calculate our model. Last, the separability measure and the error bound are calculated in the kernel-induced feature space. 2.1 Model Formulation Under the Gaussian distribution assumption, the conditional density function for each class C i (1
p(Φ(x) |C i ) = 1 (2π) N/2 |Σ i | 1/2
exp − 1 2 (Φ(x)
− μ i ) T Σ −1 i (Φ(x)
− μ i ) , (1) where Φ(x) is the image of x in the feature space, N is the dimension of the feature space (N could be infinity), μ i and Σ i are the mean and the covariance matrix of C i , respectively, and |Σ i | is the determinant of the covariance matrix. According to the Bayesian Theorem, the posterior probability of class C i is calculated by P (C
i |x) =
p(x |C i )P (C i ) m j=1
p(x |C j )P (C j ) . (2)
Based on Eq. (2), the decision rule can be formulated as below: x ∈ C w if P (C
w |x) = max 1 ≤j≤m
P (C j |x). (3) Kernel Maximum a Posteriori Classification with Error Bound Analysis 843
This means that a test data point will be assigned to the class with the maximum of P (C
w |x), i.e., the MAP. Since the MAP is calculated in the kernel-induced feature space, the output model is named as the KMAP classification. KMAP can provide not only a class label but also the probability of a data point belonging to that class. This probability can be viewed as a confidence of classifying new data points and can be used in statistical systems that reason under uncertainty. If the confidence is lower than some specified threshold, the system can refuse to make an inference. However, many kernel learning methods including KSVM cannot output these probabilities. It can be further formulated as follows: g i
− μ i ) T Σ −1 i (Φ(x)
− μ i ) + log |Σ i |. (4) The intuitive meaning of the function is that a class is more likely assigned to an unlabeled data point, when the Mahalanobis distance from the data point to the class center is smaller. 2.2 Parameter Estimation In order to compute the Mahalanobis distance function, the mean vector and the covariance matrix for each class are required to be estimated. Typically, the mean vector (μ i ) and the within covariance matrix (Σ i ) are calculated by the maximum likelihood estimation. In the feature space, they are formulated as follows:
μ i = 1 n i n i j=1 Φ(x j ), (5) Σ i = S i = 1 n i n i j=1 (Φ(x j ) − μ i )(Φ(x j ) − μ i ) T , (6)
where n i is the cardinality of the set composed of data points belonging to C i . Directly employing S i as the covariance matrix, will generate quadratic dis- criminant functions in the feature space. In this case, KMAP is noted as KMAP-M. However, the covariance estimation problem is clearly ill-posed, be- cause the number of data points in each class is usually much smaller than the number of dimensions in the kernel-induced feature space. The treatment of this ill-posed problem is to introduce the regularization. There are several kinds of regularization methods. One of them is to replace the individual within- covariance matrix by their average, i.e., Σ i = S = È m i=1 S i m + rI, where I is the identity matrix and r is a regularization coefficient. This method can substan- tially reduce the number of free parameters to be estimated. Moreover, it also reduces the discriminant function between two classes to a linear one. Therefore, a linear discriminant analysis method can be obtained. Alternatively, we can estimate the covariance matrix by combining the above linear discriminant function with the quadratic one. Instead of estimating the
844 Z. Xu et al. covariance matrix in the input space [7], we try to apply this method in the feature space. The formulation in the feature space is as follows: Σ i
− η) ˜ Σ i + η trace( ˜
Σ i ) n I, (7) where ˜ Σ i = (1 − θ)S
i + θS.
In the equations, θ (0 ≤ θ ≤ 1) is a coefficient linked with the linear dis- criminant term and the quadratic discriminant one. Moreover, η (0 ≤ η ≤ 1)
determines the shrinkage to a multiple of the identity matrix. This approach is more flexible to adjust the effect of the regularization. The corresponding KMAP is noted as KMAP-R. 2.3
Kernel Calculation We derive methods to calculate the Mahalanobis distance (Eq. (4)) using the kernel trick, i.e., we only need to formulate the function in an inner-product form regardless of the explicit mapping function. To do this, the spectral rep- resentation of the covariance matrix, Σ i = N j=1
Λ ij Ω ij Ω T ij where Λ
ij ∈ R is
the j-th eigenvalue of Σ i and Ω ij ∈ R
N is the eigenvector relevant to Λ ij , is
utilized. However, the small eigenvalues will degrade the performance of the func- tion overwhelmingly because they are underestimated due to the small number of examples. In this paper, we only estimate the k largest eigenvalues and re- place each left eigenvalue with a nonnegative number h i . Thus Eq. (4) can be reformulated as follows: g i (Φ(x)) = 1 h i [g 1i (Φ(x)) − g
2i (Φ(x))] + g 3i (Φ(x))
= 1 h i ⎛ ⎝ N j=1
[Ω T ij (Φ(x) − μ
i )] 2 − k j=1 1 − h i Λ ij [Ω T ij (Φ(x) − μ
i )] 2 ⎞ ⎠ + log ⎛ ⎝h N −k i k j=1 Λ ij ⎞ ⎠ .
(8) In the following, we show that g 1i (Φ(x)), g 2i (Φ(x)), and g 3i (Φ(x)) can all be written in a kernel form. To formulate these equations, we need to calculate the eigenvalues Λ i and eigenvectors Ω i . The eigenvectors lie in the space spanned by all the training samples, i.e., each eigenvector Ω ij can be written as a linear combination of all the training samples: Ω ij = n l=1 γ (l)
ij Φ(x
l ) = U γ
ij (9)
where γ ij = (γ (1) ij , γ (2) ij , . . . , γ (n) ij ) T is an n dimensional column vector and U = (Φ(x 1
n )).
Kernel Maximum a Posteriori Classification with Error Bound Analysis 845
It is easy to prove that γ ij and Λ ij are actually the eigenvector and eigenvalue of the covariance matrix Σ G (i) , where G (i)
∈ R n i ×N is the i-th block of the kernel matrix G relevant to C i . We omit the proof due to the limit of space. Accordingly, we can express g 1i (Φ(x)) as the kernel form: g 1i (Φ(x)) = n j=1
γ T ij U T (Φ(x) − μ i ) T (Φ(x)
− μ i )U γ ij = n j=1 γ T ij K x − 1 n i n i l=1 K x l 2 = K x − 1 n i n i l=1
K x l 2 2 , (10) where K
x = {K(x 1 , x), . . . , K(x n , x)
} T . In the same way, g 2i (Φ(x)) can be formulated as the following: g 2i (Φ(x)) = k j=1
1 − h i Λ ij Ω T ij (Φ(x) − μ
i )(Φ(x)
− μ i ) T Ω ij . (11)
Substituting (9) into the above g 2i (Φ(x)), we have: g 2i (Φ(x)) = k j=1
1 − h i Λ ij γ T ij K x − 1 n i n i l=1 K x l K x − 1 n i n i l=1 K x l T γ ij . (12)
Now, the Mahalanobis distance function in the feature space g i (Φ(x)) can be finally written in a kernel form, where N in g 3i (Φ(x)) is substituted by the cardi- nality of data n. The time complexity of KMAP is mainly due to the eigenvalue decomposition which scales as O(n 3
KFDA. 2.4
Connection to Other Kernel Methods In the following, we show the connection between KMAP and other kernel-based methods. In the regularization method based on Eq. (7), by varying the settings of θ and η, other kernel-based classification methods can be derived. When (θ = 0, η = 0), the KMAP model represents a quadratic discriminant method in the kernel- induced feature space; when (θ = 1, η = 0), it represents a kernel discriminant method; and when (θ = 0, η = 1) or (θ = 1, η = 1), it represents the nearest mean classifier. Therefore, by varying θ and η, different models can be generated from different combinations of quadratic discriminant, linear discriminant and the nearest mean methods. We consider a special case of the regularization method when θ = 1 and η = 0. If both classes are assumed to have the same covariance structure for a
846 Z. Xu et al. binary class problem, i.e., Σ i = Σ 1 +Σ 2 2 , it leads to a linear discriminant function. Assuming all classes have the same class prior probabilities, g i (Φ(x)) can be derived as: g i (Φ(x)) = (Φ(x) − μ i ) T ( Σ 1 +Σ 2 2 ) −1 (Φ(x) − μ
i ), where i = 1, 2. We reformulate the above equation in the following form: g i (Φ(x)) = w i Φ(x) +
b i , where w i = −4(Σ 1 + Σ
2 ) −1 μ i , and b i = 2μ
T i (Σ 1 + Σ
2 ) −1 μ i . The decision hyperplane is f (Φ(x)) = g 1 (Φ(x)) − g 2 (Φ(x)), i.e., f (Φ(x)) = (Σ 1 + Σ 2 ) −1 (μ 1 − μ 2 )Φ(x)
− 1 2 (μ 1 − μ 2 ) T (Σ 1 + Σ 2 ) −1 (μ 1 + μ 2 ). (13)
Eq. (13) is just the solution of KFDA [3]. Therefore, KFDA can be viewed as a special case of KMAP when all classes have the same covariance structure. Remark. KMAP provides a rich class of kernel-based classification algorithms using different regularization methods. This makes KMAP as a flexible frame- work for classification adaptive to data distribution. 2.5
Separability Measures and Error Bounds To measure the separability of different classes of data in the feature space, the Kullback-Leibler divergence (a.k.a. K-L distance) between two Gaussians is adopted. The K-L divergence is defined as d KL
i (Φ(x)), p j (Φ(x))] = P i (Φ(x)) ln p i (Φ(x)) p j (Φ(x)) . (14)
Since the K-L divergence is not symmetric, a two-way divergence is used to measure the distance between two distributions d ij
KL [p i (Φ(x)), p j (Φ(x))] + d KL [p j (Φ(x)), p i (Φ(x))] (15) Following [6], it can be proved that: d ij
1 2 (μ i − μ
j ) T (Σ −1 i + Σ −1 j )(μ i − μ j ) +
1 2 trace(Σ −1 i Σ j + Σ
−1 j Σ i − 2I), (16) which can be solved by using the trick in Section 2.3. The Bayesian decision rule guarantees the lowest average error rate as pre- sented in the following: P (correct) = m i=1
R i p(Φ(x) |C i )P (C i )dΦ(x),
(17) where R
i is the decision region of class C i .
density distribution function. Following [6], we have P (error) ≤ P (C 1
2 ) exp
−q(0.5) , (18) Kernel Maximum a Posteriori Classification with Error Bound Analysis 847
where q(0.5) =
1 8 (μ 2 − μ
1 ) T Σ 1 + Σ 2 2 −1 (μ 2 − μ 1 ) +
1 2 ln | Σ 1 +Σ 2 2 | |Σ 1 ||Σ 2 | . (19)
Using the results in Section 2.3, the Bhattacharyya error bound can be easily calculated in the kernel-induced feature space. 3 Experiments In this section, we report the experiments to evaluate the separability measure, the error bound and the prediction performance of the proposed KMAP. 3.1 Synthetic Data We compare the separability measure and error bounds on three synthetic data sets. The description of these data sets can be found in [8]. The data sets are named according to their characteristics and they are plotted in Fig. 1. We map the data using RBF kernel to a special feature space where Gaussian distributions are approximately satisfied. We then calculate separability mea- sures on all data sets according to Eq. (16). The separability values for the Overlap, Bumpy, and Relevance in the original input space, are 14.94, 5.16, and 22.18, respectively. Those corresponding values in the feature space are 30.88, 5.87, and 3631, respectively. The results indicate that data become more separable after mapped into the feature space, especially for the Relevance data set. For data in the kernel-induced feature space, the error bounds are calculated according to Eq. (18). Figure 1 also plots the prediction rates and the upper error bounds for data in the input space and in the feature space, respectively. It can be observed that the error bounds are more valid in the feature space than those in the input space. 3.2 Benchmark Data Experimental Setup. In this experiment, KSVM, KFDA, Modified Quadratic Discriminant Analysis (MQDA) [9] and Kernel Fisher’s Quadratic Discriminant Analysis (KFQDA) [10] are employed as the competitive algorithms. We imple- ment two variants of KMAP, i.e., KMAP-M and KMAP-R. The properties of eight UCI benchmark data sets are described in Table 1. In all kernel methods, a Gaussian-RBF kernel is used. The parameter C of KSVM and the parameter γ in RBF kernel are all tuned by 10-cross validation. In KMAP, we select k pairs of eigenvalues and eigenvectors according to their contribution to the covariance matrix, i.e., the index j ∈ { :
λ È n q=1 λ q ≥ α}; while in MQDF, the range of k is relatively small and we select k by cross validation. PCA is used as the regularization method in KFQDA and the commutative de- cay ratio is set to 99%; the regularization parameter r is set to 0.001 in KFDA. 848 Z. Xu et al. −1.5 −1
0 0.5
1 1.5
2 −2.5
−2 −1.5
−1 −0.5
0 0.5
1 1.5
2 Overlap
−2.5 −2 −1.5 −1 −0.5
0 0.5
1 1.5
2 2.5
−2 −1.5
−1 −0.5
0 0.5
1 1.5
2 Bumpy
(a) Overlap (b)Bumpy
−1 −0.5
0 0.5
1 −1 −0.8 −0.6 −0.4
−0.2 0 0.2 0.4 0.6
0.8 Relevance Bumpy Relevance Overlap 0 10 20 30 40 50 60 Different data sets Prediction error and error bound (%)
input_bound input_rate feature_bound feature_rate (c) Relevance (d) Separability Comparison Fig. 1. The data plot of Overlap, Bumpy and Relevance and the comparison of data separability in the input space and the feature space Table 1. Data set information Data Set # Samples # Features # Classes Data Set # Samples # Features # Classes Iono 351
34 2 Breast 683 9 2 Twonorm 1000
21 2 Sonar 208 60 2 Pima 768
8 2 Iris 150 4 3 Wine 178
13 3 Segment 210 19 7 In both KMAP and MQDF, h takes the value of Λ k+1
. In KMAP-R, extra para- meters (θ, η) are tuned by cross-validation. All experimental results are obtained in 10 runs and each run is executed with 10-cross validation for each data set. Experimental Results. Table 2 reports the average prediction accuracy with the standard errors on each data set for all algorithms. It can be observed that both variants of KMAP outperform MQDF, which is an MAP method in the input space. This also empirically validates that the separability among different classes of data becomes larger and that the upper error bounds get tighter and more accurate, after data are mapped to the high dimensional feature space. Moreover, the performance of KMAP is competitive to that of other kernel methods. Especially, the performance of KMAP-R gets better prediction accu- racy than all other methods for most of the data sets. The reason is that the regularization methods in KMAP favorably capture the prior distributions of
Kernel Maximum a Posteriori Classification with Error Bound Analysis 849
Table 2. The prediction results of KMAP and other methods Data set
KSVM MQDF
KFDA KFQDA KMAP-M KMAP-R Iono(%)
94.1 ±0.7 89.6±0.5 94.2±0.1 93.6±0.4 95.2±0.2 95.7 ±0.3
Breast(%) 96.5
±0.4 96.5±0.1 96.4±0.1 96.5±0.1 96.5±0.1 97.5
±0.1 Twonorm(%) 96.1 ±0.4 97.4±0.4 96.7±0.5 97.3±0.5 97.6±0.7 97.5
±0.4 Sonar(%)
86.6 ±1.0 83.7±0.7 88.3±0.3 85.1±1.9 87.2±1.6 88.8 ±1.2
Pima(%) 77.9
±0.7 73.1±0.4 71.0±0.5 74.1±0.5 75.4±0.7 75.5
±0.4 Iris(%)
96.2 ±0.4 96.0±0.1 95.7±0.1 96.8±0.2 96.9±0.1 98.0 ±0.0
Wine(%) 98.8
±0.1 99.2±1.3 99.1±0.1 96.9±0.7 99.3±0.1 99.3
±0.6 Segment(%) 92.8 ±0.7 86.9±1.2 91.6±0.3 85.8±0.8 90.2±0.2 92.1 ±0.8
Average(%) 92.38
90.30 91.62
90.76 92.29
93.05 data, since the Gaussian assumption in the feature space can fit a very complex distribution in the input space. 4 Related Work In statistical pattern recognition, the probability density function can first be es- timated from data, then future examples could be assigned to the class with the MAP. One typical example is the Quadratic Discriminant Function (QDF) [11], which is derived from the multivariate normal distribution and achieves the min- imum mean error rate under Gaussian distribution. In [9], a Modified Quadratic Discriminant Function (MQDF) less sensitive to estimation error is proposed. [7] improves the performance of QDF by covariance matrix interpolation. Unlike QDF, another type of classifiers does not assume the probability density func- tions in advance, but are designed directly on data samples. An example is the Fisher discriminant analysis (FDA), which maximizes the between-class covari- ance while minimizing the within-class variance. It can be derived as a Bayesian classifier under Gaussian assumption on the data. [3] develops a Kernel Fisher Discriminant Analysis (KFDA) by extending FDA to a non-linear space by the kernel trick. To supplement the statistical justification of KFDA, [10] extends the maxi- mum likelihood method and Bayes classification to their kernel generalization under Gaussian Hilbert space assumption. The authors do not directly kernel- ize the quadratic forms in terms of kernel values. Instead, they use an explicit mapping function to map the data to a high dimensional space. Thus the kernel matrix is usually used as the input data of FDA. The derived model is named as Kernel Fisher’s Quadratic Discriminant Analysis (KFQDA). 5 Conclusion and Future Work In this paper, we present a novel kernel classifier named Kernel-based Maximum a Posteriori, which implements Gaussian distribution in the kernel-induced fea- ture space. Comparing to state-of-the-art classifiers, the advantages of KMAP include that the prior information of distribution is incorporated and that it can output probability or confidence in making a decision. Moreover, KMAP can
850 Z. Xu et al. be regarded as a more generalized classification method than other kernel-based methods such as KFDA. In addition, the error bound analysis illustrates that Gaussian distribution is more easily satisfied in the feature space than that in the input space. More importantly, KMAP with proper regularization achieves very promising performance. We plan to incorporate the probability information into both the kernel func- tion and the classifier in the future work. Acknowledgments The work described in this paper is fully supported by two grants from the Re- search Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4205/04E and Project No. CUHK4235/04E). References 1. Sch¨ olkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, Chichester (1998) 3. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Muller, K.: Fisher discriminant analysis with kernels. In: Proceedings of IEEE Neural Network for Signal Process- ing Workshop, pp. 41–48 (1999) 4. Lanckriet, G.R.G., Ghaoui, L.E., Bhattacharyya, C., Jordan, M.I.: A robust min- imax approach to classification. Journal of Machine Learning Research 3, 555–582 (2002) 5. Huang, K., Yang, H., King, I., Lyu, M.R., Chan, L.: Minimum error minimax probability machine. Journal of Machine Learning Research 5, 1253–1286 (2004) 6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication (2000) 7. Friedman, J.H.: Regularized discriminant analysis. Journal of American Statistics Association 84(405), 165–175 (1989) 8. Centeno, T.P., Lawrence, N.D.: Optimising kernel parameters and regularisation coefficients for non-linear discriminant analysis. Journal of Machine Learning Re- search 7(2), 455–491 (2006) 9. Kimura, F., Takashina, K., S., T., Y., M.: Modified quadratic discriminant func- tions and the application to Chinese character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 9, 149–153 (1987) 10. Huang, S.Y., Hwang, C.R., Lin, M.H.: Kernel Fisher’s discriminant analysis in Gaussian Reproducing Kernel Hilbert Space. Technical report, Academia Sinica, Taiwan, R.O.C. (2005) 11. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San Diego (1990) Comparison of Local Higher-Order Moment Kernel and Conventional Kernels in SVM for Texture Classification Keisuke Kameyama Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan Keisuke.Kameyama@cs.tsukuba.ac.jp Abstract. Use of local higher-order moment kernel (LHOM kernel) in SVMs for texture classification was investigated by comparing it with SVMs using other conventional kernels. In the experiments, it became clear that SVMs with LHOM kernels achieve better trainability and give stable response to the texture classes when compared with those with conventional kernels. Also, the number of support vectors were kept low which indicates better class separability in the nonlinearly-mapped fea- ture space. Keywords: Support Vector Machine (SVM), Higher-Order Moment Spectra, Kernel Function, Texture classification. 1 Introduction Image texture classification of according to their local nature has a various use. Among them are, scene understanding in robot vision, document processing and diagnosis support in medical images, etc. Generally, segmentation of textures rely on local features, extracted by a local window. Multiple features are extracted from the windowed image, and gives a feature vector of the local texture. The feature vector is further passed to a classifier, which maps them to class labels. One of the common choices among the local features is the local power spec- trum of the signal. However, there are cases when second order feature is not sufficient. In such cases, higher-order feature can be used to improve the classifi- cation ability [1]. However, when the orders of the features grow higher, exhaus- tive examination of the possible features or an efficient search for an effective feature would become difficult due to the high dimensionality of the feature space. Thus, use of moments and moment spectra features of higher-order have been limited so far. Recently, use of high dimensional feature space by the so-called kernel-based methods are being actively examined. The Support Vector Machine (SVM) [2] [3] M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 851–860, 2008. c Springer-Verlag Berlin Heidelberg 2008
852 K. Kameyama which also benefits from this nature has a potential to make the high dimen- sional signal features a realistic choice for supervised signal classification. In the literature, kernel function corresponding to higher-order moments [4,5] and their spectra [6] has been investigated. In this work, the effectiveness of using the local higher-order moment kernel (LHOM kernel) in SVMs for texture classification is examined by comparing it with SVMs using other conventional kernel functions. In the following, local higher-order moment (LHOM) features and SVMs will be reviewed in Secs. 2 and 3, respectively. The definition of the LHOM kernel and its characteristics will be mentioned in Sec. 4. In Sec. 5, the setup and the results of the tex- ture classification experiments will be shown, with discussions focused on the differences introduced by the kernel choice. Finally, Sec. 6 concludes the paper. 2 Local Moment and Local Moment Spectra 2.1 Feature Extraction by Local Power Spectrum Let s(t) be a real valued image signal defined on R 2 . One of the most common characterization method of local signal is the use of local power spectrum. It can be obtained as the Fourier transform of the local second-order moment of signal s as, M
(ω, x) = m s,w,2 (τ , x) exp( −jω
T τ )dτ .
(1) The integral is over R 2 where not specified. The local second-order moment of signal s is defined as m s,w,2 (τ , x) = w(t)s(t + x)w(t + τ )s(t + x + τ )dt, (2) where w is the window function for localizing the signal characterization. In this work, a window of a Gaussian function defined as w(t, Σ) = (2π) −1 |Σ|
−1/2 exp(
− 1 2 t T Σ −1 t),
(3) is used. Here, Σ is a positive definite symmetric real 2 × 2 matrix to modify the window shape, and superscript T denotes the matrix transpose. 2.2 Local Higher-Order Moment Spectra (LHOMS) Moment spectra of orders n higher than two, namely the bispectrum (n = 3), trispectrum (n = 4) and so on, also contribute to characterize the texture. The spectrum of order n reflects the correlation of n − 1 frequency components in complex values. Therefore it is possible, for example, to characterize the phase difference among the signal components using the bispectrum, while the infor- mation was lost in the case of power spectrum.
Comparison of LHOM Kernel and Conventional Kernels in SVM 853
The n-th order local moment spectrum is the Fourier transform of the n- th order local moment. By defining Ω n −1
T 1 ω T 2 . . . ω T n −1 ] T and T n −1 = [τ T 1 τ T 2 . . . τ T n −1 ] T , we have M s,w,n (Ω n −1 , x) = R 2(n −1) m s,w,n (T n −1 , x) exp( −jΩ
T n −1 T n −1 )dT n −1 (4) where
m s,w,n
(T n −1 , x) = w(t)s(t + x) n −1
{w(t + τ k )s(t + x + τ k ) }dt. (5) The n-th order LHOMS being product of n − 1 frequency components and their sum frequency component, makes LHOMS a especially useful tool for characterizing the phase relations among the fundamental (ω 0 ) and harmonic (2ω 0 , 3ω 0 , . . .) frequency components. 3 Support Vector Machine and Kernel Functions Support Vector Machine (SVM) [2] [3] is a supervised pattern recognition scheme with the following two significant features. First, the SVM realizes an optimal linear classifier (optimal hyperplane) in the feature space, which is theoretically based on the Structural Risk Minimization (SRM) theory. The SVM learning achieves a linear classifier with minimum VC-dimension thereby keeping the expected generalization error low. Second, utilization of feature spaces of very high dimension is made possible by way of kernel functions. The SVM takes an input vector x ∈ R L
mined feature extraction function Φ(x) ∈ R
N . This feature vector is classified to one of the two classes by a linear classifier y = sgn(u(x)) = sgn( w, Φ(x) + b) y ∈ {−1, 1}, (6) where w
∈ R N and b ∈ R are the weight vector and the bias determining the placement of the discrimination hyperplane in the feature space, respectively. Parameters w and b, are determined by supervised learning using a training set of
input-output pairs {(x
i , y
i ) ∈ R L × {±1}}
i=1 . Assuming that the training set mapped to the feature space {Φ(x i ) } i=1
is linearly separable, the weight vector w will be determined to maximize the mar- gin between the two classes, by solving an optimization problem with inequality constraints to minimize 1 2 ||w|| 2 subject to {y i ( w, Φ(x i ) + b)
≥ 1} i=1
. It can be solved by using the Lagrangian function L(w, b, α), minimizing it for the variables w and b, and maximizing it for the multipliers α = [α 1 , . . . , α ]. 854 K. Kameyama When the training set is not linearly separable, the constraints in the opti- mization problem can be relaxed to minimize 1 2 ||w|| 2 + C i=1 ξ i subject to {y i ( w, Φ(x i ) + b) ≥ 1−ξ i } i=1 (7)
utilizing the so called soft margins by introducing nonnegative slack variables {ξ i } i=1
and regularization constant C > 0. In both cases, function u for the classifier is obtained in the form of u(x) = i=1
y i α i Φ(x
i ), Φ(x) + b. (8) Typically, the optimal solution of the original constrained problem will reside at where equalities hold for a very small fraction of the inequality conditions. This leaves the multipliers for the other conditions to be zero and makes the use of Eq. (8) practical even for a large . Training inputs with nonzero multipliers are called support vectors, thus the name of SVM. Parameter b in Eq. (8) can be obtained from the (equality) constraints for one of the support vectors. In both operations of training and using the SVM, the feature extraction function Φ(x) is never explicitly evaluated. Instead, the inner product K(x, y) = Φ(x), Φ(y) called the kernel function is always evaluated. Among the well known conventional kernel functions are, the polynomial kernel K(x, y) = ( x, y + 1) d (d = 1, 2, ...) (9) and the radial-basis function (RBF) kernel K(x, y) = exp( − ||x − y|| 2 γ 2 ). (γ ∈ R) (10) In some cases of feature extractions, evaluation of Φ, Φ requires less com- putation than directly evaluating Φ. This nature enables the SVM to efficiently utilize feature spaces of very high dimensions. 4 The LHOMS and LHOM Kernels 4.1 LHOMS Kernel Derivation Here, a kernel function for the case when the LHOMS corresponds to the non- linear feature map Φ, will be derived. This will amount to treating the LHOMS function as the feature vector of infinite dimension for all the possible combina- tions of frequency vectors ω 1 , . . . , ω n −1 . Comparison of LHOM Kernel and Conventional Kernels in SVM 855
The inner product of the n-th order moment spectra of signals s(t) and v(t) localized at x and y, respectively, is K w,n
(s, v ; x, y) = M
s,w,n (ω 1 , . . . , ω n −1 , x), M v,w,n
(ω 1 , . . . , ω n −1 , y) = . . .
(S ∗ w ( n −1 k=1 ω k , x) n −1 k=1 S w (ω k , x)) × ×{V
∗ w ( n −1 k=1 ω k , y) n −1 k=1 V w (ω k , y)
} ∗ dω 1 . . . dω
n −1 = (2π) 2(n −1)
w(z)s(z + x)w(z + τ )v(z + y + τ )dz n dτ . (11) See [6] for the derivation. Function K w,n
(s, v ; x, y) will serve as the kernel function in the SVM for signal classification. Notable here, is that the increase of computational cost due to the moment order can be avoided as far as the inner product is concerned. In the literature, the inner product of the n-th order moments of two signals s and v, which omits the window w in Eq. (3) has been derived as, m s,n , m v,n
= s(z)v(z + τ )dz n dτ ,
(12) by an early work of McLaughlin and Raghu [7] for use in optical character recognition. Recently, it has been used in the context of modern kernel-based methods, for signal and image recognition problems [4, 5]. Non-localized and localized versions of this kernel will be referred to as the HOM and LHOM kernels, respectively. 4.2 Equivalence of LHOMS and LHOM Kernels The following clarifies an important relation between the LHOMS kernel and the LHOM kernel. Theorem (Plancherel). Define the Fourier transform of function f (x) ∈ L 2 ( R N ) as (
Ff)(ω) = R N f (x)e −j ω,x
dx. Then, Ff, Fg = (2π) N f, g
holds for functions f, g ∈ L
2 ( R N ) [8].
Corollary. By denoting LHOMS and LHOM of an image s(t) as M s,w,n
and m s,w,n , respectively, M s,w,n , M v,w,n
= (2π) 2(n
−1) m s,w,n , m v,w,n
. |
ma'muriyatiga murojaat qiling