Lecture Notes in Computer Science

bet	77/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 73 74 75 76 77 78 79 80 ... 88

830

R. Xu and Y.-W. Chen

paper. In this method 3D volumes are treated as the third order tensors directly

without the unfolding to be 1D vectors. Additionally, the output of G3D-PCA is

three matrices whose columns are the orthogonal bases respectively in the three

mode subspaces. The bases in the third order tensor spaces can be constructed

from the three matrices and the maximal available number for the tensor bases is

not limited by the number of the training samples. According to these properties,

our method is not suﬀered from the huge computing cost; and what’s more, it has

good performance on generalization even when few samples are used in training.

References

1. Cootes, T.F., Cooper, D., Taylor, C.J., Graham, J.: Active shape models – their

training and application. Comput. Vision Image Understanding 61(1), 38–59 (1995)

2. van Assen, H.C., Danilouchkine, M.G., Behloul, F.: J., H., van der Geest, R.J.,

Reiber, J.H.C., Lelieveldt, B.P.F.: Cardiac lv segmentation using a 3d active shape

model driven by fuzzy inference. In: Procedings of Lecture Notes in Computer

Science: Medical Image Computing and Computer-Assisted Intervention, pp. 533–

540 (2003)

3. Kaus, M.R., von Berg, J., Weese, J., Niessen, W.J., Pekar, V.: Automated segmen-

tation of the left ventricle in cardiac mri. Medical Image Analysis 8(3), 245–254

(2004)

4. Mitchell, S.C., Bosch, J.G., Lelieveldt, B.P.F., van der Geest, R.J., Reiber, J.H.C.,

Sonka, M.: 3d active appearance models: Segmentation of cardiac mr and ultra-

sound images. IEEE Transactions on Medical Imaging 21(9), 1167–1178 (2002)

5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)

6. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuro-

science 3(1), 71–86 (1991)

7. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear sigular value decompo-

sition. SIAM Journal of Matrix Analysis and Application 21(4), 1253–1278 (2000)

8. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: On the best rank-1 and rank-(r1,

r2,., rn) approximation of higher-order tensors. SIAM Journal of Matrix Analysis

and Application 21(4), 1324–1342 (2000)

9. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear analysis of image ensembles: Ten-

sorfaces. In: Procedings of European Conference on Computer Vision, pp. 447–460

(2002)

10. Xu, D., Yan, S.C., Zhang, L., Zhang, H.J., Liu, Z.L., Shum, H.Y.: Concurrent

subspaces analysis. In: Procedings of IEEE Conference on Computer Vision and

Pattern Recognition, vol. 2, pp. 203–208 (2005)

11. West, J., Fitzpatrick, J., et al.: Comparison and evaluation of retrospective in-

termodality brain image registration techniques. J. Comput. Assist. Tomogr. 21,

554–566 (1997)

Head Pose Estimation Based on Tensor

Factorization

Wenlu Yang

1,2

, Liqing Zhang

, and Wenjun Zhu

Department of Computer Science and Engineering, 800 Dongchuan Rd.,

Shanghai Jiaotong University, Shanghai 200240, China

Department of Electronic Engineering, 1550 Pudong Rd.

Shanghai Maritime University, Shanghai 200135, China

wenluyang@online.sh.cn, lqzhang@sjtu.edu.cn, wilsonzhu@sjtu.edu.cn

Abstract. This paper investigates head pose estimation problem which

is considered as front-end preprocessing for improving multi-view human

face recognition. We propose a computational model for perceiving head

poses based on the Non-negative Multi-way Factorization (NMWF). The

model consists of three components: the tensor representation for multi-

view faces, feature selection and head pose perception. To ﬁnd the facial

representation basis, the NMWF algorithm is applied to the training data

of facial images. The face tensor includes three factors of facial images,

poses, and people. The former two factors are used to construct the com-

putational model for pose estimation. The discriminative measure for per-

ceiving the head pose is deﬁned by the similarity between tensor basis and

the representation of testing facial image which is projection of faces on

the subspace spanned by the basis “TensorFaces”. Computer simulation

results show that the proposed model achieved satisfactory accuracy for

estimating head poses of facial images in the CAS-PEAL face database.

Keywords: Head pose estimation, Face recognition, Tensor decomposi-

tion, Non-negative multi-way factorization, TensorFaces.

Introduction

Human possess a remarkable ability to recognize faces regardless of facial geome-

tries, expressions, head poses, lighting conditions, distances, and ages. Modelling

the functions of face recognition and identiﬁcation remains a diﬃcult open prob-

lem in the ﬁelds of computer vision and neuroscience. Particularly, face recogni-

tion robust to head pose variation is still diﬃcult in the complex natural envi-

ronment. Therefore, head pose estimation is a very useful front-end processing

for multi-view human face recognition. Head pose variation covers the follow-

ing three free rotation parameters, such as yaw (rotation around the neck), tilt

(rotation up and down), and roll (rotation left proﬁle to right proﬁle). In this

paper, we will focus on yawing rotation which has many important applications.

Previous methods on head pose estimation from 2D images can be roughly

divided into two categories: template-based approach and appearance-based ap-

Corresponding author.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 831–840, 2008.

c Springer-Verlag Berlin Heidelberg 2008

832

W. Yang, L. Zhang, and W. Zhu

proach. The template-based approaches are based on nearest neighbor classiﬁ-

cation with texture templates [1][2], or deduced from geometric information like

conﬁgurations of facial landmarks [3][5][6]. Appearance-based approaches gener-

ally regard head pose estimation as a parameter estimation problem. It could be

handled using regression method, multi-class classiﬁcation, or nonlinear dimen-

sion reduction method. Typical appearance approaches include Support Vector

Machines (SVM)[7][8], tree-structured AdaBoost pose classiﬁer[9], soft margin

AdaBoost[10], Neural Networks (NN)[11][12], Gabor Wavelet Networks[13],

MDS[14], Principal component analysis (PCA)[15][16], Kernel PCA [17], Inde-

pendent Component Analysis (ICA)[18][24], linear or non-linear embedding and

mapping [19][20](ISOMAP[21], Local Linear Embedding (LLE)[22][23]).

In this paper, we propose a computational model for perceiving head poses

from facial images using tensor decomposition. The rest of this paper is organized

as follows: In section 2, we propose an computational model for pose estimation

and introduce the tensor decomposition algorithm the Nonnegative Multi-Way

Factorization (NMWF). In section 4, the tensor bases for diﬀerent head poses

are obtained from training data and computer simulations are provided to show

the performance of our proposed pose estimation method. Finally, discussions

and conclusions are given in section 5.

Tensor Representation and Learning Algorithm

In this section, we ﬁrst review related linear models for Tensor Decomposition.

Then we propose a computational model for perceiving multi-views or head

poses of facial images. Then we introduce the Non-negative Multi-way Factor-

ization (NMWF) algorithm for learning the representation basis functions of

facial images, and we further deﬁne representation basis for diﬀerent views of

facial images. Furthermore, we propose a method for estimating the head poses

from facial images.

2.1

Related Models for Tensor Decomposition

PCA has been a popular technique in facial image recognition, and its gener-

alization, known as ICA, has also been employed in face recognition[24]. PCA

based face recognition method Eigenfaces works well when face images are well

aligned. If facial images contain other factors, such as lighting, poses, and ex-

pression, The recognition performance of Eigenfaces degenerate dramatically.

Tensor representation is able to formulate more than two factors in one frame-

work. Therefore, tensor models have richer representational capability that com-

monly used matrix does. Grimes and Rao [25] proposed a bilinear sparse model

to learn styles and contents from natural images.

When facial images contain more than two factors, these linear and bilinear al-

gebra do not work well. Some researchers proposed multilinear algebra to repre-

sent multiple factors of image ensembles. Vasilescu and Terzopoulos [26] applied

the higher-order tensors approach to factorize facial images, resulting the “Ten-

sorFaces” representation of facial images. And later, they proposed a multilinear

Head Pose Estimation Based on Tensor Factorization

833

ICA model to learn the statistically independent components of multiple factors.

Morup et al. proposed the Non-negative Multi-way Factorization (NMWF) algo-

rithm, a generalization of Lee and Seung’s algorithm [28] of non-negative matrix

factorization (NMF algorithm) to decompose the time-frequency representation of

EEG[27].

Using the advantage of multiple factor representation in tensor model, we will

use the tensor factorization approach to ﬁnd the tensor basis for facial image

and further propose a computational model for perceiving head poses from fa-

cial images. The model consists of two parts: the training module and perception

module, as shown in Fig.1. The training module is to ﬁnd the tensor represen-

tation using the non-negative multi-way algorithm. By using the coeﬃcients of

poses factor, we deﬁne the representation basis for diﬀerent head poses.

Fig. 1. Model for learning the “TensorFaces” and estimating poses

The second module, perception module, is to estimate the head pose from

input facial image. other part composed of testing set, basis functions, coeﬃcients

of poses, similarity analysis, and output of resulting poses. Given a facial image

randomly selected from the testing set, the corresponding coeﬃcients of poses

are obtained by projecting the image onto the subspace spanned by the basis

functions. Then analyze the similarity between the coeﬃcients and factors of

poses, and the maximal correlation coeﬃcients results in the corresponding pose.

The next section will give the brief introduction of the NMWF and the algorithm

for pose estimation.

2.2

The NMWF Algorithm

In this section, we brieﬂy introduce tensor representation model and the Nonneg-

ative Multi-way Factorization (NMWF) algorithm. Generally speaking, tensor is

a generalized data representation of 2-dimensional matrix. Given an n-way array

T , the objective of tensor factorization is to decompose T into the following N

components

834

W. Yang, L. Zhang, and W. Zhu

T ≈ T

h

=

j=1

(1)

⊗ U

(2)

⊗ · · · ⊗ U

(n)

(1)

This factorization is important for many applications, since each rank-1 com-

ponent represents certain feature of the tensor data. There exist a number of

questions needed to be discussed with this tensor factorization. The ﬁrst ques-

tion is how many components are good for the decomposition, this problem is

related to the matrix rank estimation in the singular value decomposition of ma-

trices. It is usually diﬃcult to estimate the rank of a tensor because the tensor

rank estimation is still an NP-hard and open problem. Refer to [4] for a tuto-

rial on the tensor rank estimation. Second issue of the tensor factorization is

what criterions shall be used for ﬁnding the tensor components. Commonly used

criterions covers the least-squares error, relative entropy and more generalized

divergence, the alpha-divergence. In this paper, we use the relative entropy as a

criterion to ﬁnd the tensor components.

Given a nonnegative tensor

T , the relative entropy of the tensor T and its

approximation

is deﬁned as

Div(

T ||T

) =

··· ,j

log

··· ,j

− T

··· ,j

(2)

Denote the inner product of two tensor as

<

T , R >=

··· ,j

Then the relative entropy can be rewritten as

Div(

T ||T

) =<

T , log(T ) > − < T , log(T

) >

− < 1, T > + < 1, T

> . (3)

To explore some speciﬁc properties of the data, such as sparseness and smooth-

ness, some constraints may be imposed as additional conditions of the optimiza-

tion problem. By using the gradient descent algorithm, we derive the learning

algorithm as follows:

p

j

⇐ U

··· ,k

i=1

k1,k2,··· ,kN

··· ,k

i=1

(4)

The algorithm can be implemented alternatively by updating one set of para-

meters and keeping the others unchanged.

Face Tensor Representation and Head Pose Estimation

The facial images can be formulated into the framework of tensor representation.

Diﬀerent mode represents diﬀerent factor in facial images. In this paper, we take

Head Pose Estimation Based on Tensor Factorization

835

three factors: face image, head pose and people into consideration. Thus the facial

tensor discussed in this paper consists of three modes. For a data set of facial im-

ages with I

people, I

poses of each person and I

-dimensionality of each facial

image, the tensor

denotes all facial images. According to the decomposi-

tion of

, U

(1)

, j = 1, 2,

· · · , N represent the set of the basis functions for

facial images, U

(2)

j

and U

(3)

denote poses and people coeﬃcients separately with

respect to the basis function U

(1)

. Using all basis functions U

(1)

j

, j = 1, 2,

· · · , N,

the k-th vector U

k

(k = 1, 2,

· · · , I

) of coeﬃcients of people, and the l-th vector

of pose coeﬃcients, the facial image

k,l

is reconstructed as follows:

k,l

j=1

(1)

(2)

(k)U

(3)

(l).

(5)

Therefore, the parameters U

(2)

j

(k), U

(3)

(l), j = 1, 2,

· · · , N are considered as the

feature representation on the tensor basis

I

1

, U

(1)

, j = 1, 2,

· · · , N.

In order to ﬁnd the tensor representation of an input image, we use the least

square error method. Given a new facial image X, we attempts to ﬁnd the

coeﬃcients S = (s

, s

· · · , s

)

of the following tensor representation,

X =

j=1

(1)

(6)

The least square error solution of the above equation is given by the following

solution S =

−1

B, where A = (A

i,j

) =

< U

(1)

, U

(1)

and

B = (B

) =

< U

(j)

, X > .

3.1

Head Pose Estimation

After obtaining the decomposition of the tensor

, we are able to use the

second mode of the model for estimating the poses of faces randomly selected

from the testing set.

Given a facial image I = U

S, here, U

denotes the basis functions and S

the representation of the image on the subspace spanned by the bases. Comput-

ing S = (U

1

U

)

−1

I, the S is obtained. And then the similarity of S and

2,k

(k = 1, 2,

· · · , I

) is calculated by the correlation measure

Corr(S, U

2,k

) =

< S, U

2,k

>

< S, S >< U

2,k

, U

2,k

(7)

For diﬀerent head pose facial images, the feature representation of U

2,k

in S

are diﬀerent. Therefor, head poses can be easily estimated by maximizing the

similarity between S and U

2,k

836

W. Yang, L. Zhang, and W. Zhu

Simulations and Results

In this section, we will present simulation results to verify the performance of

the computational model.

4.1

Face Database

The facial images are selected from the pose subset, images of 1040 subjects

across 7 diﬀerent poses

{±45

◦

,

±30

◦

±15

◦

, 0

◦

} shown in Fig. 2, included in the

public CASE-PEAL-R1 face database[29]. In our experiments, we ﬁrst crop the

original facial images to patches of 300

× 250 pixels which include whole faces

by the positions of eyes. Then resize them to 36

×30 for the limit of compu-

tational memory. The tensor

D in Equation (1) is a 3-mode tensor with pixels,

poses, and people. For the limited computing resources, images of one hundred

subjects across seven diﬀerent poses are randomly selected. And the facial tensor

1080

×7×100

is generated for ﬁnding the tensor basis functions. We use all other

images in the pose subset as testing set.

Fig. 2. Examples of facial images in CASE-PEAL-R1 face database[29]

4.2

TensorFaces Representation

The tensor

D is decomposed by the NMWF, and the basis functions or “Tensor-

Faces” are obtained, shown as in Fig.3. Carefully examining the “TensorFaces”

tells us that the “TensorFaces” have the favorable characteristics for seven pose

representation. When diﬀerent facial images are projected onto the tensor basis,

the resulting coeﬃcients S of seven poses are apparently discriminative. The

subsets of “TensorFaces” projected onto subspace in diﬀerent angles of yawing

rotation are shown in Fig. 4.

4.3

Head Pose Estimation

According to the algorithm mentioned in section 3.1, we estimate head poses of

facial images randomly selected from the testing set. For test stage, 500 facial

images were used. The accuracy rate of pose estimation is shown in Figs. 6.

Compared with the other methods, our method shows some advantages such as

simplicity, eﬀectivity, and high accuracy. First testing on the same face database

included in the CAS-PEAL, Ma[30] used the combination of SVM classiﬁer and

Head Pose Estimation Based on Tensor Factorization

837

Fig. 3. Basis functions decomposed by the NMWF from facial images across seven

poses

Fig. 4. Subsets of “TensorFaces” in diﬀerent poses. (Left) Pose angle of yaw rotation:

-30

◦

. (Right) Pose angle of yaw rotation: 0

◦

838

W. Yang, L. Zhang, and W. Zhu

−5

0

5

−3

−2

−1

−3

−2

−1

PC1

PC2

PC3

−45

−15

+15

+45

−2

−1

−2

−1

−3

−2

−1

PC1

PC3

PC2

−45

−30

−15

+00

+15

+30

+45

Fig. 5. The left ﬁgure shows the sample features of four poses projected in 2-

dimensional space. The right ﬁgure shows sample features of seven poses, projected

in 3-dimensional space.

0.985

0.99

0.995

1.005

0.985

0.99

0.995

Fig. 6. Pose estimation results. (Left) The X-axis denotes indexes of tests, and Y-axis

denotes pose estimation results. (Right) It is diﬀerent from the (Left) that the results

are rearranged by poses. The X-axis denotes poses in the range of

{-45

◦

, -30

◦

, -15

◦

, 0

◦

+15

◦

, +30

◦

, +45

◦

}. The Y-axis denotes pose estimation results by poses.

LGBPH features for pose estimation, showed that their best accuracy of pose

estimation is 97.14%. Using the same data, the worst accuracy we obtained is

98.6% and the average is 99.2%. Second in the preprocessing process of the face

images, Ma et al. applied geometric normalization, histogram equalization, and

facial mask to the images, and cropped them to 64

×64 pixels. And we simply

cropped the image to include the head and resized it to 36

×30 for the limit of

computing resources. Therefore, our preprocessing is simpler and the results is

better. By the way, if the facial images are not cropped, the resulting accuracy

of pose estimation is worse, averagely about 90.4%.

On the other hand, the training set, randomly selected from the face database,

composed of 100 subjects with seven poses. Our supplemental simulations show

that if the subjects in the training set increase, the resulting accuracy of pose

estimation will further improve, even close to 100%.

Head Pose Estimation Based on Tensor Factorization

839

Discussions and Conclusions

In this paper, we proposed a computational model for head pose estimation

based on tensor decomposition. Using the NMWF algorithm, facial images are

decomposed into three factors such as the “TensorFaces”, poses, and people.

And then, these facial images are projected onto the subspace spanned by the

“TensorFaces”. The resulting representations can be used for pose estimation and

further face recognition. Consider the correlation between representation of facial

images and components of poses as the similarity measure of pose estimation,

the head pose of the facial images included in the public CAS-PEAL database

are estimated. The simulation results show that the model is very eﬃcient.

Acknowledgments. The work was supported by the National Basic Research

Program of China (Grant No. 2005CB724301) and the National High-Tech Re-

search Program of China (Grant No. 2006AA01Z125). Portions of the research

in this paper use the CAS-PEAL face database collected under the sponsor of

the Chinese National Hi-Tech Program and ISVISION Tech. Co. Ltd.

References

1. Bichsel, M., Pentland, A.: Automatic interpretation of human head movements. In:

13th International Joint Conference on Artiﬁcial Intelligence (IJCAI), Workshop

on Looking At People, Chambery France (1993)

2. McKenna, S., Gong, S.: Real time face pose estimation. Real-Time Imaging 4(5),

333–347 (1998)

3. Heinzmann, J., Zelinsky, A.: 3D facial pose and gaze point estimation using a robust

real-time tracking paradigm. In: Proceedings of the IEEE International Conference

on Automatic Face and Gesture Recognition, pp. 142–147 (1998)

4. Bro, R.: PARAFAC. Tutorial and applications, Chemom. Intell. Lab. Syst. In:

Special Issue 2nd Internet Conf. in Chemometrics (INCINC 1996), vol. 38, pp.

149–171 (1997)

5. Xu, M., Akatsuka, T.: Detecting Head Pose from Stereo Image Sequence for Active

Face Recognition. In: Proceedings of the 3rd. International Conference on Face &

Gesture Recognition, pp. 82–87 (1998)

6. Hu, Y.X., Chen, L.B., Zhou, Y., Zhang, H.J.: Estimating face pose by facial asym-

metry and geometry. In: Proceedings of Automatic Face and Gesture Recognition,

pp. 651–656 (2004)

7. Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vec-

tor machines (SVM). In: Proceedings of Fourteenth International Conference on

Pattern Recognition, vol. 1, pp. 154–156 (1998)

8. Moon, H., Miller, M.: Estimating facial pose from a sparse representation. In:

International Conference on Image Processing, vol. 1, pp. 75–78 (2004)

9. Yang, Z., Ai, H., Okamoto, T.: Multi-view face pose classiﬁcation by tree-structured

classiﬁer. In: International Conference on Image Processing, vol. 2, pp. 358–361

(2005)

10. Guo, Y., Poulton, G., Li, J., Hedley, M.: Soft margin AdaBoost for face pose

classiﬁcation. In: IEEE International Conference on Acoustics, Speech, and Signal

Processing, vol. 3, pp. 221–224 (2003)

840

W. Yang, L. Zhang, and W. Zhu

11. Baluja, S., Sahami, M., Rowley, H.A.: Eﬃcient face orientation discrimination. In:

International Conference on Image Processing, vol. 1, pp. 589–592 (2004)

12. Voit, M., Nickel, K., Stiefelhagen, R.: Multi-view head pose estimation using neural

networks. In: The 2nd Canadian Conference on Computer and Robot Vision, pp.

347–352 (2005)

13. Kruger, V., Sommer, G.: Gabor Wavelet Networks for Object Representation. Jour-

nal of the Optical Society of America 19(6), 1112–1119 (2002)

14. Chen, L.B., Zhang, L., Hu, Y.X., Li, M.J., Zhang, H.J.: Head Pose Estimation Us-

ing Fisher Manifold Learning. In: Proceedings of the IEEE International Workshop

on Analysis and Modeling of Faces and Gestures, pp. 203–207 (2003)

15. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: IEEE Computer

Society Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991)

16. Martinez, A.M., adn Kak, A.C.: PCA versus LDA. IEEE Transactions on Pattern

Analysis and Machine Intelligence 23(2), 228–233 (2001)

17. Li, S.Z., Fu, Q.D., Gu, L., Scholkopf, B., Cheng, Y.M., Zhang, H.J.: Kernel Machine

Based Learning for Multi-view Face Detection and Pose Estimation. International

Conference of Computer Vision 2, 674–679 (2001)

18. Li, S.Z., Lv, X.G., Hou, X.W., Peng, X.H., Cheng, Q.S.: Learning Multi-View Face

Subspaces and Facial Pose Estimation Using Independent Component Analysis.

IEEE Transactions on Image Processing 14(6), 705–712 (2005)

19. Hu, N., Huang, W.M., Ranganath, S.: Head pose estimation by non-linear embed-

ding and mapping. In: International Conference on Image Processing, vol. 2, pp.

342–345 (2005)

20. Raytchev, B., Yoda, I., Sakaue, K.: Head Pose Estimation By Nonlinear Manifold

Learning. In: International Conference on Pattern Recognition, vol. 4, pp. 462–466

(2004)

21. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for

nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)

22. Saul, L.K., Roweis, S.T.: Think Globally, Fit Locally: Unsupervised Learning of Low

Dimensional Manifolds. Journal of Machine Learning Research 4, 119–155 (2003)

23. Roweis, S., Saul, L.: Nonlinear Dimensionality Reduction by Locally Linear Embed-

ding. Science 290, 2323–2326 (2000)

24. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face recognition by independent

component analysis. IEEE Transactions on Neural Networks 13(6), 1450–1464

(2002)

25. Grimes David, B., Rao Rajesh, P.N.: Bilinear Sparse Coding for Invariant Vision.

Neural Computation 17(1), 47–73 (2005)

26. Alex, M., Vasilescu, O., Terzopoulos, D.: Multilinear analysis of image ensembles:

TensorFaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV

2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002)

27. Mørup, M., Hansen, L.K., Parnas, J., Arnfred, S.M.: Decomposing the time-

frequency representation of EEG using non-negative matrix and multi-way fac-

torization. Technical reports (2006)

28. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-

torization. Nature 401(6755), 788–791 (1999)

29. Gao, W., Cao, B., Shan, S., Zhou, D., Zhang, X., Zhao, D.: The CAS-PEAL

Large-Scale Chinese Face Database and Evaluation Protocols, Technical Report

No. JDL TR 04 FR 001, Joint Research & Development Laboratory, CAS (2004)

30. Ma, B.P., Zhang, W.C., Shan, S.G., Chen, X.L., Gao, W.: Robust Head Pose

Estimation Using LGBP. In: Proceeding of International Conference on Pattern

Recognition (2), pp. 512–515 (2006)

Kernel Maximum a Posteriori Classiﬁcation with

Error Bound Analysis

Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu

Dept. of Computer Science and Engineering,

The Chinese Univ. of Hong Kong,

Shatin, N.T., Hong Kong

{zlxu,kzhuang,jkzhu,king,lyu}@cse.cuhk.edu.hk

Abstract. Kernel methods have been widely used in data classiﬁca-

tion. Many kernel-based classiﬁers like Kernel Support Vector Machines

(KSVM) assume that data can be separated by a hyperplane in the fea-

ture space. These methods do not consider the data distribution. This

paper proposes a novel Kernel Maximum A Posteriori (KMAP) classi-

ﬁcation method, which implements a Gaussian density distribution as-

sumption in the feature space and can be regarded as a more generalized

classiﬁcation method than other kernel-based classiﬁer such as Kernel

Fisher Discriminant Analysis (KFDA). We also adopt robust methods for

parameter estimation. In addition, the error bound analysis for KMAP

indicates the eﬀectiveness of the Gaussian density assumption in the fea-

ture space. Furthermore, KMAP achieves very promising results on eight

UCI benchmark data sets against the competitive methods.

Introduction

Recently, kernel methods have been regarded as the state-of-the-art classiﬁcation

approaches [1]. The basic idea of kernel methods in supervised learning is to map

data from an input space to a high-dimensional feature space in order to make

data more separable. Classical kernel-based classiﬁers include Kernel Support

Vector Machine (KSVM) [2], Kernel Fisher Discriminant Analysis (KFDA) [3],

and Kernel Minimax probability Machine [4,5]. The reasonability behind them is

that the linear discriminant functions in the feature space can represent complex

separating surfaces when mapped back to the original input space. However, one

drawback of KSVM is that it does not consider the data distribution and cannot

directly output the probabilities or conﬁdences for classiﬁcation. Therefore, it is

hard to be applied in systems that reason under uncertainty.

On the other hand, in statistical pattern recognition, the probability densities

can be estimated from data. Future examples are then assigned to the class with

the Maximum A Posteriori (MAP) [6]. One typical probability density function

is the Gaussian density function. The Gaussian density functions are easy to

handle. However, the Gaussian distribution cannot be easily satisﬁed in the

input space and it is hard to deal with non-linearly separable problems.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 841–850, 2008.

c Springer-Verlag Berlin Heidelberg 2008

842

Z. Xu et al.

To solve these problems, we propose a Kernel Maximum a Posteriori (KMAP)

Classiﬁcation method under Gaussianity assumption in the feature space. Dif-

ferent from KSVM, we make the Gaussian density assumption, which implies

that data can be separated by more complex surfaces in the feature space. Gen-

erally, distributions other than the Gaussian distribution can also be assumed

in the feature space. However, under a distribution with a complex form, it is

hard to get a close form solution and easy to trap in over-ﬁtting. Moreover, with

the Gaussian assumption, a kernelized version can be derived without knowing

the explicit form of the mapping functions for our model. In addition, to indi-

cate the eﬀectiveness of our assumption, we calculate a separability measure and

the error bound for bi-category data sets. The error bound analysis shows that

Gaussian density distribution can be more easily satisﬁed in the feature space.

This paper is organized as follows. Section 2 derives the MAP decision rules

in the feature space, and analyzes its separability measures and upper error

bounds. Section 3 presents the experiments against other classiﬁers. Section 4

reviews the related work. Section 5 draws conclusions and lists possible future

research directions.

Main Results

In this section, our MAP classiﬁcation model is derived. Then, we adopt a special

regularization to estimate the parameters. The kernel trick is used to calculate

our model. Last, the separability measure and the error bound are calculated in

the kernel-induced feature space.

2.1

Model Formulation

Under the Gaussian distribution assumption, the conditional density function

for each class C

(1

≤ i ≤ m) is written as below:

p(Φ(x)

) =

(2π)

N/2

|Σ

1/2

exp

−

(Φ(x)

− μ

)

−1

(Φ(x)

− μ

)

, (1)

where Φ(x) is the image of x in the feature space, N is the dimension of the

feature space (N could be inﬁnity), μ

and Σ

are the mean and the covariance

matrix of C

, respectively, and

|Σ

| is the determinant of the covariance matrix.

According to the Bayesian Theorem, the posterior probability of class C

calculated by

P (C

|x) =

p(x

)P (C

)

j=1

p(x

)P (C

)

(2)

Based on Eq. (2), the decision rule can be formulated as below:

∈ C

if P (C

|x) = max

≤j≤m

P (C

|x).

(3)

Kernel Maximum a Posteriori Classiﬁcation with Error Bound Analysis

843

This means that a test data point will be assigned to the class with the maximum

of P (C

|x), i.e., the MAP. Since the MAP is calculated in the kernel-induced

feature space, the output model is named as the KMAP classiﬁcation. KMAP can

provide not only a class label but also the probability of a data point belonging

to that class. This probability can be viewed as a conﬁdence of classifying new

data points and can be used in statistical systems that reason under uncertainty.

If the conﬁdence is lower than some speciﬁed threshold, the system can refuse

to make an inference. However, many kernel learning methods including KSVM

cannot output these probabilities.

It can be further formulated as follows:

i

(Φ(x)) = (Φ(x)

− μ

)

−1

(Φ(x)

− μ

) + log

|Σ

(4)

The intuitive meaning of the function is that a class is more likely assigned to

an unlabeled data point, when the Mahalanobis distance from the data point to

the class center is smaller.

2.2

Parameter Estimation

In order to compute the Mahalanobis distance function, the mean vector and

the covariance matrix for each class are required to be estimated. Typically, the

mean vector (μ

) and the within covariance matrix (Σ

) are calculated by the

maximum likelihood estimation. In the feature space, they are formulated as

follows:

j=1

Φ(x

(5)

= S

j=1

(Φ(x

)

− μ

)(Φ(x

)

− μ

)

(6)

where n

is the cardinality of the set composed of data points belonging to C

Directly employing S

as the covariance matrix, will generate quadratic dis-

criminant functions in the feature space. In this case, KMAP is noted as

KMAP-M. However, the covariance estimation problem is clearly ill-posed, be-

cause the number of data points in each class is usually much smaller than

the number of dimensions in the kernel-induced feature space. The treatment

of this ill-posed problem is to introduce the regularization. There are several

kinds of regularization methods. One of them is to replace the individual within-

covariance matrix by their average, i.e., Σ

= S =

i=1

+ rI, where I is the

identity matrix and r is a regularization coeﬃcient. This method can substan-

tially reduce the number of free parameters to be estimated. Moreover, it also

reduces the discriminant function between two classes to a linear one. Therefore,

a linear discriminant analysis method can be obtained.

Alternatively, we can estimate the covariance matrix by combining the above

linear discriminant function with the quadratic one. Instead of estimating the

844

Z. Xu et al.

covariance matrix in the input space [7], we try to apply this method in the

feature space. The formulation in the feature space is as follows:

i

= (1

− η) ˜

+ η

trace( ˜

)

(7)

where ˜

= (1

− θ)S

+ θS.

In the equations, θ (0

≤ θ ≤ 1) is a coeﬃcient linked with the linear dis-

criminant term and the quadratic discriminant one. Moreover, η (0

≤ η ≤ 1)

determines the shrinkage to a multiple of the identity matrix. This approach is

more ﬂexible to adjust the eﬀect of the regularization. The corresponding KMAP

is noted as KMAP-R.

2.3

Kernel Calculation

We derive methods to calculate the Mahalanobis distance (Eq. (4)) using the

kernel trick, i.e., we only need to formulate the function in an inner-product

form regardless of the explicit mapping function. To do this, the spectral rep-

resentation of the covariance matrix, Σ

j=1

where Λ

∈ R is

the j-th eigenvalue of Σ

and Ω

∈ R

is the eigenvector relevant to Λ

, is

utilized. However, the small eigenvalues will degrade the performance of the func-

tion overwhelmingly because they are underestimated due to the small number

of examples. In this paper, we only estimate the k largest eigenvalues and re-

place each left eigenvalue with a nonnegative number h

. Thus Eq. (4) can be

reformulated as follows:

(Φ(x)) =

(Φ(x))

− g

(Φ(x))] + g

(Φ(x))

⎛

⎝

j=1

[Ω

(Φ(x)

− μ

)]

−

j=1

−

[Ω

(Φ(x)

− μ

)]

⎞

⎠

+ log

⎛

⎝h

−k

j=1

⎞

⎠ .

(8)

In the following, we show that g

(Φ(x)), g

(Φ(x)), and g

(Φ(x)) can all be

written in a kernel form. To formulate these equations, we need to calculate the

eigenvalues Λ

and eigenvectors Ω

. The eigenvectors lie in the space spanned

by all the training samples, i.e., each eigenvector Ω

can be written as a linear

combination of all the training samples:

l=1

(l)

Φ(x

) = U γ

(9)

where γ

= (γ

(1)

, γ

(2)

, . . . , γ

(n)

)

is an n dimensional column vector and U =

(Φ(x

1

), . . . , Φ(x

)).

Kernel Maximum a Posteriori Classiﬁcation with Error Bound Analysis

845

It is easy to prove that γ

and Λ

are actually the eigenvector and eigenvalue

of the covariance matrix Σ

(i)

, where G

(i)

∈ R

×N

is the i-th block of the kernel

matrix G relevant to C

. We omit the proof due to the limit of space.

Accordingly, we can express g

(Φ(x)) as the kernel form:

(Φ(x)) =

j=1

(Φ(x)

− μ

)

(Φ(x)

− μ

)U γ

j=1

−

l=1

= K

−

l=1

(10)

where K

{K(x

, x), . . . , K(x

, x)

}

In the same way, g

(Φ(x)) can be formulated as the following:

(Φ(x)) =

j=1

−

(Φ(x)

− μ

)(Φ(x)

− μ

)

(11)

Substituting (9) into the above g

(Φ(x)), we have:

(Φ(x)) =

j=1

−

l=1

−

l=1

(12)

Now, the Mahalanobis distance function in the feature space g

(Φ(x)) can be

ﬁnally written in a kernel form, where N in g

(Φ(x)) is substituted by the cardi-

nality of data n. The time complexity of KMAP is mainly due to the eigenvalue

decomposition which scales as

O(n

3

). Thus KMAP has the same complexity as

KFDA.

2.4

Connection to Other Kernel Methods

In the following, we show the connection between KMAP and other kernel-based

methods.

In the regularization method based on Eq. (7), by varying the settings of θ and

η, other kernel-based classiﬁcation methods can be derived. When (θ = 0, η = 0),

the KMAP model represents a quadratic discriminant method in the kernel-

induced feature space; when (θ = 1, η = 0), it represents a kernel discriminant

method; and when (θ = 0, η = 1) or (θ = 1, η = 1), it represents the nearest

mean classiﬁer. Therefore, by varying θ and η, diﬀerent models can be generated

from diﬀerent combinations of quadratic discriminant, linear discriminant and

the nearest mean methods.

We consider a special case of the regularization method when θ = 1 and

η = 0. If both classes are assumed to have the same covariance structure for a

846

Z. Xu et al.

binary class problem, i.e., Σ

+Σ

, it leads to a linear discriminant function.

Assuming all classes have the same class prior probabilities, g

(Φ(x)) can be

derived as: g

(Φ(x)) = (Φ(x)

− μ

)

(

+Σ

)

−1

(Φ(x)

− μ

), where i = 1, 2. We

reformulate the above equation in the following form: g

(Φ(x)) = w

Φ(x) +

, where w

−4(Σ

+ Σ

)

−1

, and b

= 2μ

(Σ

+ Σ

)

−1

. The decision

hyperplane is f (Φ(x)) = g

(Φ(x))

− g

(Φ(x)), i.e.,

f (Φ(x)) = (Σ

+ Σ

)

−1

(μ

− μ

)Φ(x)

−

(μ

− μ

)

(Σ

+ Σ

)

−1

(μ

+ μ

). (13)

Eq. (13) is just the solution of KFDA [3]. Therefore, KFDA can be viewed as a

special case of KMAP when all classes have the same covariance structure.

Remark. KMAP provides a rich class of kernel-based classiﬁcation algorithms

using diﬀerent regularization methods. This makes KMAP as a ﬂexible frame-

work for classiﬁcation adaptive to data distribution.

2.5

Separability Measures and Error Bounds

To measure the separability of diﬀerent classes of data in the feature space,

the Kullback-Leibler divergence (a.k.a. K-L distance) between two Gaussians is

adopted. The K-L divergence is deﬁned as

KL

[p

(Φ(x)), p

(Φ(x))] =

(Φ(x)) ln

(Φ(x))

(14)

Since the K-L divergence is not symmetric, a two-way divergence is used to

measure the distance between two distributions

ij

= d

(Φ(x)), p

(Φ(x))] + d

(Φ(x)), p

(Φ(x))]

(15)

Following [6], it can be proved that:

ij

=

(μ

− μ

)

(Σ

−1

+ Σ

−1

)(μ

− μ

) +

trace(Σ

−1

+ Σ

−1

− 2I), (16)

which can be solved by using the trick in Section 2.3.

The Bayesian decision rule guarantees the lowest average error rate as pre-

sented in the following:

P (correct) =

i=1

p(Φ(x)

)P (C

)dΦ(x),

(17)

where R

is the decision region of class C

.

We implement the Bhattacharyya bound in the feature space for the Gaussian

density distribution function. Following [6], we have

P (error)

≤ P (C

1

)P (C

) exp

−q(0.5)

(18)

Kernel Maximum a Posteriori Classiﬁcation with Error Bound Analysis

847

where

q(0.5) =

(μ

− μ

)

+ Σ

−1

(μ

− μ

) +

+Σ

|Σ

||Σ

(19)

Using the results in Section 2.3, the Bhattacharyya error bound can be easily

calculated in the kernel-induced feature space.

Experiments

In this section, we report the experiments to evaluate the separability measure,

the error bound and the prediction performance of the proposed KMAP.

3.1

Synthetic Data

We compare the separability measure and error bounds on three synthetic data

sets. The description of these data sets can be found in [8]. The data sets are

named according to their characteristics and they are plotted in Fig. 1.

We map the data using RBF kernel to a special feature space where Gaussian

distributions are approximately satisﬁed. We then calculate separability mea-

sures on all data sets according to Eq. (16). The separability values for the

Overlap, Bumpy, and Relevance in the original input space, are 14.94, 5.16,

and 22.18, respectively. Those corresponding values in the feature space are

30.88, 5.87, and 3631, respectively. The results indicate that data become more

separable after mapped into the feature space, especially for the Relevance

data set.

For data in the kernel-induced feature space, the error bounds are calculated

according to Eq. (18). Figure 1 also plots the prediction rates and the upper

error bounds for data in the input space and in the feature space, respectively.

It can be observed that the error bounds are more valid in the feature space

than those in the input space.

3.2

Benchmark Data

Experimental Setup. In this experiment, KSVM, KFDA, Modiﬁed Quadratic

Discriminant Analysis (MQDA) [9] and Kernel Fisher’s Quadratic Discriminant

Analysis (KFQDA) [10] are employed as the competitive algorithms. We imple-

ment two variants of KMAP, i.e., KMAP-M and KMAP-R.

The properties of eight UCI benchmark data sets are described in Table 1.

In all kernel methods, a Gaussian-RBF kernel is used. The parameter C of

KSVM and the parameter γ in RBF kernel are all tuned by 10-cross validation.

In KMAP, we select k pairs of eigenvalues and eigenvectors according to their

contribution to the covariance matrix, i.e., the index j

∈ { :

q=1

≥ α}; while

in MQDF, the range of k is relatively small and we select k by cross validation.

PCA is used as the regularization method in KFQDA and the commutative de-

cay ratio is set to 99%; the regularization parameter r is set to 0.001 in KFDA.

848

Z. Xu et al.

−1.5

−1

−0.5

0.5

1.5

−2.5

−2

−1.5

−1

−0.5

0.5

1.5

Overlap

−2.5

−2

−1.5

−1

−0.5

0.5

1.5

2.5

−2

−1.5

−1

−0.5

0.5

1.5

Bumpy

(a) Overlap

(b)Bumpy

−1

−0.5

0.5

−1

−0.8

−0.6

−0.4

−0.2

0.2

0.4

0.6

0.8

Relevance

Bumpy

Relevance

Overlap

Different data sets

Prediction error and error bound (%)

input_bound

input_rate

feature_bound

feature_rate

(d) Separability Comparison

Fig. 1. The data plot of Overlap, Bumpy and Relevance and the comparison of data

separability in the input space and the feature space

Table 1. Data set information

Data Set # Samples # Features # Classes Data Set # Samples # Features # Classes

Iono

351

Breast

683

Twonorm

1000

Sonar

208

Pima

768

Iris

150

Wine

178

Segment

210

In both KMAP and MQDF, h takes the value of Λ

k+1

. In KMAP-R, extra para-

meters (θ, η) are tuned by cross-validation. All experimental results are obtained

in 10 runs and each run is executed with 10-cross validation for each data set.

Experimental Results. Table 2 reports the average prediction accuracy with

the standard errors on each data set for all algorithms. It can be observed that

both variants of KMAP outperform MQDF, which is an MAP method in the

input space. This also empirically validates that the separability among diﬀerent

classes of data becomes larger and that the upper error bounds get tighter and

more accurate, after data are mapped to the high dimensional feature space.

Moreover, the performance of KMAP is competitive to that of other kernel

methods. Especially, the performance of KMAP-R gets better prediction accu-

racy than all other methods for most of the data sets. The reason is that the

regularization methods in KMAP favorably capture the prior distributions of

Kernel Maximum a Posteriori Classiﬁcation with Error Bound Analysis

849

Table 2. The prediction results of KMAP and other methods

Data set

KSVM

MQDF

KFDA KFQDA KMAP-M KMAP-R

Iono(%)

94.1

±0.7 89.6±0.5 94.2±0.1 93.6±0.4 95.2±0.2

95.7

±0.3

Breast(%)

96.5

±0.4 96.5±0.1 96.4±0.1 96.5±0.1 96.5±0.1

97.5

±0.1

Twonorm(%) 96.1

±0.4 97.4±0.4 96.7±0.5 97.3±0.5 97.6±0.7

97.5

±0.4

Sonar(%)

86.6

±1.0 83.7±0.7 88.3±0.3 85.1±1.9 87.2±1.6

88.8

±1.2

Pima(%)

77.9

±0.7 73.1±0.4 71.0±0.5 74.1±0.5 75.4±0.7

75.5

±0.4

Iris(%)

96.2

±0.4 96.0±0.1 95.7±0.1 96.8±0.2 96.9±0.1

98.0

±0.0

Wine(%)

98.8

±0.1 99.2±1.3 99.1±0.1 96.9±0.7 99.3±0.1

99.3

±0.6

Segment(%)

92.8

±0.7 86.9±1.2 91.6±0.3 85.8±0.8 90.2±0.2

92.1

±0.8

Average(%)

92.38

90.30

91.62

90.76

92.29

93.05

data, since the Gaussian assumption in the feature space can ﬁt a very complex

distribution in the input space.

Related Work

In statistical pattern recognition, the probability density function can ﬁrst be es-

timated from data, then future examples could be assigned to the class with the

MAP. One typical example is the Quadratic Discriminant Function (QDF) [11],

which is derived from the multivariate normal distribution and achieves the min-

imum mean error rate under Gaussian distribution. In [9], a Modiﬁed Quadratic

Discriminant Function (MQDF) less sensitive to estimation error is proposed. [7]

improves the performance of QDF by covariance matrix interpolation. Unlike

QDF, another type of classiﬁers does not assume the probability density func-

tions in advance, but are designed directly on data samples. An example is the

Fisher discriminant analysis (FDA), which maximizes the between-class covari-

ance while minimizing the within-class variance. It can be derived as a Bayesian

classiﬁer under Gaussian assumption on the data. [3] develops a Kernel Fisher

Discriminant Analysis (KFDA) by extending FDA to a non-linear space by the

kernel trick.

To supplement the statistical justiﬁcation of KFDA, [10] extends the maxi-

mum likelihood method and Bayes classiﬁcation to their kernel generalization

under Gaussian Hilbert space assumption. The authors do not directly kernel-

ize the quadratic forms in terms of kernel values. Instead, they use an explicit

mapping function to map the data to a high dimensional space. Thus the kernel

matrix is usually used as the input data of FDA. The derived model is named

as Kernel Fisher’s Quadratic Discriminant Analysis (KFQDA).

Conclusion and Future Work

In this paper, we present a novel kernel classiﬁer named Kernel-based Maximum

a Posteriori, which implements Gaussian distribution in the kernel-induced fea-

ture space. Comparing to state-of-the-art classiﬁers, the advantages of KMAP

include that the prior information of distribution is incorporated and that it can

output probability or conﬁdence in making a decision. Moreover, KMAP can

850

Z. Xu et al.

be regarded as a more generalized classiﬁcation method than other kernel-based

methods such as KFDA. In addition, the error bound analysis illustrates that

Gaussian distribution is more easily satisﬁed in the feature space than that in

the input space. More importantly, KMAP with proper regularization achieves

very promising performance.

We plan to incorporate the probability information into both the kernel func-

tion and the classiﬁer in the future work.

Acknowledgments

The work described in this paper is fully supported by two grants from the Re-

search Grants Council of the Hong Kong Special Administrative Region, China

(Project No. CUHK4205/04E and Project No. CUHK4235/04E).

References

1. Sch¨

olkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002)

2. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, Chichester (1998)

3. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Muller, K.: Fisher discriminant

analysis with kernels. In: Proceedings of IEEE Neural Network for Signal Process-

ing Workshop, pp. 41–48 (1999)

4. Lanckriet, G.R.G., Ghaoui, L.E., Bhattacharyya, C., Jordan, M.I.: A robust min-

imax approach to classiﬁcation. Journal of Machine Learning Research 3, 555–582

(2002)

5. Huang, K., Yang, H., King, I., Lyu, M.R., Chan, L.: Minimum error minimax

probability machine. Journal of Machine Learning Research 5, 1253–1286 (2004)

6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classiﬁcation. Wiley-Interscience

Publication (2000)

7. Friedman, J.H.: Regularized discriminant analysis. Journal of American Statistics

Association 84(405), 165–175 (1989)

8. Centeno, T.P., Lawrence, N.D.: Optimising kernel parameters and regularisation

coeﬃcients for non-linear discriminant analysis. Journal of Machine Learning Re-

search 7(2), 455–491 (2006)

9. Kimura, F., Takashina, K., S., T., Y., M.: Modiﬁed quadratic discriminant func-

tions and the application to Chinese character recognition. IEEE Transactions on

Pattern Analysis and Machine Intelligence 9, 149–153 (1987)

10. Huang, S.Y., Hwang, C.R., Lin, M.H.: Kernel Fisher’s discriminant analysis in

Gaussian Reproducing Kernel Hilbert Space. Technical report, Academia Sinica,

Taiwan, R.O.C. (2005)

11. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic

Press, San Diego (1990)

Comparison of Local Higher-Order Moment

Kernel and Conventional Kernels in SVM for

Texture Classiﬁcation

Keisuke Kameyama

Department of Computer Science,

Graduate School of Systems and Information Engineering,

University of Tsukuba,

1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan

Keisuke.Kameyama@cs.tsukuba.ac.jp

Abstract. Use of local higher-order moment kernel (LHOM kernel) in

SVMs for texture classiﬁcation was investigated by comparing it with

SVMs using other conventional kernels. In the experiments, it became

clear that SVMs with LHOM kernels achieve better trainability and give

stable response to the texture classes when compared with those with

conventional kernels. Also, the number of support vectors were kept low

which indicates better class separability in the nonlinearly-mapped fea-

ture space.

Keywords: Support Vector Machine (SVM), Higher-Order Moment

Spectra, Kernel Function, Texture classiﬁcation.

Introduction

Image texture classiﬁcation of according to their local nature has a various use.

Among them are, scene understanding in robot vision, document processing and

diagnosis support in medical images, etc.

Generally, segmentation of textures rely on local features, extracted by a local

window. Multiple features are extracted from the windowed image, and gives a

feature vector of the local texture. The feature vector is further passed to a

classiﬁer, which maps them to class labels.

One of the common choices among the local features is the local power spec-

trum of the signal. However, there are cases when second order feature is not

suﬃcient. In such cases, higher-order feature can be used to improve the classiﬁ-

cation ability [1]. However, when the orders of the features grow higher, exhaus-

tive examination of the possible features or an eﬃcient search for an eﬀective

feature would become diﬃcult due to the high dimensionality of the feature

space. Thus, use of moments and moment spectra features of higher-order have

been limited so far.

Recently, use of high dimensional feature space by the so-called kernel-based

methods are being actively examined. The Support Vector Machine (SVM) [2] [3]

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 851–860, 2008.

c Springer-Verlag Berlin Heidelberg 2008

852

K. Kameyama

which also beneﬁts from this nature has a potential to make the high dimen-

sional signal features a realistic choice for supervised signal classiﬁcation. In the

literature, kernel function corresponding to higher-order moments [4,5] and their

spectra [6] has been investigated.

In this work, the eﬀectiveness of using the local higher-order moment kernel

(LHOM kernel) in SVMs for texture classiﬁcation is examined by comparing

it with SVMs using other conventional kernel functions. In the following, local

higher-order moment (LHOM) features and SVMs will be reviewed in Secs. 2

and 3, respectively. The deﬁnition of the LHOM kernel and its characteristics

will be mentioned in Sec. 4. In Sec. 5, the setup and the results of the tex-

ture classiﬁcation experiments will be shown, with discussions focused on the

diﬀerences introduced by the kernel choice. Finally, Sec. 6 concludes the paper.

Local Moment and Local Moment Spectra

2.1

Feature Extraction by Local Power Spectrum

Let s(t) be a real valued image signal deﬁned on R

. One of the most common

characterization method of local signal is the use of local power spectrum. It can

be obtained as the Fourier transform of the local second-order moment of signal

s as,

M

s,w,2

(ω, x) =

s,w,2

(τ , x) exp(

−jω

τ )dτ .

(1)

The integral is over R

where not speciﬁed. The local second-order moment of

signal s is deﬁned as

s,w,2

(τ , x) =

w(t)s(t + x)w(t + τ )s(t + x + τ )dt,

(2)

where w is the window function for localizing the signal characterization. In this

work, a window of a Gaussian function deﬁned as

w(t, Σ) = (2π)

−1

|Σ|

−1/2

exp(

−

−1

t),

(3)

is used. Here, Σ is a positive deﬁnite symmetric real 2

× 2 matrix to modify the

window shape, and superscript T denotes the matrix transpose.

2.2

Local Higher-Order Moment Spectra (LHOMS)

Moment spectra of orders n higher than two, namely the bispectrum (n = 3),

trispectrum (n = 4) and so on, also contribute to characterize the texture. The

spectrum of order n reﬂects the correlation of n

− 1 frequency components in

complex values. Therefore it is possible, for example, to characterize the phase

diﬀerence among the signal components using the bispectrum, while the infor-

mation was lost in the case of power spectrum.

Comparison of LHOM Kernel and Conventional Kernels in SVM

853

The n-th order local moment spectrum is the Fourier transform of the n-

th order local moment. By deﬁning Ω

−1

= [ω

. . . ω

−1

]

and T

−1

[τ

. . . τ

−1

]

, we have

s,w,n

(Ω

−1

, x) =

2(n

−1)

s,w,n

−1

, x) exp(

−jΩ

−1

)dT

−1

(4)

where

s,w,n

−1

, x) =

w(t)s(t + x)

−1

k=1

{w(t + τ

)s(t + x + τ

)

}dt.

(5)

The n-th order LHOMS being product of n

− 1 frequency components and

their sum frequency component, makes LHOMS a especially useful tool for

characterizing the phase relations among the fundamental (ω

) and harmonic

(2ω

, 3ω

, . . .) frequency components.

Support Vector Machine and Kernel Functions

Support Vector Machine (SVM) [2] [3] is a supervised pattern recognition scheme

with the following two signiﬁcant features. First, the SVM realizes an optimal

linear classiﬁer (optimal hyperplane) in the feature space, which is theoretically

based on the Structural Risk Minimization (SRM) theory. The SVM learning

achieves a linear classiﬁer with minimum VC-dimension thereby keeping the

expected generalization error low. Second, utilization of feature spaces of very

high dimension is made possible by way of kernel functions.

The SVM takes an input vector x

∈ R

L

, which is transformed by a predeter-

mined feature extraction function Φ(x)

∈ R

. This feature vector is classiﬁed

to one of the two classes by a linear classiﬁer

y = sgn(u(x)) = sgn( w, Φ(x) + b)

∈ {−1, 1},

(6)

where w

∈ R

and b

∈ R are the weight vector and the bias determining the

placement of the discrimination hyperplane in the feature space, respectively.

Parameters w and b, are determined by supervised learning using a training

set of

input-output pairs

{(x

, y

)

∈ R

× {±1}}

i=1

Assuming that the training set mapped to the feature space

{Φ(x

)

}

i=1

linearly separable, the weight vector w will be determined to maximize the mar-

gin between the two classes, by solving an optimization problem with inequality

constraints to

minimize

||w||

subject to

( w, Φ(x

) + b)

≥ 1}

i=1

It can be solved by using the Lagrangian function L(w, b, α), minimizing it for

the variables w and b, and maximizing it for the multipliers α = [α

, . . . , α ].

854

K. Kameyama

When the training set is not linearly separable, the constraints in the opti-

mization problem can be relaxed to

minimize

||w||

+ C

i=1

subject to

( w, Φ(x

) + b)

≥ 1−ξ

}

i=1

(7)

utilizing the so called soft margins by introducing nonnegative slack variables

{ξ

}

i=1

and regularization constant C > 0.

In both cases, function u for the classiﬁer is obtained in the form of

u(x) =

i=1

Φ(x

), Φ(x) + b.

(8)

Typically, the optimal solution of the original constrained problem will reside

at where equalities hold for a very small fraction of the

inequality conditions.

This leaves the multipliers for the other conditions to be zero and makes the use

of Eq. (8) practical even for a large . Training inputs with nonzero multipliers

are called support vectors, thus the name of SVM. Parameter b in Eq. (8) can

be obtained from the (equality) constraints for one of the support vectors.

In both operations of training and using the SVM, the feature extraction

function Φ(x) is never explicitly evaluated. Instead, the inner product K(x, y) =

Φ(x), Φ(y) called the kernel function is always evaluated. Among the well

known conventional kernel functions are, the polynomial kernel

K(x, y) = ( x, y + 1)

(d = 1, 2, ...)

(9)

and the radial-basis function (RBF) kernel

K(x, y) = exp(

−

||x − y||

(γ

∈ R)

(10)

In some cases of feature extractions, evaluation of Φ, Φ requires less com-

putation than directly evaluating Φ. This nature enables the SVM to eﬃciently

utilize feature spaces of very high dimensions.

The LHOMS and LHOM Kernels

4.1

LHOMS Kernel Derivation

Here, a kernel function for the case when the LHOMS corresponds to the non-

linear feature map Φ, will be derived. This will amount to treating the LHOMS

function as the feature vector of inﬁnite dimension for all the possible combina-

tions of frequency vectors ω

, . . . , ω

−1

Comparison of LHOM Kernel and Conventional Kernels in SVM

855

The inner product of the n-th order moment spectra of signals s(t) and v(t)

localized at x and y, respectively, is

w,n

(s, v ; x, y)

= M

s,w,n

(ω

, . . . , ω

−1

, x), M

v,w,n

(ω

, . . . , ω

−1

, y)

. . .

∗

(

−1

k=1

, x)

−1

k=1

(ω

, x))

×{V

∗

(

−1

k=1

, y)

−1

k=1

(ω

, y)

}

∗

dω

. . . dω

−1

= (2π)

2(n

−1)

w(z)s(z + x)w(z + τ )v(z + y + τ )dz

dτ .

(11)

See [6] for the derivation.

Function K

w,n

(s, v ; x, y) will serve as the kernel function in the SVM for

signal classiﬁcation. Notable here, is that the increase of computational cost due

to the moment order can be avoided as far as the inner product is concerned.

In the literature, the inner product of the n-th order moments of two signals

s and v, which omits the window w in Eq. (3) has been derived as,

s,n

, m

v,n

s(z)v(z + τ )dz

dτ ,

(12)

by an early work of McLaughlin and Raghu [7] for use in optical character

recognition. Recently, it has been used in the context of modern kernel-based

methods, for signal and image recognition problems [4, 5]. Non-localized and

localized versions of this kernel will be referred to as the HOM and LHOM

kernels, respectively.

4.2

Equivalence of LHOMS and LHOM Kernels

The following clariﬁes an important relation between the LHOMS kernel and

the LHOM kernel.

Theorem (Plancherel). Deﬁne the Fourier transform of function f (x)

∈

(

) as (

Ff)(ω) =

f (x)e

−j ω,x

dx. Then,

Ff, Fg = (2π)

f, g

holds for functions f, g

∈ L

(

) [8].

Corollary. By denoting LHOMS and LHOM of an image s(t) as M

s,w,n

and

s,w,n

, respectively,

s,w,n

, M

v,w,n

= (2π)

2(n

−1)

s,w,n

, m

v,w,n

856

K. Kameyama

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 73 74 75 76 77 78 79 80 ... 88