Lecture Notes in Computer Science

bet	52/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 48 49 50 51 52 53 54 55 ... 88

X

i

W

V

i

i

Fig. 1. The proposed data generation model

Factor Analysis Based on Data Generation Model

For given data x, if we keep the extra-class information and reduce the intra-

class information as much as possible, we can expect better classiﬁcation per-

formances. In this aspect, the proposed method can be thought to be similar to

the traditional LDA method. The LDA ﬁnds a projection matrix which simulta-

neously maximizes the between-scatter matrix and minimizes the within-scatter

matrix of original data set. On the other hand, the proposed method ﬁrst esti-

mates the intra-class information from the diﬀerence data set from same class,

and excludes the intra-class information from original data to keep the extra-

class information. Therefore, as compared to LDA, the proposed method does

550

M. Cho, D. Yoon, and H. Park

not need to compute a inverse matrix of the within-scatter matrix and the num-

ber of basis of subspace does not depend on a number of class. In addition, the

proposed method is so simple to get subspaces and many variations of the pro-

posed method can be developed by exploiting various data generation model. In

this section, we will state how to obtain the subspace in detail based on simple

linear generative model.

3.1

Intra-class Factor Loading

We ﬁrst ﬁnd the projection matrix Λ which represents the intra-class factor in-

stead of intra-class factor loading matrix V for obtaining intra-class information.

In given data set

{x}, we calculate the diﬀerence vector δ between two data from

same class, which can be written as

kij

= x

− x

= (W ξ

− W ξ

) + (V η

− V η

(4)

where x

and x

are data from a class C

(k=1,...,K). Because the x

, x

are

came from the same class, we can assume that the extra-class factor does not

make much diﬀerence and we ignore the ﬁrst term W ξ

− W ξ

. Then we can

obtain a new approximate relationship

kij

≈ V (η

− η

(5)

Based on the relationship, we try to ﬁnd the factor loading matrix V . For the

obtained the data set,

Δ =

{δ

kij

}

k=1,...,K,i=1,...,N,j=1,...,N

(6)

where K is the number of classes and N is the number of data in each class,

we apply PCA to obtain the principal component of Δ. The obtained matrix

Λ of the principal component of the Δ gives the subspace which maximizes the

variance of intra-class factor η.

The original data set X is projected to the subspace for extraction of intra-

class factors Y

intra

, such as

intra

= XΛ.

(7)

Note that Y

intra

is a low dimensional data set and includes the intra-class

information of X. Using the Y

intra

, we reconstruct X

intra

in original dimension

by applying the calculation,

intra

= Y

intra

(ΛΛ

)

−1

(8)

Note that X

intra

keeps the intra-class information which is not desirable for

classiﬁcation. To remove the undesirable information, we subtract X

intra

from

the original data set X. As a result, we get a new data set ˜

X such as

X = X

− X

intra

(9)

A Subspace Method Based on Data Generation Model

551

3.2

Extra-Class Factor Loading

Using the newly obtained data set ˜

X, we try to ﬁnd the projection matrix

Λ which represents the extra-class factor instead of extra-class factor loading

matrix W for preserving extra-class information as much as possible. To solve

this problem, let us consider the data set ˜

X. Noting that a data x

intra

in the

data set X

intra

is a reconstruction from the intra-class factor, we can write the

approximate relationship,

intra

≈ V η.

(10)

By combining it with equation (3), the newly obtained data ˜

x in ˜

X can be

rewritten as

x

≈ x − V η = W ξ.

(11)

From this, we can say that the new data set ˜

X mainly has the extra-class

information, and thus we need to preserve the information as much as possible.

From these consideration, we apply PCA to the new data set ˜

X, and obtain the

principal component matrix ˜

Λ of the data set ˜

X. By projecting ˜

X to the basis

vectors such as

X ˜

Λ = ˜

(12)

we can obtain the data set ˜

Z which has small intra-class variance and large

extra-class variance. The obtained data set is used for classiﬁcation.

Experimental Results

For veriﬁcation of the proposed method, we did comparison experiments on fa-

cial data sets with conventional methods : PCA, LDA and intra-person. For

classiﬁcation, we apply the minimum distance method[10] with Euclidean dis-

tance. When we ﬁnd the subspace for each method, we optimized the dimension

of subspace with respect to the classiﬁcation rates for each data set.

4.1

Face Recognition

We ﬁrst conducted facial recognition task for face images with diﬀerent viewpoints

which are obtained from FERET(Face Recognition Technology) database at

the homepage(http : //www.itl.nist.gov/iad/humanid/f eret/). Figure 2 shows

some samples of the data set. We used 450 images from 50 subjects and each sub-

ject consists of 9 images of diﬀerent poses corresponding to 15 degree from left and

right. The left, right(

±60 degree) and frontal images are used for training and the

rest 300 images are used for testing. The size of image is 50

× 70, thus the dimen-

sion of the raw input is 3500. For LDA method, we ﬁrst applied PCA to solve small

sample set problem and obtained 52 dimensional features. We then applied LDA

and obtained 9 dimensional features. Similarly, in the proposed method, we ﬁrst

552

M. Cho, D. Yoon, and H. Park

Fig. 2. Examples of the human face images with diﬀerent viewpoints

Table 1. Result on face image data

Method

Dimension

Classiﬁcation Rate

PCA

117

LDA

99.66

Intra-Person

92.33

Proposed

100

ﬁnd 83 dimensional subspace for intra-class factor and 8 dimensional subspace for

extra-class factor.

In this case, there are the 50 number of classes and the number of data in

each class is very limited; 3 for each class. The experimental results are shown

in Table 1. The performance of the proposed method is perfect, and the other

methods also give generally good results. In spite of 50 number of class and

limited number of training data, the good results may due to that the variation

of between-class is intrinsically high.

4.2

Pose Recognition

We conducted pose recognition task with the same data set used in Section 4.1.

Therefore, we used 450 images from 9 diﬀerent viewpoints classes of

−60

o

to 60

with 15

intervals, and each class consists of 50 images from diﬀerent persons.

The 255 images which are composed of 25 images for each class are used for

training and the rest 255 images are used for testing. For LDA method, we

ﬁrst applied PCA and obtained 167 dimensional features. We then applied LDA

A Subspace Method Based on Data Generation Model

553

Table 2. Result on the human pose image data

Method

Dimension

Classiﬁcation Rate

PCA

65

36.44

LDA

57.78

Intra-Person

38.67

Proposed

58.22

and obtained 6 dimensional features. For the proposed method, we ﬁrst ﬁnd

128 dimensional subspace for intra-class factor and 21 dimensional subspace for

extra-class factor.

In this case, there is the 9 number of classes and the 225 number of training

data which is composed of 25 number of data from each class. The results are

shown in Table 2. The performance is generally low, but the proposed method

and LDA give much better performance than the PCA and intra-person method.

From the low performance, we can conjecture that the variance of between-class

is very small in contrast to the variance of within-class. However, the proposed

method achieved the best performance.

4.3

Facial Expression Recognition

We also conducted facial expression recognition task with data set obtained

from PICS(Psychological Image Collection at Stirling) at the homepage(http :

//pics.psych.stir.ac.uk/). Figure 3 shows the facial expression image sample. We

obtained 276 images from 69 persons and each person has 4 images of diﬀerent

expressions. The 80 images which is composed of 20 images from each expression

are used for training and the rest 196 images are used for testing. The size of

image is 80

× 90, thus the dimension of the raw input is 7200. For LDA method,

we ﬁrst applied PCA and obtained 59 dimensional features. We then applied

LDA and obtained 3 dimensional features. For the proposed method, we ﬁrst

ﬁnd 48 dimensional subspace for intra-class factor and 14 dimensional subspace

for extra-class factor.

In this case, there is 4 number of classes and 20 number of training data

in each class. Although the performances of all methods are generally low, the

proposed method performed much better than PCA and intra-person method.

Like the pose recognition, it is also seemed that the variance of between-class

Table 3. Result on facial expression image data

Method

Dimension

Classiﬁcation Rate

PCA

65

35.71

LDA

65.31

Intra-Person

42.35

Proposed

66.33

554

M. Cho, D. Yoon, and H. Park

Fig. 3. Examples of the human facial expression images

is very small in contrast to the variance of within-class. However, the proposed

method is achieved the best performance.

Conclusions and Discussions

In this paper, we proposed a new subspace method based on a data generation

model with class information which can be represented as intra-class factors

and extra-class factors. By reducing the intra-class information from original

data and by keeping extra-class information using PCA, we could get a low

dimensional features which preserves some essential information for the given

classiﬁcation problem. In the experiments on various type of facial classiﬁcation

tasks, the proposed method showed better performance than conventional meth-

ods. As further study, it could be possible to ﬁnd more sophisticated dimension

reduction than PCA which can enlarge extra-class information. Also, the kernel

method could be applied to overcome the non-linearity problem.

Acknowledgements

This work was supported by the Korea Research Foundation Grant funded by

the Korean Government(MOEHRD) (KRF-2006-311-D00807).

References

1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition

Using Class Speciﬁc Linear Projection. IEEE trans. on Pattern Recogntion and

Machine Intelligence 19(7), 711–720 (1997)

2. Alpaydin, E.: Machine Learning. MIT Press, Cambridge (2004)

3. Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Clas-

siﬁers. Advances in Neural Information Processing System, 487–493 (1998)

A Subspace Method Based on Data Generation Model

555

4. Fisher, R.A.: The Statistical Utilization of Multiple Measurements. Annals of Eu-

genics 8, 376–386 (1938)

5. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic

Press, London (1990)

6. Hinton, G.E., Zemel, R.S.: Autoencoders, Minimum Description Length and

Helmholtz Free Energy. Advances In Neural Information Processing Systems 6,

3–10 (1994)

7. Hinton, G.E., Ghahramani, Z.: Generative Models for Discovering Sparse Dis-

tributed Representations. Philosophical Transactions Royal Society B 352, 1177–

1190 (1997)

8. Lee, O., Park, H., Choi, S.: PCA vs. ICA for Face Recognition. In: The 2000

International Technical Conference on Circuits/Systems, Computers, and Com-

munications, pp. 873–876 (2000)

9. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian Modeling of Facial Similarity.

Advances in Neural Information Processing System, 910–916 (1998)

10. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press,

London (1979)

11. Martinez, A., Kak, A.: PCA versus LDA. IEEE Trans. on Pattern Analysis and

Machine Inteligence 23(2), 228–233 (2001)

12. Park, H., Cho, M.: Classiﬁcation of Bio-data with Small Data set Using Additive

Factor Model and SVM. In: Hoﬀmann, A., Kang, B.-h., Richards, D., Tsumoto,

S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 770–779. Springer, Heidelberg

(2006)

13. Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively,

with Application to Face Veriﬁcation. In: Proc. of International Conference on

Computer Vision on Pattern Recognition, pp. 539–546 (2005)

14. Ghahramani, Z.: Factorial Learning and The EM Algorithm. In: Advances In Neu-

ral Information Processing Systems, vol. 7, pp. 617–624 (1995)

Hierarchical Feature Extraction for Compact

Representation and Classiﬁcation of Datasets

Markus Schubert and Jens Kohlmorgen

Fraunhofer FIRST.IDA

Kekul´

estr. 7, 12489 Berlin, Germany

{markus,jek}@first.fraunhofer.de

http://ida.first.fraunhofer.de

Abstract. Feature extraction methods do generally not account for hi-

erarchical structure in the data. For example, PCA and ICA provide

transformations that solely depend on global properties of the overall

dataset. We here present a general approach for the extraction of feature

hierarchies from datasets and their use for classiﬁcation or clustering.

A hierarchy of features extracted from a dataset thereby constitutes a

compact representation of the set that on the one hand can be used to

characterize and understand the data and on the other hand serves as a

basis to classify or cluster a collection of datasets. As a proof of concept,

we demonstrate the feasibility of this approach with an application to

mixtures of Gaussians with varying degree of structuredness and to a

clinical EEG recording.

Introduction

The vast majority of feature extraction methods does not account for hierarchical

structure in the data. For example, PCA [1] and ICA [2] provide transformations

that solely depend on global properties of the overall data set. The ability to

model the hierarchical structure of the data, however, might certainly help to

characterize and understand the information contained in the data. For example,

neural dynamics are often characterized by a hierarchical structure in space and

time, where methods for hierarchical feature extraction might help to group

and classify such data. A particular demand for these methods exists in EEG

recordings, where slow dynamical components (sometimes interpreted as internal

“state” changes) and the variability of features make data analysis diﬃcult.

Hierarchical feature extraction is so far mainly related to 2-D pattern analysis.

In these approaches, pioneered by Fukushima’s work on the Neocognitron [3], the

hierarchical structure is typically a priori hard-wired in the architecture and the

methods primarily apply to a 2-D grid structure. There are, however, more recent

approaches, like local PCA [4] or tree-dependent component analysis [5], that

are promising steps towards structured feature extraction methods that derive

also the structure from the data. While local PCA in [4] is not hierarchical and

tree-dependent component analysis in [5] is restricted to the context of ICA,

we here present a general approach for the extraction of feature hierarchies and

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 556–565, 2008.

c Springer-Verlag Berlin Heidelberg 2008

Hierarchical Feature Extraction

557

their use for classiﬁcation and clustering. We exemplify this by using PCA as

the core feature extraction method.

In [6] and [7], hierarchies of two-dimensional PCA projections (using proba-

bilistic PCA [8]) were proposed for the purpose of visualizing high-dimensional

data. For obtaining the hierarchies, the selection of sub-clusters was performed

either manually [6] or automatically by using a model selection criterion (AIC,

MDL) [7], but in both cases based on two-dimensional projections. A 2-D pro-

jection of high-dimensional data, however, is often not suﬃcient to unravel the

structure of the data, which thus might hamper both approaches, in particular,

if the sub-clusters get superimposed in the projection. In contrast, our method is

based on hierarchical clustering in the original data space, where the structural

information is unchanged and therefore undiminished. Also, the focus of this

paper is not on visualizing the data itself, which obviously is limited to 2-D or

3-D projections, but rather on the extraction of the hierarchical structure of the

data (which can be visualized by plotting trees) and on replacing the data by

a compact hierarchical representation in terms of a tree of extracted features,

which can be used for classiﬁcation and clustering. The individual quantity to

be classiﬁed or clustered in this context, is a tree of features representing a set of

data points. Note that classifying sets of points is a more general problem than

the well-known problem of classifying individual data points. Other approaches

to classify sets of points can be found, e.g., in [9, 10], where the authors deﬁne

a kernel on sets, which can then be used with standard kernel classiﬁers.

The paper is organized as follows. In section 2, we describe the hierarchical

feature extraction method. In section 3, we show how feature hierarchies can

be used for classiﬁcation and clustering, and in section 4 we provide a proof

of concept with an application to mixtures of Gaussians with varying degree

of structuredness and to a clinical EEG recording. Section 5 concludes with a

discussion.

Hierarchical Feature Extraction

We pursue a straightforward approach to hierarchical feature extraction that

allows us to make any standard feature extraction method hierarchical: we per-

form hierarchical clustering of the data prior to feature extraction. The feature

extraction method is then applied locally to each signiﬁcant cluster in the hi-

erarchy, resulting in a representation (or replacement) of the original dataset in

terms of a tree of features.

2.1

Hierarchical Clustering

There are many known variants of hierarchical clustering algorithms (see, e.g.,

[11, 12]), which can be subdivided into divisive top-down procedures and ag-

glomerative bottom-up procedures. More important than this procedural aspect,

however, is the dissimilarity function that is used in most methods to quantify

the dissimilarity between two clusters. This function is used as the criterion to

558

M. Schubert and J. Kohlmorgen

determine the clusters to be split (or merged) at each iteration of the top-down

(or bottom-up) process. Thus, it is this function that determines the clustering

result and it implicitly encodes what a “good” cluster is. Common agglomerative

procedures are single-linkage, complete-linkage, and average-linkage. They diﬀer

simply in that they use diﬀerent dissimilarity functions [12].

We here use Ward’s method [13], also called the minimum variance method,

which is agglomerative and successively merges the pair of clusters that causes

the smallest increase in terms of the total sum-of-squared-errors (SSE), where the

error is deﬁned as the Euclidean distance of a data point to its cluster mean. The

increase in square-error caused by merging two clusters, D

and D

, is given by

d (D

i

, D

) =

+ n

− m

(1)

where n

and n

are the number of points in each cluster, and m

and m

are

the means of the points in each cluster [12]. Ward’s method can now simply

be described as a standard agglomerative clustering procedure [11, 12] with the

particular dissimilarity function d given in Eq. (1). We use Ward’s criterion,

because it is based on a global ﬁtness criterion (SSE) and in [11] it is reported

that the method outperformed other hierarchical clustering methods in several

comparative studies. Nevertheless, depending on the particular application, other

criteria might be useful as well.

The result of a hierarchical clustering procedure that successively splits or

merges two clusters is a binary tree. At each hierarchy level, k = 1, ..., n, it de-

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 48 49 50 51 52 53 54 55 ... 88