Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
X i η ξ W V i i Fig. 1. The proposed data generation model 3 Factor Analysis Based on Data Generation Model For given data x, if we keep the extra-class information and reduce the intra- class information as much as possible, we can expect better classification per- formances. In this aspect, the proposed method can be thought to be similar to the traditional LDA method. The LDA finds a projection matrix which simulta- neously maximizes the between-scatter matrix and minimizes the within-scatter matrix of original data set. On the other hand, the proposed method first esti- mates the intra-class information from the difference data set from same class, and excludes the intra-class information from original data to keep the extra- class information. Therefore, as compared to LDA, the proposed method does
550 M. Cho, D. Yoon, and H. Park not need to compute a inverse matrix of the within-scatter matrix and the num- ber of basis of subspace does not depend on a number of class. In addition, the proposed method is so simple to get subspaces and many variations of the pro- posed method can be developed by exploiting various data generation model. In this section, we will state how to obtain the subspace in detail based on simple linear generative model. 3.1 Intra-class Factor Loading We first find the projection matrix Λ which represents the intra-class factor in- stead of intra-class factor loading matrix V for obtaining intra-class information. In given data set {x}, we calculate the difference vector δ between two data from same class, which can be written as δ kij = x ki − x kj = (W ξ
ki − W ξ
kj ) + (V η
ki − V η
kj ), (4) where x ki and x kj are data from a class C k (k=1,...,K). Because the x ki , x
kj are
came from the same class, we can assume that the extra-class factor does not make much difference and we ignore the first term W ξ ki − W ξ
kj . Then we can obtain a new approximate relationship δ kij ≈ V (η ki − η kj ). (5) Based on the relationship, we try to find the factor loading matrix V . For the obtained the data set, Δ = {δ
} k=1,...,K,i=1,...,N,j=1,...,N , (6)
where K is the number of classes and N is the number of data in each class, we apply PCA to obtain the principal component of Δ. The obtained matrix Λ of the principal component of the Δ gives the subspace which maximizes the variance of intra-class factor η. The original data set X is projected to the subspace for extraction of intra- class factors Y intra , such as Y intra
= XΛ. (7)
Note that Y intra
is a low dimensional data set and includes the intra-class information of X. Using the Y intra , we reconstruct X intra in original dimension by applying the calculation, X intra = Y intra
Λ T (ΛΛ T ) −1 . (8)
Note that X intra
keeps the intra-class information which is not desirable for classification. To remove the undesirable information, we subtract X intra from
the original data set X. As a result, we get a new data set ˜ X such as ˜ X = X
− X intra
. (9)
A Subspace Method Based on Data Generation Model 551
3.2 Extra-Class Factor Loading Using the newly obtained data set ˜ X, we try to find the projection matrix ˜ Λ which represents the extra-class factor instead of extra-class factor loading matrix W for preserving extra-class information as much as possible. To solve this problem, let us consider the data set ˜ X. Noting that a data x intra
in the data set X intra is a reconstruction from the intra-class factor, we can write the approximate relationship, x intra ≈ V η. (10)
By combining it with equation (3), the newly obtained data ˜ x in ˜
X can be rewritten as ˜ x
(11) From this, we can say that the new data set ˜ X mainly has the extra-class information, and thus we need to preserve the information as much as possible. From these consideration, we apply PCA to the new data set ˜ X, and obtain the principal component matrix ˜ Λ of the data set ˜ X. By projecting ˜ X to the basis vectors such as ˜ X ˜ Λ = ˜ Z, (12) we can obtain the data set ˜ Z which has small intra-class variance and large extra-class variance. The obtained data set is used for classification. 4 Experimental Results For verification of the proposed method, we did comparison experiments on fa- cial data sets with conventional methods : PCA, LDA and intra-person. For classification, we apply the minimum distance method[10] with Euclidean dis- tance. When we find the subspace for each method, we optimized the dimension of subspace with respect to the classification rates for each data set. 4.1
Face Recognition We first conducted facial recognition task for face images with different viewpoints which are obtained from FERET(Face Recognition Technology) database at the homepage(http : //www.itl.nist.gov/iad/humanid/f eret/). Figure 2 shows some samples of the data set. We used 450 images from 50 subjects and each sub- ject consists of 9 images of different poses corresponding to 15 degree from left and right. The left, right( ±60 degree) and frontal images are used for training and the rest 300 images are used for testing. The size of image is 50 × 70, thus the dimen- sion of the raw input is 3500. For LDA method, we first applied PCA to solve small sample set problem and obtained 52 dimensional features. We then applied LDA and obtained 9 dimensional features. Similarly, in the proposed method, we first
552 M. Cho, D. Yoon, and H. Park Fig. 2. Examples of the human face images with different viewpoints Table 1. Result on face image data Method Dimension Classification Rate PCA
117 97 LDA 9 99.66
Intra-Person 92 92.33 Proposed 8 100 find 83 dimensional subspace for intra-class factor and 8 dimensional subspace for extra-class factor. In this case, there are the 50 number of classes and the number of data in each class is very limited; 3 for each class. The experimental results are shown in Table 1. The performance of the proposed method is perfect, and the other methods also give generally good results. In spite of 50 number of class and limited number of training data, the good results may due to that the variation of between-class is intrinsically high. 4.2 Pose Recognition We conducted pose recognition task with the same data set used in Section 4.1. Therefore, we used 450 images from 9 different viewpoints classes of −60 o
o with 15
o intervals, and each class consists of 50 images from different persons. The 255 images which are composed of 25 images for each class are used for training and the rest 255 images are used for testing. For LDA method, we first applied PCA and obtained 167 dimensional features. We then applied LDA
A Subspace Method Based on Data Generation Model 553
Table 2. Result on the human pose image data Method
Dimension Classification Rate PCA 65
LDA 6 57.78 Intra-Person 51 38.67 Proposed 21 58.22 and obtained 6 dimensional features. For the proposed method, we first find 128 dimensional subspace for intra-class factor and 21 dimensional subspace for extra-class factor. In this case, there is the 9 number of classes and the 225 number of training data which is composed of 25 number of data from each class. The results are shown in Table 2. The performance is generally low, but the proposed method and LDA give much better performance than the PCA and intra-person method. From the low performance, we can conjecture that the variance of between-class is very small in contrast to the variance of within-class. However, the proposed method achieved the best performance. 4.3 Facial Expression Recognition We also conducted facial expression recognition task with data set obtained from PICS(Psychological Image Collection at Stirling) at the homepage(http : //pics.psych.stir.ac.uk/). Figure 3 shows the facial expression image sample. We obtained 276 images from 69 persons and each person has 4 images of different expressions. The 80 images which is composed of 20 images from each expression are used for training and the rest 196 images are used for testing. The size of image is 80 × 90, thus the dimension of the raw input is 7200. For LDA method, we first applied PCA and obtained 59 dimensional features. We then applied LDA and obtained 3 dimensional features. For the proposed method, we first find 48 dimensional subspace for intra-class factor and 14 dimensional subspace for extra-class factor. In this case, there is 4 number of classes and 20 number of training data in each class. Although the performances of all methods are generally low, the proposed method performed much better than PCA and intra-person method. Like the pose recognition, it is also seemed that the variance of between-class Table 3. Result on facial expression image data Method
Dimension Classification Rate PCA 65
LDA 3 65.31 Intra-Person 76 42.35 Proposed 14 66.33 554 M. Cho, D. Yoon, and H. Park Fig. 3. Examples of the human facial expression images is very small in contrast to the variance of within-class. However, the proposed method is achieved the best performance. 5 Conclusions and Discussions In this paper, we proposed a new subspace method based on a data generation model with class information which can be represented as intra-class factors and extra-class factors. By reducing the intra-class information from original data and by keeping extra-class information using PCA, we could get a low dimensional features which preserves some essential information for the given classification problem. In the experiments on various type of facial classification tasks, the proposed method showed better performance than conventional meth- ods. As further study, it could be possible to find more sophisticated dimension reduction than PCA which can enlarge extra-class information. Also, the kernel method could be applied to overcome the non-linearity problem. Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD) (KRF-2006-311-D00807). References 1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE trans. on Pattern Recogntion and Machine Intelligence 19(7), 711–720 (1997) 2. Alpaydin, E.: Machine Learning. MIT Press, Cambridge (2004) 3. Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Clas- sifiers. Advances in Neural Information Processing System, 487–493 (1998) A Subspace Method Based on Data Generation Model 555
4. Fisher, R.A.: The Statistical Utilization of Multiple Measurements. Annals of Eu- genics 8, 376–386 (1938) 5. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, London (1990) 6. Hinton, G.E., Zemel, R.S.: Autoencoders, Minimum Description Length and Helmholtz Free Energy. Advances In Neural Information Processing Systems 6, 3–10 (1994) 7. Hinton, G.E., Ghahramani, Z.: Generative Models for Discovering Sparse Dis- tributed Representations. Philosophical Transactions Royal Society B 352, 1177– 1190 (1997) 8. Lee, O., Park, H., Choi, S.: PCA vs. ICA for Face Recognition. In: The 2000 International Technical Conference on Circuits/Systems, Computers, and Com- munications, pp. 873–876 (2000) 9. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian Modeling of Facial Similarity. Advances in Neural Information Processing System, 910–916 (1998) 10. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979) 11. Martinez, A., Kak, A.: PCA versus LDA. IEEE Trans. on Pattern Analysis and Machine Inteligence 23(2), 228–233 (2001) 12. Park, H., Cho, M.: Classification of Bio-data with Small Data set Using Additive Factor Model and SVM. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 770–779. Springer, Heidelberg (2006) 13. Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively, with Application to Face Verification. In: Proc. of International Conference on Computer Vision on Pattern Recognition, pp. 539–546 (2005) 14. Ghahramani, Z.: Factorial Learning and The EM Algorithm. In: Advances In Neu- ral Information Processing Systems, vol. 7, pp. 617–624 (1995) Hierarchical Feature Extraction for Compact Representation and Classification of Datasets Markus Schubert and Jens Kohlmorgen Fraunhofer FIRST.IDA Kekul´ estr. 7, 12489 Berlin, Germany {markus,jek}@first.fraunhofer.de http://ida.first.fraunhofer.de Abstract. Feature extraction methods do generally not account for hi- erarchical structure in the data. For example, PCA and ICA provide transformations that solely depend on global properties of the overall dataset. We here present a general approach for the extraction of feature hierarchies from datasets and their use for classification or clustering. A hierarchy of features extracted from a dataset thereby constitutes a compact representation of the set that on the one hand can be used to characterize and understand the data and on the other hand serves as a basis to classify or cluster a collection of datasets. As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording. 1 Introduction The vast majority of feature extraction methods does not account for hierarchical structure in the data. For example, PCA [1] and ICA [2] provide transformations that solely depend on global properties of the overall data set. The ability to model the hierarchical structure of the data, however, might certainly help to characterize and understand the information contained in the data. For example, neural dynamics are often characterized by a hierarchical structure in space and time, where methods for hierarchical feature extraction might help to group and classify such data. A particular demand for these methods exists in EEG recordings, where slow dynamical components (sometimes interpreted as internal “state” changes) and the variability of features make data analysis difficult. Hierarchical feature extraction is so far mainly related to 2-D pattern analysis. In these approaches, pioneered by Fukushima’s work on the Neocognitron [3], the hierarchical structure is typically a priori hard-wired in the architecture and the methods primarily apply to a 2-D grid structure. There are, however, more recent approaches, like local PCA [4] or tree-dependent component analysis [5], that are promising steps towards structured feature extraction methods that derive also the structure from the data. While local PCA in [4] is not hierarchical and tree-dependent component analysis in [5] is restricted to the context of ICA, we here present a general approach for the extraction of feature hierarchies and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 556–565, 2008. c Springer-Verlag Berlin Heidelberg 2008
Hierarchical Feature Extraction 557
their use for classification and clustering. We exemplify this by using PCA as the core feature extraction method. In [6] and [7], hierarchies of two-dimensional PCA projections (using proba- bilistic PCA [8]) were proposed for the purpose of visualizing high-dimensional data. For obtaining the hierarchies, the selection of sub-clusters was performed either manually [6] or automatically by using a model selection criterion (AIC, MDL) [7], but in both cases based on two-dimensional projections. A 2-D pro- jection of high-dimensional data, however, is often not sufficient to unravel the structure of the data, which thus might hamper both approaches, in particular, if the sub-clusters get superimposed in the projection. In contrast, our method is based on hierarchical clustering in the original data space, where the structural information is unchanged and therefore undiminished. Also, the focus of this paper is not on visualizing the data itself, which obviously is limited to 2-D or 3-D projections, but rather on the extraction of the hierarchical structure of the data (which can be visualized by plotting trees) and on replacing the data by a compact hierarchical representation in terms of a tree of extracted features, which can be used for classification and clustering. The individual quantity to be classified or clustered in this context, is a tree of features representing a set of data points. Note that classifying sets of points is a more general problem than the well-known problem of classifying individual data points. Other approaches to classify sets of points can be found, e.g., in [9, 10], where the authors define a kernel on sets, which can then be used with standard kernel classifiers. The paper is organized as follows. In section 2, we describe the hierarchical feature extraction method. In section 3, we show how feature hierarchies can be used for classification and clustering, and in section 4 we provide a proof of concept with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording. Section 5 concludes with a discussion. 2 Hierarchical Feature Extraction We pursue a straightforward approach to hierarchical feature extraction that allows us to make any standard feature extraction method hierarchical: we per- form hierarchical clustering of the data prior to feature extraction. The feature extraction method is then applied locally to each significant cluster in the hi- erarchy, resulting in a representation (or replacement) of the original dataset in terms of a tree of features. 2.1 Hierarchical Clustering There are many known variants of hierarchical clustering algorithms (see, e.g., [11, 12]), which can be subdivided into divisive top-down procedures and ag- glomerative bottom-up procedures. More important than this procedural aspect, however, is the dissimilarity function that is used in most methods to quantify the dissimilarity between two clusters. This function is used as the criterion to
558 M. Schubert and J. Kohlmorgen determine the clusters to be split (or merged) at each iteration of the top-down (or bottom-up) process. Thus, it is this function that determines the clustering result and it implicitly encodes what a “good” cluster is. Common agglomerative procedures are single-linkage, complete-linkage, and average-linkage. They differ simply in that they use different dissimilarity functions [12]. We here use Ward’s method [13], also called the minimum variance method, which is agglomerative and successively merges the pair of clusters that causes the smallest increase in terms of the total sum-of-squared-errors (SSE), where the error is defined as the Euclidean distance of a data point to its cluster mean. The increase in square-error caused by merging two clusters, D i and D
j , is given by d (D i
j ) =
n i n j n i + n j m i − m
j , (1) where n i and n j are the number of points in each cluster, and m i and m
j are
the means of the points in each cluster [12]. Ward’s method can now simply be described as a standard agglomerative clustering procedure [11, 12] with the particular dissimilarity function d given in Eq. (1). We use Ward’s criterion, because it is based on a global fitness criterion (SSE) and in [11] it is reported that the method outperformed other hierarchical clustering methods in several comparative studies. Nevertheless, depending on the particular application, other criteria might be useful as well. The result of a hierarchical clustering procedure that successively splits or merges two clusters is a binary tree. At each hierarchy level, k = 1, ..., n, it de- Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling