Dissimilarities and matching between symbolic objects prof. Donato Malerba


Download 445 b.
Sana21.07.2018
Hajmi445 b.


DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS

  • Prof. Donato Malerba

  • Department of Informatics,

  • University of Bari, Italy

  • malerba@di.uniba.it

  • ASSO School

  • Athens, Greece

  • October 6-8, 2003


COMPUTING DISSIMILARITIES: WHY?

  • Several data analysis techniques are based on quantifying a dissimilarity (or similarity) measure between multivariate data.

    • Clustering
    • Discriminant analysis
    • Visualization-based approaches
  • Symbolic objects are a kind of multivariate data.

  • Ex.: [colour={red, black}][weight  {60,70,80}][height  []1.50,1.60]

  • The dissimilarity measures presented here are among those investigated in the ASSO Project.



A case study



The construction of SO



TABLE OF BOOLEAN SYMBOLIC OBJECTS



COMPUTATION OF DISSIMILARITIES BETWEEN SYMBOLIC OBJECTS

  • Dissimilarity matrix



The MID property



The MID property



BOOLEAN SYMBOLIC OBJECTS (BSO’S)

  • A BSO is a conjunction of boolean elementary events:

  • [Y1=A1]  [Y2=A2]  ...  [Yp=Ap]

  • where each variable Yi takes values in Yi and Ai is a subset of Yi

  • Let a and b be two BSO’s:

  • a = [Y1=A1]  [Y2=A2]  ...  [Yp=Ap]

  • b = [Y1=B1]  [Y2=B2]  ...  [Yp=Bp]

  • where each variable Yj takes values in Yj and Aj and Bj are subsets of Yj. We are interested to compute the dissimilarity d(a,b).



CONSTRAINED BSO’S

  • Two types of dependencies between variables:

  • Hierarchical dependence (mother-daughter): A variable Yi may be inapplicable if another variable Yj takes its values in a subset Sj  Yj. This dependence is expressed as a rule:

  • if [Yj = Sj] then [Yi = NA]

  • Logical dependence: This case occurs, if a subset

  • Sj  Yj of a variable Yj is related to a subset Si  Yi of a variable Yi by a rule such as:

  • if [Yj = Sj] then [Yi = Si]



DISSIMILARITY AND SIMILARITY MEASURES

    • Dissimilarity Measure
    • d: EER such that d*a = d(a,a) d(a,b) = d(b,a) <a,bE
    • Similarity Measure
    • s: EE R such that s*a = s(a,a) s(a,b) = s(b,a) 0a,bE
    • Generally:
    • aE: d*a = d* and s*a= s* and specifically, d* = 0 while s*= 1
    • Dissimilarity measures can be transformed into similarity measures (and viceversa):
    • d=(s) ( s=-1(d) )
    • where:
    • (s) strictly decreasing function, and (1) = 0, (0) = 


DISSIMILARITY AND SIMILARITY MEASURES: PROPERTIES



DISSIMILARITY MEASURES BETWEEN BSO’S

  • Author(s) (Year)  Notation from the SODAS Package

  • Gowda & Diday (1991)  U_1

  • Ichino & Yaguchi (1994)  U_2, U_3, U_4

  • De Carvalho (1994)  SO_1, SO_2

  • De Carvalho (1996, 1998)  SO_3, SO_4, SO_5, C_1

  • U: only for unconstrained BSO’s

  • C: only for constrained BSO’s

  • SO: for both constrained and unconstrained BSO’s



GOWDA & DIDAY’S DISSIMILARITY MEASURE

  • Gowda & Diday’s dissimilarity measures for two BSO’s a and b:

  • U_1



GOWDA & DIDAY’S DISSIMILARITY MEASURE

  • Properties:

  • D(a, b) = 0  a = b (definiteness property),

  • No proof is reported for the triangle inequality property



ICHINO & YAGUCHI’S DISSIMILARITY MEASURES

  • Ichino & Yaguchi’s dissimilarity measures are based on the Cartesian operators join  and meet .

  • For continuous variables:

  • Aj Bj

  • Aj Bj

  • while for nominal variables:

  • Aj Bj = Aj Bj

  • Aj Bj = Aj Bj

  • Given a pair of subsets (Aj, Bj) of Yj the componentwise dissimilarity(Aj,Bj) is:

  • (Aj, Bj) =Aj BjAj Bj+ (2Aj BjAjBj)

  • where 0    0.5 and Ajis defined depending on variable types.



ICHINO & YAGUCHI’S DISSIMILARITY MEASURES

  • (Aj,Bj) are aggregated by an aggregation function such as the generalised Minkowski’s distance of order q:

  • U_2

  • Drawback: dependence on the chosen units of measurements.

  • Solution: normalization of the componentwise dissimilarity:

  • U_3

  • The weighted formulation guarantees that dq(a,b)[0,1].

  • U_4



DE CARVALHO’S DISSIMILARITY MEASURES

  • A straightforward extension of similarity measures for classical data matrices with nominal variables.

  • where (Vj) is either the cardinality of the set Vj (if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable).



DE CARVALHO’S DISSIMILARITY MEASURES

  • Five different similarity measures si, i = 1, ..., 5, are defined:

  • The corresponding dissimilarities are di = 1  si.

  • The di are aggregated by an aggregation function AF such as the generalised Minkowski metric, thus obtaining:

  • SO_1



DE CARVALHO’S EXTENSION OF ICHINO & YAGUCHI’S DISSIMILARITY MEASURE

  • A different componentwise dissimilarity measure:

  • where  is defined as in Ichino & Yaguchi’s dissimilarity measure.

  • The aggregation function AF suggested by De Carvalho is:

  • SO_2



THE DESCRIPTION-POTENTIAL APPROACH

  • All dissimilarity measures considered so far are defined by two functions: a comparison function (componentwise measure) and an aggregation function.

  • A different approach is based on the concept of description potential (a) of a symbolic object a.

  • where (Vj) is either the cardinality of the set Vj (if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable).



THE DESCRIPTION-POTENTIAL APPROACH

  • SO_3

  • SO_4

  • SO_5

  • The triangular inequality does not hold for SO_3 and SO_4, which are equivalent. SO_5 is a metric.



DESCRIPTION POTENTIAL FOR CONSTRAINED BSO’S

  • Given a BSO a and a logical dependence expressed by the rule:

  • if [Yj = Sj] then [Yi = Si]

  • the incoherent restriction a’ of a is defined as:

  • a’= [Y1=A1]  ...  [Yj-1=Aj-1]  [Yj=Aj Sj]  ...  [Yi-1=Ai-1]  [Yi=Ai (Yi\Si)]  ...  [Yp=Ap]

  • Then the description potential of a is:

  • A similar extension exists for hierarchical dependencies.



DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S

  • The extended definition of description potential can be applied to the computation of the distances SO_3, SO_4 and SO_5.

  • De Carvalho proposed an extension of ’, so that SO_2 can also be applied to constrained BSO.

  • He also proposed an extension of , , , and  in order to take into account of constraints. Therefore, SO_1 can also be applied to constrained BSO.

  • Finally, C_1 is defined as follows:

  • where:

  • If all BSO’s are coherent, then the dissimilarity measures do not change.



DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S

  • The extended definition of description potential can be applied to the computation of the distances SO_3, SO_4 and SO_5.

  • De Carvalho proposed an extension of ’, so that SO_2 can also be applied to constrained BSO:

  • where:



DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S



DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S



DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S



DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S



DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S

  • The previous extension of , ,  in order to take into account of constraints, can be used in SO_1.

  • Finally, C_1 is defined as follows:

  • where:

  • If all BSO’s are coherent, then the dissimilarity measures do not change.



MATCHING

  • Matching is the process of comparing two or more structures to discover their similarities or differences.

  • Similarity judgements in the matching process are directional: They have a

  • referent, a, a prototype or the description of a class of objects

  • subject, b, a variant of the prototype or an instance of a class of objects.

  • Matching two structures is a common problem to many domains, like symbolic classification, pattern recognition, data mining and expert systems.



MATCHING BSO’S

  • Generally, a BSO represents a class description and plays the role of the referent in the matching process.

  • a: [color = {black, white}]  [height =[170, 200]]

  • describes a set of individuals either black or white, whose height is in the interval [170,200]. Such a set of individuals is called extension of the BSO. The extension is a subset of the universe  of individuals

  • Given two BSO’s a and b, the matching operators define whether b is the description of an individual in the extension of a.

  • In the ASSO software two matching operators for BSO’s have been defined.



CANONICAL MATCHING OPERATOR

  • The result of the canonical matching operator is either 0 (false) or 1 (true).

  • If E denotes the space of BSO’s described by a set of p variables Yi taking values in the corresponding domains Yi, then the matching operator is a function:

  • Match: E × E  {0, 1}

  • such that for any two BSO’s a, b  E:

  • a = [Y1=A1]  [Y2=A2]  ...  [Yp=Ap]

  • b = [Y1=B1]  [Y2=B2]  ...  [Yp=Bp]

  • it happens that:

    • Match(a,b) = 1 if BiAi for each i=1, 2, , p,
    • Match(a,b) = 0 otherwise.


CANONICAL MATCHING OPERATOR

  • Examples:

  • District1 = [profession={farmer, driver}]  [age=[24,34]]

  • Indiv1 = [profession=farmer]  [age=28]

  • Indiv2 = [profession=salesman]  [age=[27,28]]

    • Match(District1,Indiv1) = 1
    • Match(District1,Indiv2) = 0


CANONICAL MATCHING OPERATOR

  • The canonical matching function satisfies two out of three properties of a similarity measure:

    •  a, b  E: Match(a, b)  0
    •  a, b  E: Match(a, a)  Match(a, b)
  • while it does not satisfy the commutativity or simmetry property:

    •  a, b  E: Match(a, b) = Match(b, a)
  • because of the different role played by a and b.



FLEXIBLE MATCHING OPERATOR

  • The requirement BiAi for each i=1, 2, , p, might be too strict for real-world problems, because of the presence of noise in the description of the individuals of the universe.

  • Example:

  • District1 = [profession={farmer, driver}]  [age=[24,34]]

  • Indiv3 = [profession=farmer]  [age=23]

  • Match(District1,Indiv3) = 0

  • It is necessary to rely on a flexible definition of matching operator, which returns a number in [0,1] corresponding to the degree of match between two BSO’s, that is

  • flexible-matching: E × E  [0,1]



FLEXIBLE MATCHING OPERATOR

  • For any two BSO’s a and b,

  • i) flexible-matching(a,b)=1 if Match(a,b)=true,

  • ii) flexible-matching(a,b)[0,1) otherwise.

  • The result of the flexible matching can be interpreted as the probability of a matching b provided that a change is made in b.

  • Let Ea = {b' E | Match(a,b')=1} and P(b | b') be the conditional probability of observing b given that the original observation was b'. Then

  • that is flexible-matching(a,b) equals the maximum conditional probability over the space of BSO’s canonically matched by a.



FLEXIBLE MATCHING: AN APPLICATION

  • Credit card applications (Quinlan)

  • Fifteen variables whose names and values have been changed to meaningless symbols to protect the confidentiality of the data.

  • +

  • class variable: positive in case of approval of credit facilities, negative otherwise.

  • Training set: 490 cases

  • 6 rules generated by Quinlan’s system C4.5



FLEXIBLE MATCHING: AN APPLICATION

  • Such rules can be easily represented by means of Boolean symbolic objects.

  • Both matching operators can be considered in order to test the validity of the induced rules.



A new dissimilarity measure

  • Flexible matching is asymmetric. However it is possible to “symmetrize” it  New dissimilarity measure SO_6

  • It is computed as

  • d(a,b) =

  • = 1-(flexible_matching(a,b)+flexible_matching(b,a))/2







Defining dissimilarity measures for probabilistic symbolic objects

  • Steps:

  • Define coefficients measuring the divergence between two probability distributions



Defining dissimilarity measures for probabilistic symbolic objects

  • Steps:

  • Symmetrize the non symmetric coefficients

    • m(P,Q)= m(Q,P) + m(P,Q)
  • Aggregate the contribution of all variables to compute the dissimilarity between two symbolic objects

  • PSO Dissimilarity measures



Mixture SO

  • Some SO’s can be described by both non-modal and modal variables

  • They are neither BSO’s nor PSO’s

  • What dissimilarity measure, then?

  • In ASSO it has been proposed to combine the result of two dissimilarity measure, one for modal and the other for non-modal.

  • Combination can be either additive or multiplicative.

  • This possibility should be taken with great care!!!



REFERENCES

  • Esposito F., Malerba D., V. Tamma, H.-H. Bock. Classical resemblance measures. Chapter 8.1

  • Esposito F., Malerba D., V. Tamma. Dissimilarity measures for symbolic objects. Chapter 8.3

  • Esposito F., Malerba D., F.A. Lisi. Matching symbolic objects. Chapter 8.4

  • in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 2000.

  • D. Malerba, L. Sanarico, & V. Tamma (2000). A comparison of dissimilarity measures for Boolean symbolic data. In P. Brito, J. Costa, & D. Malerba (Eds.), Proc. of the ECML 2000 Workshop on “Dealing with Structured Data in Machine Learning and Statistics”, Barcelona.

  • D. Malerba, F. Esposito, V. Gioviale, & V. Tamma. Comparing Dissimilarity Measures in Symbolic Data Analysis. Pre-Proceedings of EKT-NTTS, vol. 1, pp. 473-481.



REFERENCES

  • D. Malerba, F. Esposito, M. Monopoli (2002). Estrazione e matching di oggetti simbolici da database relazionali. Atti del Decimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati SEBD’2002, 265-272.

  • D. Malerba, F. Esposito, & M. Monopoli (2002). Comparing dissimilarity measures for probabilistic symbolic objects. In A. Zanasi, C. A. Brebbia, N.F.F. Ebecken, P. Melli (Eds.) Data Mining III, Series Management Information Systems, Vol 6, 31-40, WIT Press, Southampton, UK.

  • E. Diday, F. Esposito (2003). An Introduction to Symbolic Data Analysis and the Sodas Software, Intelligent Data Analysis, 7, 6, (in press).

  • Other project reports



METHOD DISS

  • Dissimilarity measures between both BSO’s and PSO’s.

  • Input: Asso file of SO’s

  • Output for dissimilarities: Report + Asso file with dissimilarity matrix

  • Developer: Dipartimento di Informatica, University of Bari, Italy.



TWO USE CASE DIAGRAMS



PARAMETER SETUP

  • The user can select a subset of variables Yi on which the dissimilarity measure or the matching operator has to computed .



PARAMETER SETUP

  • The user can select a number of parameters.



OUTPUT SODAS FILE

  • The output ASSO file contains both the same input data and an additional dissimilarity matrix. The dissimilarity between the i-th and the j-th BSO is written in the cell (entry) (i, j) of the matrix.

  • Only the lower part of the dissimilarity matrix is reported in the file, since dissimilarities are symmetric.

  • abalone output file



OUTPUT REPORT FILE

  • The report file is organized as follows:

  • Output report file



Output

  • Visualization of the dissimilarity table



Output

  • Visualization of a line graph of dissimilarities



Output

  • Visualization of a scatterplot of Sammon’s nonlinear mapping into a bidimensional space



Download 445 b.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2020
ma'muriyatiga murojaat qiling