Knowledge discovery & data mining Towards kd support Environments Fosca Giannotti and Dino Pedreschi

Download 0.62 Mb.
Hajmi0.62 Mb.

Knowledge discovery & data mining Towards KD Support Environments

  • Fosca Giannotti and
  • Dino Pedreschi
  • Pisa KDD Lab
  • CNUCE-CNR & Univ. Pisa
  • A tutorial @ EDBT2000

Module outline

  • Data analysis and KD Support Environments
  • Data mining technology trends
  • Towards data mining query languages
  • DATASIFT: a logic-based KDSE
  • Future research challenges

Vertical applications

  • We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques
    • Fraud detection
    • Market basket analysis
    • Customer segmentation

Why are these applications challenging?

  • Require manipulation and reasoning over knowledge and data at different abstraction levels
    • conceptual
      • semantic integration of domain knowledge, expert (business) rules and extracted knowledge
      • semantic integration of different analysis paradigms
    • logical/physical
      • interoperability with external components: DBMS’s, data mining tools, desktop tools
      • querying/mining optimization: loose vs. tight coupling between query language and specialized mining tools

Why are these applications challenging?

  • The associated KDD process needs to be carefully specified, tuned and controlled
  • Selection and
  • Preprocessing
  • Data Mining
  • Interpretation
  • and Evaluation
  • Data
  • Consolidation
  • Knowledge
  • p(x)=0.02
  • Warehouse
  • Data Sources
  • Patterns &
  • Models
  • Prepared Data
  • Consolidated
  • Data

Why are these applications challenging?

  • Still not properly supported by available KDD technology
  • what is offered: horizontal, customizable toolkits/suites of data mining primitives
  • what is needed: KD support environments for vertical applications

Datamining vs. traditional Sw development process

  • Traditional
  • Focus on knowledge transfer, design and coding
  • 30% - analysis and design
  • 70% - program design, coding and testing
  • Prototyping - expensive
  • Development process has few loops
  • Maintenance requires human analysis
  • Data mining
  • Focus on data selection, representation and search
  • 70% - data preparation
  • 30% - model generation and testing
  • Prototyping - cheap
  • Development process is inherently iterative
  • Maintenance requires re-learning model

From R. Agrawal’s invited lecture @ KDD’99

  • The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.
  • Early Market
  • Mainstream Market
  • Chasm

Is data mining in the chasm?

  • Perceived to be sophisticated technology, usable only by specialists
  • Long, expensive projects
  • Stand-alone, loosely-coupled with data infrastructures
  • Difficult to infuse into existing mission-critical applications

Module outline

  • Data analysis and KD Support Environments
  • Data mining technology trends
    • from tools …
    • … to suites …
    • … to solutions
  • Towards data mining query languages
  • DATASIFT: a logic-based KDSE
  • Future research challenges

Generation 1: data mining tools

  • ~1980: first generation of DM systems
  • research-driven tools for single tasks, e.g.
    • build a decision tree - say C4.5
    • find clusters - say Autoclass (Cheeseman 88)
  • Difficult to use more than one tool on the same data – lots of data/metadata transformation
  • Intended user: a specialist, technically sophisticated.

Generation 2: data mining suites

  • ~1995: second generation of DM systems
  • toolkits for multiple tasks with support for data preparation and interoperability with DBMS, e.g.
    • SPSS Clementine
    • IBM Intelligent Miner
    • SAS Enterprise Miner
    • SFU DBMiner
  • Intended user: data analyst – suites require significant knowledge of statistics and databases

Growth of DM tools (source:

  • From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems, Dec. 1999.

Generation 3: data mining solutions

  • Beginning end of 1990s
  • vertical data mining-based applications and solutions oriented to solving one specific business problem, e.g.
    • detecting credit card fraud
    • customer retention
  • Address entire KDD process, and push result into a front-end application
  • Intended user: business user – the interfaces hid the data mining complexity

Emerging short-term technology trends

  • Tighter interoperability by means of standards which facilitate the integration of data mining with other applications:
    • KDD process, e.g. the Cross-Industry Standard Process for Data Mining model (
    • representation of mining models: e.g., the PMML - predictive modeling markup language (
    • DB interoperability: the Microsoft OLE DB for data mining interface

Approaches in data mining suites

  • Database-oriented approach
    • IBM Intelligent Miner
  • OLAP-based mining
    • DBMiner - Jiawei Han’s group @ SFU
  • Machine learning
    • CART, ID3/C4.5/C5.0, Angoss Knowledge Studio
  • Statistical approaches
    • The SAS Institute Enterprise Miner.
  • Visualization approach:
    • SGI MineSet, VisDB (Keim et al. 94).

Other approaches in data mining suites

  • Neural network approach:
    • Cognos 4thoughts, NeuroRule (Lu et al.’95).
  • Deductive DB integration:
    • KnowlegeMiner (Shen et al.’96)
    • Datasift (Pisa KDD Lab - see refs).
  • Rough sets, fuzzy sets:
    • Datalogic/R, 49er
  • Multi-strategy mining:
    • INLEN, KDW+, Explora

SFU DBMiner: OLAP-centric mining

  • Warehouse Workplace
  • Active Object Elements
  • Active Object

IBM Intelligent Miner – DB-centric mining

  • Mining Base Container
  • Contents Container
  • Work Area

IBM – IM architecture

Angoss Knowledge Studio: ML-centric mining

  • Project Outline
  • Work Area
  • Additional Visualizations

KS project outline tool

  • (Limited) support to the KDD process

Support for data consolidation step

  • DBMiner
    • ODBC databases – SQL + SmartDrives
    • Single database – multiple tables
    • Consolidation of heterogeneous sources unsupported
  • Intelligent Miner
    • DB2 and text – SQL without SmartDrives
    • Multiple databases
    • Consolidation of heterogeneous sources supported
  • Knowledge Studio
    • ODBC databases and text
    • Single table
    • Consolidation of heterogeneous sources unsupported

Support for selection and preprocessing

  • DBMiner
    • SQL only
  • Intelligent Miner
    • SQL + standard and advanced statistical functionalities
  • Knowledge Studio
    • descriptive statistics

Support for data mining step

  • DBMiner
    • Association rules
    • Decision trees
    • Prediction
  • Intelligent Miner
    • Associations rules
    • Sequential patterns
    • Clustering
    • Classification
    • Prediction
    • Similar time series
  • Knowledge Studio
    • Decision trees
    • Clustering
    • Prediction

Support for interpretation and evaluation

  • Predefined interestingness measures
  • Emphasis on visualization
  • Limited export capability of analysis results
  • Gain charts for comparison of predictive models (KS and IM)
  • Limited model combination capabilities (KS)

Module outline

  • Data analysis and KD Support Environments
  • Data mining technology trends
    • from tools …
    • … to suites …
    • … to solutions
  • Towards data mining query languages
  • DATASIFT: a logic-based KDSE
  • Future research challenges

Data Mining Query Languages

  • A DMQL can provide the ability to support ad-hoc and interactive data mining
  • Hope: achieve the same effect that SQL had on relational databases.
  • Various proposals:
    • DMQL (Han et al 96)
    • mine operator (Meo et el 96)
    • M-SQL (Imielinski et al 99)
    • query flocks (Tsur et al 98)

MINE operator of (Meo et al 96)

References - DMQL

  • J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27-33, Montreal, Canada, June 1996.
  • R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122-133, Sept. 1996.
  • T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999.
  • S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12, 1998.

Module outline

  • Data analysis and KD Support Environments
  • Data mining technology trends
    • from tools …
    • … to suites …
    • … to solutions
  • Towards data mining query languages
  • DATASIFT: a logic-based KDSE
  • Future research challenges

DATASIFT - towards a logic-based KDSE

  • DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules)
  • LDL++ syntax: Prolog-like deductive rules
  • LDL++ semantics: SQL extended with recursion (and more)
  • Integration of deduction and induction
  • Employed to systematically develop the methodology for MBA and audit planning
  • See Pisa KDD Lab references

Our position

  • A suitable integration of
    • deductive reasoning (logic database languages)
    • inductive reasoning (association rules & decision trees)
  • provides a viable solution to high-level problems in knowledge-intensive data analysis applications

Our goal

  • Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge
    • data preparation
    • knowledge extraction
    • post-processing and knowledge evaluation
    • business rules
    • autofocus datamining

With respect to other DMQL’s

  • extending logic query languages yields extra expressiveness, needed to bridge the gap between
    • data mining (e.g., association rule mining)
    • vertical applications (e.g., market basket analysis)

Architecture - client agent

  • User interface
  • Access to business rules and visualization of results through
    • web browser to control interaction
    • MS Excel objects (sheets and charts) to represent output of analysis (association rules)

Architecture - server agent

  • A query engine (mediator)
    • record previous analyses
    • Metadata/meta knowledge
    • interaction with other components
  • LDL++ server
    • extended with external calls to DBMSs and to …
  • Inductive modules
    • Apriori
    • classifiers (decision trees)
  • Coupling with DBMS using the Cache-mine approach
  • Performance comparable with SQL-based approaches on same mining queries (Giannotti at el 2000)

Deductive rules in LDL++

  • E.g.: select transactions involving milk
  • milk_basket(T,I)  basket(T,I),basket(T,milk).
  • Querying ?- milk_basket(T,I)
        • milk_basket(2,bread). milk_basket(3,bread).
        • milk_ basket(2,milk). milk_basket(3,orange).
        • milk_ basket(2,onions). milk_basket(3,milk).
        • milk_ basket(2,fish).
  • A small database of cash register transactions
  • basket(1,fish). basket(2,bread). basket(3,bread).
  • basket(1,bread). basket(2,milk). basket(3,orange).
        • basket(2,onions). basket(3,milk).
        • basket(2,fish).

Aggregates in LDL++

  • E.g.: count occurrences of pairs of distinct items in all transactions
  • pair(I1,I2,count) basket(T,I1),basket(T,I2),I1 I2.
  • A small database of cash register transactions
  • basket(1,fish). basket(2,bread). basket(3,bread).
  • basket(1,bread). basket(2,milk). basket(3,orange).
        • basket(2,onions). basket(3,milk).
        • basket(2,fish).
  • aggregate
  • Querying ?- pair(fish,bread,N)
        • pair(fish,bread,2) (i.e., N=2)
  • Aggregates are the logical interface between deductive and inductive environment.

Association rules in LDL++

  • E.g., compute one-to-one association rules with at least 40% support
  • rules(patterns<0.4,0,{I1,I2}>)basket(T,I1),basket(T,I2).
  • basket(1,fish). basket(2,bread). basket(3,bread).
  • basket(1,bread). basket(2,milk). basket(3,orange).
        • basket(2,onions). basket(3,milk).
        • basket(2,fish).
  • patterns
  • is the aggregate interfacing the computation of association rules
  • patterns, trans_set>

Association rules in LDL++

  • Result of the query ?- rules(X,Y,S,C)
      • rules({milk},{bread},0.66,1)
      • i.e. milk  bread [0.66,1]
      • rules({bread},{milk},0.66,0.66)
      • rules({fish},{bread},0.66,1)
      • rules({bread},{fish},0.66,0.66)
  • Same status for data and induced rules
  • basket(1,fish). basket(2,bread). basket(3,bread).
  • basket(1,bread). basket(2,milk). basket(3,orange).
        • basket(2,onions). basket(3,milk).
        • basket(2,fish).

Reasoning on item hierarchies

  • Which rules survive/decay up/down the item hierarchy?
  • rules_at_level(I,pattern) 
  • itemset_abstraction(I,Tid,Itemset).
  • preserved_rules(Left,Right)
  • rules_at_level(I,Left,Right,_,_),
  • rules_at_level(I+1,Left,Right,_,_).

Business rules: reasoning on promotions

  • Which rules are established by a promotion?
  • interval(before, -, 3/7/1998).
  • interval(promotion, 3/8/1998, 3/30/1998).
  • interval(after, 3/31/1998, +).
  • established_rules(Left, Right) 
      • not rules_partition(before, Left, Right, _, _),
      • rules_partition(promotion, Left, Right, _, _),
      • rules_partition(after, Left, Right, _, _).

Business rules: temporal reasoning

  • How does rule support change along time?

Decision tree construction in DATASIFT

  • construct training and test set using rules
  • training_set(P,Case_list)  ...
  • test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)
  •  ...
  • construct classifier using external call to C5.0
  • tree_rules(Tree_name,P,PF,MC,BO,Rule_list)  training_set(P,Case_list), tree_induction(Case_list,PF,MC,BO,Rule_list).
  • parameters
    • pruning factor PF
    • misclassification costs MC
    • boosting BO
  • external call
  • induced classifier

Putting decision trees at work

  • prediction of target variable
  • prediction(Tree_name,ID,CAR,Predicted_CAR)  tree_rules(Tree_name, _ ,_ , _ , Rule_list), test_subject(ID, F1, …, F20, _, _, CAR), classify(Rule_list ,[F1, …, F20], Predicted_CAR).
  • Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive)
  • actual_recovery(Tree_name,sum)  prediction(Tree_name, ID, _ , pos), test_subject(ID, F1, …, F20, _,Actual_Recovery, _).
  • aggregate

Combining decision trees

  • Model conjunction:
  • tree_conjunction(T1,T2,ID,CAR,pos)  prediction(T1, ID, CAR, pos), prediction(T2, ID, CAR, pos).
  • tree_conjunction (T1, T2, ID, CAR, neg)  test_subject(ID, F1, …, F20, _, _, CAR), ~ tree_conjunction(T1, T2, ID, CAR, pos).
  • More interesting combinations readily expressible:
    • e.g. meta learning (Chan and Stolfo 93)

We proposed ...

  • a KDD methodology for audit planning:
    • define an audit cost model
    • monitor training- and test-set construction
    • assess the quality of a classifier
    • tune classifier construction to specific policies
  • and its formalization in a prototype logic-based KDSE, supporting:
    • integration of deduction and induction
    • integration of domain and induced knowledge
    • separation of conceptual and implementation level

Module outline

  • Data analysis and KD Support Environments
  • Data mining technology trends
    • from tools …
    • … to suites …
    • … to solutions
  • Towards data mining query languages
  • DATASIFT: a logic-based KDSE
  • Future research challenges

A data mining research agenda

  • Integration with data warehouse and relational DB
  • Scalable, parallel/distributed and incremental mining
  • Data mining query language optimization
  • Multiple, integrated data mining methods
  • KDSE and methodological support for vertical appl.
  • Interactive, exploratory data mining environments
  • Mining on other forms of data:
    • spatio-temporal databases
    • text
    • multimedia
    • web

Scale up!

  • Scaling up existing algorithms (AI, ML, IR)
    • Association rules
    • Correlation rules
    • Causal relationship
    • Classification
    • Clustering
    • Bayesian networks

Background knowledge & constraints

  • Incorporating background knowledge and constraints into existing data mining techniques
  • Double benefit for DMQL: semantics and optimization!
    • traditional algorithms
    • need user-controlled focus in mining process
      • Association rules containing certain items
      • Sequential patterns containing certain patterns
      • Classification?

Vertical applications of data mining

  • More success stories needed!
  • Current data mining systems lack a thick semantic layer (similarly to the early relational database systems)
  • Verticalized data mining systems, e.g.
    • Market analysis systems
    • Fraud detection systems
  • Automated mining and interactive mining: how far are they?

Autofocus data mining

  • policy options, business rules
  • selection of data mining function
  • fine parameter tuning of mining function

DBMS coupling

  • Tight-coupling with DBMS
    • Most data mining algorithms are based on flat file data (i.e. loose-coupling with DBMS)
    • A set of standard data mining operators
    • (e.g. sampling operator)

Web mining – why?

  • No standards on the web, enormous blob of unstructured and heterogeneous info
  • Very dynamic
    • One new WWW server every 2 hours
    • 5 million documents in 1995
    • 320 million documents in 1998
  • Indices get obsolete very quickly
  • Better means needed for discovering resources and extracting knowledge

Web mining: challenges

  • Today`s search engines are plagued by problems
    • the abundance problem: 99% of info of no interest to 99% of people!
    • limited coverage of the Web
    • limited query interface based on keyword-oriented search
    • limited customization to individual users

Web mining

  • Web content mining
    • mining what Web search engines find
    • Web document classification (Chakrabarti et al 99)
    • warehousing a Meta-Web (Zaïane and Han 98)
    • intelligent query answering in Web search
  • Web usage mining
    • Web log mining: find access patterns and trends (Zaiane et al 98)
    • customized user tracking and adaptive sites (Perkowitz et al 97)
  • Web structure mining
    • discover authoritative pages: a page is important if important pages point to it (Chakrabarti et al 99, Kleinberg 98)

Warehousing a Meta-Web (Zaïane & Han 98)

  • Meta-Web: summarizes the contents and structure of the Web, which evolves with the Web
  • Layer0: the Web itself
  • Layer1: the lowest layer of the Meta-Web
    • an entry: a Web page summary, including class, time, URL, contents, keywords, popularity, weight, links, etc.
  • Layer2 and up: summary/classification/clustering
  • Meta-Web is warehoused and incrementally updated
  • Querying and mining is performed on or assisted by meta-Web
  • Is it feasible/sustainable? Is XML of any help?

Meta-Web from Jiawei Han’s panel talk @ SIGMOD99

  • Generalized Descriptions
  • More Generalized Descriptions
  • Layer0
  • Layer1
  • Layern
  • ...

Weblog mining

  • Web servers register a log entry for every single access.
  • A huge number of accesses (hits) are registered and collected in an ever-growing web log.
  • Why warehousing/mining web logs?
    • Enhance server performance by learning access patterns of general or particular users (guess what user will ask next and pre-cache!)
    • Improve system design of web applications
    • Identify potential prime advertisement locations
  • Greatest peril: the privacy pitfall
    • See e.g. (Markoff 99) the rise of the Little Brother.

Some web mining references

  • M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access patterns. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997.
  • J. Pitkow. In search of reliable usage data on the www. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997.
  • T. Sullivan. Reading reader reaction : A proposal for inferential analysis of web server log files. In Proc. 3rd Conf. Human Factors & the Web, Denver, Colorado, June 1997.
  • O. R. Zaiane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages 19-29, Santa Barbara, CA, April 1998.
  • O. R. Zaiane, and J. Han. Resource and knowledge discovery in global information systems: a preliminary design and experiment. In Proc. KDD’95, p.331-336, 1995.
  • O. R. Zaiane, and J. Han. WebML: querying the world-wide web for resources and knowledge. In Proc. Int. Workshop on Web informtion and Data management (WIDM98), p. 9-12, 1998.
  • S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, et al. Mining the web’s link structure. COMPUTER, 32:60-67, 1999.
  • S. Chakrabarti, B. E. Dom, P. Indik. Enhanced hypertext classification using hyperlinks. In Proc. 1998 ACM-SIGMOD, p. 307-318, 1999.
  • J. Kleinberg. Autohoritative sources in a hyperlinked environment. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 1998.
  • J. Markoff. The Rise of Little Brother. Upside, Apr. 1999;

Pisa KDD Lab references

  • F. Giannotti and G. Manco. Making Knowledge Extraction and Reasoning Closer. In Proc. PAKDD'99, The Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, 2000.
  • F. Giannotti and G. Manco. Querying Inductive Databases via Logic-Based User Defined Aggregates. In Proc. PKDD'99, The Third Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases. Prague, Sept. 1999.
  • F. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. DaWak'99, First Int. Conf. on Data Warehousing and Knowledge Discovery. Florence, Italy, Sept. 1999.
  • F. Bonchi , F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, San Diego (CA), August 1999.
  • F. Giannotti, G. Manco, D. Pedreschi and F. Turini. Experiences with a logic-based knowledge discovery support environment. In Proc. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (SIGMOD'99 DMKD). Philadelphia, May 1999.
  • F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Integration of Deduction and Induction for Mining Supermarket Sales Data. In Proc. PADD'99, Practical Application of Data Discovery, Int. Conference, London, April 1999.
  • F. Giannotti, G. Manco, M. Nanni, D. Pedreschi. Nondeterministic, Nonmonotonic Logic Databases. IEEE Trans. on Knowledge and Data Engineering. 2000.
  • F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Using deduction for intelligent data analysis. Submitted, 2000.
  • P. Becuzzi, M. Coppola, S. Ruggieri and M. Vanneschi. Parallelisation of C4.5 as a particular divide and conquer computation. Proc.3rd Workshop on High Performance Data Mining, Springer-Verlag LNCS, 2000.

Download 0.62 Mb.

Do'stlaringiz bilan baham:

Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan © 2020
ma'muriyatiga murojaat qiling