Answering Approximate Queries Efficiently


Download 4.77 Mb.
Sana21.03.2020
Hajmi4.77 Mb.
#92008

Answering Approximate Queries Efficiently

  • Chen Li
  • Department of Computer Science
  • Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica

30,000-Foot View of Info Systems

  • Data Repository (RDBMS, Search Engines, etc.)
  • Query
  • Answers matching conditions

Example: a movie database

  • Star
  • Title
  • Year
  • Genre
  • Keanu Reeves
  • The Matrix
  • 1999
  • Sci-Fi
  • Samuel Jackson
  • Star Wars: Episode III - Revenge of the Sith
  • 2005
  • Sci-Fi
  • Schwarzenegger
  • The Terminator
  • 1984
  • Sci-Fi
  • Samuel Jackson
  • Goodfellas
  • 1990
  • Drama
  • Tom
  • Find movies starred Samuel Jackson

How about our governor: Schwarrzenger?

  • Star
  • Title
  • Year
  • Genre
  • Keanu Reeves
  • The Matrix
  • 1999
  • Sci-Fi
  • Samuel Jackson
  • Star Wars: Episode III - Revenge of the Sith
  • 2005
  • Sci-Fi
  • Schwarzenegger
  • The Terminator
  • 1984
  • Sci-Fi
  • Samuel Jackson
  • Goodfellas
  • 1990
  • Drama
  • The user doesn’t know the exact spelling!

Relaxing Conditions

  • Star
  • Title
  • Year
  • Genre
  • Keanu Reeves
  • The Matrix
  • 1999
  • Sci-Fi
  • Samuel Jackson
  • Star Wars: Episode III - Revenge of the Sith
  • 2005
  • Sci-Fi
  • Schwarzenegger
  • The Terminator
  • 1984
  • Sci-Fi
  • Samuel Jackson
  • Goodfellas
  • 1990
  • Drama
  • Find movies with a star “similar to” Schwarrzenger.

In general: Gap between Queries and Facts

  • Errors in the query
    • The user doesn’t remember a string exactly
    • The user unintentionally types a wrong string
  • Samuel Jackson
  • Schwarzenegger
  • Samuel Jackson
  • Keanu Reeves
  • Star
  • Samuel L. Jackson
  • Schwarzenegger
  • Samuel L. Jackson
  • Keanu Reeves
  • Star
  • Relation R
  • Relation S
  • Errors in the database:
    • Data often is not clean by itself
    • Especially true in data integration and cleansing

“Did you mean…?” features in Search Engines

What if we don’t want the user to change the query? Answering Queries Approximately

  • Data Repository (RDBMS, Search Engines, etc.)
  • Query
  • Answers matching conditions approximately

Technical Challenges

  • How to relax conditions?
    • Name: “Schwarzenegger” vs “Schwarrzenger”
    • Salary: “in [50K,60K]” vs “in [49K,63K]”
  • How to answer queries efficiently?
    • Index structures
    • Selectivity estimation
    • See our three recent VLDB papers

Rest of the talk

  • Selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

Queries with Fuzzy String Predicates

  • Stars: name similar to “Schwarrzenger”
  • Employees: SSN similar to “430-87-7294”
  • Customers: telephone number similar to “412-0964”
  • Similar to:
    • a domain-specific function
    • returns a similarity value between two strings
  • Examples:
    • Edit distance: ed(Schwarrzenger, Schwarzenegger)=2
    • Cosine similarity
    • Jaccard coefficient distance
    • Soundex
  • Database

Example Similarity Function: Edit Distance

  • A widely used metric to define string similarity
  • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2
  • Example:
  • s1: Tom Hanks
  • s2: Ton Hank
  • ed(s1,s2) = 2

Selectivity of Fuzzy Predicates

  • star SIMILARTO ’Schwarrzenger’
  • Selectivity: # of records satisfying the predicate
  • Star
  • Title
  • Year
  • Genre
  • Keanu Reeves
  • The Matrix
  • 1999
  • Sci-Fi
  • Samuel Jackson
  • Star Wars: Episode III - Revenge of the Sith
  • 2005
  • Sci-Fi
  • Schwarzenegger
  • The Terminator
  • 1984
  • Sci-Fi
  • Samuel Jackson
  • Goodfellas
  • 1990
  • Drama

Selectivity Estimation: Problem Formulation

  • A bag of strings
  • Input: fuzzy string predicate P(q, δ)
  • star SIMILARTO ’Schwarrzenger’
  • Output: # of strings s that satisfy dist(s,q) <= δ

Why Selectivity Estimation?

  • SELECT *
  • FROM Movies
  • WHERE star SIMILARTO ’Schwarrzenger’
  • AND year BETWEEN [1980,1989];
  • Star
  • Title
  • Year
  • Genre
  • Keanu Reeves
  • The Matrix
  • 1999
  • Sci-Fi
  • Samuel Jackson
  • Star Wars: Episode III - Revenge of the Sith
  • 2005
  • Sci-Fi
  • Schwarzenegger
  • The Terminator
  • 1984
  • Sci-Fi
  • Samuel Jackson
  • Goodfellas
  • 1990
  • Drama
  • Movies
  • SELECT *
  • FROM Movies
  • WHERE star SIMILARTO ’Schwarrzenger’
  • AND year BETWEEN [1970,1971];
  • The optimizer needs to know the selectivity of a predicate to decide a good plan.

Using traditional histograms?

  • No “nice” order for strings
  • Lexicographical order?
    • Similar strings could be far from each other: Kammy/Cammy
    • Adjacent strings have different selectivities: Cathy/Catherine

Outline

  • Selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
    • Overview
    • Proximity between strings
    • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

Our approach: SEPIA

  • Selectivity Estimation of Approximate Predicates
  • Intuition

Proximity between Strings

  • Edit Distance? Not discriminative enough

Edit Vector from s1 to s2

  • A vector <I, D, S>
    • I: # of insertions
    • D: # of deletions
    • S: # of substitutions
    • in a sequence of edit operations with their edit distance
    • Easily computable
    • Not symmetric
    • Not unique, but tend to be (ed <= 3  91% unique)

Why Edit Vector?

  • More discriminative

SEPIA histograms: Overview

Frequency table for each cluster

Global PPD Table

  • Proximity Pair Distribution table

SEPIA histograms: summary

Selectivity Estimation: ed(lukas, 2)

Selectivity Estimation for ed(q,d)

  • For each cluster Ci
  • For each v2 in frequency table of Ci
  • Use (v1,v2,d) to lookup PPD
  • Take the sum of these f * N
  • Pruning possible (triangle inequality)

Outline

  • Selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
    • Overview
    • Proximity between strings
    • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

Clustering Strings

  • Two example algorithms
  • Lexicographic order based.
  • K-Medoids
    • Choose initial pivots
    • Assign strings to its closest pivot
    • Swap a pivot with another string
    • Reassign the strings

Number of Clusters

  • It affects:
  • Cluster quality
    • Similarity of strings within each cluster
  • Costs:
    • Space
    • Estimation time

Constructing Frequency Tables

  • For each cluster, group strings based on their edit vector from the pivot
  • Count the frequency for each group

Constructing PPD Table

  • Get enough samples of string triplets (q,p,s)
  • Propose a few heuristics
    • ALL_RAND
    • CLOSE_RAND
    • CLOSE_LEX
    • CLOSE_UNIQUE

Dynamic Maintenance: Frequency Table

  • Take insertion as an example

Dynamic Maintenance: PPD

Improving Estimation Accuracy

  • Reasons of estimate errors
    • Miss hits in PPD.
    • Inaccurate percentage entries in PPD.
  • Improvement: use sample fuzzy predicates to analyze their estimation errors

Relative-Error Model

Outline

  • Motivation: selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
    • Overview
    • Proximity between strings
    • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

Data

  • Citeseer:
    • 71K author names
    • Length: [2,20], avg = 12
  • Movie records from UCI KDD repository:
    • 11K movie titles.
    • Length: [3,80], avg = 35
  • Introduced duplicates:
    • 10% of records
    • # of duplicates: [1,20], uniform
  • Final results:
    • Citeseer: 142K author names
    • UCI KDD: 23K movie titles

Setting

  • Test bed
    • PC: 2.4G P4, 1.2GB RAM, Windows XP
    • Visual C++ compiler
  • Query workload:
    • Strings from the data
    • String not in the data
    • Results similar
  • Quality measurements
    • Relative error: (fest – freal) / freal
    • Absolute relative error : |fest – freal | / freal

Clustering Algorithms

  • K-Metoids is better

Quartile distribution of relative errors

  • Data set 1. CLOSE_RAND; 1000 clusters

Number of Clusters

Effectiveness of Applying Relative-Error Model

Dynamic Maintenance

Other work 1: Relaxing SQL queries with Selections/Joins

  • SELECT * FROM Jobs J, Candidate C WHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5
  • Jobs
  • Candidates
  • JID
  • Company
  • Zipcode
  • Salary
  • CID
  • Zipcode
  • ExpSalary
  • WorkYear
  • r1
  • Broadcom
  • 92047
  • 80
  • s1
  • 93652
  • 120
  • 3
  • r2
  • Intel
  • 93652
  • 95
  • s2
  • 92612
  • 130
  • 6
  • r3
  • Microsoft
  • 82632
  • 120
  • s3
  • 82632
  • 100
  • 5
  • r4
  • IBM
  • 90391
  • 130
  • s4
  • 90391
  • 150
  • 1
  • ...
  • ...

Query Relaxation: Skyline!

Other work 2: Fuzzy predicates on attributes of mixed types

  • SELECT *
  • FROM Movies
  • WHERE star SIMILARTO ’Schwarrzenger’
  • AND |year – 1977| <= 3;
  • Star
  • Title
  • Year
  • Genre
  • Keanu Reeves
  • The Matrix
  • 1999
  • Sci-Fi
  • Samuel Jackson
  • Star Wars: Episode III - Revenge of the Sith
  • 2005
  • Sci-Fi
  • Schwarzenegger
  • The Terminator
  • 1984
  • Sci-Fi
  • Samuel Jackson
  • Goodfellas
  • 1990
  • Drama
  • Movies

Mixed-Typed Predicates

  • String attributes: edit distance
  • Numeric attributes: absolute numeric difference
  • SELECT *
  • FROM Movies
  • WHERE star SIMILARTO ’Schwarrzenger’
  • AND |year – 1977| <= 3;

MAT-tree: Intuition

  • Indexing on two attributes is more effective than two separate indexing structures
  • Numeric attribute: B-tree
  • String attribute: tree-based index structure?

MAT-tree: Overview

  • Tree-based indexing structure:
  • Compressing strings as a “compressed trie” that fits into a limited space
  • An edit distance between a string and compressed trie can be computed
  • Experiments show that MAT-tree is very efficient

Conclusion

  • It’s important to support answering approximate queries efficiently
  • Our results so far:
    • SEPIA: provides accurate selectivity estimation for fuzzy string predicates
    • Relaxing SQL queries with selections and joins
    • MAT-tree: indexing structure supporting fuzzy queries with mixed-types predicates


Download 4.77 Mb.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2022
ma'muriyatiga murojaat qiling