Chen Li Chen Li

Chen Li

Similarity join for large data sets

Formulation: set-similarity join

Word tokens:

Formulation of set-similarity join

Large amounts of data

Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23)

Prefixes of similar sets should share tokens

Each set has 5 tokens

Compute token frequencies

Partition using prefixes

Bring records for each id in each pair

Join two half filled records

Formulation of set-similarity join

Hardware

Stage 2

Stage 2 has good speedup

Good scaleup

Other methods for the 3 stages

Set-similarity joins in Hadoop:

Chen Li @ UC Irvine

Do'stlaringiz bilan baham:

Chen Li Chen Li

Chen Li

Chen Li

Similarity join for large data sets

Similarity join for large data sets

Techniques applicable to other domains, e.g.:

Formulation: set-similarity join

Formulation: set-similarity join

Hadoop-based solutions

Experiments

Word tokens:

Word tokens:

Formulation of set-similarity join

Formulation of set-similarity join

 Hadoop-based solutions

Experiments

Large amounts of data

Large amounts of data

Data or processing does not fit in one machine

Assumptions:

Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23)

Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23)

Prefixes of similar sets should share tokens

Prefixes of similar sets should share tokens

Each set has 5 tokens

Each set has 5 tokens

“Similar”: they share at least 4 tokens

Prefix length: 2

Stage 1: Order tokens by frequency

Stage 2: Finding “similar” id pairs

Stage 3: id pairs  record paris

Compute token frequencies

Compute token frequencies

Partition using prefixes

Partition using prefixes

Bring records for each id in each pair

Bring records for each id in each pair

Join two half filled records

Join two half filled records

Formulation of set-similarity join

Formulation of set-similarity join

Hadoop-based solutions

 Experiments

Hardware

Hardware

Software

Datasets: publications (DBLP and CITESEERX)

Stage 2

Stage 2

Stage 2 has good speedup

Stage 2 has good speedup

Good scaleup

Good scaleup

Other methods for the 3 stages

Other methods for the 3 stages

Case: R <> S

Dealing with limited memory

Set-similarity joins in Hadoop:

Set-similarity joins in Hadoop:

Three-stage approach using Hadoop

Experimental study

Chen Li @ UC Irvine

Chen Li @ UC Irvine

Source code available at:

http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/