Chen Li Chen Li


Download 490 b.
Sana08.12.2017
Hajmi490 b.
#21769


Chen Li

  • Chen Li












Similarity join for large data sets

  • Similarity join for large data sets

  • Techniques applicable to other domains, e.g.:



Formulation: set-similarity join

  • Formulation: set-similarity join

  • Hadoop-based solutions

  • Experiments

    • More results: see SIGMOD2010 paper




Word tokens:

  • Word tokens:





Formulation of set-similarity join

  • Formulation of set-similarity join

  •  Hadoop-based solutions

  • Experiments



Large amounts of data

  • Large amounts of data

  • Data or processing does not fit in one machine

  • Assumptions:

    • Self join: R = S
    • Two similar sets share at least 1 token


Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23)

  • Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23)



Prefixes of similar sets should share tokens

  • Prefixes of similar sets should share tokens



Each set has 5 tokens

  • Each set has 5 tokens

  • “Similar”: they share at least 4 tokens

  • Prefix length: 2





Compute token frequencies

  • Compute token frequencies



Partition using prefixes

  • Partition using prefixes



Bring records for each id in each pair

  • Bring records for each id in each pair



Join two half filled records

  • Join two half filled records



Formulation of set-similarity join

  • Formulation of set-similarity join

  • Hadoop-based solutions

  •  Experiments



Hardware

  • Hardware

    • 10-node IBM x3650 cluster
    • Intel Xeon processor E5520 2.26GHz with four cores
    • Four 300GB hard disks
    • 12GB RAM
  • Software

  • Datasets: publications (DBLP and CITESEERX)



Stage 2

  • Stage 2





Stage 2 has good speedup

  • Stage 2 has good speedup



Good scaleup

  • Good scaleup



Other methods for the 3 stages

  • Other methods for the 3 stages

  • Case: R <> S

  • Dealing with limited memory



Set-similarity joins in Hadoop:



Chen Li @ UC Irvine

  • Chen Li @ UC Irvine

  • Source code available at:

  • http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/



Download 490 b.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling