Chen Li Chen Li
Download 490 b.
|
Chen Li
Similarity join for large data sets
Formulation: set-similarity join
Word tokens:
Formulation of set-similarity join
Large amounts of data
Map: <23, (a,b,c)> (a, 23), (b, 23), (c, 23)
Prefixes of similar sets should share tokens
Each set has 5 tokens
Compute token frequencies
Partition using prefixes
Bring records for each id in each pair
Join two half filled records
Formulation of set-similarity join
Hardware
Stage 2
Stage 2 has good speedup
Good scaleup
Other methods for the 3 stages
Set-similarity joins in Hadoop:
Chen Li @ UC Irvine
Download 490 b. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling