- Generate a set of m random permutations
- for each do
- compute and
- check if
- end for
- if equality was observed in k cases, estimate.
Fast similarity search with random permutations - for each random permutation do
- create a file
- for each document d do
- write out to
- end for
- sort using key s--this results in contiguous blocks with fixed s containing all associated
- create a file
- for each pair within a run of having a given s do
- write out a document-pair record to g
- end for
- sort on key
- end for
- merge for all in order, counting the number of entries
Eliminating near-duplicates via shingling - “Find-similar” algorithm reports all duplicate/near-duplicate pages
- Eliminating duplicates
- Eliminating near-duplicates
- Represent each document as a set T(d) of q-grams (shingles)
- Find Jaccard similarity between and
- Eliminate the pair from step 9 if it has similarity above a threshold
- Similarity search and duplicate elimination on the graph structure of the web
- To improve quality of hyperlink-assisted ranking
- Detecting mirrored sites
- Approach 1 [Bottom-up Approach]
- Start process with textual duplicate detection
- cleaned URLs are listed and sorted to find duplicates/near-duplicates
- each set of equivalent URLs is assigned a unique token ID
- each page is stripped of all text, and represented as a sequence of outlink IDs
- Continue using link sequence representation
- Until no further collapse of multiple URLs are possible
- Approach 2 [Bottom-up Approach]
- identify single nodes which are near duplicates (using text-shingling)
- extend single-node mirrors to two-node mirrors
- continue on to larger and larger graphs which are likely mirrors of one another
Do'stlaringiz bilan baham: |