Web search engines


Estimating Jaccard coefficient with random permutations


Download 352.5 Kb.
bet9/10
Sana03.11.2023
Hajmi352.5 Kb.
#1742628
1   2   3   4   5   6   7   8   9   10
Bog'liq
search

Estimating Jaccard coefficient with random permutations

  • Generate a set of m random permutations
  • for each do
  • compute and
  • check if
  • end for
  • if equality was observed in k cases, estimate.

Fast similarity search with random permutations

  • for each random permutation do
  • create a file
  • for each document d do
  • write out to
  • end for
  • sort using key s--this results in contiguous blocks with fixed s containing all associated
  • create a file
  • for each pair within a run of having a given s do
  • write out a document-pair record to g
  • end for
  • sort on key
  • end for
  • merge for all in order, counting the number of entries

Eliminating near-duplicates via shingling

  • “Find-similar” algorithm reports all duplicate/near-duplicate pages
  • Eliminating duplicates
  • Eliminating near-duplicates
    • Represent each document as a set T(d) of q-grams (shingles)
    • Find Jaccard similarity between and
    • Eliminate the pair from step 9 if it has similarity above a threshold

Detecting locally similar sub-graphs of the Web

  • Similarity search and duplicate elimination on the graph structure of the web
    • To improve quality of hyperlink-assisted ranking
  • Detecting mirrored sites
    • Approach 1 [Bottom-up Approach]
      • Start process with textual duplicate detection
          • cleaned URLs are listed and sorted to find duplicates/near-duplicates
          • each set of equivalent URLs is assigned a unique token ID
          • each page is stripped of all text, and represented as a sequence of outlink IDs
      • Continue using link sequence representation
      • Until no further collapse of multiple URLs are possible
    • Approach 2 [Bottom-up Approach]
      • identify single nodes which are near duplicates (using text-shingling)
      • extend single-node mirrors to two-node mirrors
      • continue on to larger and larger graphs which are likely mirrors of one another

Download 352.5 Kb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8   9   10




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling