Web search engines


Approximate string matching


Download 352.5 Kb.
bet8/10
Sana03.11.2023
Hajmi352.5 Kb.
#1742628
1   2   3   4   5   6   7   8   9   10
Bog'liq
search

Approximate string matching

  • Aggressive conflation mechanism to collapse variant spellings into the same token
      • E.g.: Soundex : takes phonetics and pronunciation details into account
      • used with great success in indexing and searching last names in census and telephone directory data.
  • Decompose terms into a sequence of q-grams or sequences of q characters
      • Check for similarity in the grams
      • Looking up the inverted index : a two-stage affair:
        • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms
        • These terms are submitted to the regular index
      • Used by Google for spelling correction
      • Idea also adopted for eliminating near-duplicate pages

Meta-search systems

  • Take the search engine to the document
    • Forward queries to many geographically distributed repositories
      • Each has its own search service
    • Consolidate their responses.
  • Advantages
    • Perform non-trivial query rewriting
      • Suit a single user query to many search engines with different query syntax
    • Surprisingly small overlap between crawls
  • Consolidating responses
    • Function goes beyond just eliminating duplicates
    • Search services do not provide standard ranks which can be combined meaningfully

Similarity search

  • Cluster hypothesis
    • Documents similar to relevant documents are also likely to be relevant
  • Handling “find similar” queries
    • Replication or duplication of pages
    • Mirroring of sites

Document similarity

  • Jaccard coefficient of similarity between document and
  • T(d) = set of tokens in document d
    • .
    • Symmetric, reflexive, not a metric
    • Forgives any number of occurrences and any permutations of the terms.
  • is a metric

Download 352.5 Kb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8   9   10




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling