- Aggressive conflation mechanism to collapse variant spellings into the same token
- E.g.: Soundex : takes phonetics and pronunciation details into account
- used with great success in indexing and searching last names in census and telephone directory data.
- Decompose terms into a sequence of q-grams or sequences of q characters
- Check for similarity in the grams
- Looking up the inverted index : a two-stage affair:
- Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms
- These terms are submitted to the regular index
- Used by Google for spelling correction
- Idea also adopted for eliminating near-duplicate pages
Meta-search systems - Take the search engine to the document
- Forward queries to many geographically distributed repositories
- Each has its own search service
- Consolidate their responses.
- Advantages
- Perform non-trivial query rewriting
- Suit a single user query to many search engines with different query syntax
- Surprisingly small overlap between crawls
- Consolidating responses
- Function goes beyond just eliminating duplicates
- Search services do not provide standard ranks which can be combined meaningfully
Similarity search - Cluster hypothesis
- Documents similar to relevant documents are also likely to be relevant
- Handling “find similar” queries
- Replication or duplication of pages
- Mirroring of sites
Document similarity - Jaccard coefficient of similarity between document and
- T(d) = set of tokens in document d
- .
- Symmetric, reflexive, not a metric
- Forgives any number of occurrences and any permutations of the terms.
- is a metric
Do'stlaringiz bilan baham: |