Other issues - Spamming
- Adding popular query terms to a page unrelated to those terms
- E.g.: Adding “Hawaii vacation rental” to a page about “Internet gambling”
- Little setback due to hyperlink-based ranking
- Titles, headings, meta tags and anchor-text
- TFIDF framework treats all terms the same
- Meta search engines:
- Assign weight age to text occurring in tags, meta-tags
- Using anchor-text on pages u which link to v
- Anchor-text on u offers valuable editorial judgment about v as well.
Other issues (contd..) - Including phrases to rank complex queries
- Operators to specify word inclusions and exclusions
- With operators and phrases queries/documents can no longer be treated as ordinary points in vector space
- Dictionary of phrases
- Could be cataloged manually
- Could be derived from the corpus itself using statistical techniques
- Two separate indices:
- one for single terms and another for phrases
Corpus derived phrase dictionary - Two terms and
- Null hypothesis = occurrences of and are independent
- To the extent the pair violates the null hypothesis, it is likely to be a phrase
- Measuring violation with likelihood ratio of the hypothesis
- Pick phrases that violate the null hypothesis with large confidence
- Contingency table built from statistics
Corpus derived phrase dictionary - Hypotheses
- Null hypothesis
- Alternative hypothesis
- Likelihood ratio
Approximate string matching - Non-uniformity of word spellings
- dialects of English
- transliteration from other languages
- Two ways to reduce this problem.
- Aggressive conflation mechanism to collapse variant spellings into the same token
- Decompose terms into a sequence of q-grams or sequences of q characters
Do'stlaringiz bilan baham: |