Web search engines


Download 352.5 Kb.
bet2/10
Sana03.11.2023
Hajmi352.5 Kb.
#1742628
1   2   3   4   5   6   7   8   9   10
Bog'liq
search

Stopwords

  • Function words and connectives
  • Appear in large number of documents and little use in pinpointing documents
  • Indexing stopwords
    • Stopwords not indexed
      • For reducing index space and improving performance
    • Replace stopwords with a placeholder (to remember the offset)
  • Issues
    • Queries containing only stopwords ruled out
    • Polysemous words that are stopwords in one sense but not in others
      • E.g.; can as a verb vs. can as a noun

Stemming

  • Conflating words to help match a query term with a morphological variant in the corpus.
  • Remove inflections that convey parts of speech, tense and number
  • E.g.: university and universal both stem to universe.
  • Techniques
    • morphological analysis (e.g., Porter's algorithm)
    • dictionary lookup (e.g., WordNet).
  • Stemming may increase recall but at the price of precision
    • Abbreviations, polysemy and names coined in the technical and commercial sectors
    • E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad !

Batch indexing and updates

  • Incremental indexing
    • Time-consuming due to random disk IO
    • High level of disk block fragmentation
  • Simple sort-merges.
  • For a dynamic collection
    • single document-level change may need to update hundreds to thousands of records.
    • Solution : create an additional “stop-press” index.
  • Maintaining indices over dynamic collections.

Stop-press index

  • Collection of document in flux
    • Model document modification as deletion followed by insertion
    • Documents in flux represented by a signed record (d,t,s)
    • “s” specifies if “d” has been deleted or inserted.
  • Getting the final answer to a query
    • Main index returns a document set D0.
    • Stop-press index returns two document sets
      • D+ : documents not yet indexed in D0 matching the query
      • D- : documents matching the query removed from the collection since D0 was constructed.
  • Stop-press index getting too large
    • Rebuild the main index
      • signed (d, t, s) records are sorted in (t, d, s) order and merge-purged into the master (t, d) records
    • Stop-press index can be emptied out.

Download 352.5 Kb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8   9   10




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling