Web search engines


Index compression techniques


Download 352.5 Kb.
bet3/10
Sana03.11.2023
Hajmi352.5 Kb.
#1742628
1   2   3   4   5   6   7   8   9   10
Bog'liq
search

Index compression techniques

  • Compressing the index so that much of it can be held in memory
    • Required for high-performance IR installations (as with Web search engines),
  • Redundancy in index storage
    • Storage of document IDs.
  • Delta encoding

Encoding gaps

  • Small gap must cost far fewer bits than a document ID.
  • Binary encoding
    • Optimal when all symbols are equally likely
  • Unary code
    • optimal if probability of large gaps decays exponentially

Encoding gaps

  • Gamma code
  • Golomb codes
    • Further enhancement

Lossy compression mechanisms

  • Trading off space for time
  • collect documents into buckets
    • Construct inverted index from terms to bucket IDs
    • Document' IDs shrink to half their size.
  • Cost: time overheads
  • Solution: index documents in each bucket separately
    • E.g.: Glimpse (http://webglimpse.org/)

General dilemmas

Relevance ranking

  • Keyword queries
    • In natural language
    • Not precise, unlike SQL
      • Boolean decision for response unacceptable
    • Solution
      • Rate each document for how likely it is to satisfy the user's information need
      • Sort in decreasing order of the score
      • Present results in a ranked list.
  • No algorithmic way of ensuring that the ranking strategy always favors the information need
    • Query: only a part of the user's information need

Download 352.5 Kb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8   9   10




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling