Web search engines

Download 352,5 Kb.

bet	2/10
Sana	03.11.2023
Hajmi	352,5 Kb.
	#1742628

1 2 3 4 5 6 7 8 9 10

Bog'liq
search

Stopwords

Function words and connectives
Appear in large number of documents and little use in pinpointing documents
Indexing stopwords

Stopwords not indexed

For reducing index space and improving performance

Replace stopwords with a placeholder (to remember the offset)

Issues

Queries containing only stopwords ruled out
Polysemous words that are stopwords in one sense but not in others

E.g.; can as a verb vs. can as a noun

Stemming

Conflating words to help match a query term with a morphological variant in the corpus.
Remove inflections that convey parts of speech, tense and number
E.g.: university and universal both stem to universe.
Techniques

morphological analysis (e.g., Porter's algorithm)
dictionary lookup (e.g., WordNet).

Stemming may increase recall but at the price of precision

Abbreviations, polysemy and names coined in the technical and commercial sectors
E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad !

Batch indexing and updates

Incremental indexing

Time-consuming due to random disk IO
High level of disk block fragmentation

Simple sort-merges.

To replace the indexed update of variable-length postings

For a dynamic collection

single document-level change may need to update hundreds to thousands of records.
Solution : create an additional “stop-press” index.

Maintaining indices over dynamic collections.

Stop-press index

Collection of document in flux

Model document modification as deletion followed by insertion
Documents in flux represented by a signed record (d,t,s)
“s” specifies if “d” has been deleted or inserted.

Getting the final answer to a query

Main index returns a document set D0.
Stop-press index returns two document sets

D+ : documents not yet indexed in D0 matching the query
D- : documents matching the query removed from the collection since D0 was constructed.

Stop-press index getting too large

Rebuild the main index

signed (d, t, s) records are sorted in (t, d, s) order and merge-purged into the master (t, d) records

Stop-press index can be emptied out.

Download 352,5 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 10

Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2025
ma'muriyatiga murojaat qiling