Web search engines

Estimating Jaccard coefficient with random permutations

Download 352,5 Kb.

bet	9/10
Sana	03.11.2023
Hajmi	352,5 Kb.
	#1742628

1 2 3 4 5 6 7 8 9 10

Bog'liq
search

Eliminating near-duplicates via shingling
Detecting locally similar sub-graphs of the Web

Estimating Jaccard coefficient with random permutations

Generate a set of m random permutations
for each do
compute and
check if
end for
if equality was observed in k cases, estimate.

Fast similarity search with random permutations

for each random permutation do
create a file
for each document d do
write out to
end for
sort using key s--this results in contiguous blocks with fixed s containing all associated
create a file
for each pair within a run of having a given s do
write out a document-pair record to g
end for
sort on key
end for
merge for all in order, counting the number of entries

Eliminating near-duplicates via shingling

“Find-similar” algorithm reports all duplicate/near-duplicate pages
Eliminating duplicates

Maintain a checksum with every page in the corpus

Eliminating near-duplicates

Represent each document as a set T(d) of q-grams (shingles)
Find Jaccard similarity between and
Eliminate the pair from step 9 if it has similarity above a threshold

Detecting locally similar sub-graphs of the Web

Similarity search and duplicate elimination on the graph structure of the web

To improve quality of hyperlink-assisted ranking

Detecting mirrored sites

Approach 1 [Bottom-up Approach]

Start process with textual duplicate detection

cleaned URLs are listed and sorted to find duplicates/near-duplicates
each set of equivalent URLs is assigned a unique token ID
each page is stripped of all text, and represented as a sequence of outlink IDs

Continue using link sequence representation
Until no further collapse of multiple URLs are possible

Approach 2 [Bottom-up Approach]

identify single nodes which are near duplicates (using text-shingling)
extend single-node mirrors to two-node mirrors
continue on to larger and larger graphs which are likely mirrors of one another

Download 352,5 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 10

Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2025
ma'muriyatiga murojaat qiling