Detecting mirrored sites (contd.) - Approach 3 [Step before fetching all pages]
- Uses regularity in URL strings to identify host-pairs which are mirrors
- Preprocessing
- Host are represented as sets of positional bigrams
- Convert host and path to all lowercase characters
- Let any punctuation or digit sequence be a token separator
- Tokenize the URL into a sequence of tokens, (e.g., www6.infoseek.com gives www, infoseek, com)
- Eliminate stop terms such as htm, html, txt, main, index, home, bin, cgi
- Form positional bigrams from the token sequence
- Two hosts are said to be mirrors if
- A large fraction of paths are valid on both web sites
- These common paths link to pages that are near-duplicates.
Do'stlaringiz bilan baham: |