Previous lecture recap

Download 0.54 Mb.

bet	3/6
Sana	25.01.2023
Hajmi	0.54 Mb.
	#1122127

1 2 3 4 5 6

Bog'liq
lecture16-crawling

Robots.txt example

No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:

Sec. 20.2.1

Processing steps in crawling

Pick a URL from the frontier
Fetch the document at the URL
Parse the URL

Extract links from it to other docs (URLs)

Check if URL has content already seen

If not, add to indexes

For each extracted URL

Ensure it passes certain URL filter tests
Check if it is already in the frontier (duplicate URL elimination)

E.g., only crawl .edu, obey robots.txt, etc.

Which one?

Sec. 20.2.1

Basic crawl architecture

WWW

DNS

Parse

Content
seen?

Doc
FP’s

Dup
URL
elim

URL
set

URL Frontier

URL
filter

robots
filters

Fetch

Sec. 20.2.1

DNS (Domain Name Server)

A lookup service on the internet

Given a URL, retrieve its IP address
Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds)

Common OS implementations of DNS lookup are blocking: only one outstanding request at a time
Solutions

DNS caching
Batch DNS resolver – collects requests and sends them out together

Sec. 20.2.2

Parsing: URL normalization

When a fetched document is parsed, some of the extracted links are relative URLs
E.g., http://en.wikipedia.org/wiki/Main_Page has a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
During parsing, must normalize (expand) such relative URLs

Sec. 20.2.1

Download 0.54 Mb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6

Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling