Robots.txt example - No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":
- User-agent: *
- Disallow: /yoursite/temp/
- User-agent: searchengine
- Disallow:
- Pick a URL from the frontier
- Fetch the document at the URL
- Parse the URL
- Extract links from it to other docs (URLs)
- Check if URL has content already seen
- For each extracted URL
- E.g., only crawl .edu, obey robots.txt, etc.
Basic crawl architecture DNS (Domain Name Server) - A lookup service on the internet
- Given a URL, retrieve its IP address
- Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds)
- Common OS implementations of DNS lookup are blocking: only one outstanding request at a time
- Solutions
- DNS caching
- Batch DNS resolver – collects requests and sends them out together
- When a fetched document is parsed, some of the extracted links are relative URLs
- E.g., http://en.wikipedia.org/wiki/Main_Page has a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
- During parsing, must normalize (expand) such relative URLs
Do'stlaringiz bilan baham: |