Previous lecture recap


Download 0.54 Mb.
bet4/6
Sana25.01.2023
Hajmi0.54 Mb.
#1122127
1   2   3   4   5   6
Bog'liq
lecture16-crawling

Content seen?

  • Sec. 20.2.1

Filters and robots.txt

  • Sec. 20.2.1

Duplicate URL elimination

  • For a non-continuous (one-shot) crawl, test to see if an extracted+filtered URL has already been passed to the frontier
  • For a continuous crawl – see details of frontier implementation
  • Sec. 20.2.1

Distributing the crawler

  • Sec. 20.2.1

Communication between nodes

  • WWW
  • Fetch
  • DNS
  • Parse
  • Content
  • seen?
  • URL
  • filter
  • Dup
  • URL
  • elim
  • Doc
  • FP’s
  • URL
  • set
  • URL Frontier
  • robots
  • filters
  • Host
  • splitter
  • To
  • other
  • nodes
  • From
  • other
  • nodes
  • Sec. 20.2.1

Download 0.54 Mb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling