Common Problems
Web Crawler

If you can answer all of these questions, you understand the web crawler

  • Q: 设计一个网络爬虫,重点有哪两个功能要求? functional requirement?
  • Q: 设计一个网络爬虫,非功能性需求有哪些要考虑的?
  • Q: 简述爬虫流程?
  • Q: 先画一个最简单版本的 high level design
  • Q: 如何设计容错机制,确保进度不丢失? how can we sure we are fault tolerant and don't lose progress?
  • Q: 如果抓取一个url的时候失败了,应该如何处理?
  • Q: What about if we fail to fetch a URL? 我们应该如何应对?如何设计重试?
  • Q: 如果爬虫程序宕机了会发生什么?
  • Q: 如何扩展我们的需求,满足5天100亿网页的需求?我们需要多少crawler worker 实例?
  • Q: 那我们需要多少parsing worker实例?
  • Q: 如何优化DNS解析步骤?
  • Q: 如何避免爬取重复的页面?比如不同网址内容相同。
  • Q: 什么是crawler traps? 比如避免crawler traps?

Q: 设计一个网络爬虫,重点有哪两个功能要求? functional requirement? A: 1)crawl the web starting from a given set of seed urls. 2) Extract text data from the web page and store the text for later processing

Q: 设计一个网络爬虫,非功能性需求有哪些要考虑的? A:

  1. 有容错能力,优雅处理错误和重新开始异常任务。 fault tolerance,
  2. 遵守robots.txt
  3. 爬虫效率,efficiency to crawl the web under in 5 days
  4. Scalability to handle 10B pages

Q: 简述爬虫流程? A:

  1. 从frontier获取种子url,从DNS获取ip
  2. 通过ip获取html
  3. 提取text从html
  4. 将text存到数据库中
  5. 从text从解析其他url,然后加入到frontier queue中
  6. 重复1-5

Q: 先画一个最简单版本的 high level design A:

Q: 如何设计容错机制,确保进度不丢失? how can we sure we are fault tolerant and don't lose progress? A:

Q: 如果抓取一个url的时候失败了,应该如何处理? A: 我们可以注意到 crawler server 做了很多事情,访问dns,获取webpage,提取text and extract urls from the text to add to frontier queue. While a server can handler all of these tasks, it's not ideal from fault tolerance prespective. 我们将crawler server拆分两部分,crawler worker 和 parser worker ,简单来说就是爬取和解析两个步骤分开,两个步骤工作效率不同,可以独立扩展,同时可以应对需求变化。

  1. crawler worker 的职责就是从 frontier queue中获取一个task, hit dns, fetch web page, save raw html to s3 and store metadata to database.
  2. parser worker 的职责就是 从 parsing queue中获取一个task, fetch s3 url from database, download html from s3 by s3 url , save parsed text to s3, extract urls from text, add urls to frontier queue

we'll add metadata db with a table for urls that have been fetched and processed.

Q: What about if we fail to fetch a URL? 我们应该如何应对?如何设计重试? A: 一个网站可以当前不可访问,被转移,也可能比较慢,暂时下线,我们需要设计一个重试机制。 A)延迟重试, 最简单的方式是使用内存计时器延迟一段时间,然后重试,但是这个方式的缺点就是会浪费时间,延迟的这段时间不能做其他事情

B) kafak队列手动指数退避, kafak with manual exponential backoff.

Approach

Kafka does not support retries out of the box, but we could implement them ourselves. We could have a separate topic for failed URLs and a separate service that reads from this topic and retries the fetch with exponential backoff. In order to know how much to backoff, we could store the time that the next fetch should occur in the message itself. When a consumer reads a message, it will check the time and if it is in the future, it will wait until that time to retry the fetch.

Challenges

This can work, it's just complex to implement and maintain. Fortunately for us, there are services that do this for us out of the box.

C) Great Solution: SQS with Exponential Backoff

Approach

SQS supports retries with configurable exponential backoff out of the box -- convenient! No need to implement our own retry logic. Initially, messages that fail to process are retried once per the visibility timeout, with the default being 30 seconds. The visibility timeout increases exponentially after each retry attempt—30 seconds, 2 minutes, 5 minutes, and up to 15 minutes. This strategy helps to manage message processing more efficiently without overwhelming the system.

Challenges

We don't want to retry indefinitely. That would be silly.

To prevent excessive delays, it is common to cap the exponential backoff at a maximum value. After a certain number of failures, as determined by the ApproximateReceiveCount, the message is moved to a dead-letter queue (DLQ). At this stage, the message is considered unprocessable. For our purposes, we'll consider the site offline, and thus unscrapable, after 5 retries.

Q: 如果爬虫程序宕机了会发生什么?

A: 答案很简单:我们启动一个新的。我们只需确保未完成的 URL 不会丢失。

好消息是,URL 将保留在队列中,直到确认已被爬虫程序获取并且 HTML 存储在 blob 存储中。这样,如果爬虫程序宕机,URL 将被另一个爬虫程序拾取,并且该过程将继续。实现这一点的实际机制因技术而异。我将从非常高的层次概述两种流行的技术 Kafka 和 SQS 如何处理这种情况:

Apache Kafka:

Kafka 将消息保留在日志中,即使读取后也不会将其删除。爬虫程序通过偏移量跟踪其进度,在成功获取和处理 URL 之前,偏移量不会在 Kafka 中更新。如果爬虫程序失败,下一个爬虫程序将从上一个爬虫程序停止的地方继续,确保不会丢失任何数据。 SQS:

使用 SQS,消息会保留在队列中,直到被明确删除。可见性超时会在消息被抓取后对其他爬虫隐藏该消息。如果爬虫在确认成功处理之前失败,则该消息将在超时到期后自动再次可见,从而允许其他爬虫尝试抓取。另一方面,一旦 HTML 存储在 blob 存储中,爬虫将从队列中删除该消息,确保不会再次处理该消息。

Of course, the same applies if a parsing working goes down. The URL will remain in the queue until it is confirmed to have been processed and the text data is stored in blob storage.

Given SQS's built-in support for retries and exponential backoff and the ease with which visibility timeouts can be configured, we'll use SQS for our system.

Q: 如何扩展我们的需求,满足5天100亿网页的需求?我们需要多少crawler worker 实例? A: 满足扩展需求,我们需要满足两个请求 首先要5天爬取100亿网页,单个实例肯定是无法满足的,那么我们要计算我们需要多少实例 我们需要认识到这是一个io密集任务,而不是cpu密集任务,所以我们预估我们的单宽情况 AWS中,网络带宽可以支持400Gbps,我们以一个页面1MB计算

400Gbps = 400,000,000 bit/s / 8 bits/byte / 1MB/page = 50,000 pages/ second

每秒5万页,这个速度肯定是不现实的,我们肯定不会使用到满带宽,我们就以20%带宽使用来计算,50,000 pages/second * 20% = 10,000 pages/ second

我们的目标是10billion, 所以单实例爬完10B需要用时

10,000,000,000 / 10,000 / 86400 = 11.5 days

所以我们只需要3台实例就能5天内爬完10B页面

Q: 那我们需要多少parsing worker实例? A: 当解析任务被添加奥 parsing queue , we just scale this up and down dynamically based on the number of pages in the futurer processing queue.This could be via Lambda functions, ECS tasks, or any other serverless technology.

The "Further Processing Queue" refers to a stage in a data processing pipeline where tasks or items that require additional processing are held before being processed further. In the context of your parser workers, the Further Processing Queue would likely contain references to pages or data blobs that have been downloaded by your crawlers but still need to be parsed. The parser workers would pick up items from this queue, process them (e.g., extract text data from HTML), and then store the results back in blob storage.

In a dynamically scalable system, the number of parser workers (or instances running the parsing tasks) could be adjusted automatically based on the number of items waiting in the Further Processing Queue. For instance, if the queue grows longer (indicating a backlog of pages to be processed), more parser workers could be spun up to handle the load. Conversely, if the queue is short, fewer workers might be needed, allowing the system to scale down and save resources.

Q: 如何优化DNS解析步骤? A: 首选当我们要爬取一个网站时,使用IP地址而不是域名进行爬取有几个潜在的原因:

  1. 避免DNS解析瓶颈:在大规模的网络爬取中,DNS解析可能成为瓶颈,因为每个域名都需要解析成一个IP地址。如果直接使用IP地址,可以跳过DNS解析这一步,减少延迟并提高爬取速度。

  2. 绕过域名限制:某些网站可能对来自特定域名的请求设置了限制或阻止。如果直接使用IP地址进行请求,有可能绕过这些限制,尽管这并不总是合法或道德的做法。

  3. 处理多域名的情况:一些大型网站使用多个域名来分布流量或提供不同的服务,但它们可能指向相同的服务器IP地址。使用IP地址可以简化处理和统一访问这些域名的工作。

  4. 一致性和直接性:直接使用IP地址可以避免在域名和IP地址之间可能存在的不一致问题(例如DNS缓存不同步)。这可以确保请求始终到达预期的服务器,特别是在分布式爬取环境中。

不过,需要注意的是,直接使用IP地址也有其局限性,比如:

  • 负载均衡:许多网站通过DNS实现负载均衡,即同一域名在不同时间可能解析到不同的IP地址。如果直接使用IP地址,可能会绕过这种负载均衡机制,导致访问被限制或阻止。

  • 可维护性:使用域名更易于维护,因为IP地址可能会变化,尤其是在云环境中。

  • 法律和道德问题:有些情况下直接使用IP地址可能被视为规避安全措施,存在法律和道德上的风险。

综合来看,是否使用IP地址而不是域名,通常取决于具体的需求和爬取的场景。

Q: 如何避免爬取重复的页面?比如不同网址内容相同。 A: 首先,我们要通过检查数据库,检查这个网址是否已经爬取过了,如果已经爬取过了就跳过,这是最简单的优化方式 第二,针对不同网址相同内容的场景,我们可以使用

  1. hash and store in metadata w/ index (with)
  2. Bloom Filter [[什么是Bloom Filter]]

Great Solution: Hash and Store in Metadata DB w/ Index

Approach

We could hash the content of the page and store this hash in our URL table in the Metadata DB. When we fetch a new URL, we hash the content and compare it to the hashes in the Metadata DB. If we find a match, we skip the page. To make sure the look up is fast, we need to build an index on the hash column in the Metadata DB. This would allow us to quickly look up the hash of the new URL and see if it already exists in the DB.

Challenges

While the index wil become large and may slow down writes, this would be overly pessimistic. Modern databases are quite efficient at handling indexes, even large ones. While it's true that maintaining an index incurs overhead, this overhead is generally well-optimized in modern systems so it's safe to overlook this concern.

Great Solution: Bloom Filter

Approach

Another possible approach is to use a Bloom filter. A Bloom filter is a probabilistic data structure that allows us to test whether an element is a member of a set. It can tell us definitively if an element is not in the set, but it can only tell us with some probability if an element is in the set. This is perfect for our use case. We can use a Bloom filter to store the hashes of the content of the pages we have already crawled. When we fetch a new URL, we hash the content and check the Bloom filter. If the hash is not in the Bloom filter, we know we haven't crawled this page before. If the hash is in the Bloom filter, we know we have crawled this page before and can skip it.

From a technology perspective, we can use Redis to store the Bloom filter. Redis has a built-in data structure called a Bloom filter that we can use for this purpose.

Challenges

The main challenge with a Bloom filter is that it can give false positives. This means that it might tell us that we have crawled a page when we actually haven't. We could argue that this is an acceptable trade-off for the performance benefits and can configure our bloom filter to reduce the probability of false positives by increasing the size of the filter and the number of hash functions used.

Q: 什么是crawler traps? 比如避免crawler traps?

A: Crawler traps are pages that are designed to keep crawlers on the site indefinitely. They can be created by having a page that links to itself many times or by having a page that links to many other pages on the site. If we're not careful, we could end up crawling the same site over and over again and never finish.

Fortunately the solution is pretty straight forward, we can implement a maximum depth for our crawlers. We can add a depth field to our URL table in the Metadata DB and increment this field each time we follow a link. If the depth exceeds a certain threshold, we can stop crawling the page. This will prevent us from getting stuck in a crawler trap.