atextcrawler/doc/source/introduction.md

3.9 KiB

Introduction

What atextcrawler does:

  • Start from a seed (white+black-)list of website base URLs
  • Loop over sites selected by applying criteria to the content of the site's start page
  • Crawl the site, i.e. loop over resources of the site
  • Extract plaintext content from the resource (html parsing is optimized for html5); discard non-text content, but handle feeds and sitemaps
  • Extract internal and external links; external links contribute to the site list
  • Keep track of the sites and resources in a PostgreSQL database
  • Store plaintext content of resources in an Elasticsearch index
  • Store vector embeddings of plaintexts also in Elasticsearch using tensorflow model server with a multilingual language model

Architecture

There is only one python process running concurrently. We use asyncio where possible (almost everywhere).

  1. There is a queue of websites, see database table site_queue. The queue is fed a) on first startup with seeds, b) manually and c) from crawls which find external links. When the queued is handled new sites are stored to table site. New sites are updated, existing sites only if the last update was more than crawl.site_revisit_delay seconds in the past. After the queue has been handled there is a delay (crawl.site_delay seconds) before repetition.
  2. Updating a site means: the start page is fetched and criteria are applied to its content to determine whether the site is relevant. (It is assumed that (non-)relevance is obvious from the start page already.) If the site is relevant, more information is fetched (e.g. sitemaps).
  3. There is s a configurable number of crawler workers (config crawl.workers) which concurrently crawl sites, one at a time per worker. (During the crawl the site is marked as locked using crawl_active=true.) They pick a relevant site which has not been crawled for a certain time ("checkout"), crawl it, and finally mark it as crawled (crawl_active=false, "checkin") and schedule the next crawl. Each crawl (with begin time, end time, number of found (new) resources)) is stored in table crawl.
  4. Crawls are either full crawls (including all paths reachable through links from the start page are fetched) or feed crawls (only paths listed in a feed of the site are fetched). The respective (minimum) intervals in which these crawls happens are full_crawl_interval and feed_crawl_interval. Feed crawls can happen more frequently (e.g. daily).
  5. When a path is fetched it can result in a MetaResource (feed or sitemap) or a TextResource (redirects are followed and irrelevant content is ignored). A TextResource obtained from a path can be very similar to a resource obtained from another path; in this case no new resource is created, but both paths are linked to the same resource (see tables site_path and resource).
  6. If a MetaResource is fetched and it is a sitemap, its paths are added to table site_path. If it is a feed, the feed is stored in table site_feed and its paths are added to table site_path.
  7. Links between sites are stored in table site_link.

Site annotations

Database table site_annotation can have any number of annotations for a base_url. While crawling, these annotations are considered: Blacklisting or whitelisting has precedence over function site_filter (in plugin filter_site).

Annotations cannot be managed from within atextcrawler; this requires another application, usually atextsearch.

Each annotation requires a base_url of the annotated site and if a site with this base_url exists in the site table, it should also be associated with the site's id (column site_id).

Limitations

  • atextcrawler is not optimized for speed; it is meant to be run as a background task on a server with limited resources (or even an SBC, like raspberry pi, with attached storage)
  • atextcrawler only indexes text, no other resources like images