atextcrawler crawls and indexes selected websites. It starts from a few seed sites and follows their external links. Criteria defined in plugin code determine which linked sites (and which of their resources) are (recursively) added to the pool.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ibu b7f3e174db
Remove unwanted print
1 year ago
doc Make minimum text length configurable and actually remove elasticsearch documents 1 year ago
src/atextcrawler Remove unwanted print 1 year ago
tests Put under version control 1 year ago
.gitignore Put under version control 1 year ago
.pre-commit-config.yaml Put under version control 1 year ago
Pipfile Put under version control 1 year ago
Pipfile.lock Put under version control 1 year ago
README.md Put under version control 1 year ago
license.txt Put under version control 1 year ago
pyproject.toml Put under version control 1 year ago

README.md

atextcrawler is an asynchronous webcrawler indexing text for literal and semantic search.

Its client-side counterpart is atextsearch

atextcrawler crawls and indexes selected websites. It starts from a few seed sites and follows their external links. Criteria defined in plugin code determine which linked sites (and which of their resources) are (recursively) added to the pool.

atextcrawler is written in Python, runs a configurable number of async workers concurrently (in one process), uses tensorflow for embedding (paragraph-sized) text chunks in a (multi-)language model and stores metadata in PostgreSQL and texts in elasticsearch.