a-text/atextcrawler: atextcrawler crawls and indexes selected websites. It starts from a few seed sites and follows their external links. Criteria defined in plugin code determine which linked sites (and which of their resources) are (recursively) added to the pool. -

atextcrawler crawls and indexes selected websites. It starts from a few seed sites and follows their external links. Criteria defined in plugin code determine which linked sites (and which of their resources) are (recursively) added to the pool.

Go to file

ibu b7f3e174db Remove unwanted print		2022-01-09 17:05:27 +00:00
doc	Make minimum text length configurable and actually remove elasticsearch documents	2021-12-26 18:21:15 +00:00
src/atextcrawler	Remove unwanted print	2022-01-09 17:05:27 +00:00
tests	Put under version control	2021-11-29 09:16:31 +00:00
.gitignore	Put under version control	2021-11-29 09:16:31 +00:00
.pre-commit-config.yaml	Put under version control	2021-11-29 09:16:31 +00:00
Pipfile	Put under version control	2021-11-29 09:16:31 +00:00
Pipfile.lock	Put under version control	2021-11-29 09:16:31 +00:00
README.md	Put under version control	2021-11-29 09:16:31 +00:00
license.txt	Put under version control	2021-11-29 09:16:31 +00:00
pyproject.toml	Put under version control	2021-11-29 09:16:31 +00:00

README.md

atextcrawler is an asynchronous webcrawler indexing text for literal and semantic search.

Its client-side counterpart is atextsearch

atextcrawler is written in Python, runs a configurable number of async workers concurrently (in one process), uses tensorflow for embedding (paragraph-sized) text chunks in a (multi-)language model and stores metadata in PostgreSQL and texts in elasticsearch.