3.2 KiB

Raw Blame History

Installation

Installation was only tested on Debian bullseye (on amd64). The instructions below are for this system. (Please adapt to other environments.)

System packages

apt install pandoc tidy python3-systemd protobuf-compiler libprotobuf-dev

The protobuf packages are required for python package gcld3 (see below).

PostgreSQL database

We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:

createdb -E UTF8 --lc-collate=C --lc-ctype=C -T template0 -O atextcrawler atextcrawler

Elasticsearch

We need access to an elasticsearch instance (over TCP/IP).

Note: TLS is not yet supported, so install this service locally.

See elasticsearch howto.

Tensorflow model server

We need access to a tensorflow model server (over TCP/IP). It should serve universal_sentence_encoder_multilingual or a similar language model.

Note: TLS is not yet supported, so install this service locally.

See tensorflow howto.

Setup virtualenv and install atextcrawler

apt install python3-pip
adduser --home /srv/atextcrawler --disabled-password --gecos "" atextcrawler
su - atextcrawler
cat >>.bashrc <<EOF
export PYTHONPATH=\$HOME/repo/src
EOF
pip3 install --user pipenv
cat >>.profile <<EOF
PYTHONPATH=\$HOME/repo/src
PATH=\$HOME/.local/bin:$PATH
\$HOME/.local/bin/pipenv shell
EOF
exit
su - atextcrawler
git clone https://gitea.multiname.org/a-text/atextcrawler.git repo
cd repo
pipenv sync
pipenv install --site-packages  # for systemd
pre-commit install

Note: One of the dependencies, Python package tldextract, uses this directory for caching:

$HOME/.cache/python-tldextract/

Configure atextcrawler

As user atextcrawler execute

mkdir $HOME/.config
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler

Edit $HOME/.config/atextcrawler/main.yaml.

If you want to override a plugin, copy it to the plugins directory and edit it, e.g.

cp /srv/atextcrawler/repo/src/atextcrawler/plugin_defaults/filter_site.py $HOME/.config/plugins

Optionally edit $HOME/.config/atextcrawler/initial_data/seed_urls.list.

Check (and print) the instance configuration:

python -m atextcrawler.config

Test run

To see if it works, run atextcrawler from the command line:

python -m atextcrawler

You can stop it with Ctrl-C; stopping may take a few seconds or even minutes.

Install systemd service

To make the service persistent, create a systemd unit file /etc/systemd/system/atextcrawler.service with this content:

[Unit]
Description=atextcrawler web crawler
Documentation=https://gitea.multiname.org/a-text/atextcrawler
Requires=network.target
After=network-online.target

[Service]
Type=simple
User=atextcrawler
Group=atextcrawler
WorkingDirectory=/srv/atextcrawler/repo
Environment=PYTHONPATH=/srv/atextcrawler/repo/src
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
TimeoutStartSec=30
ExecStop=/bin/kill -INT $MAINPID
TimeoutStopSec=180
Restart=on-failure

[Install]
WantedBy=multi-user.target

and

systemctl daemon-reload
systemctl enable atextcrawler
systemctl start atextcrawler

3.2 KiB Raw Blame History