atextcrawler/doc/source/installation.md

# Installation
Installation was only tested on Debian bullseye (on amd64).
The instructions below are for this system.
(Please adapt to other environments.)

## System packages
```
apt install pandoc tidy python3-systemd openjdk-17-jre-headless
apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev
```
Java is needed for tika.
The second line is required for python package gcld3 (see below).

## PostgreSQL database
We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:
```
createdb -E UTF8 --lc-collate=C --lc-ctype=C -T template0 -O atextcrawler atextcrawler
```

## Elasticsearch
We need access to an elasticsearch instance (over TCP/IP).

Note: TLS is not yet supported, so install this service locally.

See [elasticsearch howto](elasticsearch.md).

Create an API key (using the password for user elastic):
```
http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}'
```

## Tensorflow model server
We need access to a tensorflow model server (over TCP/IP).
It should serve `universal_sentence_encoder_multilingual`
or a similar language model.

Note: TLS is not yet supported, so install this service locally.

See [tensorflow howto](tensorflow_model_server.md).

## Setup virtualenv and install atextcrawler
```
apt install python3-pip
adduser --home /srv/atextcrawler --disabled-password --gecos "" atextcrawler
su - atextcrawler
cat >>.bashrc <<EOF
export PYTHONPATH=\$HOME/repo/src
EOF
pip3 install --user pipenv
mkdir repo
cat >>.profile <<EOF
PYTHONPATH=\$HOME/repo/src
PATH=\$HOME/.local/bin:$PATH
cd repo
\$HOME/.local/bin/pipenv shell
EOF
exit
su - atextcrawler
rm Pipfile
git clone https://gitea.multiname.org/a-text/atextcrawler.git $HOME/repo
virtualenv --system-site-packages `pipenv --venv`  # for systemd
pipenv sync
```

Note: One of the dependencies, Python package `tldextract`,
uses this directory for caching:
```
$HOME/.cache/python-tldextract/
```

## Configure atextcrawler
As user `atextcrawler` execute
```
mkdir -p $HOME/.config
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler
```

Edit `$HOME/.config/atextcrawler/main.yaml`.

If you want to override a plugin, copy it to the plugins directory
and edit it, e.g.
```
cp $HOME/repo/doc/source/config_template/plugins/filter_site.py $HOME/.config/atextcrawler/plugins
```

Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`.

Check (and print) the instance configuration:
```
python -m atextcrawler.config
```

## Test run
To see if it works, run `atextcrawler` from the command line:
```
python -m atextcrawler
```
You can follow the log with:
```
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler
```

You can stop with `Ctrl-C`; stopping may take a few seconds or even minutes.

## Install systemd service
To make the service persistent, create a systemd unit file
`/etc/systemd/system/atextcrawler.service` with this content:
```
[Unit]
Description=atextcrawler web crawler
Documentation=https://gitea.multiname.org/a-text/atextcrawler
Requires=network.target elasticsearch.service tensorflow.service
After=network-online.target elasticsearch.service tensorflow.service

[Service]
Type=simple
User=atextcrawler
Group=atextcrawler
WorkingDirectory=/srv/atextcrawler/repo
Environment=PYTHONPATH=/srv/atextcrawler/repo/src
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
TimeoutStartSec=30
ExecStop=/bin/kill -INT $MAINPID
TimeoutStopSec=300
Restart=on-failure

[Install]
WantedBy=multi-user.target
```
and
```
systemctl daemon-reload
systemctl enable atextcrawler
systemctl start atextcrawler
```
Then follow the log with:
```
journalctl -efu atextcrawler
```
Put under version control 2021-11-29 09:16:31 +00:00			`# Installation`
			`Installation was only tested on Debian bullseye (on amd64).`
			`The instructions below are for this system.`
			`(Please adapt to other environments.)`

			`## System packages`
			```
Improve doc: installation 2021-11-29 11:39:09 +00:00			`apt install pandoc tidy python3-systemd openjdk-17-jre-headless`
			`apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev`
Put under version control 2021-11-29 09:16:31 +00:00			```
Improve doc: installation 2021-11-29 11:39:09 +00:00			`Java is needed for tika.`
			`The second line is required for python package gcld3 (see below).`
Put under version control 2021-11-29 09:16:31 +00:00
			`## PostgreSQL database`
			`We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:`
			```
			`createdb -E UTF8 --lc-collate=C --lc-ctype=C -T template0 -O atextcrawler atextcrawler`
			```

			`## Elasticsearch`
			`We need access to an elasticsearch instance (over TCP/IP).`

			`Note: TLS is not yet supported, so install this service locally.`

			`See [elasticsearch howto](elasticsearch.md).`

Improve doc: installation 2021-11-29 11:39:09 +00:00			`Create an API key (using the password for user elastic):`
			```
			`http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}'`
			```

Put under version control 2021-11-29 09:16:31 +00:00			`## Tensorflow model server`
			`We need access to a tensorflow model server (over TCP/IP).`
			It should serve `universal_sentence_encoder_multilingual`
			`or a similar language model.`

			`Note: TLS is not yet supported, so install this service locally.`

			`See [tensorflow howto](tensorflow_model_server.md).`

			`## Setup virtualenv and install atextcrawler`
			```
			`apt install python3-pip`
			`adduser --home /srv/atextcrawler --disabled-password --gecos "" atextcrawler`
			`su - atextcrawler`
			`cat >>.bashrc <<EOF`
			`export PYTHONPATH=\$HOME/repo/src`
			`EOF`
			`pip3 install --user pipenv`
Improve doc: installation 2021-11-29 11:39:09 +00:00			`mkdir repo`
Put under version control 2021-11-29 09:16:31 +00:00			`cat >>.profile <<EOF`
			`PYTHONPATH=\$HOME/repo/src`
			`PATH=\$HOME/.local/bin:$PATH`
Improve doc: installation 2021-11-29 11:39:09 +00:00			`cd repo`
Put under version control 2021-11-29 09:16:31 +00:00			`\$HOME/.local/bin/pipenv shell`
			`EOF`
			`exit`
			`su - atextcrawler`
Improve doc: installation 2021-11-29 11:39:09 +00:00			`rm Pipfile`
			`git clone https://gitea.multiname.org/a-text/atextcrawler.git $HOME/repo`
			virtualenv --system-site-packages `pipenv --venv` # for systemd
Put under version control 2021-11-29 09:16:31 +00:00			`pipenv sync`
			```

			Note: One of the dependencies, Python package `tldextract`,
			`uses this directory for caching:`
			```
			`$HOME/.cache/python-tldextract/`
			```

			`## Configure atextcrawler`
			As user `atextcrawler` execute
			```
Improve doc: installation 2021-11-29 11:39:09 +00:00			`mkdir -p $HOME/.config`
Put under version control 2021-11-29 09:16:31 +00:00			`cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler`
			```

			Edit `$HOME/.config/atextcrawler/main.yaml`.

			`If you want to override a plugin, copy it to the plugins directory`
			`and edit it, e.g.`
			```
Improve doc: installation 2021-11-29 11:39:09 +00:00			`cp $HOME/repo/doc/source/config_template/plugins/filter_site.py $HOME/.config/atextcrawler/plugins`
Put under version control 2021-11-29 09:16:31 +00:00			```

			Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`.

			`Check (and print) the instance configuration:`
			```
			`python -m atextcrawler.config`
			```

			`## Test run`
			To see if it works, run `atextcrawler` from the command line:
			```
			`python -m atextcrawler`
			```
Improve doc: installation 2021-11-29 11:39:09 +00:00			`You can follow the log with:`
			```
			`journalctl -ef SYSLOG_IDENTIFIER=atextcrawler`
			```

			You can stop with `Ctrl-C`; stopping may take a few seconds or even minutes.
Put under version control 2021-11-29 09:16:31 +00:00
			`## Install systemd service`
			`To make the service persistent, create a systemd unit file`
			`/etc/systemd/system/atextcrawler.service` with this content:
			```
			`[Unit]`
			`Description=atextcrawler web crawler`
			`Documentation=https://gitea.multiname.org/a-text/atextcrawler`
Improve doc: add dependencies to systemd unit 2021-12-09 05:55:01 +00:00			`Requires=network.target elasticsearch.service tensorflow.service`
			`After=network-online.target elasticsearch.service tensorflow.service`
Put under version control 2021-11-29 09:16:31 +00:00
			`[Service]`
			`Type=simple`
			`User=atextcrawler`
			`Group=atextcrawler`
			`WorkingDirectory=/srv/atextcrawler/repo`
			`Environment=PYTHONPATH=/srv/atextcrawler/repo/src`
			`ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler`
			`TimeoutStartSec=30`
			`ExecStop=/bin/kill -INT $MAINPID`
Improve doc: installation 2021-11-29 11:39:09 +00:00			`TimeoutStopSec=300`
Put under version control 2021-11-29 09:16:31 +00:00			`Restart=on-failure`

			`[Install]`
			`WantedBy=multi-user.target`
			```
			`and`
			```
			`systemctl daemon-reload`
			`systemctl enable atextcrawler`
			`systemctl start atextcrawler`
			```
Improve doc: installation 2021-11-29 11:39:09 +00:00			`Then follow the log with:`
			```
			`journalctl -efu atextcrawler`
			```