## Related work * [collection of crawlers](https://github.com/adbar/awesome-crawler) * [collection of webscrapers](https://github.com/adbar/awesome-web-scraper) ### crawlers * [acrawler](https://acrawler.readthedocs.io/en/latest/) * [trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html) * [repo](https://github.com/adbar/trafilatura) * [intro](https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html) * [aiohttp_spider](https://github.com/niklak/aiohttp_spider/) * [scrapy](https://docs.scrapy.org/en/latest/) * [heritrix3](https://github.com/internetarchive/heritrix3/) * [YaCy](https://yacy.net/) * [searchmysite](https://searchmysite.net/) * [spiderling](http://corpus.tools/raw-attachment/wiki/Downloads/spiderling-src-0.84.tar.xz) * [aiohttp_spider](https://github.com/niklak/aiohttp_spider) * https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika * [edge search engine](https://memex.marginalia.nu/projects/edge/about.gmi) #### general * [elastic enterprise search](https://www.elastic.co/blog/building-a-scalable-easy-to-use-web-crawler-for-elastic-enterprise-search) ### sitemap parsers * [ultimate-sitemap-parser](https://github.com/mediacloud/ultimate-sitemap-parser) ### url handling * [courlan](https://pypi.org/project/courlan/) ### language detection * [overview](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language) * [guess_language-spirit](https://pypi.org/project/guess_language-spirit/) * [guess_language](https://pypi.org/project/guess-language/) * [cld3](https://github.com/google/cld3) ### text extraction * [JusText](http://corpus.tools/wiki/Justext_changelog) [demo](https://nlp.fi.muni.cz/projects/justext/) ### deduplication * [PostgreSQL extension smlar](https://github.com/jirutka/smlar) * [use smlar](https://medium.datadriveninvestor.com/the-smlar-plug-in-for-effective-retrieval-of-massive-volumes-of-simhash-data-e429c19da1a3) * remove paragraphs with more than 50% word-7-tuples encountered previously ### Extract more meta tags * https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md https://support.shareaholic.com/hc/en-us/articles/115003085186 ### Date parsing dependent on language * https://en.wikipedia.org/wiki/Date_format_by_country * https://en.wikipedia.org/wiki/Common_Locale_Data_Repository * https://pypi.org/project/dateparser/ * https://github.com/ovalhub/pyicu * https://github.com/night-crawler/cldr-language-helpers * https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language ICU * https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse * https://gist.github.com/dpk/8325992 * https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html * https://unicode-org.github.io/icu/userguide/ * https://unicode-org.github.io/icu-docs/#/icu4c/ * https://github.com/ovalhub/pyicu/blob/master/samples/break.py * https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table * https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras * https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview