atextcrawler/doc/source/devel/related_work.md

3.2 KiB

crawlers

general

sitemap parsers

url handling

language detection

text extraction

deduplication

Extract more meta tags

Date parsing dependent on language

ICU