Heritrix
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.
It provides web interface for operator control and monitoring of crawls. It stores content to ARC or ISO WARC aggregate/transcript format.
It provides web interface for operator control and monitoring of crawls. It stores content to ARC or ISO WARC aggregate/transcript format.
http://crawler.archive.org/
License:
Tech:
Tags: