Nutch - Highly extensible, highly scalable Web crawler
Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. Its main feature include
- Fetching, parsing and indexation in parallel and distributed
- Plugin support
- Ontology
- Clustering
- Distributed filesystem (via Hadoop)
- Link-graph database
- NTLM authentication
- MapReduce
- Many formats: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)
http://nutch.apache.org/
License:
Tech:
Tags: