Tikka - A content analysis toolkit
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. It extracts text from following file formats.
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
http://tika.apache.org/
License:
Tech:
Tags: