Ticket #832 (new Improvement)
Enable Daisy to do textextraction on OOXML and RTF documents
| Reported by: | Matthias Bauer <matthias.bauer.drs@…> | Owned by: | somebody |
|---|---|---|---|
| Priority: | Major | Milestone: | 2.5 |
| Component: | Repository - querying and indexing | Version: | 2.5 |
| Keywords: | Cc: |
Description
Not being able to do full-text search on OOXML documents is becoming a problem for us - especially with an increasing number of MS Office '07 and '10 clients. So, I decided to do some work on that (see attached patch).
First I started to use the OOXML-related APIs of Apache POI (v. 3.6) to create additional text-extractors for .docx and .pptx files. I also extended the existing XLS extractor to cope with xlsx files as well. (Included for sake of completeness...)
Later on I stumbled upon Apache Tika (http://tika.apache.org/). That's a java-based text-extraction toolkit. It is capable of extracting text and meta-data from a variety of file formats, including MS Office, OOXML, OpenOffice? and PDF. So, I've written a Tika-based text-extractor as well. It is capable of doing text-extraction on all file formats Daisy is currently supporting. In addition it can handle OOXML, RTF and ePUB content, compressed archives (ZIP, TAR,...), image and audio file meta-data.
There are a few minor things with it, though:
- First thing is text-extraction on RTF files: Tika currently introduces fake white-space characters around every non-US-ASCII character. The text-extractor is trying to catch that with some regexp-based post-processing. That's working quite well in most cases. But it's not perfect.
- Tika claims to be able to extract data from JAR and class files. This would require the use of a more recent version of asm (v3.1 or so). Unfortunately, the current jBPM workflow engine won't work with this version of asm. With the old asm version, text-extraction on class files isn't working. (Not much of an issue for us, though.)
Attachments
Change History
Changed 3 years ago by Matthias Bauer <matthias.bauer.drs@…>
- Attachment daisy-2.5-TikaTextExtraction.patch added
Changed 9 months ago by matthias.bauer.drs@…
- Attachment tika-1.1-for-daisy-2.5-dev.patch added
improved patch for Tika textextractor using Tika 1.1. Patch is against current trunk.