Ticket #832 (new Improvement)
Enable Daisy to do textextraction on OOXML and RTF documents
|Reported by:||Matthias Bauer <matthias.bauer.drs@…>||Owned by:||somebody|
|Component:||Repository - querying and indexing||Version:||2.5|
Not being able to do full-text search on OOXML documents is becoming a problem for us - especially with an increasing number of MS Office '07 and '10 clients. So, I decided to do some work on that (see attached patch).
First I started to use the OOXML-related APIs of Apache POI (v. 3.6) to create additional text-extractors for .docx and .pptx files. I also extended the existing XLS extractor to cope with xlsx files as well. (Included for sake of completeness...)
Later on I stumbled upon Apache Tika (http://tika.apache.org/). That's a java-based text-extraction toolkit. It is capable of extracting text and meta-data from a variety of file formats, including MS Office, OOXML, OpenOffice? and PDF. So, I've written a Tika-based text-extractor as well. It is capable of doing text-extraction on all file formats Daisy is currently supporting. In addition it can handle OOXML, RTF and ePUB content, compressed archives (ZIP, TAR,...), image and audio file meta-data.
There are a few minor things with it, though:
- First thing is text-extraction on RTF files: Tika currently introduces fake white-space characters around every non-US-ASCII character. The text-extractor is trying to catch that with some regexp-based post-processing. That's working quite well in most cases. But it's not perfect.
- Tika claims to be able to extract data from JAR and class files. This would require the use of a more recent version of asm (v3.1 or so). Unfortunately, the current jBPM workflow engine won't work with this version of asm. With the old asm version, text-extraction on class files isn't working. (Not much of an issue for us, though.)