id,summary,reporter,owner,description,type,status,priority,milestone,component,version,resolution,keywords,cc
832,Enable Daisy to do textextraction on OOXML and RTF documents,Matthias Bauer <matthias.bauer.drs@…>,somebody,"Not being able to do full-text search on OOXML documents is becoming a problem for us - especially with an increasing number of MS Office '07 and '10 clients. So, I decided to do some work on that (see attached patch). 

First I started to use the OOXML-related APIs of Apache POI (v. 3.6) to create additional text-extractors for .docx and .pptx files. I also extended the existing XLS extractor to cope with xlsx files as well. (Included for sake of completeness...)

Later on I stumbled upon Apache Tika ([http://tika.apache.org/]). That's a java-based text-extraction toolkit. It is capable of extracting text and meta-data from a variety of file formats, including MS Office, OOXML, OpenOffice and PDF. So, I've written a Tika-based text-extractor as well. It is capable of doing text-extraction on all file formats Daisy is currently supporting. In addition it can handle OOXML, RTF and ePUB content, compressed archives (ZIP, TAR,...), image and audio file meta-data. 

There are a few minor things with it, though: 

 * First thing is text-extraction on RTF files: Tika currently introduces fake white-space characters around every non-US-ASCII character. The text-extractor is trying to catch that with some regexp-based post-processing. That's working quite well in most cases. But it's not perfect.

 * Tika claims to be able to extract data from JAR and class files. This would require the use of a more recent version of asm (v3.1 or so). Unfortunately, the current jBPM workflow engine won't work with this version of asm. With the old asm version, text-extraction on class files isn't working. (Not much of an issue for us, though.)",Improvement,new,Major,2.5,Repository - querying and indexing,2.5,,,
