Ticket #832 (new Improvement)

Opened 3 years ago

Last modified 9 months ago

Enable Daisy to do textextraction on OOXML and RTF documents

Reported by: Matthias Bauer <matthias.bauer.drs@…> Owned by: somebody
Priority: Major Milestone: 2.5
Component: Repository - querying and indexing Version: 2.5
Keywords: Cc:

Description

Not being able to do full-text search on OOXML documents is becoming a problem for us - especially with an increasing number of MS Office '07 and '10 clients. So, I decided to do some work on that (see attached patch).

First I started to use the OOXML-related APIs of Apache POI (v. 3.6) to create additional text-extractors for .docx and .pptx files. I also extended the existing XLS extractor to cope with xlsx files as well. (Included for sake of completeness...)

Later on I stumbled upon Apache Tika (http://tika.apache.org/). That's a java-based text-extraction toolkit. It is capable of extracting text and meta-data from a variety of file formats, including MS Office, OOXML, OpenOffice? and PDF. So, I've written a Tika-based text-extractor as well. It is capable of doing text-extraction on all file formats Daisy is currently supporting. In addition it can handle OOXML, RTF and ePUB content, compressed archives (ZIP, TAR,...), image and audio file meta-data.

There are a few minor things with it, though:

  • First thing is text-extraction on RTF files: Tika currently introduces fake white-space characters around every non-US-ASCII character. The text-extractor is trying to catch that with some regexp-based post-processing. That's working quite well in most cases. But it's not perfect.
  • Tika claims to be able to extract data from JAR and class files. This would require the use of a more recent version of asm (v3.1 or so). Unfortunately, the current jBPM workflow engine won't work with this version of asm. With the old asm version, text-extraction on class files isn't working. (Not much of an issue for us, though.)

Attachments

daisy-2.5-TikaTextExtraction.patch (88.4 KB) - added by Matthias Bauer <matthias.bauer.drs@…> 3 years ago.
tika-1.1-for-daisy-2.5-dev.patch (42.5 KB) - added by matthias.bauer.drs@… 9 months ago.
improved patch for Tika textextractor using Tika 1.1. Patch is against current trunk.

Change History

Changed 3 years ago by Matthias Bauer <matthias.bauer.drs@…>

Changed 9 months ago by matthias.bauer.drs@…

improved patch for Tika textextractor using Tika 1.1. Patch is against current trunk.

comment:1 Changed 9 months ago by matthias.bauer.drs@…

I've improved the old patch. It now uses Tika 1.1. The two minor issues mentioned in the comment above should be solved now.

Note: See TracTickets for help on using tickets.