Skip to content

Apache Tika - Custom version

For a Knowledge mining project, our team were looking to have a consistent representation of embedded images in Apache Tika XHTML output.

To give you an example, the embedded images links for PowerPoint were missing but the images links for PDF were there.

This version is trying to harmonize the way embedded images are showing up in the XHTML in a nutshell. Once stabilized our plan is to propose our changes to the Apache Tika community.

This custom version features set

XHTML Tags

  • PDF Page tags contains the page id. <div class="page" id="2">
  • PPTX : added a slide div with slide id and title (when available) <div class="slide" id="1"> <div class="slide" id="2" title="Oil Production">
  • PPT/PPTX slide-notes div renamed to slide-notes-content for consistency <div class="slide-notes-content">

Embedded Images XHTML tags

  • Numbering 5 digits with padding left 0.
  • Representation in the XHTML for Office and PDF documents.

For PDF - Original

<img src="embedded:image0.jpg" alt="image0.jpg"/>
  • Our version
<img src="image00000.jpg" alt="image00000.jpg" class="embedded" id="00000" 
    contenttype="image/jpeg" witdh="1491" height="2109" size="775380"/>

For PowerPoint - Original

<div class="embedded" id="slide6_rId3" />
  • Our version
<img class="embedded" id="slide5_rId4" contenttype="image/x-wmf" src="image00005.wmf" alt="image5.wmf" 
title="Picture 4" witdh="120" height="172" size="7092"/>

Some img attributes aren't HTML compliant we know. This above output is close to provide an HTML preview of any document.

Benefits: we can scan big images, specific type of images, size in bytes or dimensions.

PDF Parser new configurations

The new PDF parser configuration are all related to Image extraction thus they will take effects on calling the unpack endpoint. It means they will also requires the extractInlineImages option to be set to true as well. - allPagesAsImages : this instructs to convert any PDF page as image. - singlePagePDFAsImage : this instructs to convert a single page PDF to an image. - stripedImagesHandling : this instructs to convert a PDF page into an image based on the number of Contents Streams. Some PDF writers tend to stripe an image into multiple contents streams (Array) - stripedImagesThreshold : minimum number of contents streams to convert the page into an image. - graphicsToImage : a page with graphics objets could be better represented with an image. - graphicsToImageThreshold : minimum number of graphics objects to convert the page into an image.

To leverage those features add the corresponding headers prefixed by X-Tika-PDF.

--header "X-Tika-PDFextractInlineImages:true" --header "X-Tika-PDFsinglePagePDFAsImage:true"

Azure Blob Storage support for unpacking (tika-server)

The unpack feature produces an archive response which you can expand and process. For documents containing a lot of high-res images, the unpack will hit some limitations like OOM.

To avoid hitting those potential limitations, support cloud storage and big-size documents, I implemented an azure-unpack resource to write any embedded resource directly into an Azure Storage container and directory.

Benefits : no archive client expansion, network bandwidth reduced, handles documents with a lot of high-res images and more.

See tika-server for more details on how to use that feature.

DISCLAIMER

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.