Skip to content

PUTHURR

OCRLAYOUT (Python)

Provides the ability to get more meaninful output from common OCR responses by manipulating Bounding Boxes.

Current OCR engines responses are focus on text recall. Ocrlayout tries to go a step further by re-ordering the lines of text so it'd approach a human-reading behavior.

Problem Statement

When processing images containing lots of textual information, it becomes relevant to assemble the generated text into meaninful blocks of text. While OCR engines responses are fine for recall, they aren't necessary generating meaningfull output text for further Text processing.

More meaningfull output for what?

  • Text Analytics you may leverage any Text Analytics such as Key Phrases, Entities Extraction with more confidence of its outcome
  • Accessibility : Any infographic becomes alive, overcoming the alt text feature.
  • Read Aloud feature : it becomes easier to build solutions to read aloud an image, increasing verbal narrative of visual information.
  • Machine Translation : get more accurate MT output as you can retain more context.
  • Sentences/Paragraph Classification: from scanned-base images i.e. contracts, having a more meaninful textual output allows you to classify it at a granular level in terms of risk, personal clause or conditions.

Documentation Repository

OCRLAYOUT Recipes (Python, .NET Core)

This repository provides clear examples of the value of ocrlayout with real-world examples.

Repository

CUSTOM APACHE TIKA (Java)

For a Knowledge mining project, our team were looking to have a consistent representation of embedded images in Apache Tika XHTML output.

To give you an example, the embedded images links for PowerPoint were missing but the images links for PDF were there.

This version is trying to harmonize the way embedded images are showing up in the XHTML in a nutshell. Once stabilized our plan is to propose our changes to the Apache Tika community.

More details here

GLOBAL DISCLAIMER

ALL DESCRIBED SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.