Using Tesseract OCR to extract scanned invoice data in Java application

Default author image
Andrzej Karczyński

Software Architect

  • August 4, 2015

Contents

OCR can be used to extract textual data from images, such as scanned documents. Generally it works as follows:

  • Pre-process image data, for example: convert to gray scale, smooth, de-skew, filter.
  • Detect lines, words and characters.
  • Produce ranked list of candidate characters based on trained data set.
  • Post process recognized characters, choose best characters based on confidence from previous step and language data. Language data includes dictionary, grammar rules, etc.

There are couple of open source OCR engines. The most popular is Tesseract-OCR. The main advantage of tesseract-ocr is high accuracy of character recognition, but also it contains prepared trained data sets for 39 languages. You could train OCR engine yourself, but it is rather difficult task.

Use case

Oft is the case that companies want their paper invoices digitized. This comes with some challenges:

  • Invoice documents are not standardized
  • Processing of whole document can easily confuse OCR engines
  • Documents could have non trivial layout and could contain tables and other non-text elements  (like branding)

Fortunately we can achieve better results by making this process partially manual:

  • User is presented with image of scanned invoice.
  • User selects document fragment and chooses target field. Target field can be one of: invoice number, invoice description, total price.
  • Selected fragment is converted to image and sent to server.
  • Server use tesseract-ocr to process image fragment and sends text data to client.

Choosing target field has one more advantage. We can further tune ocr engine based on type of data to be extracted.

How to use tesseract ocr from Java?

Tesseract-ocr is written in C++ language. Fortunately there are also Java bindings.

To use tesseract-ocr in maven project import following dependencies:

Use can choose from several prepackaged native libraries. For example in Windows:

Put choosen trained data set in tessdata directory.

Here is example how to use tesseract-ocr in Java:

Above code initializes tesseract with pol.traineddata and processes image located in file path, then returns result.

Better accuracy with a whitelist of characters

Tesseract is great for recognizing text but sometimes is confused when you want to extract numbers or special identifiers (like invoice numbers).

We can tune tesseract to better recognize characters based on context, using character whitelist.

For example, for numbers we can use:

and for invoice number:

Letter “I” is removed, because symbol “/” is much more common in invoice numbers.

Turn off dictionaries

Tesseract-OCR post-processesing recognizes characters based on language data. For this reason character “/” is often recognized as “1” in invoice numbers, because there is no words containing slash character. String of digits rarely contain special characters.

Fortunately we can turn off dictionary post-processing in tesseract initializer:

Multi-threading

From version 3.01 use can use several instances of tesseract in multi-threaded environment. Only one instance per thread can be used. Because of that, it is best to use object pool in multi-threaded environment.

Summary

Tesseract-OCR is great tool for character recognition. In this post I showed how to use it from Java and how to successfully extract data from invoice documents.

Looking for a software development company?

Work with a team that already helped dozens of market leaders. Book a discovery call to see:

  • How our products work
  • How you can save time & costs
  • How we’re different from another solutions

footer-contact-steps

We keep your data safe: ISO certified

We operate in accordance with the ISO 27001 standard, ensuring the highest level of security for your data.
certified dekra 27001
© 2025 Pretius. All right reserved.