OCR can be used to extract textual data from images, such as scanned documents. Generally it works as follows:
There are couple of open source OCR engines. The most popular is Tesseract-OCR. The main advantage of tesseract-ocr is high accuracy of character recognition, but also it contains prepared trained data sets for 39 languages. You could train OCR engine yourself, but it is rather difficult task.
Oft is the case that companies want their paper invoices digitized. This comes with some challenges:
Fortunately we can achieve better results by making this process partially manual:
Choosing target field has one more advantage. We can further tune ocr engine based on type of data to be extracted.
Tesseract-ocr is written in C++ language. Fortunately there are also Java bindings.
To use tesseract-ocr in maven project import following dependencies:
1 2 3 4 5 6 7 8 9 10 | <dependency> <groupId>org.bytedeco.javacpp-presets</groupId> <artifactId>tesseract</artifactId> <version>3.03-rc1-0.11</version> </dependency> <dependency> <groupId>org.bytedeco.javacpp-presets</groupId> <artifactId>leptonica</artifactId> <version>1.72-1.0</version> </dependency> |
Use can choose from several prepackaged native libraries. For example in Windows:
1 2 3 4 5 6 7 8 9 10 11 12 | <dependency> <groupId>org.bytedeco.javacpp-presets</groupId> <artifactId>tesseract</artifactId> <version>3.03-rc1-0.11</version> <classifier>windows-x86_64</classifier> </dependency> <dependency> <groupId>org.bytedeco.javacpp-presets</groupId> <artifactId>leptonica</artifactId> <version>1.72-1.0</version> <classifier>windows-x86_64</classifier> </dependency> |
Put choosen trained data set in tessdata directory.
Here is example how to use tesseract-ocr in Java:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | public String process(String file) { TessBaseAPI api = new TessBaseAPI(); if (api.Init(“.”, “pol”,) != 0) { throw new RuntimeException(“Could not initialize tesseract.”); } PIX image = null; BytePointer outText = null; try { image = lept.pixRead(file); api.SetImage(image); outText = api.GetUTF8Text(); String string = outText.getString(“UTF-8”); if (string != null) { string = string.trim(); } return string; } catch (UnsupportedEncodingException e) { throw new RuntimeException(“charset”, e); } finally { if (outText != null) { outText.deallocate(); } if (image != null) { lept.pixDestroy(image); } if (api != null) { api.End(); } } } |
Above code initializes tesseract with pol.traineddata and processes image located in file path, then returns result.
Tesseract is great for recognizing text but sometimes is confused when you want to extract numbers or special identifiers (like invoice numbers).
We can tune tesseract to better recognize characters based on context, using character whitelist.
For example, for numbers we can use:
1 | api.SetVariable(“tessedit_char_whitelist”, “0123456789,”); |
and for invoice number:
1 | api.SetVariable(“tessedit_char_whitelist”, “0123456789,/ABCDEFGHJKLMNPQRSTUVWXY”); |
Letter “I” is removed, because symbol “/” is much more common in invoice numbers.
Tesseract-OCR post-processesing recognizes characters based on language data. For this reason character “/” is often recognized as “1” in invoice numbers, because there is no words containing slash character. String of digits rarely contain special characters.
Fortunately we can turn off dictionary post-processing in tesseract initializer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | StringGenericVector pars = new StringGenericVector(); api = new TessBaseAPI(); pars.addPut(new STRING(“load_system_dawg”)); pars.addPut(new STRING(“load_freq_dawg”)); pars.addPut(new STRING(“load_punc_dawg”)); pars.addPut(new STRING(“load_number_dawg”)); pars.addPut(new STRING(“load_unambig_dawg”)); pars.addPut(new STRING(“load_bigram_dawg”)); pars.addPut(new STRING(“load_fixed_length_dawgs”)); StringGenericVector parsValues = new StringGenericVector(); parsValues.addPut(new STRING(“0”)); parsValues.addPut(new STRING(“0”)); parsValues.addPut(new STRING(“0”)); parsValues.addPut(new STRING(“0”)); parsValues.addPut(new STRING(“0”)); parsValues.addPut(new STRING(“0”)); parsValues.addPut(new STRING(“0”)); if (api.Init(“.”, “pol”, 0, (ByteBuffer)null, 0, pars, parsValues, false) != 0) { throw new RuntimeException(“Could not initialize tesseract.”); } |
From version 3.01 use can use several instances of tesseract in multi-threaded environment. Only one instance per thread can be used. Because of that, it is best to use object pool in multi-threaded environment.
Tesseract-OCR is great tool for character recognition. In this post I showed how to use it from Java and how to successfully extract data from invoice documents.