Optical character recognition (OCR) is not an easy problem. It is a process for extracting textual data from an image. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
OCR can be used to extract textual data from images, such as scanned documents. Generally it works as follows:
- Pre-process image data, for example: convert to gray scale, smooth, de-skew, filter.
- Detect lines, words and characters.
- Produce ranked list of candidate characters based on trained data set.
- Post process recognized characters, choose best characters based on confidence from previous step and language data. Language data includes dictionary, grammar rules, etc.
There are couple of open source OCR engines. The most popular is Tesseract-OCR. The main advantage of tesseract-ocr is high accuracy of character recognition, but also it contains prepared trained data sets for 39 languages. You could train OCR engine yourself, but it is rather difficult task.
Use case
Oft is the case that companies want their paper invoices digitized. This comes with some challenges:
- Invoice documents are not standardized
- Processing of whole document can easily confuse OCR engines
- Documents could have non trivial layout and could contain tables and other non-text elements (like branding)
Fortunately we can achieve better results by making this process partially manual:
- User is presented with image of scanned invoice.
- User selects document fragment and chooses target field. Target field can be one of: invoice number, invoice description, total price.
- Selected fragment is converted to image and sent to server.
- Server use tesseract-ocr to process image fragment and sends text data to client.
Choosing target field has one more advantage. We can further tune ocr engine based on type of data to be extracted.
How to use tesseract ocr from Java?
Tesseract-ocr is written in C++ language. Fortunately there are also Java bindings.
To use tesseract-ocr in maven project import following dependencies:
1
2
3
4
5
6
7
8
9
10
|
<dependency>
<groupId>org.bytedeco.javacpp-presets</groupId>
<artifactId>tesseract</artifactId>
<version>3.03-rc1-0.11</version>
</dependency>
<dependency>
<groupId>org.bytedeco.javacpp-presets</groupId>
<artifactId>leptonica</artifactId>
<version>1.72-1.0</version>
</dependency>
|
Use can choose from several prepackaged native libraries. For example in Windows:
1
2
3
4
5
6
7
8
9
10
11
12
|
<dependency>
<groupId>org.bytedeco.javacpp-presets</groupId>
<artifactId>tesseract</artifactId>
<version>3.03-rc1-0.11</version>
<classifier>windows-x86_64</classifier>
</dependency>
<dependency>
<groupId>org.bytedeco.javacpp-presets</groupId>
<artifactId>leptonica</artifactId>
<version>1.72-1.0</version>
<classifier>windows-x86_64</classifier>
</dependency>
|
Put choosen trained data set in tessdata directory.
Here is example how to use tesseract-ocr in Java:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
public String process(String file) {
TessBaseAPI api = new TessBaseAPI();
if (api.Init(“.”, “pol”,) != 0) {
throw new RuntimeException(“Could not initialize tesseract.”);
}
PIX image = null;
BytePointer outText = null;
try {
image = lept.pixRead(file);
api.SetImage(image);
outText = api.GetUTF8Text();
String string = outText.getString(“UTF-8”);
if (string != null) {
string = string.trim();
}
return string;
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(“charset”, e);
} finally {
if (outText != null) {
outText.deallocate();
}
if (image != null) {
lept.pixDestroy(image);
}
if (api != null) {
api.End();
}
}
}
|
Above code initializes tesseract with pol.traineddata and processes image located in file path, then returns result.
Better accuracy with a whitelist of characters
Tesseract is great for recognizing text but sometimes is confused when you want to extract numbers or special identifiers (like invoice numbers).
We can tune tesseract to better recognize characters based on context, using character whitelist.
For example, for numbers we can use:
1
|
api.SetVariable(“tessedit_char_whitelist”, “0123456789,”);
|
and for invoice number:
1
|
api.SetVariable(“tessedit_char_whitelist”, “0123456789,/ABCDEFGHJKLMNPQRSTUVWXY”);
|
Letter “I” is removed, because symbol “/” is much more common in invoice numbers.
Turn off dictionaries
Tesseract-OCR post-processesing recognizes characters based on language data. For this reason character “/” is often recognized as “1” in invoice numbers, because there is no words containing slash character. String of digits rarely contain special characters.
Fortunately we can turn off dictionary post-processing in tesseract initializer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
StringGenericVector pars = new StringGenericVector();
api = new TessBaseAPI();
pars.addPut(new STRING(“load_system_dawg”));
pars.addPut(new STRING(“load_freq_dawg”));
pars.addPut(new STRING(“load_punc_dawg”));
pars.addPut(new STRING(“load_number_dawg”));
pars.addPut(new STRING(“load_unambig_dawg”));
pars.addPut(new STRING(“load_bigram_dawg”));
pars.addPut(new STRING(“load_fixed_length_dawgs”));
StringGenericVector parsValues = new StringGenericVector();
parsValues.addPut(new STRING(“0”));
parsValues.addPut(new STRING(“0”));
parsValues.addPut(new STRING(“0”));
parsValues.addPut(new STRING(“0”));
parsValues.addPut(new STRING(“0”));
parsValues.addPut(new STRING(“0”));
parsValues.addPut(new STRING(“0”));
if (api.Init(“.”, “pol”,
0, (ByteBuffer)null, 0, pars, parsValues, false) != 0) {
throw new RuntimeException(“Could not initialize tesseract.”);
}
|
Multi-threading
From version 3.01 use can use several instances of tesseract in multi-threaded environment. Only one instance per thread can be used. Because of that, it is best to use object pool in multi-threaded environment.
Summary
Tesseract-OCR is great tool for character recognition. In this post I showed how to use it from Java and how to successfully extract data from invoice documents.