How to prepare training files for Tesseract OCR and improve characters recognition?

23 June 2016, Bogusław Zaręba

Over the last few years, optical character recognition (OCR) has become very popular. You can find various OCR engines which help you with the OCR process but you should consider Tesseract to build your own OCR application. It is a very powerful tool and it’s completely free (licensed under the Apache License, Version 2.0). The main advantage of tesseract-ocr is its high accuracy of character recognition. Unfortunately, it is poorly documented so you need to put quite an effort to make use of its all features.

Tesseract is very good at recognizing multiple languages and fonts. It can be used as a command-line program or an embedded library in a custom application. We used it to develop an application that automatically reads data from ID cards. It worked well and we did not spent much time on development. But we had some problems with specific letters recognition (mixing W and H, O and 0 (zero)). So we had to train Tesseract how to read these fonts properly.

Looking for a solution on how to do this, I came across a couple of articles suggesting to use some third-party GUI applications, but I encountered many problems with customizing them and still didn’t meet my goals. Luckily, I found this great article by Cédric Verstraeten which helped me to make it an old-fashioned command-line way. Unfortunately, it’s a little bit outdated and doesn’t include some details. In this article I will try to explain the process step by step.

What do we need before we begin?

First, you need to install tesseract-ocr (this tutorial is based on version 3.02). Do not forget to add the installation directory to your system path (the installer may not do it). You also need these applications:

  • Cygwin – if you are using Windows (or you can rewrite the scripts from this article to Windows Batch)
  • Qt-box-editor – this is the only GUI program, you’re going to need – to fix the boxes generated by Tesseract, and ensure we feed the right data into it.
    • for Windows (I used version 1.08, the newer ones are for some reason not packaged with all needed libraries, what makes the installation more difficult)
    • or for Unix (sources)

Let’s move on

First, you must prepare the data which you want to feed into Tesseract. You need one or multiple files that together contain at least 1 (but preferably more) occurrence of each glyph of your font. I decided that to achieve the best accuracy I should train Tesseract with images preprocessed in exactly the same way as they would be in the final application. In my case the font was OCR-B – a font that is used on ID cards in Poland. So one of my files looked like this:

pol.ocrb.exp0

The input files must be named accordingly to the Tesseract convention:

For example, if you had 3 .png files with English text in Arial font, their names would be:

Or in my case (14 .tif files, Polish, OCR-B):

Once they are all gathered in one place and named correctly, we need to generate the box files for them. These files tell Tesseract where each glyph is located. Just open the bash console (on Windows it would be cygwin) and launch the script:

The first two parameters of the command are input and output file names (remember to change them accordingly), then there follow config files (“batch.nochop” and “makebox”) which tell Tesseract what to do. You can find them all in $TESSERACT_INSTALATION_DIR/tessdata/configs/ and $TESSERACT_INSTALATION_DIR/tessdata/tessconfigs/ (here you can find the list of parameters you can use in the config files). In this case, we are using two of them:

  • makebox – tells Tesseract to (only) generate box files
  • batch.nochop – tells Tesseract not to use its fancy algorithms for segmenting the picture. If your files contain letters in a grid, you should use it, but otherwise you may want to remove it from the command.

Now it’s time for some manual work

Open each file (image file, not *.box file that you generated) with qt-box-editor and correct Tesseract if it made any mistakes (if it did not, you probably don’t have to train it 🙂 ).

Przechwytywanie

Time to train Tesseract to recognize letters properly

Now we are going to generate *.traineddata file which can later be loaded to Tesseract, so it can recognize characters the way we want it.

There is yet one important thing to remember before you go further: If you are using windows make sure all of your files that you are using have the UNIX style end-of-line! If you are editing them manually you can do it with notepad++ in Edit -> EOL Conversion.

This is the script I used. Do not run it now, read it carefully. You will need to customize it to meet your needs.

The “wrap function” is nothing special. Just a handy method that repeats a string a given number of times with a different number inside it (run wrap 10 “prefix” “suffix” if you are not sure what it does). The most important part of the script begins after that.

Remove the old output

We need to remove all the files generated last time if we run the script again. It’s important because Tesseract sometimes works oddly when the output files are already there (is it a bug or a feature?). Remember to change the “pol” part to “eng” or any other language you are using (here and in every other occurrence that you will find. The same applies to “ocrb”).

Training files

Now it’s time to take the box and image files and compound them into training (*.tr) files.

This time we’re only using “box.train” config to tell Tesseract to generate *.tr files.

Unicharset

Then we are going to extract the charset from the box files (the command creates a “unicharset” file).

We use our “wrap” function to do it for all the files at once, no matter how many of them we have (just set the $N variable to the right value).

Font properties

Next we need to create a font_properties file.

The syntax is as follows:

Do not forget to set the values accordingly to the properties of your font.

The final training

It is the time for what everyone has been waiting for:

Rename the files

Now we have to add the language prefix to the generated files, so that they can be nicely consumed in the last step. This part of the script is not very sophisticated:

Combine it all into a traineddata file

And the last step. Take all the files with pol.* (or other) prefix and combine them into pol.traineddata:

Once the file is ready, you can copy it to $TESSERACT_INSTALATION_DIR/tessdata/ so you can use it from command-line or wherever else you need it (for example in a new application that uses Tesseract as a library).

Happy OCR-ing!

Tagged with: , , , , , , , , , , , ,

Hire us!

Pretius is a software development company.
We create web applications using: Java, Oracle DB, Oracle Apex, AngularJS.
Contact us to talk about how we can help you with your software project!