rOpenSci | Tesseract Update: Options and Languages

Tesseract Update: Options and Languages

A few weeks ago we announced the first release of the tesseract package: a high quality OCR engine in R. We have now released an update with extra features.

Installing Training Data

As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data.

# Download French training data
tesseract_download("fra")

Note that this function is not needed on Linux. Here you should install training data via your system package manager instead. For example on Debian/Ubuntu:

sudo apt-get install tesseract-ocr-fra

And on Fedora/CentOS you use:

sudo yum install tesseract-langpack-fra

Use tesseract_info() to see which training data are currently installed.

OCR Engine Parameters

Tesseract supports many parameters to fine tune the OCR engine. For example you can limit the possible characters that can be recognized.

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
ocr("image.png", engine = engine)

In the example above, Tesseract will only consider numeric characters. If you know in advance the data is numeric (for example an accounting spreadsheet) such options can tremendously improve the accuracy.

Magick Images

Tesseract now automatically recognizes images from the awesome magick package (our R wrapper to ImageMagick). This can be useful to preprocess images before feeding to tesseract.

library(magick)
library(tesseract)
image <- image_read("http://jeroen.github.io/files/dog_hq.png")
image <- image_crop(image, "1700x100+50+150")
cat(ocr(image))

We plan to more integration between Magick and Tesseract in future versions.