This tutorial will describe how to convert an image to text on CentOS using Tesseract.
Tesseract OCR (Optical Character Recognition) is a program that was developed by HP between 1995 – 2005. It is considered to be one of the best (read: accurate), freely available OCR engines.
Unfortunately, Tesseract on Linux is primarily tested on Ubuntu. Therefore, we will only be using Tesseract 2.04, as there are no Tesseract 3.xx RPMs available for CentOS (you can compile it manually – if you are interested, please leave a comment and I will upload the how-to for this.)
Tesseract 2.04 can only read text from .tif and .bmp files – however, ImageMagick can convert almost any image format to .bmp or .tif – in this tutorial, we will also be installing ImageMagick.
Additional yum repository: RPMForge (How to enable RPMForge on CentOS)
Install the required packages for Tesseract and ImageMagick:
yum install tesseract tesseract-en ImageMagick
1. Download an image that you want to extract the text from. In this example, we’ll use CentOS Blog’s own logo:
cd /tmp wget http://www.centosblog.com/wp-content/themes/techblog/images/logo.png
2. Convert the image to a format that Tesseract can read (.tif or .bmp) using ImageMagick:
convert logo.png logo.tif
3. Use Tesseract to convert the image to text. Tesseract requires a file to print the output to:
tesseract logo.tif output
4. Check the result (note that Tesseract automatically names the output file as .txt):
# cat output.txt @ CentOS Blog
That’s a pretty good result! However, please take note that OCR engines are obviously not always 100% accurate. There are a number of variables for an accurate result, such as colour, noise (ie. non-text/graphics), language and font.
I’ve created a script that reads an image URL, and an output file, and will take care of retrieving an image, converting it to a relevant format for Tesseract, and then converting the image to a text file: