• centosblog_icon_b_64

How to convert an image to text on CentOS Linux

This tutorial will describe how to convert an image to text on CentOS using Tesseract.

Tesseract OCR (Optical Character Recognition) is a program that was developed by HP between 1995 – 2005. It is considered to be one of the best (read: accurate), freely available OCR engines.

Unfortunately, Tesseract on Linux is primarily tested on Ubuntu. Therefore, we will only be using Tesseract 2.04, as there are no Tesseract 3.xx RPMs available for CentOS (you can compile it manually – if you are interested, please leave a comment and I will upload the how-to for this.)

Tesseract 2.04 can only read text from .tif and .bmp files – however, ImageMagick can convert almost any image format to .bmp or .tif – in this tutorial, we will also be installing ImageMagick.

Prerequisites

Additional yum repository: RPMForge (How to enable RPMForge on CentOS)

[line]

A. Installation

Install the required packages for Tesseract and ImageMagick:

yum install tesseract tesseract-en ImageMagick

B. Testing

1. Download an image that you want to extract the text from. In this example, we’ll use CentOS Blog’s own logo:

cd /tmp
wget http://www.centosblog.com/wp-content/themes/techblog/images/logo.png

2. Convert the image to a format that Tesseract can read (.tif or .bmp) using ImageMagick:

convert logo.png logo.tif

3. Use Tesseract to convert the image to text. Tesseract requires a file to print the output to:

tesseract logo.tif output

4. Check the result (note that Tesseract automatically names the output file as .txt):

# cat output.txt
@ CentOS Blog

That’s a pretty good result! However, please take note that OCR engines are obviously not always 100% accurate. There are a number of variables for an accurate result, such as colour, noise (ie. non-text/graphics), language and font.

[line]

I’ve created a script that reads an image URL, and an output file, and will take care of retrieving an image, converting it to a relevant format for Tesseract, and then converting the image to a text file:

[git:pre_bash@https://github.com/centosblog/image2text/blob/master/image2text.sh]
Scan to Donate Bitcoin
Like this? Donate Bitcoin to at:
Bitcoin 14M4a7UHEX61VoHkyjj4dxbUBNGGz3hmhM
Donate
Share This Post

About Author: Curtis K

Hi! My name is Curtis, and I am the creator of CentOS Blog. Please feel free to comment any suggestions, feedback or questions on my posts!

  • http://michaels-musings.com/ Michael

    Seems like Tesseract is no longer in RPMForge????

    [root@localhost ~]# yum install tesseract tesseract-en
    {snip}
    rpmforge | 1.9 kB 00:00
    rpmforge/primary_db | 2.6 MB 01:54
    {snip}
    Setting up Install Process
    No package tesseract available.
    No package tesseract-en available.
    Error: Nothing to do

    • emb3dd3d

      Yes tesseract 2.04 appears to be available in the CentOS5 repo..

      Excluding Packages from CentOS-5 – Addons
      Finished
      Excluding Packages from CentOS-5 – Base
      Finished
      Excluding Packages from CentOS-5 – Extras
      Finished
      Excluding Packages from CentOS-5 – Updates

      Finished
      Installed Packages
      Name : tesseract
      Arch : x86_64
      Version : 2.04
      Release : 1.el5.rf
      Size : 2.8 M
      Repo : installed
      Summary : Raw OCR Engine
      URL : http://code.google.com/p/tesseract-ocr/

  • emb3dd3d

    Greetings.. if you still monitor this site, I am interested in the compile procedures for v3.x for CentOS 5.x. I am going to attempt it from instructions found on another location but it helps to have a couple of references. I see there are rpms for Centos 6.x, now if I could force upgrades to the dev enviornment :|

    Thanx in advance.