Sunday, August 2, 2009

OCR: Converting Images to Text with MODI

Joe Schmoe from Kokomo has a scanned image of a 300-page contract. Joe wishes he could search this file for certain rates and terms, but it's an image, not a text file. OCR might be just what the doctor ordered.

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. --Wikipedia

One such OCR solution that you may already have available to you is Microsoft Office Document Imaging (MODI), part of the Microsoft Office suite. Let's look at how you can use Ruby and the MODI API to automate the conversion of a scanned document into text.

Installing MODI


MODI might not have been installed when you installed Microsoft Office, so your first step may be to install it from the Office install disks. If installed, you will probably find an icon for "Microsoft Office Document Imaging" located on your Windows Start/Programs menus under "Microsoft Office Tools". If it's not there, go to your Add/Remove Software control panel, select your Microsoft Office installation, and select the option to add features. Then follow the necessary steps, which may vary depending on your version of Windows and Office.

Accessing the MODI API

To begin with, we'll use the win32ole module to create a new instance of the MODI.Document object:


require 'win32ole'
doc = WIN32OLE.new('MODI.Document')

Loading the Image

The next step is to call the document object's Create() method, passing it the name of the .TIF file to load:

doc.Create('C:\images\image1.tif')

NOTE: MODI only works with TIFF files. If your image is in another format (.JPG or .PNG, for example), you can use an image editor (such as Paint.NET or Photoshop) or code library (such as RMagick) to convert it to TIFF format.

Performing the OCR

The OCR() method performs the optical character recognition on the document. The mthod can be called without parameters...

doc.OCR()

...or with any of three optional parameters:

doc.OCR( { 'LangId' => 9,
'OCROrientImage' => true,
'OCRStraightenImage' => true } )

LangId: An integer representing the language of the document. English = 9, French = 12, German = 7, Italian = 16, Spanish = 10. This value defaults to the user's regional settings.

OCROrientImage: This boolean value specifies whether the OCR engine attempts to determine the orientation (portrait versus landscape) of the page. The value defaults to true.

OCRStraightenImage: This boolean value specifies whether the OCR engine attempts to "deskew" the image to correct minor misalignments. The value defaults to true.

You may find that tweaking these parameters from their default values produces better results, depending on the individual image(s) you're working with.

Getting the Text

Naturally, you'll want to get your hands on the text produced by the OCR process. Each page of the Document is represented by an Image object. The Image object contains a Layout object; and that Layout object's Text property represents the text for that image/page. So the hierarchy looks like this:

Document
=>Images
=>Image
=>Layout
=>Text

To accrue the entire text, simply iterate over the Document.Images collection and grab the Layout.Text values. For example:

File.open('my_text.txt', 'w') do |f|
for image in doc.Images
f.puts("\n" + image.Layout.Text + "\n")
end
end

Text, But Not Formatting

No OCR process can guarantee 100% accuracy, but I've found that MODI does a pretty good job recognizing text. Results will vary, of course, depending on the quality of the TIFF image. Note, however, that it cannot preserve formatting of tabular data. So while the text in a series of columns may be produced with a high degree of accuracy, that text will probably be produced with one value per line. So...

apple orange pear

...comes out as...

apple
orange
pear

Paragraphs of text have, in my experience, been produced with the proper line feeds. Play around with it and see if it meets your needs.

That concludes our show for today. Thanks for tuning in!

6 comments:

Anonymous said...

As a newbie Ruby developer, your code samples are soo much fun to implement. I just keep flooring myself with how easy Ruby is. Thanks! -Vicki

Matt Sidesinger said...

If you get the following error:

C:\>ruby test.rb
test.rb:9:in `method_missing': OCR (WIN32OLERuntimeError)
OLE error code:C6C81091 in Unknown
OCR running error
HRESULT error code:0x80020009
Exception occurred. from test.rb:9

Try the following (from http://support.microsoft.com/kb/918215/en-us)

1. Click Start, click Run, type sysdm.cpl, and then click OK.
2. On the Advanced tab, under Performance, click Settings.
3. On the Data Execution Prevention tab, use one of the following procedures:
4. Click Turn on DEP for essential Windows programs and services only to select the OptIn policy.
5. Click Turn on DEP for all programs and services except those I select to select the OptOut policy, and then click Add to add the programs that you do not want to use the DEP feature.
6. Click OK two times.

I added MSPVIEW.EXE and MSPSCAN.EXE to the exception list. You will have to restart your computer.

I found this in a forum, courtesy of "Jeff."

--Matt Sidesinger

Mariovsky said...

is there any possibility to use it on Office 2007?
I couldn't find it.

Thanks.

Anonymous said...

Office 2007 supports it but in 2010 the MODI component is deprecated, not sure why, there is some info on wikipedia.

Bruno Pio said...

I get the following error (windows 7, office 2007):

C:\>ruby test.rb
test.rb:9:in `method_missing': OCR (WIN32OLERuntimeError)
OLE error code:C6C81091 in Unknown
OCR running error
HRESULT error code:0x80020009
Exception occurred. from test.rb:9

Try the following (from http://support.microsoft.com/kb/918215/en-us)

1. Click Start, click Run, type sysdm.cpl, and then click OK.
2. On the Advanced tab, under Performance, click Settings.
3. On the Data Execution Prevention tab, use one of the following procedures:
4. Click Turn on DEP for essential Windows programs and services only to select the OptIn policy.
5. Click Turn on DEP for all programs and services except those I select to select the OptOut policy, and then click Add to add the programs that you do not want to use the DEP feature.
6. Click OK two times.

I added MSPVIEW.EXE and MSPSCAN.EXE to the exception list. I restart my computer, but I still have the error...

Anonymous said...

Just use Try and Catch, I tried everything and only this solved the problem