Configuring Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is a method of converting printed text into digital format so that it can be used in computer-based processing and analysis.

Optical Character Recognition (OCR) is a method of converting images of text into a character-based format that can be used in computer-based processing and analysis. Voyager's OCR functionality processes image-based text in index records from PDF, TIF, PNG, BMP, JPG and GIF files.

This article describes how to implement a script that runs OCR during the last step of the Indexing Pipeline.  

Prerequisites

The following modules and external components are used to run this script. The specific versions of the specific builds must be installed per these instructions for PDF and image OCR to work as expected.

Python 2.7.8 for Windows 32-bit

Python 2.7.8 should have been installed with ArcGIS Desktop (10.3.1 or earlier). Under the assumption that Voyager is co-installed with ArcGIS Desktop, these scripts have been designed to work with this version of Python.

  • To confirm your version and architecture, simply run python from your command prompt. The output should be:

  • We recommend that you include your 32-bit python path on your System PATH environment variable, and that you also set this as the (initial) value for your System PYTHONPATH environment variable, as follows:

Python Image Library (PIL) 1.1.7 for Windows Python 2.7 32-bit

  • Download the PIL installer from here

  • Double-click the executable to install into the Python 2.7 location (above); all installation defaults are acceptable

Tesseract OCR

  • Download the Tesseract OCR libraries from here

  • Double-click the executable to run the installer; all installation defaults are acceptable

  • On some machines (Windows Server 2012 R2), you will need to add the tesseract install folder to your Path System Variable and create a TESSDATA_PREFIX System Variable set to the location of your Tesseract-OCR install

PyTesser (Python Bindings for Tesseract OCR)

  • Download the PyTesser libraries from here

  • Unpack the contents of this file to a folder called ~pytesser_v0.0.1~ and copy this folder to <pythoninstall>\Lib\site-packages

  • Add the full path to this folder to your PYTHONPATH

GhostScript

  • Download GhostScript from here

  • Double-click the executable to run the installer; all installation defaults are acceptable

ImageMagick

  • Download the ImageMagick installers from here

  • Double-click the executable to run the installer; all installation defaults are acceptable

  • Set the MAGICK_HOME System environment variable to the full path of this folder.

  • NOTE: Some issues were encountered on minimal builds of Windows Server 2012 R2 that did not include legacy (pre 2013) 32-bit (x86) Visual C++ Redistributable packages. Ensure that the list of installed programs contains the following:

  • All versions of the Visual C++ Redistributable packages can be downloaded from Microsoft’s website.

PythonMagick (Python Bindings for ImageMagick)

  • The easiest way to install Python Magick is by using a WHL (wheel) file using “PIP”. If PIP is not installed (it is not installed by default in Python 2.7.8), you can install PIP by downloading get-pip.py from here and running the command > python get-pip.py

  • Download the PythonMagick .WHL from here.

  • Install PythonMagick by running the command > pip install PythonMagick-0.9.10-cp27-none-win32.whl 

PyPDF2

  • Install PyPDF2 with pip … > <python>\Scripts\pip.exe install pypdf2

C:\Temp\

  • Make sure the directory c:\temp\ exists. OCR with PDFs writes its work in progress to this directory

Testing the Script

By following the steps above, all of the software prerequisites should now be installed. Independent of a Voyager install, you can test that components are in place and working as expected by running the script test_ocr_last_step.py against each of the ocrtest.pdf and ~.png file (contact Voyager Support and Professional Services for information about this file and the Python step.)

  • From either the command-line or PyScripter (for example), text extracted from the images will be printed to the console.

  • PDF from command-line:

  • PNG from PyScripter:

Installation & Basic Use

  • Copy the script <package>\Scripts\ocr_last_step.py to your <voyager>\app\py\pipeline\steps\ folder

  • Define a location that contains PDFs and/or images with human-readable but not machine-readable text (e.g. nothing you can select, copy and paste on your PC)

  • On the Location’s Pipeline settings tab, uncheck Use Default Pipeline Configuration. Select and Add the ocr_last_step as a last custom python step in the pipeline then click Save

  • From the main Location List, select this entry, Clear it’s current index (if it has already been built), and Rebuild its index entries

  • Monitor the process through the Discovery > Status page until it is complete and then observe that PDFs have been OCR in your Voyager Search results

  • Leverage the OCR text in your queries

Advanced Configuration

The following arguments are available when processing images with OCR.  Note that at least one argument must be included in the Pipeline step.

  • rotationDirection (default -1 [counter-clockwise])
    Specifies the direction that Voyager rotates pages in the document if OCR at the initial / previous rotation is unsuccessful

  • rotationIncrement (default 90 [degrees])
    Specifies how far Voyager rotates pages in the document if OCR at the initial / previous rotation is unsuccessful

  • maxPagesToProcess (default 6)
    Specifies the number of pages Voyager processes.This avoids the pitfalls of performing resource intensive OCR on all pages of long docs, where there is enough text early in the document

  • tokenizeResults (default True)
    Removes duplicate terms from the resultant text. This makes the OCR result searchable but not necessarily readable. To retain the document’s full prose, set this to False

  • ocrScoreThreshold (default 10)
    Voyager scores OCR results against a list of terms to get a rough gauge of the quality. For example, it scores how many terms in the list exist in the OCR results. The threshold value sets the number of terms that OCR needs to find before moving on to the next page or document (OCR score was at or above the threshold), or rotate the page and try again (OCR score was below our threshold)

  • ocrTermFile (default ‘OCR.Dictionary.txt’)
    By default, OCR results are scored against a list of common words. This parameter allows naming an external list of terms to use for scoring the success of OCR results more specifically. The file is expected to exist alongside the OCR PY script in <voyager_install>\app\py\pipeline\steps\