A friend pointed me to an open source project called OCRopus because I am currently working on a project related to OCR. Commercial OCR solutions ain’t cheap and you can really dig a hole in your pocket trying to get a good OCR solution. It’s neither the price of the hardware nor the software that is high but the amount of work that needs to be done to make sure a correct output is obtained.
Most OCR solutions need a vast amount of time to train the software to correctly identify characters. Artificial Intelligence can help but not now, not today, not yet.
OCRopus is not the one who recognize the character itself but it relies on Tesseract. OCRopus provides layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. Sounds really good doesn’t it? Tesseract is the OCR engine that OCRopus uses.
Most of the project is tested and developed on Ubuntu, but if your platform has binutils and build tools you’re good to go. I believe it is also possible to build using Microsoft Visual Studio on Windows and of course MingGW. I went for the easiest option since I only have 2 hours to spare and I already have Cygwin on my system.
I first installed libraries header files (libpng-devel, libtiff-devel, libjpeg-devel) and build tools (gcc, make, g++, autoconf) and then built tesseract with the normal ./configure && make && make install
method. To build OCRopus there is a need for Perforce Jam. Jam is actually Just Another Make. I find it a little funny when I have to build Jam using make. Oh well. OCRopus is built with ./configure && jam && jam install
and it went pretty well.
To run them don’t forget to download the language files for your target language otherwise it will complain: Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
I ran my tests with standard LUA scripts that came with OCRopus (located in /usr/local/share/ocropus/scripts/) with the command ocroscript.exe rec-tess input_image > output.html
I created a 10 line Word Document with different fonts and printed it to a PDF. Using Adobe Photoshop I saved it to a JPG image. Then I gradually resized the image to the smallest I can get some output with.
To see the tests and results, click on Continue Reading.