OCRopus and Tesseract

A friend pointed me to an open source project called OCRopus because I am currently working on a project related to OCR. Commercial OCR solutions ain’t cheap and you can really dig a hole in your pocket trying to get a good OCR solution. It’s neither the price of the hardware nor the software that is high but the amount of work that needs to be done to make sure a correct output is obtained.

Most OCR solutions need a vast amount of time to train the software to correctly identify characters. Artificial Intelligence can help but not now, not today, not yet.

OCRopus is not the one who recognize the character itself but it relies on Tesseract. OCRopus provides layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. Sounds really good doesn’t it? Tesseract is the OCR engine that OCRopus uses.

Most of the project is tested and developed on Ubuntu, but if your platform has binutils and build tools you’re good to go. I believe it is also possible to build using Microsoft Visual Studio on Windows and of course MingGW. I went for the easiest option since I only have 2 hours to spare and I already have Cygwin on my system.

I first installed libraries header files (libpng-devel, libtiff-devel, libjpeg-devel) and build tools (gcc, make, g++, autoconf) and then built tesseract with the normal ./configure && make && make install method. To build OCRopus there is a need for Perforce Jam. Jam is actually Just Another Make. I find it a little funny when I have to build Jam using make. Oh well. OCRopus is built with ./configure && jam && jam install and it went pretty well.

To run them don’t forget to download the language files for your target language otherwise it will complain: Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset

I ran my tests with standard LUA scripts that came with OCRopus (located in /usr/local/share/ocropus/scripts/) with the command ocroscript.exe rec-tess input_image > output.html

I created a 10 line Word Document with different fonts and printed it to a PDF. Using Adobe Photoshop I saved it to a JPG image. Then I gradually resized the image to the smallest I can get some output with.

To see the tests and results, click on Continue Reading.

Test at 100%

Ocropus 100%

The Output

The quick brown fox jumps over the lazy dog — Times New Roman
The quick brown fox jumps over the lazy dog · Calibri
The quick brown fox jumps over the lazy dog – Arial
05132 quirk hrumn tux jumps uber the [agp hug — QBUJ Qingltsh ?IE2xt
{PRIZE THE QUICK Bll0WN FOX JU MPS (WEB THE L1\ZYl)0G – STENCIL
The quick brown Eox jumps over the lazy dog – Cooper Black
The quick brown fox jumps over· the lazy dog – Consolas
The quick brown fox jumps over the lazy dog — Andale Mono
The quick brown fox jumps over the lazy dog — Bookman Old Style
The quick brown fox jumps over the lazy dog — Lucida Bright

Resized 50%

Ocropus 100% * 50%

The Output

The quick brown ibx jumps over the lazy dog » Times New Roman
The quick brown foxlumps over the lazy dog – Calibri
The quick brown fox jumps over the lazy dog – Arial
Chr quul: brulnu fax yumus nun Ihr law Dug — OID Qiuglrsln Crxl
{HG ‘l‘l{lE QUICK lIlIOWN FOX .lUllll’S 0VlElI ‘l‘l{lE IAZY D0!} – STIENCII,
The quick brown fox jumps aver the lazy dag — Camper Black
The quick brown fox jumps over the lazy dog – Consolas
The quick brown fox jumps over the lazy dog – Andale Mono
The quick brown foxjumps over the lazy dog e Bookman Old Style
The quick brown fox jumps over the lazy dog – Luciclu Bright

Resized 50% more

Ocropus 100% * 50% * 50%

The Output is empty for the image above so I had to undo and resize the image to 75%

Undone, Resized 75% more

Ocropus 100% * 50% * 75%

The Output

Thu qluuk brrmn luynuupx mer nlm lng dug ‘l umu Nui Rumrm
The qmck brown foxlumps over the lazy dog — Calrbrl
The quick brown fox lumps over the lazy dug 7 Anal
(Cllr nurrl: lrruluu lm lumvs unrr tllr law hm Olh @rrq|rsl1€utjil’C
‘I’lIIZ (IUICK IHIDWN l*0X JUDIPS l)\’lEIl’I`Ill£I.1\ZYIil)G- S`l’llNl)lI.
The quick brvvrm bx jump: over the lazy dog · Cooper Black
The quick brown Fox jumps over the lazy dog — Consulas
The qui ck brown fox ]umps over the lazy dug — Andale Mono
The quick brown fax jumps uver the lazy dog — Bcokmzm Old Sty le
Thu quick brrmn fox lumps ru ur thc lazy drug — Lucida Bright

Conclusion

From the tests of raw, untrained OCRopus + Tesseract we can see that the output really relies on the size, thickness, and style of the text. It failed to recognize the stylish Old English Text MT and Stencil from the start. To my surprise towards the end it also failed to recognize Times New Roman.

And the winners here are clearly the console a.k.a. fixed width fonts Consolas and Andale Mono although towards the end of the test we can see a degradation of recognition for them.

The smallest resize that didn’t produce any output also confirms the theory regarding text size. It’s too small.

OCR experts will not agree 100% with my theories and they are right. These images are rasterized and anti-aliased fonts. In a true OCR environment the image would have been converted to binary images meaning that font smoothing cease to exist. Have you ever seen a fax transmission? It’s like that.

Having said that I did some test with binary images and the software was able to recognize text as small as 6pt. Unfortunately due to the classified nature of the data I am unable to publish them here.

Do share your experience with open sourced OCRs, or any OCR software for that matter. I hope you enjoyed reading.

0 Shares

5 thoughts on “OCRopus and Tesseract”

  1. The compiled folder is 31M after tarred and gzipped. I am thinking of a good way how to send it to you. Do you need jam as well?

    One important thing you may want to note is that I compiled it in cygwin.

  2. Hi ady, i try to compile under windows without success. Can you send me the compiled version in email ? 31m is not problem for my mail. Thanks in advance.

  3. It surprises me that size has such impact on performance.
    I would have imagined this component to be caught at an early stage and corrected for.

    But then I don’t know much in that field…
    Thanks for sharing your test though.

    Nicolas

Comments are closed.