July, 2008 | Ady's Blog

OCRopus and Tesseract

A friend pointed me to an open source project called OCRopus because I am currently working on a project related to OCR. Commercial OCR solutions ain’t cheap and you can really dig a hole in your pocket trying to get a good OCR solution. It’s neither the price of the hardware nor the software that is high but the amount of work that needs to be done to make sure a correct output is obtained.

Most OCR solutions need a vast amount of time to train the software to correctly identify characters. Artificial Intelligence can help but not now, not today, not yet.

OCRopus is not the one who recognize the character itself but it relies on Tesseract. OCRopus provides layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. Sounds really good doesn’t it? Tesseract is the OCR engine that OCRopus uses.

Most of the project is tested and developed on Ubuntu, but if your platform has binutils and build tools you’re good to go. I believe it is also possible to build using Microsoft Visual Studio on Windows and of course MingGW. I went for the easiest option since I only have 2 hours to spare and I already have Cygwin on my system.

I first installed libraries header files (libpng-devel, libtiff-devel, libjpeg-devel) and build tools (gcc, make, g++, autoconf) and then built tesseract with the normal ./configure && make && make install method. To build OCRopus there is a need for Perforce Jam. Jam is actually Just Another Make. I find it a little funny when I have to build Jam using make. Oh well. OCRopus is built with ./configure && jam && jam install and it went pretty well.

To run them don’t forget to download the language files for your target language otherwise it will complain: Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset

I ran my tests with standard LUA scripts that came with OCRopus (located in /usr/local/share/ocropus/scripts/) with the command ocroscript.exe rec-tess input_image > output.html

I created a 10 line Word Document with different fonts and printed it to a PDF. Using Adobe Photoshop I saved it to a JPG image. Then I gradually resized the image to the smallest I can get some output with.

To see the tests and results, click on Continue Reading.

Continue reading OCRopus and Tesseract

WordPress 2.6 (Tyner)

WordPress 2.6 has been released. Go get WordPress 2.6 now!. Or you can also read the development blog for the complete details.

MMU For Sale

Yeah, you heard it right. Multimedia University (MMU) is up for sale, and it’s valued between RM800~900 million. The plan is to raise money for the high-speed broadband (HSBB) project which will cost TM RM1~1.5 billion per year.

Here’s an article from Business Times Online.

From the line While management remains optimistic on the disposal of staff loans, the group is struggling to find a buyer for the university,” the duo noted., it’s now confirmed that MMU is for sale!

I was alerted of this news while reading Amanz.

The Most Expensive “Baju Melayu” I Ever Had

Or frankly the title should be “The Most Expensive Baju Melayu I will have” because I have just sent the fabric to the tailor a couple of hours ago at Wisma Yakin near Jalan Tuanku Abdul Rahman. The thing is I am unsure whether I will get a more expensive ones later in life.

For those who are wondering what it is, go here for Wikipedia page on Baju Melayu.

I am not getting this one especially for the coming Eid ul-Fitr, but for my niece’s wedding ceremony in August. And since it is pricey I will also use it for Eid.

The fabric is RM200 and the tailor charge is RM120. It maybe normal for certain people but since all of my previous Baju Melayus were tailored by my mom and using normal fabrics this is the most expensive one for me.

The reason I didn’t go for ready-made ones is that they might not fit well, and the ones I found are either too cheap (equals bad quality) or too expensive. The turquoise Baju Melayu will be ready during the last week of July, just in time for the wedding on August 8th.

MasterCard/Visa Promotion Fraud Attempt #2

Back in January I wrote about an attempt to squeeze my credit card numbers by a caller using a private number.

On July 4th, I received a call from 016 336 8916 but since I have my phone on private mode they were not able to reach me. However a couple of minutes later they sent me a text message: “Hello mr/mrs Ady Romantika. I’m Ros from Visa/master card voucher department. Because of your loyalty to us, you are entitled for complimentary vouchers. Please come to our office with your spouse to collect your vouchers. Unit 515, Level 5, Block E, Phileo Damansara 1, No. 9, Jalan 16/11, Off Jalan Damansara, 46350 Petaling Jaya Selangor. Please come anytime from Monday to Sunday between 3pm to 8pm. Please also allow us at least 45mins of your time. Thank you and see you soon.”

It was a long text message indeed. I sent them a reply that they are scammers, and they were bold enough to reply me. Their message now have been sent to the Royal Malaysian Police and the media via email. I am unsure whether any action will be taken to investigate it but I shall wait and see.

As I mentioned in my previous post, there is no logical chance that Mastercard and Visa are running a promotion together. I am pretty sure if I show up they will ask me for my credit cards and take the chance to copy the numbers, expiries, and CVV/CV2 numbers. Then they are free to use my credit cards online.

For the less cautious this might be a trap they might easily fall into. Beware!

Compressing WordPress Output

While toying around with NextGen code so that I can activate my custom image mirror, I saw the output from Firebug. I noticed that my HTML output is not compressed (by the absence of gzip content-encoding).

Some Apache servers have this module already enabled (previously mod_gzip a 3rd party module in Apache 1, and now built-in in Apache 2 as mod_deflate).

But what if you don’t have access to the Apache configuration, such as in a shared hosting environment?

I have the answer for PHP. I always include this line in the bootstrap code of the applications I build using Zend Framework:

ob_start("ob_gzhandler");

And the output will be gzipped prior to sending it to the browser. The result? Faster transfer to users.

For WordPress you can put the line in index.php:

< ?php ob_start("ob_gzhandler"); /* Short and sweet */ define('WP_USE_THEMES', true); require('./wp-blog-header.php'); ?>

Easy, isn’t it? Here are Firebug screenshots, before and after. Notice that I managed to cut the size of my front page by 1/5?

[As the screenshots are too wide please click on Continue Reading to see them]
Continue reading Compressing WordPress Output

PHP Framework Benchmark

In April I wrote about Eclipse PDT, Zend Framework, PHPUnit.

AVNet Labs have executed a comprehensive benchmark against popular PHP Frameworks.

It looks like they are also using Zend Framework for their development. I’ll stay with Zend as well, because I believe in vendor-product compatibility. I will not ask for support from Adobe if I have a problem with Microsoft Visual Studio, so it’s the same concept here.

Zend is The PHP Company.

Thanks to Rizal for the heads up.

Chitika Oh Chitika

On 10 April 2007 I tried to apply for Chitika eMiniMalls just to try out my luck even though my number of visitors is much lower back then. I received a reply which I totally understand and accepted:

In an effort to bring value to our publishers, we carefully consider each submission. During our review process we have determined that Chitika | eMiniMalls might not be a good match for your website.

On 27 June 2008 I received an email from Chitika:

Hi Ady,

Great news! There have been a lot of changes over at Chitika recently – so although our ads were not a good match for your website in the past, we believe that we may be a much better fit for you now. Why? Because the Chitika network now serves Premium ads for ALL types of site content like: Finance, Health, Travel, Family, & more. (Previously we focused mainly on product-related websites.)

We now offer a LOT more than eMiniMalls too – our new Chitika|Premium ads target your search traffic, and are showing extremely high CTRs and eCPMs for our publishers. So if you have a good amount of US search traffic, Chitika|Premium will be a great fit for your site!

Re-open your Chitika application here, to get started. (You will be able to edit your information such as website, email, and PayPal info before you submit) or head over to the “Chitika | Premium” page for more information.

And I received a reply:

Hello,

The email address you used doesn’t match the domain that you submitted, and it also does not match the email that was used to register the domain, so we cannot tell if you actually own this website.

If you do own this site, please re-open your application using the link below and supply an email address from the domain, or the email address that was used to register the domain.

If you cannot do this, then please re-open your application and tell us why in the “Comments” field. Thanks. Looking forward to your comments.

This is because I used my GMail address for the registration. If I enter my email address for this blog domain I immediately get “Invalid Email ID!”. I stated in the comment field of the registration form but I guess it was not taken into anyone’s attention. But then again, the reply did mention “If you cannot do this, then please re-open your application and tell us why in the “Comments” field. Thanks. Looking forward to your comments.” WTH?

Not that Chitika is bad or anything but as an Internet user I expect an established site updates the list of TLD frequently. I also had the email not matching problem with Nuffnang but they received my application without any problem.