« Follow the UW Law Library on Facebook and Twitter | Main | Index to Legal Periodicals No Longer Available via LexisNexis »

CAPTCHAs Being Used to Help Digitize Books with Poor OCR Accuracy

CAPTCHAs are those distorted letters that you have to enter after some internet transactions to verify that you're actually a human.

I recently learned that some CAPTCHAs are being used to help digitize old printed material by asking users to decipher scanned words from books that computerized optical character recognition failed to recognize. That is very cool.

Science Magazine reports that:

Whereas standard CAPTCHAs display images of random characters rendered by a computer, reCAPTCHA [from Google] displays words taken from scanned texts. The solutions entered by humans are used to improve the digitization process. To increase efficiency and security, only the words that automated OCR programs cannot recognize are sent to humans.

This illustration from the Science article helps demonstrate how it works:
recaptcha.jpg
The article explains:

In this example, the word "morning" was unrecognizable by OCR. reCAPTCHA isolated the word, distorted it using random transformations including adding a line through it, and then presented it as a challenge to a user.

Because the original word ("morning") was not recognized by OCR, another word for which the answer was known ("overlooks") was also presented to determine if the user entered the correct answer.

For more information, see the reCAPTCHA page and the Science Magazine article.