Your Ad Here

Stopping Spam and Deciphering Text Too..

October 2nd, 2007

Share

In a great example of tamping a resource that had gone unnoticed before, researchers at Carnegie Mellon University are CAPCHA (Completely Automated Turing Test To Tell Computers and Humans Apart), as a means of deciphering old texts that standard computer-based text recognition can’t figure out.   CMU is one of the many organizations working to digitize the libraries of the world by scanning text from books and manuscripts.  The process relies on OCR or Optical Character Recognition to convert the scan images into text data, which can be indexed, searched, cataloged and, if necessary translated – all things which would not be possible with a simple image of the scan, which would also take up considerably more room to store.

But there’s one major issue:  Some old texts, especially ones which may be in less than perfect condition or use non-standard fonts are very difficult for computers to read.  If the computer hits a word, or group of words, which are too degraded or distorted for the computer to figure out, there remains only one reliable option:  send the text to a human to figure out.   This is obviously going to throw a big kink into projects which digitize thousands of volumes and millions of pages of text.

But researchers have come up with a very cleaver idea.  Many websites rely on a simple method of blocking spam-bots or automated scripts by requiring the users to enter text displayed in an image and often jumbled to prevent automated recognition.  We’ve all seen them, but now the technology can be used to do something else useful.   By displaying images of the text in question, the users can now contribute to digitization projects by providing the text in the image.

There’s one obvious issue that comes to mind:  How to know that the entered text is an accurate interpretation of the scanned text?  That has already been thought of as well, quoting the above article:

“If a person types the correct answer to the one we already know, we have confidence that they will give the correct answer to the other,” says Luis von Ahn, a Professor at CMU.

“We send the same unknown words to two different people, and if they both provide the same answer then effectively we can be sure that it is correct.

CAPCHA can be unpopular with users, as it often is an annoyance and distorted text can be difficult for humans to read as well.  But now that it’s contributing something to historical preservation and the storage of human knowledge, it may be a bit less annoying.   Still, there are a hundred million more books which are waiting to be digitized and at the rate things are going that could take hundreds of years.   It’s unlikely that this technology alone will be able to speed up the process by enough to digitize those millions of books any time soon, but it certainly can help.


This entry was posted on Tuesday, October 2nd, 2007 at 4:50 pm and is filed under Good Science, History. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
View blog reactions


Your Ad Here

Leave a Reply

Please copy the string VKhwU2 to the field below:

Your Ad Here