1.5 million books in your pocket

Thursday, February 05, 2009 at 8:56 AM

One of the great things about an iPhone or Android phone is being able to play Pacman while stuck in line at the post office. Sometimes though, we yearn for something more than just playing games or watching videos.

What if you could also access literature's greatest works, such as Emma and The Jungle Book, right from your phone? Or, some of the more obscure gems such as Mark Twain's hilarious travelogue, Roughing It? Today we are excited to announce the launch of a mobile version of Google Book Search, opening up over 1.5 million mobile public domain books in the US (and over half a million outside the US) for you to browse while buying your postage.

While these books were already available on Google Book Search, these new mobile editions are optimized to be read on a small screen. To try it out and start reading, open up your web browser in your iphone or Android phone and go to http://books.google.com/m.

There's an interesting backstory about the work involved to prepare so many books for mobile devices. If you use Google Book Search, you'll notice that our previews are composed of page images made by digitizing physical copies of books. These page images work well when viewed from a computer, but prove unwieldy when viewed on a phone's small screen.

Our solution to make these books accessible is to extract the text from the page images so it can flow on your mobile browser just like any other web page. This extraction process is known as Optical Character Recognition (or OCR for short). The following example demonstrates the difference between page images and the extracted text:

=> "Because I made a blunder, my dear Watson— which is, I am afraid, a more common occurrence than anyone would think who only knew me through your memoirs. ...

The extraction of text from page images is a difficult engineering task. Smudges on the physical books' pages, fancy fonts, old fonts, torn pages, etc. can all lead to errors in the extracted text. The example below shows the page image from the original manuscript for Alice's Adventures Under Ground. In this extreme case, the extracted text is riddled with errors:

=> "lV~e.il!" .ÍAoHyU- AUte. U brstty/affc. su.it a. f o.tl as ~tk¿* , I s&O.IL .éfiiíjz tiotkun-) of-ttmlr1¿*y ¿i^n. sta¿rs ! Jfo» ura.ve ...

Imperfect OCR is only the first challenge in the ultimate goal of moving from collections of page images to extracted-text based books. Our computer algorithms also have to automatically determine the structure of the book (what are the headers and footers, where images are placed, whether text is verse or prose, and so forth). Getting this right allows us to render the book in a way that follows the format of the original book.

The technical challenges are daunting, but we'll continue to make enhancements to our OCR and book structure extraction technologies. With this launch, we believe that we've taken an important step toward more universal access to books.

To try it out, point your mobile browser to http://books.google.com/m and begin reading. Oh, and if you do bump into some rough patches where the text seems, well, weird, you can just tap on the text to see the original page image for that section of text.

Happy mobile reading!


No comments: