Thursday, August 05, 2010 at 8:26 AM
Posted by Leonid Taycher, software engineer
When you are part of a company that is trying to digitize all the books in the world, the first question you often get is: “Just how many books are out there?”
Well, it all depends on what exactly you mean by a “book.” We’re not going to count what library scientists call “works,” those elusive "distinct intellectual or artistic creations.” It makes sense to consider all editions of “Hamlet” separately, as we would like to distinguish between -- and scan -- books containing, for example, different forewords and commentaries.
One definition of a book we find helpful inside Google when handling book metadata is a “tome,” an idealized bound volume. A tome can have millions of copies (e.g. a particular edition of “Angels and Demons” by Dan Brown) or can exist in just one or two copies (such as an obscure master’s thesis languishing in a university library). This is a convenient definition to work with, but it has drawbacks. For example, we count hardcover and paperback books produced from the same text twice, but treat several pamphlets bound together by a library as a single book.
Our definition is very close to what ISBNs (International Standard Book Numbers) are supposed to represent, so why can’t we just count those? First, ISBNs (and their SBN precursors) have been around only since the mid 1960s, and were not widely adopted until the early-to-mid seventies. They also remain a mostly western phenomenon. So most books printed earlier, and those not intended for commercial distribution or printed in other regions of the world, have never been assigned an ISBN.
The other reason we can’t rely on ISBNs alone is that ever since they became an accepted standard, they have been used in non-standard ways. They have sometimes been assigned to multiple books: we’ve seen anywhere from two to 1,500 books assigned the same ISBN. They are also often assigned to things other than books. Even though they are intended to represent “books and book-like products,” unique ISBNs have been assigned to anything from CDs to bookmarks to t-shirts.
What about other well-known identifiers, for example those assigned by Library of Congress (Library of Congress Control Numbers) or OCLC (WorldCat accession numbers)? Rather than identifying books, these identify records that describe bibliographic entities. For example the bibliographic record for Lecture Notes in Mathematics (a monographic series with thousands of volumes) is assigned a single OCLC number. This makes sense when organizing library catalogs, but does not help us to count individual volumes. This practice also causes duplication: a particular book can be assigned one number when cataloged as part of a series or a set and another when cataloged alone. The duplication is further exacerbated by the difficulty of aggregating multiple library catalogs that use different cataloging rules. For example, a single Italian edition of “Angels and Demons” has been assigned no fewer than 5 OCLC numbers.
So what does Google do? We collect metadata from many providers (more than 150 and counting) that include libraries, WorldCat, national union catalogs and commercial providers. At the moment we have close to a billion unique raw records. We then further analyze these records to reduce the level of duplication within each provider, bringing us down to close to 600 million records.
Does this mean that there are 600 million unique books in the world? Hardly. There is still a lot of duplication within a single provider (e.g. libraries holding multiple distinct copies of a book) and among providers -- for example, we have 96 records from 46 providers for “Programming Perl, 3rd Edition”. Twice every week we group all those records into “tome” clusters, taking into account nearly all attributes of each record.
When evaluating record similarity, not all attributes are created equal. For example, when two records contain the same ISBN this is a very strong (but not absolute) signal that they describe the same book, but if they contain different ISBNs, then they definitely describe different books. We trust OCLC and LCCN number similarity slightly less, both because of the inconsistencies noted above and because these numbers do not have checksums, so catalogers have a tendency to mistype them.
We put even less trust in the “free-form” attributes such as titles, author names and publisher names. For example, are “Lecture Notes in Computer Science, Volume 1234” and “Proceedings of the 4th international symposium on Logical Foundations of Computer Science” the same book? They are indeed, but there’s no way for a computer to know that from titles alone. We have to deal with these differences between cataloging practices all the time.
We tend to rely on publisher names, as they are cataloged, even less. While publishers are very protective of their names, catalogers are much less so. Consider two records for “At the Mountains of Madness and Other Tales of Terror” by H.P. Lovecraft, published in 1971. One claims that the book it describes has been published by Ballantine Books, another that the publisher is Beagle Books. Is this one book or two? This is a mystery, since Beagle Books is not a known publisher. Only looking at the actual cover of the book will clear this up. The book is published by Ballantine as part of “A Beagle Horror Collection”, which appears to have been mistakenly cataloged as a publisher name by a harried librarian. We also use publication years, volume numbers, and other information.
So after all is said and done, how many clusters does our algorithm come up with? The answer changes every time the computation is performed, as we accumulate more data and fine-tune the algorithm. The current number is around 210 million.
Is that a final number of books in the world? Not quite. We still have to exclude non-books such as microforms (8 million), audio recordings (4.5 million), videos (2 million), maps (another 2 million), t-shirts with ISBNs (about one thousand), turkey probes (1, added to a library catalog as an April Fools joke), and other items for which we receive catalog entries.
Counting only things that are printed and bound, we arrive at about 146 million. This is our best answer today. It will change as we get more data and become more adept at interpreting what we already have.
Our handling of serials is still imperfect. Serials cataloging practices vary widely across institutions. The volume descriptions are free-form and are often entered as an afterthought. For example, “volume 325, number 6”, “no. 325 sec. 6”, and “V325NO6” all describe the same bound volume. The same can be said for the vast holdings of the government documents in US libraries. At the moment we estimate that we know of 16 million bound serial and government document volumes. This number is likely to rise as our disambiguating algorithms become smarter.
After we exclude serials, we can finally count all the books in the world. There are 129,864,880 of them. At least until Sunday.
No comments:
Post a Comment