Go To Hellman: Google Book Search

Showing posts with label Google Book Search. Show all posts

Friday, March 5, 2010

Business Idea Number 3: Gluejar Book Search

A few years ago, I was invited to give a talk about the future of libraries at a library staff retreat. After the talk, the speakers were given a special tour of the library, which had recently undergone renovation. I was struck by the loneliness of the stacks. So many books, so much knowlege, so little usage.

As OCLC's Lorcan Dempsey has recently observed, the lawsuit over Google Book Search and its proposed settlement has highlighted the limitations on libraries' ownership of their book collections. There are many things that libraries would like to do with their books that they are prevented from doing by copyright law. The possibility that the Google Books service will enable libraries to reanimate their lonely book collections is the reason that libraries have, for the most part, been sympathetic to Google's digitization program.

One session at last week's Code4Lib conference sharpened my awareness of how libraries are struggling to acheive this reanimation on their own. There were 3 different presentations, from Stanford, NC State (3.65 MB ppt), and University of Wisconsin, Oshkosh, on "virtual bookshelves". The virtual bookshelf tries to enliven the presentation of an electronic library catalog by trying to reproduce part of the experience of browsing a physical library- sometimes the book you really need is sitting there next to the book you're looking for. It's an idea based on a sound user-interface design principle: try to present information in ways that that look familiar to the user.

The virtual bookshelf is not a new idea. Google has even been awarded a patent on virtual bookshelves- see the commentary here and here. Given that Naomi Dushay (who presented the Stanford work) wrote about Virtual Bookshelves in 2004, it appears to unlikely that the Google patent (filed in 2006) will apply broadly at all.

While the virtual bookshelf is a sensible and practical incremental improvement on the library catalog interface, it's also backward looking. People looking for information today want to search inside the books, not just "browse the stacks". But libraries don't have the ability (today) to search inside the books that they think they own.

Google Books could enable libraries to do just that. Google is spending huge sums of money to digitize books in libraries and make them searchable. When they got sued for doing this, the library community looked forward to having questions surrounding the fair use of digitized books settled in court. For example, while it's pretty clear that using digitization to create an full-text index of a book would be allowed as fair use, the display of "snippets" (as done by Google) may or may not be held to be a fair use of the page scans. When a settlement of the lawsuit was announced, much of the library community was disappointed that these fair-use questions would not be settled.

Google Books already allows users to set up book collections of their own and search them. The results come with snippets (see pictures), but if the settlement is approved, Google's ability to show snippets with vastly reduced infringement liability would leave it with a dominant position in libraries because of its ability to search inside huge numbers of books. If the settlement is not approved, Google's dominance would be similar, except that a copyright decision could shut down Google Books at some time in the distant and irrelevant future.

Some aspects of the settlement create holes in Google's index. As part of the settlement, rights holders can exclude their works from Google's index. Google's publisher partner program allows publishers to create these holes today. For example, even if you add Tolkein's "the Two Towers" in your Google library, Google won't let you search inside it. Only limited research uses can be made of the digitized works; as the Open Book Alliance's Peter Brantley has argued, it's very hard to tell what sort of innovations might arise from the availability of large numbers of digitized texts as data; the same goes for indices of these works.

Many other works have been excluded from the settlement. Works published only outside the US, Canada, UK and Australia, as well as works published in the US, but not registered with the copyright office, are not covered by the settlement. Works other than books, such as newspapers, magazines, and other periodicals are also excluded.

For these reasons and others, I've begun talking to people about "Gluejar Book Search". Gluejar Book Search would be a business focused on collecting, aggregating and redistributing full-text indices of copyrighted material. To comply with copyright law, it would focus on indices that can be distributed without infringinging copyright, and would help provide libraries and publishers with tools to produce copyright-safe index documents.

I've frequently encountered the assertion that digitizing all the books in libraries is prohibitively expensive, and that only Google (or possibly the government) could possibly have the financial resources to do it. For example, Ivy Anderson reports an estimate by the California Digital Library that digitization of the 15 million books in the libraries University of California would take a half a billion dollars and one and a half centuries. There are two coutervailing arguments. First, the cost of book digitization software and equipment has rapidly fallen, and will continue to fall. Last year, I wrote about the Dan Reetz' DIY book scanner, but even commercial devices capable of both image aquisition and OCR are currently available for as little as $1,400. I described how it could cost as little as $10,000,000 to put scanners in 10,000 libraries to enable scanning of 5,000,000 books per year.

The other factor that could drastically lower the cost of producing digital full-text indices of all types of copyrighted materials is the drastically lower technical demands of an indexing system compared to that of an archival imager. Archival imagers produce huge scanned image files because of the need for high resolution in an archival image. The resulting demands on storage hardware are significant and expensive. In contrast, an index file can be quite small; the laptop I'm typing on could store indices for 3,000,000 books; I estimate that full-text indices of all the worlds books would today require at most ten commercially available hard drives.

Gluejar Book Search would be fueled by two main revenue streams. The first stream would come from customized search services to enable library patrons to search inside the library's books. The second stream would be to provide aggregated feeds of index files to mass-market and specialized search providers- Google's competitors, and book retailers such as Amazon and its competitors. Google may even want to acquire index files for works it has been asked to remove from its own index, such as the Tolkein book mentioned above.

A possible third revenue stream would come from partnerships with rightsholders willing to permit page or snippet display in exchange for link traffic. If a Book Rights Registry comes into existence, it's possible that many business models could be arranged without prohibitive transaction costs.

Part of the revenue from Gluejar Book Search could be returned to libraries, publishers and other institutions that have contributed index files to the aggregation. Libraries could choose to use these funds to fund further digitization; alternatively, they may prefer to contribute to an Open-Access index.

The success of Gluejar Book Search would depend to a significant extent on its ability to reach critical mass. If it could reach index 80% of a library's book collection, it would deliver significant value to the library. (That statement is based purely on conjecture- email me or leave a comment if you agree or disagree!) Critical mass might be rapidly attained by working closely with publishers and by partnering with low-cost digitization providers and existing content aggregators so obtain indices for the most widely held books. Once critical mass is obtained, the "long tail" could be addressed by encouraging the particpation of large numbers of libraries around the world.

A Gluejar Book Search business would require a significant but not huge raise of capital, if for no other reason than to address litigation risk. Although I believe the legal position of building copyright-safe book indices is secure, there are bound to be litigious rightsholders with a poor grasp of fair use under copyright. The other big risks involve Google. Google might well develop services that greatly undercut Gluejar Book Search's revenue streams. Finally, the "copyright-safe" approach might be completelyundermined if courts in many countries were to rule decisively for an expansive view of fair-use.

If you want to know more about Gluejar, read this post. I have been exploring many possibilities about "what to do next", and I've written about other ideas, as well. As always, I'm interested in feedback of all kinds. Over the next few months, I hope to develop this and other ideas in more depth, so stay tuned.

Friday, February 19, 2010

Notes from the Google Books Fairness Hearing

The Fairness Hearing was even more interesting than I expected; every time a speaker started droning on about something we'd all heard ten times before, Judge Chin would interrupt with a snippy or pointed comment. Judge Chin definitely runs a no-nonsense courtroom.

ResourceShelf has a nice round up of the news reporting from the fairness hearing; the best summaries are from Norman Oder at Library Journal: Part One and Part Two.

Here are some of my observations.

How Many Books?

In Dan Clancy's declaration (PDF, 149 KB) in support of the settlement, there are some interesting numbers (which actually come from Google's Jon Orwant).

Google pays approximately $2.5 million per year to license metadata from 21 commercial databases of information about books.

Google has gathered 3.27 billion records about Books, and analyzed them to identify more than 174 million unique works.

These numbers seemed to cause a great deal of confusion at the hearing. Several speakers opposed to the settlement combined this number with the information from the Declaration of Tiffaney Allen, Settlement Administrator for Rust Consulting, (PDF, 2.1 MB) that

As of February 8, 2010, Rust Consulting has received 1,846 completed hard copy claim froms, and 42,604 claim forms were completed using the settlement website. The total number of Books claimed by those 44,450 claimants is 1,125,339. [...]

Of the 1,107,620 Books claimed online, 619,531 are classified as out-of-print (not Commercially Available) and 488,089 are classified as in-print (Commercially Available).

Some objectors subtracted 1 million claimed books from 174 million unique works to get the eye-opening number of 173 million unclaimed works supposedly being exploited by Google. This is silly math, and the use of silly math is a good indicator of speakers not doing their homework.

It's known that one of the bibliographic databases licensed by Google is OCLC's Worldcat; it's probably not a coincidence that Worldcat currently contains 174,618,797 bibliographic records. There's a big difference between a bibliographic record and a book subject to the settlement. Later in the day, Daralyn Durie, an attorney representing Google, tried to clarify what the numbers meant. (updated February 22 with text from the transcript)

174 million is NOT the number of books in the settlement.
Google estimates that there are 42 million different books in US libraries.
20% of these are in the public domain.
About half of those left are written in foreign languages.
Of the 42 million, less than 10 million of these works are affected by the settlement in any way.
Of these, about 5 million are out-of-print books implicated by the settlement.

These numbers are in line with reality. Michael Cairns, a veteran of the book data supply chain business, has published his own estimates of the number of orphan works which more or less square with these numbers.

So what are the other 160 million works? They're duplicates (different editions of the same work), works that aren't books, and works published in countries excluded from the agreement and not registered with the US copyright office.

Update, February 20: Jon Orwant was kind enough to send me some clarifications.

The only correction I'd make is that it actually *is* a coincidence that OCLC cites 174M records and we cite 174M books.

One thing to add to your "silly math" bit is that the 174M number also includes public domain books (hence not part of the settlement), and (this is the part that everyone messes up, and was ambiguous in Dan's declaration) 174M is a count of *manifestations*, not *works*. Hamlet is one work but hundreds of manifestations. The actual number of works is closer to 120M, but I haven't checked our most recent analysis.

Phrase of the Day: "Identical Factual Predicate"

It became clear at the hearing that Judge Chin's decision would turn on a determination of whether the settlement and the complaint it is meant to resolve have "identical factual predicates." I'll do my best to explain why.

A significant hurdle that the parties (i.e., Google, the Authors, and the Publishers) have to overcome is that the settlement is truly innovative and forward looking, and seeks to bind absent class members to business models that would not otherwise be allowed under copyright law. In their brief justifying the use of a class action, the parties cite a 1986 Supreme Court decision nicknamed "Firefighters", Local Number 93, Int’l Assoc. of Firefighters v. City of Cleveland. In this case, in which the petitioner tried to overturn a consent decree designed to redress past racial discrimination using ongoing obligations, the Court clarified that a judicial decree may go beyond the bounds of an original complaint.

In their filings, objectors countered with the “identical factual predicate” doctrine. This doctrine arises from a case known as "Super Spuds" in which it was held that a class action settlement could not go beyond the complaint of the original lawsuit. Judge Chin seemed interested in the apparent conflict and even asked Amazon's lawyer, famed copyright attorney David Nimmer, for his views on how to reconcile the precedents.

Nonetheless, attorneys from both sides wanted to argue whether the settlement satisfied the "identical factual predicate" test. Michael Boni, attorney for the Authors Guild, appear to be digging himself deep into a hole when Judge Chin asked him "Isn't it true that this case started out about snippets?" Boni argued that the case was really about the fears that publishers had about the scanning that Google was doing, and who knew what else? I thought to myself that publishers seem to fear much about the future of their industry, and following Boni's line of reasoning, the settlement could have included air rights because authors and publishers feared that the sky was falling.

Daralyn Durie's subsequent argument went a long way to recovering the ground lost by Boni. Of all the hot-shot lawyers making arguments at the hearing, Durie was by far the most impressive. She persuasively argued that since the original complaint included the Google's distribution of scan files to the libraries that contributed books for scanning, the settlement's provisions for selling access to scan files indeed constituted an identical factual predicate.

Judge Chin's eventual decision will turn on his evaluation of the "factual predicates".

What, Exactly, is Copyright's "Head"?

By the end of the hearing, I was sick and tired of hearing the phrase "turning copyright on its head". Even Bruce Keller, attorney for the Publishers' Association, was eager to use the phrase in its negative form. Have you ever tried repeating a word over and over again, so that its sound becomes grotesquely detached from its meaning? That's my feeling about the copyright-head phrase. It's meant to express that copyright usually means that copying requires the rightsholders permission, and the settlement would allow Google to make copies unless the rightsholder refuses permission.

On repetition, I began to ask myself: What part of copyright is the head? Are there brains in copyright? Is copyright blind? Does copyright have legs? Is there an invisible hand of copyright? When you eviscerate copyright, do copyright intestines spill out onto the floor?

Judge Chin Wants to Fix It

I got the impression that Judge Chin would like to approve a settlement. At least twice he asked objectors how they would "fix" the settlement to remove their objections. He asked EFF's Cindy Cohn how to fix the privacy problems she called attention to, and he sounded unhappy when EPIC's Marc Rotenberg told him that privacy problems with the settlement couldn't be cured. He asked Irene Pakuscher (representing the Federal Republic of Germany) if the settlement could be fixed to satisfy Germany's concerns about treaty compliance and effective representation. He also wanted to explore with more than one questioner Hadrian Katz' suggestion that all problems would go away if the settlement shifted from being opt-out to being opt-in.

State Laws Aren't Relevant

In an article last year, I suggested that Judge Chin might be tempted to used state unclaimed property laws as an alternate way to unravel the Orphan Works mess. Looks like I was wrong- he expressed open skepticism at the argument of Norman Marden, representing the Commonwealth of Pennsylvania, that the settlement should be rejected because of incompatibility with state laws.

Blind People had the Best View

The National Federation of the Blind made sure to have a very visible presence at the hearing to emphasize the benefits of the settlement for the reading disabled. It worked- photographs of blind people made the New York Times.

Spectators for the hearing filled two courtrooms. For the morning, I was in the overflow room, which featured a video screen small for the room and a distorted sound system. The view of the courtroom was fixed, and omitted any view of Judge Chin. Ironically, the seats closest to the video screen were filled with people who couldn't see it. Let's hope that's not emblematic of the case.

Thursday, February 18, 2010

Settlement Lawyers Say Real Authors Don't Advocate Fair Use

Today, February 18, 2010, in the US District Court, Southern District of New York, Judge Denny Chin will hear arguments for and against approval of an agreement to settle the lawsuit against Google by a class of book rightsholders formed by the American Association of Publishers and the Author's Guild. The unlikely alliance of publishers, authors, and Google will try to push through a settlement that would provide increased access to millions of books that Google has scanned and digitized in cooperation with libraries.

You can read about the pros and cons, the benefits and controversy of the settlement on a variety of blogs, websites and news outlets, but if you want to read one paragraph (with footnote) from the thousands of pages filed with the court that embodies all the issues, contradictions and complexities of the Google Books Settlement, here it is:

Some object to the entire ASA because it does not ensure that scientific or academic works are freely accessible under “Open Access” principles. They have claimed that if those works remain unclaimed, then they should be freely made available for use. These arguments run counter to the economic interests of members of the Class.¹⁴⁶ That the reading public may wish to have free access to scientific and other academic works covered by the ASA, or that some academic authors may not want to exploit their works through the Revenue Models, should not supersede the economic interests of members of the Class.

¹⁴⁶ That the interests motivating these objections runs contrary to the interests of the Class is best illustrated by their preference that Google should prevail on the merits of this litigation. See, e.g., D.I. 336 at 2-3 (“we believe . . . that scanning books to index them and make snippets available is likely and should be considered fair use”).

This comes from the Supplemental Memorandum Responding to Specific Objections filed by lawyers for the plaintiffs in the case. This 187 page document, available from the Public Index (PDF 856 KB) presents legal arguments countering objections to the agreement filed with the court. Just in case you've not had a chance to follow all the issues surrounding the case, I'll try to explain some of this crankiness.

In this excerpt, the "some" who object in "D.I. 336" (PDF, 287 KB) to the entire ASA (Amended Settlement Agreement) is Pamela Samuelson, Professor of Law at the University of California. Samuelson writes on behalf of a long list of academic authors, who believe that many absent rightsholders would want their books to be made as freely available as possible, and object to Google's exclusive monetization of those works.

I can speak to this belief from personal experience. My wife's father was a history professor, and wrote a small number of scholarly monographs published by university presses. These monographs, representing a significant part of his life's work, are unavailable to many scholars in his field. If he were still alive, we are sure that he would have wanted his books to be digitized and made freely available. I've advised the family that the Google settlement would allow these works to become much more available, something that would be difficult to acheive without the settlement because we have no documentation of the relevant publication contracts. Nonetheless, my father-in-law's interests would have closely aligned with those of the academics represented by Samuelson, in favor of free access, and siding with Google on the fair use arguments.

A large fraction of book authors write them for reasons other than to profit from book sales, and only a very small number of authors are able to make a living publishing books. In addition to academic authors, who publish to advance their careers, there are authors who publish to advance a political or social agenda, or as a means of personal expression. It seems bizarre to me that the legal representatives of the entire class of authors should just dismiss these motivations as running counter to the "economic interests of members of the Class".

Since the lawsuit is configured as a Class Action, the central issue that Judge Chin must consider is whether all authors and publishers are properly represented by attorneys for the class, and whether the settlement deals fairly with them. The provisions of the settlement are unusually broad, so Judge Chin will need to give detailed scrutiny to the provisions of the settlement which impact some class members differently from others.

It seems to me that footnote 146 argues too much. In attacking Samuelson and the academics she represents for siding with Google on the fair use issue, the footnote undermines the plaintiff's core argument that Boni & Zack LLC and Debevoise & Plimpton LLP, the authors of the Memorandum, are fairly representing their interests in the lawsuit.

At the fairness hearing, I won't expect to hear any new arguments or experience any legal drama (although I expect vitriolic verbal grenades from Lynne Chu). I'll mostly be looking for signs of interest, impatience, or annoyance from Judge Chin.

Thursday, February 4, 2010

Copyright-Safe Full-Text Indexing of Books

As the February 18 hearing on the revised Google Books Settlement Agreement draws near, I think its timely to explore some issues surrounding full-text indexing of books. It's important to realize that when Google began its program of scanning books in libraries, it chose to do so in a way that entered the gray zone of fair use. Google continues to maintain that its scanning activities are perfectly legal, and fair use advocates welcomed the Publishers' and Authors' lawsuit because it had the potential to clarify ambiguities around fair use. No matter where the court decided to draw the line, the both fair use and rightsholder control would be able to extend into the zone of current uncertainty.

Overlooked in the controversy is the fact that Google could have chosen a safer course in its effort to make full-text indices of books. In this article, I'll argue that it's possible to make full-text indices of books in a way that steers well clear of copyright infringement. But first, I should note that playing it safe would not have been a good plan for Google. By pushing fair use to its limits, Google assured itself a favorable competitive position. In a lawsuit, Google could have lost on 90% of the fair use they were claiming and would still have ended up 10% ahead of where a safe course would have taken them. Google is large enough that even a 10% victory in court would have paid off in the long run. As it is, Google chose to settle the lawsuit under terms that put them in a better position than they would have occupied by playing it safe, and potential competitors don't gain the benefits of a fair-use precedent.

I make two assumptions about copyright in devising an copyright-safe indexing method:

You can't infringe the copyright to a work if you don't copy the work.
If you can't reconstruct a work from its index, then distributing copies of the index doesn't infringe on the work's copyright.

Just in case these assumptions are weak, my fall-back position is that indexing is clearly a fair use under US copyright law.

First, the fall-back assumption: full-text indexing is allowed as fair use under US copyright law. Indices are allowed as "transformative uses". Judge Robert Patterson's decision (pdf, 195K) in the "Harry Potter Lexicon" case gives an excellent background of this jurisprudence and concludes:

The purpose of the Lexicon’s use of the Harry Potter series is transformative. Presumably, Rowling created the Harry Potter series for the expressive purpose of telling an entertaining and thought provoking story centered on the character Harry Potter and set in a magical world. The Lexicon, on the other hand, uses material from the series for the practical purpose of making information about the intricate world of Harry Potter readily accessible to readers in a reference guide. To fulfill this function, the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.

The author of the Lexicon lost his case not because his indexing was not allowed, but rather because he copied too much of J. K. Rowling's creative expression in doing so.

Second, you have to copy to infringe copyright. A more accurate statement is this: You have to either make a copy or a derivative work to infringe copyright. The second piece of this can be a bit more confusing, because "derivative work" has a specific meaning in copyright law. A translation into another language is an example of a derivative work. Indices are not derivative works. The law considers indices to be more akin to metadata. I might need access to a book to count the number of figures it contains, but a report of the number of figures in a book and what page they're on is in no way a derivative work. The copyright act defines a derivative work as

a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.

If you make copies by scanning, however, as Google is doing, you must also establish that your use is allowed as fair use. If you don't, then you don't even need to reach the fair use provision.

The last assumption gets more technical. The simplest form of a word index is a sorted list of words with pointers to the occurrence of the word within the text. So an index of that last sentence might look like this:

a    5,9
form    3
index    7
is    8
list    11
occurrence    18
of    4,12,19
pointers    15
simplest    2
sorted    10
text    24
the    1,17,20,23
to    16
with    14
within    22
word    6,21
words    13

It doesn't take a computer science degree to see that it's easy to reconstruct the sentence from this index. For that reason this form of index is equivalent to a copy. If you remove the position pointers, however, the index loses enough information that the sentence cannot be reconstructed. So if we take the words on a page of text and sort the words in each sentence, then sort the word-sorted sentences, we get an index of a page that can't be used to reconstruct text, but can be used to build a useful full-text index of a book.

The trickiest step of completely copyright-safe indexing is producing the page index from a book without producing intermediate copies of the pages. In a conventional scanning process, a digital image of a page is stored to disk and the copy is passed to OCR software. Indexing software then works on the OCR text. A scanning process that was fastidious about copyright, however, could scan lines of text word by word and never acquire an image large enough to be subject to copyright.

US courts have considered the loading of a copyrightable work into a computer's RAM storage to constitute copying, but scanning sufficient to produce an index can in principle be done without requiring that to occur. (For an excellent law review article on the RAM-copying situation, read Jonathan Band and Jeny Marcinko's article in Stanford Technology Law Review.) Also, even sentences of more than a few words can be considered copyrightable works, as I discussed in an article from November.

Another possible way to avoid copying is to build a black-box indexer. A closer look at the RAM-copying precedent, MAI SYSTEMS v. PEAK COMPUTER suggests that a non-copying scanning indexer can be built even if page images exist somewhere in RAM. In that case, the court reasoned that the software copy could be viewed via terminal readouts, system logs, and that sort of thing. If a closed-box indexing system were built so that page images resident in RAM could never be "perceived, reproduced, or otherwise communicated", then there is a fair chance that a court would find that copying was not occurring.

I'm a technologist, not a lawyer. I would welcome comment and criticism from experts of all stripes on this analysis. For example, I've not considered international aspects at all. There are many technical aspects of copyright-safe indexing that would need to be sorted out, but doing so could open the way to countless transformative uses of all the books in the world.

Monday, January 18, 2010

Google Exposes Book Metadata Privates at ALA Forum

At the hospital, nudity is no big deal. Doctors and nurses see bodies all the time, including ones that look like yours, and ones that look a lot worse. You get a gown, but its coverage is more psychological than physical!

Today, Google made an unprecedented display of its book metadata private parts, but the audience was a group of metadata doctors and nurses, and believe me, they've seen MUCH worse. Kurt Groetsch, a Collections Specialist in the Google Books Project presented details of how Google processes book metadata from libraries, publishers, and others to the Association for Library Collections and Technical Services Forum during the American Library Association's Midwinter Meeting.

The Forum, entitled "Mix and Match: Mashups of Bibliographic Data", began with a

presentation from OCLC's Renée Register, who described how book metadata gets created and flows though the supply chain. Her blob diagram conveyed the complexity of data flow, and she bemoaned the fact that library data was largely walled off from publisher data by incompatible formats and cataloging practice. OCLC is working to connect these data silos.

Next came friend-of-the-blog Karen Coyle, who's been a consultant (or "bibliographic informant") to the Open Library project. She described the violent collision of library metadata with internet database programmers. Coyle's role in the project is not to provide direction, but to help the programmers decode arcane library-only syntax such as "ill. (some col)". The one instance where she tried to provide direction turned out to be something of a mistake. She insisted that, to allow proper sorting, the incoming data stream should try to keep track of the end of leading articles in title strings. So for example, "The Hobbit" should be stored as "(The )Hobbit". This proved to be very cumbersome. Eventually the team tried to figure out when alphabetical sorting was really required, and the answer turned out to be "never".

Open Library does not use data records at all, instead, every piece of data is typed with a URI. This architecture aligns with W3C web standards for the semantic web, and allows much more flexible searching and data mining than would be possible with a MARC record.

Finally, Groetsch reported on Google's metadata processing. They have over 100 bibliographic data sources, including libraries, publishers, retailers and aggregators of review and jacket covers. The library data includes MARC records, anonymized circulation data and authority files. The publisher and retailer data is mostly ONIX formatted XML data. They have amassed over 800 million bibliographic records containing over a trillion fields of data.

Incoming records are parsed into simple data structures which looked similar to Open Library's, but without the URI-ness. These structures are than transformed in various ways for Googles use. The raw metadata structures are stored in an SQL-like database for easy querying.

Groetsch then talked about the nitty-gritty details of data. For example, the listing of an author on a MARC record can only be used as an "indication" of the authors name, because MARC gives weak indications of the contributor role. ONIX is much better in this respect. Similarly, "identifiers" such as ISBN, OCLC number, LCCN, and library barcode number are used as key strings but are only identity indicators with varying strengths. One ISBN with a chinese publisher prefix was found on records for over 24,000 different books; ISBN reuse is not at all uncommon. One librarian had mentioned to Groetsch that in her country, ISBNs are pasted onto a book to give it a greater appearance of legitimacy.

Echoing comments from Coyle, Groetsch spoke with pride of the progress the Google Books metadata team has made in capturing series and group data. Such information is typically recorded in mushy text fields with inconsistent syntax, even in records from the same library.

The most difficult problem faced by the Google Books team is garbage data. Last year, Google came under harsh criticism for the quality of its metadata, most notably from Geoffrey Nunberg. (I wrote an article about the controversy.) The most hilarious errors came from garbage records. For example, certain Onix records describing Gulliver's Travels carried an author description of the wrong Jonathan Swift. Most of these errors come from garbage records, and when one of these is found, almost always, the same problems can be found in other metadata sources. Google would like to find a way to get corrected records back into the library data ecosystem so that they don't have to fix them again, but that there have been issues with data licensing agreements that still need to be worked out. Article like Nunberg's have been quite helpful to the Google team. Every indication is that Google is in the metadata slog for the long term.

One questioner asked the panel what the library community should be doing to prevent "metadata trainwrecks" from happening in the future. Groetsch said without hesitation "Move away from MARC". There was nodding and murmuring in the audience (the librarian equivalent of an uproar). He elaborated that the worst parts of MARC records were the free text data, and normalization of data would be beneficial whereever possible.

One of the Google engineers working on record parsing, Leonid Taycher, added that the first thing he had had to learn about MARC records was that the "Machine Readable" part of the MARC acronym was a lie. (MARC stands for MAchine Readable Cataloging) The audience was amused.

The last question from the audience was about the future role of libraries in production of metadata. Given the resources being brought to bear on the book metadata by OCLC, Google and others, should libraries be doing cataloguing at all? Karen Coyle's answer was that libraries should concentrate their attention on the rare and unique material in their collections- without their work, these materials would continue to be almost completely invisible.

Sunday, January 3, 2010

2020: Fewer Libraries, More Locations

In the middle of the block I live on, there are two fire

hydrants right next to each other. The reason for this is that half the block is in the town of Glen Ridge, New Jersey, and half the block is in Montclair, New Jersey. In the past there were incidents when fire trucks from the two towns rushed to the scene of a fire alarm, only to get into lengthy discussions about which town had responsibiity. That doesn't happen any more. In 1991, Glen Ridge closed its fire department and contracted with Montclair to supply fire protection services.

Will the same sort of consolidation happen with libraries? I think it will.

In my "Ten Predictions for the Next Ten Years" article, my first prediction was that the number of public libraries in 2020 would be half of what it is today. I also predicted that the number of public library locations would increase by 50%. I got plenty of feedback on Twitter that these predictions needed some explanation. Roy Kenagy thought that my prediction couldn't possibly apply to Iowa, where "new [libraries] sprout like weeds and people tend to them as their own".

There were two considerations, book digitization and the shift to e-books, that led me to these predictions, and neither is peculiar to New Jersey. I admit, though, that New Jersey's high taxes and density of services affected my estimate of the magnitude of coming changes.

Over the next ten years, book digitization will completely change the way most people use libraries. Instead of browsing the stacks or searching a catalog, people will increasingly make use of full text indexes and digitized resources to find books. This already happens with Google Books. They will then try to obtain the physical book in the library, or alternatively, use an e-book reader. Public libraries will need to adapt their physical plants to accommodate this changed usage pattern. Stacks will become more warehouse-like; public spaces will have fewer books and more coffee. Patrons will demand larger collections, but will accept less physical access to print. Home delivery of library materials will become much more common.

At the same time, libraries will struggle to adapt to the e-book economy. The most likely outcome will be a shift to licensed resources. Publishers will discover the benefits of putting much larger numbers of titles into e-book subscription packages such as those currently offered by Overdrive, Netlibrary, Ebrary, and others. When these packages can be used on patrons' Kindles and other e-readers, libraries will need to have them.

All of these trends will put pressure on libraries to work together on shared services, and ultimately to merge. Larger libraries will more effective at delivering both print books and e-books, and patrons will care less about where the print books are stored when they're not being lent. Smaller libraries will find it difficult to support the technical and operational expertise needed to run the public library of 2020.

While the shift to digital media will cause library organizations to become larger and fewer through mergers, it will also allow branches to be effective at smaller sizes. Without the need to store a critical mass of books, tiny, storefront branches will become more practical and cost efficient. Guys in vans carrying books will become more important. When people go to their local branch, they'll be able to use the free Google Books terminal (libraries are to get one free for every building) or other computers, check out some books, then have a coffee and socialize for an hour or so until the van makes its hourly delivery. Or they'll do their shopping rounds and come back to pick up the bag of books waiting for them. Establishing branches in shopping areas is not only a smart thing for libraries to do, it's also very cost-efficient.

In my own town, it seems that almost every year there's talk of closing the branch to save money. If you look at it, you can see why- the building is massive and has to be very expensive to operate. Eventually it will be shuttered and sold, but a storefront branch down the block could deliver the same services and cost much less to run. Does it make sense for the town high school to run its own library? Not really, but that could be another branch. We'll have fewer libraries, but more locations.

While consolidation and mergers will reduce the number of libraries, it can't be ignored that public library budgets are being slashed, and some libraries are being closed for purely financial reasons. Part of this is that the perceived value of libraries is less than it used to be. Many critical information services that used to be available only through libraries are now readily available through the internet.

There's also the possibility that public library services could be outsourced. My town's library gets annual funding of $3.8 million, roughly $100/resident, paid through our property taxes. If, in 2020, people are mostly reading books on e-readers, how much will they be willing to spend on a library? Will people prefer to spend the $100 on a commercial e-book subscription? I'm not sure, but I'm guessing that some towns will go the outsourcing route.

As I've said before, I'm an optimist about the ability of libraries to adapt to changes in media. If you look carefully at my picture of the Montclair Public Library van guy, you'll notice that what he's collected from this drop box is a newspaper, some VCR tapes and a whole bunch of DVD's. The print books were in another box, and there was not a single e-book to be carried to the main library. Makes you wonder...

Update 1/8: Some follow-up here.

Saturday, November 14, 2009

The Book Rights Registry Unclaimed Works Fiduciary: Powerful Regent or Powerless Figurehead?

In college, I did physics problem sets with a study group that called themselves the "Fish Heads" after a song frequently played on the radio by Dr. Demento. We would start work after dinner on the night before the problem set was due, and we'd work till we were done, which was seldom before midnight and more usually like 3 or 4 AM.

I thought of the Fish Heads late last night while racing through the newly revised settlement agreement of the Google Book Search lawsuit. The parties to the lawsuit had already asked for, and received, a four-day extension, and you just knew they were going to stretch out their work to meet the midnight deadline with not much room to spare. Sure enough, at 11:45 PM EST came word that the revised agreement had been filed. A few minutes after midnight, I was racing through the document to find out what the changes were, tweeting along the way. James Grimmelmann and Ken Crews were doing the same thing in our different ways. It was really nerdy. Danny Sullivan was reporting on the Conference call with Dan Clancey, Paul Aiken and Richard Sarnoff.

Here's your basic reading list for Google Book Search Settlement Agreement 2.0:

Start with the New York Times summary (Brad Stone and Miguel Helft)
Then read Danny Sullivan's report on the Conference call.
Having gotten the big picture, read James Grimmelmann's instant analysis of the revised agreement.
Then graze through the coverage overview at Gary Price's Resource Shelf.

Having slept on it and having had some time to think it through, I have a bunch of questions, and they mostly focus on the one demon that has not been exorcised from the agreement, orphan works.

The revised agreement attempts to address the peculiar situation of orphan works by introducing a new entity, the Unclaimed Works Fiduciary (UWF) which, as part of the Book Rights Registry, is to act as a spokesman for the rightsholders of the unclaimed works. The key question for your problem set is this: is this new regime a powerful Regency over Orphandom, or is it a powerless Figurehead masking a Google Autocracy of Zombies?

Here is how the revised agreement defines the UWF

Unclaimed Works Fiduciary. The Charter will provide that the Registry’s power to act with respect to the exploitation of unclaimed Books and Inserts under the Amended Settlement will be delegated to an independent fiduciary (the “Unclaimed Works Fiduciary”) as set forth in [other sections of the Agreement] and otherwise as the Board of Directors of the Registry deems appropriate. The Unclaimed Works Fiduciary will be a person or entity that is not a published book author or book publisher (or an officer, director or employee of a book publisher). The Unclaimed Works Fiduciary (and any successor) will be chosen by a supermajority vote of the Board of Directors of the Registry and will be subject to Court approval.

The section about the Registry Charter provides that

in the case of unclaimed Books and Inserts, the Unclaimed Works Fiduciary may license to third parties the Copyright Interests of Rightsholders of unclaimed Books and Inserts to the extent permitted by law.

James Grimmelmann calls that that last sentence "words of equivocation". The reason is that he and other commentators think there is almost nothing that the law, absent an act of Congress, would allow the UWF to license to a third party. The rule of "Nemo dat" should apply- you can't give something away that isn't yours to give.

The Open Book Alliance goes even further. In a post somehow released earlier than the revised agreement, it calls the revised agreement a "sleight of hand" meant to distract people from Google's monopoly grab, its usurpation of Congress, its shredding of contracts, its destruction of libraries, ~~its bioterror weapons stockpile and its threatening the sanctity of marriage~~.

Michael Healy, who has been named Executive Director of the Book Rights Registry, which would be the home of the UWF, seems to have a different perspective. In a post on the Publishing Point website, Healy notes

The Registry will now include a Court-approved fiduciary who will represent rightsholders of unclaimed books, act to protect their interests, and license their works to third parties, to the extent permitted by law.
The new version of the settlement removes the “most favored nation” clause contained in the previous version. The Registry will now be able to license unclaimed works to other parties without ever extending the same terms to Google.

"Extent permitted by law" is a hard phrase to argue with. How could a settlement go any further? Grimmelmann's theory is that the phrase is meant to be an enticement to Congress to pass a narrow law aimed at neutralizing Google's exclusive access to orphan works exploitation.

A closer look at the UWF suggests that its other powers may be less constrained. Here's what it will be able to do, as enumerated by the revised agreement:

UWF may direct Google to change the classification of a Book to a Display Book ~~or to a No Display Book or to include in, or exclude any or all Unclaimed Works from, one or more of the Display Uses (note added- see comments)~~.
UWF may allow Google to
- alter the text of a Book or Insert when displayed to users;
- add hyperlinks to any content within a page of a Book or facilitate the sharing of Book Annotations
and may exclude from Advertising Uses one or more unclaimed Books if Google displays animated, audio or video advertisements in conjunction with those Books.
UWF may approve the use of additional or different Pricing Bins for unclaimed Books
UWF may:
- dispute Google’s categorization of a Book as Fiction
- allow Google to offer to users copy/paste, print or Book Annotation functionalities as part of Preview Uses; allow Google to conduct tests to determine if another Preview Use category increases sales and revenues of such Books
- adjust the Preview Use setting for a particular Book in exceptional circumstances for good cause shown.
UWF may authorize Google to make special offers of Books available through Consumer Purchases at reduced prices from the List Price.
the Unclaimed Works Fiduciary and Google may agree to one or more of the following additional Revenue Models for unclaimed works:
- Print on Demand (“POD”) - This service would permit purchasers to obtain a print copy of a non- Commercially Available Book distributed by third parties. A Book’s availability through such POD program would not, in and of itself, result in the Book being classified as Commercially Available.
- File Download. This service would permit purchasers of Consumer Purchase for a Book to download a copy of such Book in an appropriate file format such as PDF, EPUB or other format for use on electronic book reading devices, mobile phones, portable media players and other electronic devices (“File Download”).
- Consumer Subscription Models – This service would permit the purchase of individual access to the Institutional Subscription Database or to a designated subset thereof (“Consumer Subscription”).
UWF may license to third parties the Copyright Interests of Rightsholders of unclaimed Books and Inserts to the extent permitted by law. (discussed above.)
allow the Registry to use up to twenty-five percent (25%) of Unclaimed Funds earned in any one year that have remained unclaimed for least five (5) years for the purpose of attempting to locate the Rightsholders of unclaimed Books.
UWF can challenge the classification of its Book or a group of its Books as In-Print or as Out-of-Print

All in all, it seems to me that the most significant power of the UWF is not the theoretical power to deal with third parties, but rather the power to control the display status of unclaimed works. (note added- see comments).

Under what circumstances might the UWF turn off display uses? Since the UWF is subject to the approval of the court, the court could, in principle, direct UWF to manage the unclaimed works to minimize antitrust issues. If that happened, Google's monopoly would not go much further than a release of liability for uses that might be considered fair use. Or, the UWF could use its leverage to force Google to open its unclaimed works scans to competitors.

On the other hand, the UWF, being selected by the Registry Board, and being dependent on the Registry for support, would have built-in incentives to enable revenue generating use by Google, not to mention its responsibilities to the orphan rights-holders.

In the end, whether the Unclaimed Works Fiduciary becomes a powerful Regent or a powerless Figurehead depends to a great extent on the Court's willingness to wield power. Good Luck, Denny Chin!

Ask a fish head anything you want to
they won't answer they can't talk.

Wednesday, November 11, 2009

The Uniqueness of Sentences and J. K. Rowling's (Non)Infringement of Tanya Tucker

Have you ever heard someone say something unusual and wonder to yourself if anyone in the history of humanity had ever said that before, ever? It happens a lot more than you might think.

In the discussion of my article on copyright salami, I suggested that copyright based on content as short as a sentence would not be very robust. I had reasoned that if the sentences were short enough, the would be a high probability that the same sentence had already appeared in a copyrighted work, or even in a work that was in the public domain. I imagined building huge databases of sentences that had already been used so as to clear them for reuse.

I decided to do some testing first. I chose a page at random (p. 447) from my (print) copy of J. K. Rowling's Harry Potter and the Deathly Hallows. I extracted the sentences, and put each sentence into Google and into Google Book Search. The results surprised me.

My first test sentence was

"Get - off - her!" Ron shouted.

With only 5 words, none of them uncommon, I expected to get a a few close matches. The book search produced zero hits, and no results at all close. The general Google search was more interesting. Of the 7 hits, all of them exact matches, the top two of seven hits appear to be properly attributed fair use quotations from the book. Two other hits were to complete, unauthorized copies of the book. One of these, on SlideShare, offers this disclaimer:

"hey here i got this book in pdf format .. am i violating anything after .. uploading this stuff over here ... just let me know .. if any issue come in existence, will remove it

Although the item has had 34,000 views, it pdf itself appears to have been removed from SlideShare. The pdf posted by a Filipino web designer on his web site, though, is still available (and has been since August) and is of quite good quality.

The oddest hits are to a site which masquerades as a "game ranking" portal site.

RPGRank is a real-time online game ranking system which provide a best MMORPG ranking portal for both players and games of all genre with the exclusive news, press release, review, preview, interview, trailer and vedio. RPGRank strive to provide all gamers things that they never experienced before by newest game beta keys, live-event, and online tournamentsa with attractive giveaways from games.

It appears that this site generates pages of random text for the benefit of search engines by extracting sentences from books and feeding the sentences to Google in a random order. This site has convinced Google to index "about 318,000" pages of its meaningless "content", and offers to sell "background" advertising space on the site at $1200 per month.

The last hit appears to be to a site which is presenting a Vietnamese translation of the book alongside the complete English text. Although I can't read Vietnamese, I doubt very much that it is authorized use. Vietnam joined the Berne convention only 5 years ago, so this is certainly an illegal infringement.

Of the 26 sentences on page 447, I could find only three that had been used in places that Google knows about. The first, "Leave him alone, leave him alone!" is a line from a Tanya Tucker song. The second, "Harry's stomach turned over.", has been used in James Edward Amesbury's "bloody but weakly conceived thriller", A Sporting Chance and in D. Edwards Bradley's Harry's War.

The third,"Harry did not answer immediately." is firmly in the public domain, having done duty as a complete sentence in Smith Hempstone's A Tract of Time, as a fragment in Frances Elizabeth G. Carey-Brock's 1867 My father's Hand: and Other Stories, and in Adam Williams' 2007 gripping adventure of modern China, The Dragon's Tail.

Three sentences comprising bits of dialog: "Been Stung", "And your first name?", and "Vernon Dudley", turned up numerous matches to fragments of sentences in Google. It was also amusing to see matches for the sentence "What happened to you, ugly?" This phrase matched two people-search sites which specialize in feeding Google pages with text like "What happened to Joe Smith?" Apparently there is someone who uses the screen name "you_ugly", and the people search engines just leapt to the wrong conclusions!

Most of the sentences on page 447 appear to be purely original to J. K. Rowling. Was she lucky, or were the odds stacked in her favor? Word frequencies for English have been measured, so we can easily generate a simplistic estimate of sentence occurrence rate. Ignoring the proper name "Ron", the words "Get", "off", "her" and "shout" have occurrence frequencies of 0.22%, 0.046%, 0.22%, and 0.0055%, respectively. Multiplying these occurrence rates gives us a weighted occurrence probability of this combination of 1 per 8 trillion. If you had the entire population of earth speaking random four-word English sentences they might come up with this combination in a day or two. Add "Ron" into the mix, and they might take the greater part of a year to generate the sentence J. K. Rowling wrote.

For context, it's interesting to guess at the total number of sentences that humanity has written or spoken. It's estimated that 100 billion humans have lived so far. If those humans spent 16 hours a day for an average of 65 years generating 3 sentences per minute, we'd be up to about 20 million trillion sentences. The real number is probably a factor of 100 to a thousand less (half of us are men, after all!). This estimate roughly agrees with estimates of others that all the words ever spoken could be archived using 10 exabytes of storage.

Ten exabytes is not as much storage as it used to be. The Internet Archive currently has 0.003 exabytes; although Google is quite secretive about its hardware deployment, it seems likely that their current storage capacity is in excess of 10 exabytes. Yesterday, Google announced a pricing plan where they'll rent you 0.000016 exabytes for $4096 per year. I'll do the math for you. If you want to store everything anyone has ever said, Google will rent you the space for only $2.5 billion dollars per year!

Given that Google will soon have digitized a large fraction of the world's books, there are a few things we can learn from this exercise.

It will soon be very easy for Google to detect unauthorized copies of books in its index, and presumably to remove them. The benefit to publishers of doing this would hugely outweigh any damages they're suffering from the Google Books digitization program. Why have publishers overlooked getting this to happen as part of the agreement settling their lawsuit?
It will not be difficult for Google to accurately de-duplicate the Google Books index.
J.K. Rowling's hesitancy to release her books in ebook format is really, really stupid.

Before you get distracted with something useful, do this: pick about 5 random words, make a sentence from them, and become the first human ever to say that sentence. Depending on what you do next, you may also be the last!

Friday, October 23, 2009

Copyless Crowdscanning: How to Legally Index the World's Books

Here's how I know that I have engineering in my DNA. Whenever I hear something labeled as impossible, impractical or unlawful, I can't restrain myself from trying to think of ways around the physical, logistical and legal constraints that supposedly imply impossibility. "That", "is" and "impossible" are fighting words to an engineer. And that's why I've admired the proposed Google Books Settlement. By way of a spectacular feat of legal engineering, it has suggested a way to do the seemingly impossible- to build a database of all the worlds books- in the face of the tremendous obstacle posed by an extremely messy legal situation.

But despite my admiration for the "engineering" involved in the settlement, there have always been some things I didn't like about it. And despite all that's been written about it, and the many aspects that people people have objected to, I've never seen anyone voice my particular misgivings, perhaps because of their peculiar engineer's orientation.

The settlement uses a legal innovation to accomplish its goals. I don't like that (the "legal" part, not the "innovation" part). Many people have objected to the particular innovation that is used, arguing that this precedent could lead to a reign of tyranny and/or other cataclysm, but I've not seen any objection to the use of legal apparatus in the first place. I've often made the disclaimer here that I Am Not A Lawyer, but I've generally downplayed my ingrained bias for using technology rather than law to solve the world's problems.
The settlement seems to be based on a presumption that Google's database of all the world's books cannot be built without making copies. I don't like to assume things are impossible. I should also note that several of the arguments opposing the Google Books Settlement rely on exactly the same presumption!

As the months have dragged on and the postponements pile up, I'm thinking that my first objection is starting to make more and more sense. After thinking it over for over 6 months I'm starting to think that my second objection is also valid. The rest of this post describes how it might be possible to build a full-text database of all the worlds' books without doing any copyright-infringing copying. I'll call this scheme "Copyless Crowdscanning".

What got me started on this line of thought were some simple cost calculations I presented in my article on Dan Reetz' DIY book scanner. It made me realize that the idea of having hundreds of thousands of people scanning their books with cheap scanners was not out of the realm of possibility. The main barrier to assembling a database of all the world's books will no longer be the scanning, but rather the laws governing copyright. So my focus is on how to do crowdscanning so that copyrights are not infringed; the easiest way to do that is to not make any copies.

Here are the assumptions I start with. As I've been learning about copyright, I've learned that there will always be a copyright lawyer somewhere willing to contest any common-sense assumption about copyright, so it's important to start somewhere. First, I'm assuming that scanning a small number of pages of a book (suppose that number is 1% of the book) for the purpose of indexing those pages is not a violation of copyright, as long as I don't redistribute the scans and destroy them after I finish my indexing. The indices are things I should be able to keep and redistribute.

Second, I'm assuming that it is not a violation of copyright to redistribute single sentences from a book. So, for example, publishing the following sentence:

The punishment lay in knowing that you were putting all of that effort into letting a kind of intellectual poison infiltrate your brain down to its very roots.

is not a violation of Neal Stephenson's copyright to the book Anathem. A corollary of that is that if I shuffle the order of all the sentences in a book, I can redistribute that jumble without violating copyright.

Finally, I'm assuming that scanning and distributing the title page of a book and its verso cannot be a violation of copyright; such distribution would be necessary in many cases just to convey statements of fact and as such are not subject to copyright. I recognize that artwork on these pages may need excision.

Let's suppose that we had a large number of people participating in our database building project. Suppose for example, that 100,000 people participated. Each person would scan a small fraction of each book they owned, along with its title pages. The title pages would be submitted to a book identity server, which would return a book identifier. The rest of the page scans would be processed by software, and the scans would then be destroyed. The software would digitize the scans, then chop the pages into individual sentences. An index of the pages would be generated and submitted to an "index aggregation" service. The sentences would be shuffled and submitted to a "sentence serving" service.

After many people have made partial scans and submitted partial indices to the index aggregator, a complete index would emerge that can be used just as Google Book Search is used. The complete sentences would be provided by the sentence server to provide the context of the result sets.

Note neither the index aggregator nor the sentence server would be able to reconstitute a book or even the pages from a book. It seems to me that it should be possible to add some encrypted information and send the keys to yet another party so as to allow reconstitution of the pages in authorized circumstances, such as for use by people with disabilities. If you can't use the information to reconstitute the book, then it seems to me that no copy exists and no copyrights have been infringed.

If my assumptions are incorrect, then I should expect that Harper-Collins will soon be suing me for copyright infringement. I'll be sure to let you know. If they are correct, but there's some theory that would expose any of the crownscanning participants to liability, then perhaps someone who Really-Is-A-Lawyer could elaborate in the comments. I recognize that copyless crowdscanning wouldn't be applicable without modification to things like art books, artwork in books, poetry collections, sheet music, periodicals, reference works, but it would be a start. And it would make some engineers happy.

Update: Several people (including real lawyers) have commented to me that crowdscanning would not help much as an infringement defense if the result of the entire system had the effect of making the entire text available. I just want to emphasize that I think a system can be engineered so as to enable indexing while preventing text reconstruction and avoiding the use of copies.

Tuesday, October 13, 2009

The Revolution Will Be Digitized (By Cheap Book Scanners)

It's always a good sign when you meet a literary character at a conference. Last June, I wrote about meeting a Bilbo Baggins at the Semantic Technology Conference; on Friday I met a character out of a Neal Stephenson novel at D is for Digitize.

D is for Digitize was a small conference organized by James Grimmelmann of NYU Law School. It brought together legal luminaries with people from publishing, business, academia, advocacy, technology, and the press. It had been organized to coincide with the scheduled Fairness Hearing for the Google Book Search Settlement. As it turned out, the Fairness Hearing was postponed, to be replaced by a brief "status conference". The effect of the postponement on the conference was beneficial- with the Google settlement officially on the shelf, the participants were able to have real discussions on the future of book digitization without getting too bogged down in legal argument.

That future was brought into very clear focus by the two digital cameras in Daniel Reetz' do-it-yourself book scanner. Reetz's presentation and demonstration blew away everyone in the room. Like Stephenson's Waterhouse characters in Cryptonomicon
and the Baroque Cycle, Reetz is a tinkerer and a liberator of information. He spent some time in Russia and became accustomed to the conveniences of digital books in a society that doesn't pay much attention to copyright laws. On his return home to North Dakota, he was shocked at the high price of textbooks and the low price of digital cameras. He resolved to build himself a book scanner and went dumpster diving for materials, then posted instructions for how to make the scanner online.

In May, he was awarded the Grand Prize (a laser cutter) in the Epilog Challenge, a competition sponsored by the manufacturer of a laser cutter to promote "open design" manufacturing. The laser cutter has enabled Reetz to refine his scanner design to use precision-cut plywood. His first third-generation scanner, which folds up neatly for portability, was finished just in time for him to bring to the conference. (He had fun getting it through airport security!).

Compared to robotic scanners such as the one manufactured by Kirtas the DIY Book Scanner is strikingly simple. It is built with rubber bands, drawer sliders, white LEDs and two commercial off-the-shelf digital cameras. Some Russian friends of Reetz's have figured out how to hook into the camera's firmware so that scan acquisition can be triggered by pressing a single button. Open source software is used to do image management and post-processing. An operator turns the pages and average throughput is about a thousand pages per hour. The total cost of the scanner parts are under $300, including cameras. For more pictures of Reetz's new scanner, he's posted some here.

Reetz is not the only one building cheap scanners based on his design. A small but vital community is growing around the open-source design. Although book publishers might unthinkingly assume that this group is primarily interested in book piracy, they would be wrong. Several people just want to read books they've purchased in print on their iPhones or Kindles. An engineering student in Arizona is reading disabled and must digitize to be able to read his textbooks. One Indonesian man built a scanner with donated cameras because his town's property records had been damaged in a flood. More than one book aficionado has turned to scanning in response to a too-many-books spousal ultimatum.

For other perspectives on Reetz's presentation, see Harry Lewis' post at Blown to Bits and Robin Sloan's post at The Millions.

In my article on the impact of the Americans with Disabilities Act on selling non-accessible books, I speculated that the as cost of digitizing books drops, society's expectations for the bookselling industry would change. Now that I've come face-to-face with a cheap book digitizer, I realize that much will be transformed. For example, let's assume that an effective book digitizer can be built and deployed for $500. (Even if DIY turns out not to be the way this happens, commercial manufacturers such as ATIZ are likely to be able to meet similar price points.) Then the cost of putting a book scanner in 20,000 libraries would be $10,000,000. If these libraries digitized an average of even one book per day, they could digitize 10,000,000 books in two years. Since 10 books per day should be well within the capabilities of an inexpensive digitizer, the libraries should have no technical difficulties with digitizing 4 million books per month.

If libraries acquired the capability of digitizing millions of books per month, then Google's erstwhile monopoly on digitized out-of-print books could evaporate quickly in an appropriate legal environment. Rightsholders who have been angry at Google for working with libraries on digitization should think ahead to a future in which their works can be ripped, mixed, and burned by cheap book digitizers in millions of homes and offices. The world will be different.

In Stephenson's Cryptonomicon, Randy Waterhouse develops a data haven in a Pacific island country to evade crude laws governing cryptography. I hope that Daniel Reetz doesn't have to retreat to a digitization haven country to able to bring the sensible benefits of book digitization to people who need it.

Go To Hellman

Friday, March 5, 2010

Business Idea Number 3: Gluejar Book Search

Friday, February 19, 2010

Notes from the Google Books Fairness Hearing

Thursday, February 18, 2010

Settlement Lawyers Say Real Authors Don't Advocate Fair Use

Thursday, February 4, 2010

Copyright-Safe Full-Text Indexing of Books

Monday, January 18, 2010

Google Exposes Book Metadata Privates at ALA Forum

Sunday, January 3, 2010

2020: Fewer Libraries, More Locations

Saturday, November 14, 2009

The Book Rights Registry Unclaimed Works Fiduciary: Powerful Regent or Powerless Figurehead?

Wednesday, November 11, 2009

The Uniqueness of Sentences and J. K. Rowling's (Non)Infringement of Tanya Tucker

Friday, October 23, 2009

Copyless Crowdscanning: How to Legally Index the World's Books

Tuesday, October 13, 2009

The Revolution Will Be Digitized (By Cheap Book Scanners)

Blog Archive

Popular Posts

Twitter Updates

Twitter Updates

Me

Go To Hellman Fan Page

Labels

Friday, March 5, 2010

Friday, February 19, 2010

Thursday, February 18, 2010

Thursday, February 4, 2010

Monday, January 18, 2010

Sunday, January 3, 2010

Saturday, November 14, 2009

Wednesday, November 11, 2009

Friday, October 23, 2009

Tuesday, October 13, 2009

Blog Archive

Popular Posts

Twitter Updates

Twitter Updates

Subscribe To

Me

Go To Hellman Fan Page

Labels