Showing posts with label library automation. Show all posts
Showing posts with label library automation. Show all posts

Friday, June 8, 2018

The Vast Potential for Blockchain in Libraries

There is absolutely no use for "blockchain technology" in libraries. NONE. Zip. Nada. Fuhgettaboutit. Folks who say otherwise are either dishonest, misinformed, or misleadingly defining "blockchain technology" as all the wonderful uses of digital signatures, cryptographic hashes, peer-to-peer networks, zero-knowledge proofs, hash chains and Merkle trees. I'm willing to forgive members of this third category of crypto-huckster because libraries really do need to learn about all those technologies and put them to good use. Call it NotChain, and I'm all for it.

It's not that blockchain for libraries couldn't work, it's that blockchain for libraries would be evil. Let me explain.

All the good attributes ascribed to magical "blockchain technology" are available in "git", a program used by software developers for distributed version control. The folks at GitHub realized that many problems would benefit from some workflow tools layered on top of the git, and they're now being acquired for several billion dollars by Microsoft, which is run by folks who know a lot about that digital crypto stuff.
A Merkle tree. (from Wikipedia)

Believe it or not, blockchains and git repos are both based on Merkle trees, which use cryptographic hashes to indelibly tie one information packet (a block or a commit) to a preceding information packet. The packets are thus arranged in a tree. The difference between the two is how they achieve consensus (how they prune the tree).

Blockchains strive to grow a single branch (thus, the tree becomes a chain). They reach consensus by adding packets according to the computing power of nodes that want to add a packet (proof of work) or to the wealth of nodes that want to add a packet (proof of stake). So if you have a problem where you want a single trunk (a ledger) whose control is allocated by wealth or power, blockchain may be an applicable solution.

Git repos take a different approach to consensus; git makes it easy to make a new branch (or fork) and it makes it easy to merge branches back together. It leaves the decision of whether to branch or merge mostly up to humans. So if you have a problem where you need to reach consensus (or disagreement) about information by the usual (imperfect) ways of humans, git repos are possibly the Merkle trees you need.

I think library technology should not be enabling consensus on the basis of wealth or power rather than thought and discussion. That would be evil.

Notes:
1. Here are some good articles about git and blockchain:


2. Why is "blockchain" getting all the hype, instead of "Merkle trees" or "git"? I can think of three reasons:

  1. "git" is a funny name.
  2. "Merkle" is a funny name.
  3. Everyone loves Lego blocks!
3. I wrote an article about what the library/archives/publishing world can learn from bitcoin. It's still good.

Monday, February 9, 2015

"Passwords are stored in plain text."

Many states have "open records" laws which mandate public disclosure of business proposals submitted to state agencies. When a state library or university requests proposals for library systems or databases, the vender responses can be obtained and reviewed. When I was in the library software business, it was routine to use these laws to do "competitor intelligence". These disclosures can often reveal the inner workings of proprietary vendor software which implicate information privacy and security.

Consider for example, this request for "eResources for Minitex". Minitex is a "publicly supported network of academic, public, state government, and special libraries working cooperatively to improve library service for their users in Minnesota, North Dakota and South Dakota" and it negotiates licenses databases for libraries throughout the three states.

Question number 172 in this Request for Proposals (RFP) was: "Password storage. Indicate how passwords are stored (e.g., plain text, hash, salted hash, etc.)."

To provide context for this question, you need to know just a little bit of security and cryptography.

I'll admit to having written code 15 years ago that saved passwords as plain text. This is a dangerous thing to do, because if someone were to get unauthorized access to the computer where the passwords were stored, they would have a big list of passwords. Since people tend to use the same password on multiple systems, the breached password list could be used, not only to gain access to the service that leaked the password file, but also to other services, which might include banks, stores and other sites of potential interest to thieves.

As a result, web developers are now strongly admonished never to save the passwords as plain text. Doing so in a new system should be considered negligent, and could easily result in liability for the developer if the system security is breached. Unfortunately many businesses would rather risk paying paying lawyers a lot of money to defend themselves should something go wrong than bite the bullet and pay some engineers a little money now to patch up the older systems.

To prevent the disclosure of passwords, the current standard practice is to "salt and hash" them.

A cryptographic hash function mixes up a password so that the password cannot be reconstructed. so for example, the hash of 'my_password' is 'a865a7e0ddbf35fa6f6a232e0893bea4'. When a user enters their password, the hash of the password is recalculated and compared to the saved hash to determine whether the password is correct.

As a result of this strategy, the password can't be recovered. But it can be reset, and the fact that no one can recover the password eliminates a whole bunch of "social engineering" attacks on the security of the service.

Given a LOT of computer power, there are brute force attacks on the hash, but the easiest attack is to compute the hashes for the most common passwords. In a large file of passwords, you should be able to find some accounts that are breachable, even with the hashing. And so a "salt" is added to the password before the hash is applied. In the example above, a hash would be computed for 'SOME_CLEVER_SALTmy_password'. Which, of course, is '52b71cb6d37342afa3dd5b4cc9ab4846'.

To attack the salted password file, you'd need to know that salt. And since every application uses a different salt, each file of salted passwords is completely different. A successful attack on one hashed password file won't compromise any of the others.

Another standard practice for user-facing password management is to never send passwords unencrypted. The best way to do this is to use HTTPS, since web browser software alerts the user that their information is secure. Otherwise, any server between the user and the destination server (there might be 20-40 of these for  typical web traffic) could read and store the user's password.

The Minitex RFP covers reference databases. For this reason, only a small subset of services offered to libraries are covered here. The authentication for these sorts of systems typically don't depend on the user creating a password; user accounts are used to save the results of a search, or to provide customization features. A Minitex patron can use many of the offered databases without providing any sort of password.

So here are the verbatim responses received for the Minitex RFP:

LearningExpress, LLC
Response: "All passwords are stored using a salted hash. The salt is randomly generated and unique for each user."
My comment: This is a correct answer. However, the LearningExpress login sends passwords in the clear over HTTP.

OCLC
Response: "Passwords are md5 hashed."
My comment: MD5 is the hash algorithm I used in my examples above. It's not considered very secure (see comments). OCLC Firstsearch does not force HTTPS and can send login passwords in the clear.

Credo
Response: "N/A"
My comment: This just means that no passwords are used in the service.

Infogroup Library Division
Response: "Passwords are currently stored as plain text. This may change once we develop the customization for users within ReferenceUSA. Currently the only passwords we use are for libraries to access usage stats."
My comment: The user customization now available for ReferenceUSA appears at first glance to be done correctly.

EBSCO Information Services
Response: "EBSCOhost passwords in EBSCOadmin are stored in plain text."
My comment: Should note that EBSCOadmin is not a end-user facing system. So if the EBSCO systems were compromised only library administrator credentials would be exposed. 

Encyclopaedia Britannica, Inc.
Response: "Passwords are stored as plain text."
My comment: I wonder if EB has an article on network security?

ProQuest
Response: "We store all passwords as plain text."
My comment: The ProQuest service available through my library creates passwords over HTTP but uses some client-side encryption. I have not evaluated the security of this encryption.

Scholastic Library Publishing, Inc.
Response: "Passwords are not stored. FreedomFlix offers a digital locker feature and is the only digital product that requires a login and password. The user creates the login and password. Scholastic Library Publishing, Inc does not have access to this information.”
My comment: The "FreedomFlix" service not only sends user passwords unencrypted over HTTP, it sends them in a GET query string. This means that not only can anyone see the user passwords in transit, but log files will capture and save them for long-term perusal. Third-party sites will be sent the password in referrer headers. When used on a shared computer, subsequent users will easily see the passwords. "Scholastic Library Publishing" may not have access to user passwords, but everyone else will have them.

Cengage Learning
Response: "Passwords are stored in plain text."
My comment: Like FreedomFlix, the Gale Infotrac service from Cengage sends user passwords in the clear in a GET query string. But it asks the user to enter their library barcode in the password field, so users probably wouldn't be exposing their personal passwords.

So, to sum up, adoption of up-to-date security practices is far from complete in the world of library databases. I hope that the laggards have improved since the submission date of this RFP (roughly a year ago) or at least have plans in place to get with the program. I would welcome comments to this post that provide updates. Libraries themselves deserve a lot of the blame, because for the most part the vendors that serve them respond to their requirements and priorities.

I think libraries issuing RFPs for new systems and databases should include specific questions about security and privacy practices, and make sure that contracts properly assign liability for data breaches with the answers to these questions in mind.

Note: This post is based on information shared by concerned librarians on the LITA Patron Privacy Technologies Interest Group list. Join if you care about this.

Wednesday, September 24, 2014

Emergency! Governor Christie Could Turn NJ Library Websites Into Law-Breakers

Nate Hoffelder over at The Digital Reader highlighted the passage of a new "Reader Privacy Act" passed by the New Jersey State Legislature. If signed by Governor Chris Christie it would take effect immediately. It was sponsored by my state senator, Nia Gill.

In light of my writing about privacy on library websites, this poorly drafted bill, though well intentioned, would turn my library's website into a law-breaker, subject to a $500 civil fine for every user. (It would also require us to make some minor changes at Unglue.it.)
  1. It defines "personal information" as "(1) any information that identifies, relates to, describes, or is associated with a particular user's use of a book service; (2) a unique identifier or Internet Protocol address, when that identifier or address is used to identify, relate to, describe, or be associated with a particular user, as related to the user’s use of a book service, or book, in whole or in partial form; (3) any information that relates to, or is capable of being associated with, a particular book service user’s access to a book service."
  2. “Provider” means any commercial entity offering a book service to the public.
  3. A provider shall only disclose the personal information of a book service user [...] to a person or private entity pursuant to a court order in a pending action brought by [...] by the person or private entity.
  4. Any book service user aggrieved by a violation of this act may recover, in a civil action, $500 per violation and the costs of the action together with reasonable attorneys’ fees.
My library, Montclair Public Library, uses a web catalog run by Polaris, a division of Innovative Interfaces, a private entity, for BCCLS, a consortium serving northern New Jersey. Whenever I browse a catalog entry in this catalog, a cookie is set by AddThis (and probably other companies) identifying me and the web page I'm looking at. In other words, personal information as defined by the act is sent to a private entity, without a court order.

And so every user of the catalog could sue Innovative for $500 each, plus legal fees.

The only out is "if the user has given his or her informed consent to the specific disclosure for the specific purpose." Having a terms of use and a privacy policy is usually not sufficient to achieve "informed consent".

Existing library privacy laws in NJ have reasonable exceptions for "proper operations of the library". This law does not have a similar exemption.

I urge Governor Christie to veto the bill and send it back to the legislature for improvements that take account of the realities of library websites and make it easier for internet bookstores and libraries to operate legally in the Garden State.

You can contact Gov. Christie's office using this form.

Update: Just talked to one of Nia Gill's staff; they're looking into it. Also updated to include the 2nd set of amendments.

Update 2: A close reading of the California law on which the NJ statute was based reveals that poor wording in section 4 is the source of the problem. In the California law, it's clear that it pertains only to the situation where a private entity is seeking discovery in a legal action, not when the private entity is somehow involved in providing the service.

Where the NJ law reads
A provider shall only disclose the personal information of a book service user to a government entity, other than a law enforcement entity, or to a person or private entity pursuant to a court order in a pending action brought by the government entity or by the person or private entity.  
it's meant to read
In a pending action brought by the government entity other than a law enforcement entity, or by a person or by a private entity, a provider shall only disclose the personal information of a book service user to such entity or person pursuant to a court order.
Update 3 Nov 22: Governor Christie has conditionally vetoed the bill.

Monday, September 15, 2014

Analysis of Privacy Leakage on a Library Catalog Webpage

My post last month about privacy on library websites, and the surrounding discussion on the Code4Lib list prompted me to do a focused investigation, which I presented at last weeks Code4Lib-NYC meeting.

I looked at a single web page from the NYPL online catalog. I used Chrome developer tools to trace all the requests my browser made in the process of building that page. The catalog page in question is for The Communist Manifesto. It's here: http://nypl.bibliocommons.com/item/show/18235020052907_communist_manifesto .

You can imagine how reading this work might have been of interest to government investigators during the early fifties when Sen. Joe McCarthy was at the peak of his power. Note that, following good search-engine-optimization practice, the URL embeds the title of the resource being looked at.

I chose the NYPL catalog as my example, not because it's better or worse than any other library catalog with respect to privacy, but because it's exemplary. The people building it are awesome, and the results are top-notch. I happen to know the organization is working on making privacy improvements. Please don't take my investigation to be a criticism of NYPL. But it was Code4Lib-NYC, after all.

As an example of how far ahead of the curve the NYPL catalog is, note that the webpage offers links to free downloads at Project Gutenberg. The Communist Manifesto is in the public domain, so any library catalog that tells you that no ebook is available is lying. The majority of library catalogs today lie about this.

So here are the results.

In building the Communist Manifesto catalog page, my browser contacts 11 different hosts from 8 different companies.
  • nypl.secure.bibliocommons.com
  • cdn.bibliocommons.com
  • api.bookish.com
  • contentcafe2.btol.com
  • www.google-analytics.com
  • www.googletagmanager.com
  • cdn.foxycart.com
  • idreambooks.com
  • ws.sharethis.com
  • wd-edge.sharethis.com
  • b.scorecardresearch.com
Each of these hosts is informed of the address of the web page that generates the address. They are told, essentially, "this user is looking at our Communist Manifesto page". Some of the hosts need this information to deliver the services they contribute. Others get the same information via the "referer" header generated as part of the HTTP protocol.  If the catalog were served with the more secure protocol "HTTPS", the referer header would not be sent.

The first of these is Bibliocommons. I've written about Bibliocommons before. They host the NYPL catalog "in the cloud". I'm not particularly concerned about Bibliocommons with respect to privacy, because they contract directly with NYPL, and I'm pretty sure that contracts are in place that bind Bibliocommons to the privacy policies in place at NYPL. But since HTTP is used rather than HTTPS, every host between me and the bibliocommons server can see and capture the URL of the web page I'm looking at. At the moment, I'm using the wifi in a Paris cafe, so the hosts that can see that are in the proxad.net, aas6453.net, level3.net, firehost.com and other domains. I don't know what they do with my browsing history.

I've previously written about the NYPL's use of the Bookish recommendation engine.  The BTOL.com link is for Baker&Taylor's "Content Cafe" service that provides book covers for library catalogs. I'm guessing (but don't know for sure) that these offerings have privacy policies that are aware of the privacy expectations of library users.

Yes, Google is one of the companies that NYPL tells about my web browsing. I'm pretty sure that Google knows who I am. A careful look at the Google Analytics privacy policy suggests that they can't share my browsing history outside Google. Unless required to by law.

Foxycart is not a company I was familiar with. They provide the shopping cart technology that lets me buy a book from the NYPL website and benefit them with part of the proceeds. I've been in favor of enabling such commerce on library sites because libraries need to do it to participate fully in the modern reading ecosystem. But it's still controversial in the library world.

Foxycart's privacy policy, like all privacy policies ever written, takes your privacy very seriously. Some excerpts:
When you visit this website, some information, such as the site that referred you to us, your IP and email address, and navigational and purchase information, may be collected automatically as part of the site’s operation. This information is used to generate user profiles and to personalize the web site to your particular interests. 
The information collected online is stored indefinitely and is used for various purposes. 
Cookies offer you many conveniences. They allow FoxyCart.com LLC, and certain third party content providers, to recognize information, and so can determine what content is best suited to your needs.  
We also reserve the right to disclose your personal information if required to do so by law, or in the good faith belief that such action is reasonably necessary to comply with legal process, respond to claims, or protect the rights, property or safety of our company, employees, customers or the public.

Here I need to explain about cookies. When a website gives you a cookie, it acquires the ability to track you across all the websites that company serves. This can be a great convenience for you. When you fill out a credit card form with your name and address, Foxycart can remember it for you so you don't have to type it in again when you come back to order something else. You might find that creepy if the last order you placed was on a porn site. But while NYPL hasn't told FoxyCart anything that could identify you personally, your interaction with FoxyCart is such that you may well chose to identify yourself. And all that information is stored forever. And FoxyCart can pass that information to all the Sen. Joe McCarthys of 2020. As well as certain 3rd party content providers. FoxyCart probably doesn't give away your information today, but will they even be around in 2020?

IdreamBooks syndicates book reviews. I don't know anything about them, and their homepage doesn't seem to have a privacy policy.

ScorecardResearch "conducts research by collecting Internet web browsing data and then uses that data to help show how people use the Internet, what they like about it, and what they don’t." They probably know whether I like ScorecardResearch. Their cookie is set by the ShareThis software.

ShareThis was one of the companies I mentioned in my last post. ShareThis provides social sharing buttons for the NYPL catalog. They also take your privacy very seriously. Some more excerpts:
In addition to the sharing service offered directly to users, the technology we use to assist with user sharing also allows us to gather information from publisher Web sites that include our ShareThis Sharing Icon or use our advertising technology, and enables ShareThis and our partner publishers and advertisers to use the value of the shared content and other information gathered through our technology to facilitate the delivery of relevant, targeted advertising (the ShareThis Services). 
we also receive certain non-personally identifiable information (e.g., demographic information such as zip code) from our advertisers, ad network and publisher partners, and we may combine this information with what we have collected. We also collect information from third-party Web sites with whom you have registered, like social networks, that those third parties make publicly available. 
While using the ShareThis Services, We may place third party advertisers’ and publishers’ cookies and pixels on their behalf regarding Usage Information. 
We are not responsible for the information practices of these third parties and the cookies placed by ShareThis on behalf of those third parties.
So ShareThis turns out to be in the business of advertising. They use your browsing behavior over thousands of websites to help advertisers target advertising and content to you. That scene in Minority report where Tom Cruise gets personalized ads on the billboards he walks by? Thats what ShareThis is helping to make happen today, and the NYPL website is helping them.
Ad Mall from Minority Report
They do this by cookie-sharing. In addition to setting a sharethis.com cookie, they set cookies for other companies, so they also get to know what you're reading. And when they do this, they enable other companies to connect your browsing behavior at NYPL with information you've provided to social networks. The result is that it's possible for a company selling Karl Marx merch to target ads you based on browsing the Communist Manifesto catalog page.

But it's not like ShareThis is completely promiscuous. Their privacy agreement limits their cookie sharing to an exclusive group of advertising companies. Here's the beginning of the list:
  • 33across.png
  • accuen.png
  • Adap.png
  • adaramedia.com
  • adblade.com
  • addthis.com
  • adroll.com
  • aggregateknowledge.com
  • appnexus.com
  • atlassolutions.com
  • AudienceScience.com
That's just the A's.

In 1972, Zoia Horn, a librarian at Bucknell University, was jailed for almost three weeks for refusing to testify at the trial of the Harrisburg 7 concerning the library usage of one of the defendants. That was a long time ago. No longer is there a need to put librarians in jail.



Thursday, March 27, 2014

The Asterisk behind NYPL and Bookish

Joe Regal grew up in a family that moved around. Granger, Indiana.  Lewiston, New York. Towanda, Pennsylvania. In every town there was a library, which young Joe would seek out as a haven of virtual stability. Regal remembers that in Fairfield, Connecticut, he picked up Breakfast of Champions, because the cover looked like a cereal box. He opened it and was thrilled to discover, right there on page 5, the "drawing" of an anus/asterisk. And the text of Kurt Vonnegut's novel was even more subversive than the drawing.
"Vonnegut was one of those writers who made me feel less alone.  He also made me understand that it was OK to break the rules, because often the rules were insane.  That message - captured even in the asterisk/anus drawing, though of course more deeply, richly, and powerfully in the actual writing! - meant so much to me at 13, it's hard to convey or even fully remember the totality of it.  The freedom, the sense that you could explore without fear of punishment or retribution - that's a lot of what the library meant to me as a kid.  It's easy for us to forget as adults that a book can literally save your life. Or even on a more prosaic level, if there was literally no cost to taking out a book, I could take out anything without worrying whether it was right for me. I could browse, read a bit, take it out, get bored, return it."
As an adult Joe Regal translated his passion for books to a successful career as a literary agent. He believed so deeply in Audrey Niffenegger's The Time Travelers Wife that he ignored countless rejections until he found a publisher for it. ("I do not publish science-fiction." was the complete text of one rejection.)

As an agent, Regal could see first-hand what ebooks and Amazon were doing to the ability of authors, publishers and bookstores to sustain their livelihoods. He thought about what an seller of ebooks could and should be. There should be space for curation and community. Authors should be able to connect with readers. As he talked with others about his ideas, the concept of a new kind of website for ebooks began to take shape. (I got to know Regal and his family around this time.)

A few years later, Zola Books is a reality. Initially funded by friends of Regal (including Niffenegger), Zola has recently closed a $5.1 million seed round. The round includes a variety of authors and prominent individual investors led by Charles Dolan, founder of Cablevision and HBO. Even considering the funding, Zola's ambition is breathtaking. They've built a commerce platform like BN.com, a social platform like GoodReads, an HTML epub reader with proprietary DRM (not yet launched), and partner curation tools like- (stretching a bit) sort of a TripAdvisor for books. Not to mention a solid catalog of ebooks.

A recommendation engine has been a big space on the Zola development roadmap from the beginning. It's not easy technology, so when the recommendation engine built by Bookish became available (along with the Bookish website) at a fraction of its development cost, Zola, newly funded and in a hurry, snapped it up at a bargain-basement price.

The Bookish recommendation engine uses "finger-prints" of books in its algorithm. In other words, it works more like Pandora than like Netflix. The fingerprints are not just metadata and are not just text analysis, but use elements of both along with human-powered analysis.

recommendations for
Breakfast of Champions
On Monday, New York Public Library announced that it had integrated the Bookish-powered recommendation engine into their NYPL BiblioCommons-powered web catalog, fulfilling Regal's dream of being able to give back to the libraries he loved growing up, opening up unexpected books like Breakfast of Champions to new generations of readers.  The recommendations are live on the NYPL website, so you can decide for yourself if the recommendations are good or not. I found them to be intriguing, at least.

Apparently NYPL has been looking to add a recommendation feature to its website for a few years. They tracked potential partners along with Bookish to determine the best option, and had the benefit of seeing some advance demos before "Bookish Recommends" launched online. NYPL was impressed by Bookish's "big data back-end" and that it was not driven by sales; the number of titles the it covered at the outset was impressive.  NYPL will be assessing  performance over the first year to ensure that the recommendations are valuable to readers.

According to Patrick Kennedy, Co-founder and President at BiblioCommons,
"The background to this story is the interest a number of libraries have shared with us in broadening their role as a source of book recommendations in their communities.  The initiative will allow for better visibility and sharing of librarian recommendations and reviews, the integration of other third-party recommendations databases such as LibraryThing and NoveList.  Our goal is provide a neutral platform that allows libraries to integrate the sources of their choice.  In all cases the integration API is made available by the third parties to BiblioCommons with the understanding that any library on the BiblioCommons platform may license the content."
Zola is hoping to make the Bookish API widely available to libraries and is considering a variety of licensing models. As Kennedy points out, there are recommendation services already available to libraries. The LibraryThing service (marketed by Bowker), is based on activities in the LibraryThing social network and is incredibly deep; the NoveList service from EBSCO takes a more traditional reader's advisory approach. The Bookish recommendation engine may not be based on sales the way Amazon's is, but if it doesn't help Zola sell ebooks, it will die. Can the mission of a library be advanced by using a tool whose ultimate purpose is to sell books? Or does it depend on the sort of bookseller behind the tool?

This conflict is probably why booksellers and libraries haven't been sharing as much book information infrastructure as you might expect. A library has different goals for a recommendation system than does a bookseller. Libraries need to steer users toward books of their collection that are less used, while booksellers need to present the user with books that the patron is most likely to buy. Which might ALWAYS be 50 Shades or Hunger Games.

But bookselling and libraries are both changing rapidly. With the big-box bookstore dying before their eyes, publishers are scrambling to find ways to continue putting books in front of readers. One possibility is that libraries will respond to this need and evolve a closer connection to commerce, and that booksellers will figure out how to tighten their connections to communities and their libraries. The alternative is that libraries and ebookstores grow apart to serve very different populations and needs – Amazon Prime and library subprime, if you will.

My guess is that libraries sharing infrastructure with booksellers will become the norm rather than the exception it is now. Monday's announcement by NYPL and Zola is more than just a website usability widget, it's about a vision of what libraries and booksellers can become. Zola has sent a love letter to the library world.

Notes

  1. Bookish.com started out as a joint venture of Penguin, Hachette, and Simon & Schuster. Bookish spent a vast amount of money developing the site.
  2. Competition between LibraryThing and Bookish might well lead to some changes. Bookish uses some content from LibraryThing, such as reviews, on its website. When Bookish launched, LibraryThing founder Tim Spalding wrote 
    Besides reviews, Bookish has access to some other LibraryThing data, including edition disambiguation and recommendations. A glance at their recommendations, however, will show you that they're not using them "cold," but as some sort of factor."
  3. I wrote about BiblioCommons when they came out of stealth a few years ago. They've won the business of some very high profile public Libraries, NYPL and Seattle Public Library included. They have the big  benefit of starting from scratch with current web technology, and as a result have been innovating quickly.
  4. I took a look at how the integration was done. The Bookish API is a straightforward REST and JSON with access keys. ISBN-based queries such as

    http://api.bookish.com/recapi/api/v1/recommendations?maxItems=15&token=<token>&apiKey=<key>&isbn13s=9780670024902

    return JSON like:
    [{
      "basic": {
        "isbn13": "9780671742515",
        "bookUrl": "http://www.bookish.com/books/long-dark-tea-time-of-the-soul-douglas-adams-9780671742515/<token>",
        "imageUrl": "http://images.bookish.com/covers/m/9780671742515.jpg",
        "title": "Long Dark Tea-Time of the Soul",
        "subtitle": "",
        "authors": ["Douglas Adams"]
      }
    }]


    The library-side integration done by BiblioCommons is ajaxy and javascript based; a javascript calls the api, pulls out the ISBNs and sends them back to BiblioCommons, which checks for the recommended ISBN in the catalog. A list of holdings is sent back to the browser for rendering. It looks like Bibliocommons itself does not call the bookish API, which could lend itself to easier integration with other recommender APIs.
  5. Another interesting recommender system in the library world is bX from ExLibris. It's a usage based system focused on article links, rather than books. Currently, bX will return book recommendations based on articles, but doesn't provide recommendations based on books.
  6. Don't confuse Bookish.com, the company acquired by Zola Books with booki.sh, the company acquired by Overdrive
  7. Not that we haven't had this problem at unglue.it, but why does NYPL list Robert Egan as the author of the ebook version of Breakfast of Champions? (Update: Answer from Amy Geduldig at NYPL- "The catalog entry here refers to the play Breakfast of Champions by Robert Egan, which is based on the novel by Vonnegut, but in and of itself is a different work, which is why Egan is listed as the author. ")
  8. All the book links in this post point at the NYPL BiblioCommons catalog so you can see try out Bookish Recommends for yourself.
Enhanced by Zemanta

Wednesday, July 27, 2011

Liking Library Data

If you had told me ten years ago that teenagers would be spending free time "curating their social graphs", I would have looked at you kinda funny. Of course, ten years ago, they were learning about metadata from Pokemon cards, so maybe I should have seen it coming.

Social networking websites have made us all aware of the value of modeling aspects of our daily lives in graph databases, even if we don't realize that's what we're doing. Since the "semantic web" is predicated on the idea that ALL knowledge can be usefully represented as a giant, global graph, it's perhaps not so surprising that the most familiar, and most widely implemented application of semantic web technologies has been Facebook's "Like" button.

When you click a Like button, an arc is added to Facebook's representation of your social graph. The arc links a node that represents you and another node that represents the thing you liked. As you interact with your social graph via Facebook, the added Like arc may introduce new interactions.

Google must think this is really important. They want you to start clicking "+1" buttons, which presumably will help them deliver better search. (You can try following me+, but I'm not sure what I'll do with it.)

The technology that Facebook has favored for building new objects to but in the social graph is derived from RDFa, which adds structured data into ordinary web pages. It's quite similar to "microdata", a competing technology that was recently endorsed by Google, Microsoft, and Yahoo. Facebook's vocabulary for the things it's interested in is called Open Graph Protocol (OGP), which could be considered a competitor for Schema.org.

My previous post described how a library might use microdata to help users of search engines find things in the library. While I think that eventually this will be an necessity for every library offering digital services, the are a bunch of caveats that limit the short-term utility of doing so. Some of these were neatly described in a post by Ed Chamberlain:
  • the library website needs to implement a site-map that search engine's crawlers can use to find all the items in the Library's catalog
  • the library's catalog needs to be efficient enough to not be burdened by the crawlers. Many library catalog systems are disgracefully inefficient.
  • the library's catalog needs to support persistent URLs. (Most systems do this, but it was only ten years ago that I caused Harvard's catalog to crash by trying to get it to persist links. Sorry.)
But the clincher is that web search engines are still suspicious of metadata. Spammers are constantly trying to deceive search engines. So search engines have white-lists, and unless your website is on the white-list, the search engines won't trust your structured metadata. The data might be of great use to a specialized crawler designed to aggregate metadata from libraries, but there's a chicken and egg problem: these crawlers won't be built before libraries start publishing their data.

Facebook's OGP may have more immediate benefits. Libraries are inextricably linked to their communities; what is a community if not a web of relationships? Libraries are uniquely positioned to insert books into real world social networks. A phrase I heard at ALA was "Libraries are about connections, not collections".

Libraries don't need to implement OGP to put a like button on a web page, but without OGP Facebook would understand the "Like" to be about the web page, rather than about the book or other library item.

To show what OGP might look like on a library catalog page, using the same example I used in my post on "spoonfeeding library data to search engines":
<html> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Open Graph Protocol wants the web page to be the digital surrogate for the thing to be inserted into the social graph, and so it wants to see metadata about the thing in the web page's meta tags. Most library catalog systems already put metadata in metatags, so this part shouldn't be horribly impossible.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" content="book"/>
<meta property="og:isbn" content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" content="Example Library"/>
<meta property="fb:admins" content="USER_ID"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first thing that OGP does is to call out xml namespaces- one for xhtml, a second for Open Graph Protocol, and a third for some specific-to-Facebook properties. A brief look at OGP reveals that it's even more bare bones than schema.org; you can't even express the fact that "Paul Bryers" is the author of "Avatar".

This is less of an issue than you might imagine, because OGP uses a syntax that's a subset of RDFa, so you can add namespaces and structured data to your heart's desire, though Facebook will probably ignore it.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:foaf="http://xmlns.com/foaf/0.1/"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" 
      content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" 
      content="book"/>
<meta property="og:isbn" 
      content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" 
      content="Example Library"/>
<meta property="fb:app_id" 
      content="183518461711560"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span rel="dc:creator">Author: 
    <span typeof="foaf:Person" 
        property="foaf:name">Paul Bryers
    </span> (born 1945)
 </span>
 <span rel="dc:subject">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The next step is to add the actual like button by embedding a javascript from Facebook:
<div id="fb-root"></div>
<script   src="http://connect.facebook.net/en_US/all.js#appId=183518461711560&xfbml=1"></script>
<fb:like href="http://library.example.edu/isbn/9780340930762/" 
       send="false" width="450" show_faces="false" font=""></fb:like>

The "og:url" property tells facebook the "canonical" url for this page- the url that Facebook should scrape the metadata from.

Now here's a big problem. Once you put the like button javascript on a web page, Facebook can track all the users that visit that page. This goes against the traditional privacy expectations that users have of libraries. In some jurisdictions, it may even be against the law for a public library to allow a third party to track users in this way. I expect it shouldn't be hard to modify the implementation so that the script is executed only if the user clicks the "Like" button, but I've not been able to find a case anyone has done this.

It seems to me that injecting library resources into social networks is important. The libraries and the social networks that figure out how to do that will enrich our communities and the great global graph that is humanity.

Thursday, June 30, 2011

3M's eBook Cloud Library Didn't Come Out of Nowhere!

When the Douglas County Libraries in Colorado installed self check-in stations a while ago, they realized that hey had an opportunity to restructure their space. The circulation desk that dominated the main entrance was no longer needed. It seemed obvious to Library Director Jamie LaRue what to put in its place. Libraries need to greet their visitors with displays of books available for immediate checkout. 80% of Douglas County's adult circulation is generated by visual displays of books, so the best way to entice visitors to read is to show them great books to read.

When Douglas County began investigating how to put ebooks into county resident's computers, they wanted to do something similar. A user looking for ebooks should be greeted with a virtual bookshelf of books waiting to be checked out. LaRue was not satisfied with the offering of industry leader Overdrive because he couldn't do such a simple thing.

Public libraries that offer ebooks are frequently faced with problems posed by the strong demand for ebooks. Their users are frequently disappointed that the ebooks they want are always checked out. Overdrive has not yet implemented an programming interface that would allow library catalogs to check on an ebook's availability before showing it to a user, so the process of finding an available ebook can involve a lot of tedious clicks.

To address these needs, Overdrive has announced the "Overdrive WIN" service, which will address better integration with library automation software along with a host of other improvements and service innovations.

I spoke with a number of library automation vendors at this past weekend's American Library Association meeting in New Orleans. eBook integration is high on the list of their customers' wish lists, but I couldn't find any that could tell me when they would be implementing better Overdrive integration, though many of them were in "discussions".

A new vendor worth mentioning was Toronto-based BiblioCommons, whose EC2-cloud-based OPAC service has been implemented by Seattle Public Library and is in beta with New York Public Library. I'd been hearing about BiblioCommons for long enough that I'd had my doubts as their reality. At ALA, they demoed a clean, modern web interface with plenty of social features- go take a look at Seattle Public. Given NYPL's status as a prominent Overdrive customer and Bibliocommons' actively developing codebase, I had hoped to see some preview glimpses of Overdrive WIN in BiblioCommons, but had no such luck.

Back in Douglas County, Jamie LaRue wasn't satisfied with the available options, so around the end of 2010, he had his team approach their auto-check-in vendor, 3M, to see if they could do something about ebooks. As luck would have it, they could. And they did.

Although 3M's entrance into the library ebook platform business came as a complete surprise to many in libraries and publishing, it seems obvious in retrospect. 3M's RFID tag, self-checkout/checkin, and detection businesses were already integrated with library automation systems, so much of the code needed to integrate to library systems was already written. 3M licensed ebook reader and DRM systems from Adobe, and in the space of six months, with the advice and help of customers such as Douglas County, was able to assemble a strong set of services it is branding as the "3M Cloud Library". These include reader software for iOS and Android, as well as spiffy "3M Discovery Terminals", electronic kiosks "with an intuitive touch-based interface". (pictured) 3M is even going to sell "white-label" eReader devices with software tweaked to meet the needs of libraries that want to lend devices.

While 3M is arguably breaking new ground in integration of ebooks with library systems, 3M is far behind Overdrive in the area of publisher relations, which can't just be switched on in a mere 6 months. Overdrive has announced expansions of its offerings in the school and academic markets. Meanwhile, 3M is going in publishers' back doors as it helps the State of Kansas withdraw from an awkwardly drafted Overdrive contract, which Kansas says allows them to move purchased content from Overdrive to other platforms. It's in publishers' interests to have a library ebook channel that competes with Overdrive, but they do SO like to be asked permission first.

For his part, LaRue just wants to be able to tailor his library service to the needs of his community. "I want to provide a quality, integrated experience with a local focus" is what he told me. That doesn't seem to be asking so much.

Update 6/30/11: At The Digital Reader, Nate Hoffelder reported in May that a lot of 3M's reading platform was sourced from txtr, a German start-up they'd invested in. I wasn't able to confirm this at ALA, but have since done so. The Adobe DRM implementation, reading software, apps, presentation interfaces all originated in txtr. I'm also told by multiple sources that 3M has been talking to publishers since at least December 2010.
Enhanced by Zemanta

Tuesday, February 8, 2011

Toys and Tools vs. the Enterprise at Code4Lib

© CERN
In 1991, the world's top researchers into hypertext met in a hotel in San Antonio, Texas. One poster presented there was entitled "An Architecture for Wide Area Hypertext", by a guy from CERN. Nine years later, I attended the same meeting in the same hotel. Attendees who had also been at the earlier meeting told me that the uniform reaction to the poster had been what what I'd describe now as "meh". It was too simple, not enough expressive power. There was nothing new, nothing interesting. Who could possibly care about about the physicist's stupid little toy hypertext system.

That physicist is now a Knight Commander of the Order of the British Empire, and Sir Tim's little toy system is the today's World Wide Web. Here's a few of the enterprise-ready systems that the conference organizers of HT 1991 thought were more important than "the web":   
  • Industrial Strength Hypermedia: Requirements for a Large Engineering Enterprise
  • Implementing Hypertext Database Relationships through Aggregations and Exceptions
  • Applications Navigator: Using Hypertext to Support Effective Scientific Information Exchange
You get the idea. If you make a list of the most important technologies for libraries and publishing today, your list will include a lot of things in addition to the web that started out as toys, and were derided as such for years after they became important building blocks. Linux, unix, mySQL, perl, ruby, apache web server, and so on. Even Google started out with Lego blocks as key components. The list of software technologies that began as enterprise-class applications is smaller and less loved.

This morning at the Code4Lib Conference in Bloomington, Indiana, Brad Wheeler gave a welcoming talk. He's a Vice President for Information Technology and Chief Information Officer at Indiana University and Chairman and Co-Founder of the Kuali Foundation. He emphasized how important it is for libraries to submit to "volitional interdependence for macro solutions". I think he was trying to say that libraries need to pool their software development efforts and stop focusing on their local needs and peculiarities. But there was one thing he said that I strongly disagreed with. He said that libraries should stop move beyond building toys.

Wheeler's remark was in the context of a story about his impression of a Digital Library Federation (DLF) meeting two years ago. He thought the projects being reported there were too small and too focused on individual libraries. DLF has fresh new leadership in the person of Rachel Frick, and a new structure as part of CLIR, but I'm not sure that Wheeler's criticism of "the old DLF" is justified.

Libraries need to build more toys, especially in this time of tectonic changes in the ways that users interact with the information of the world. By toys, I mean simple experiments that do interesting things, or tools that focus on solving specific problems. The first day of Code4Lib was replete with examples of toy projects. Karen Coombs from OCLC showed 10 different toys, the most interesting a mashup of geocoded library subject headings with google maps. Josh Bishoff from the University of Illinois showed how the mess of links on his library's homepage could be distilled down to a nifty mobile webapp, complete with local bus schedules. Demian Katz showed how VuFind, a tool that has already made a transition from toy to essential tool is being pushed to be even more flexible. Jay Luker, from ADS, did the same with Blacklight and Solr.

For me the highlight was Scott Hanrath's report on Anthologize, a WordPress plugin designed to turn blogs into ebooks. The initial work on Anthologize was done using a "one week one tool" process, and his account of how 12 strangers banded together to produce a working product in one week was truly inspiring. They even included 4 user interviews in developing their user experience, and achievement which proved to be very difficult to pull off, because of the tight, parallelized development schedule.

The non-toy approach to software development was also on display. Tim McGeary of Lehigh University reported on the Kuali OLE project, which has attracted $5 million in funding from Mellon Foundation and others. OLE has an impressive, state-of-the-art, three tiered modular, buzzword-compliant software architecture and specification set. But after a year of work involving multiple committees, coding has only started last week. (Update 2/9/2011: Kuali's funding for writing code has only been in place for 6 months) It all looks very good, but I'm yet to be convinced that the end result will turn out to be what the market needs.

Development of things like OLE is expensive. Georgia PINES spent about a million dollars in its successful developing Evergreen, perhaps the most recent analog to OLE. The advantage to developing toys is that failure is not expensive. By spending so much on OLE, Kuali is putting a lot of eggs in one basket, and if their extensive committee work has failed to correctly predict what the market requirements in 5 years will be, they won't get another shot at it. In contrast, developing toys lets you try a lot of things out. Most will die, but the few of them that manage to solve sticky problems will get picked up and will grow into essential infrastructure.
Enhanced by Zemanta

Saturday, January 15, 2011

Why ProQuest Bought ebrary

The New York Times
Take a look at the New York Times homepage. Then take a look at CNN.com or MSNBC. How do you tell which website belongs to a newspaper and which ones belong to a television network? All of them have video. All of them have text. All of them have blogs and forums. As media moves onto the internet, the boundaries between old media genres begin to blur, and new forms emerge, optimized for the purposes they're being used for.
CNN.com

Just as delivery of news is being transformed by the Internet, the needs of students, researchers, and scholars are driving a similar boundary-blurring transformation in libraries. It's also driving a transformation in the companies that serve the library industry.

Marty Kahn, President of ProQuest, used the Times-CNN analogy to explain to me why his company had acquired ebrary, a leader in providing ebooks to academic, corporate, and other libraries. It no longer makes sense for a company to specialize in only journal articles, databases, or eBooks if it wants to be able to provide coherent and evolving solutions.

A look at ProQuest's existing product suite bears that out. With full-text journal databases, newspapers, dissertations, historical archives and government documents (including the CIS division recently acquired from LexisNexis) ProQuest was already able to integrate an impressive array of content. The Summon service from ProQuest's SerialsSolutions unit, which centrally indexes a library's content, has experienced rapid growth, with sales at 200 institutions already. Still, the most common questions that Summon staff were fielding at ALA Midwinter surrounded the integration of ebooks into Summon. With the acquisition of ebrary, ProQuest can now answer that question authoritatively for at least one ebook vendor. (See my previous article focusing on Overdrive.)

Somehow, the topic of EBSCO and their recent acquisition of NetLibrary hardly came up in my talk with Kahn.  We spent a lot more time discussing Google. Between Google Search, Google Scholar and Google Books, Google also has the potential to present a comprehensive information solution for libraries. I often hear librarians expressing the sentiment that they need help from companies like ProQuest to present credible alternatives to Google and free sources available on the internet.

One thing Summon and other library search solutions have lacked is the ability to search the full text of the books in a library's collection. Put next to Google Books' full text plus metadata search, the metadata based search offered by a traditional library catalog can seem rather limited to most users. ebrary will bring with it a huge library of full-text book content for search within Summon.

ebrary was founded by high school friends Christopher Warnock and Kevin Sayar. Libraries were the focus from the very start. Warnock had left a job at Adobe Systems and was working on a project for Stanford University when Stanford University Librarian Mike Keller told him that in order to get paid, he had to incorporate. Warnock called up his friend Sayar, then an attorney at the legendary Silicon Valley law firm of Wilson Sonsini Goodrich & Rosati, and asked if he wanted to act on their high school dreams of starting a company together. The project at Stanford led to the conception of ebrary's initial service for libraries. (I've often heard the misconception that ebrary is somehow an Adobe funded spin-off, because of Warnock's father's role as a Founder of Adobe. In fact, Adobe and the elder Warnock had no role in starting ebrary.)

Warnock has always been passionate about ebrary's mission. "If every library acquired information digitally, all the worlds information would be free to everybody", he told me. He is genuinely excited about what ebrary will be able to do as part of ProQuest. "Being part of ProQuest will allow us to realize our dreams".

Those dreams include the creation of a vast digital library with all kinds of content. ProQuest has "billions" of PDF documents, according to Warnock; ebrary's PDF indexing and search technologies are considered to be unsurpassed anywhere. Although ProQuest is not known for ebook distribution, there's not much difference between a book and a dissertation, if you think about it. ProQuest distributes 70,000 of those every year.

ebrary has also been an innovator in business models as well as in technology. ebrary's initial model was to make ebooks available for free viewing; rights-holders were compensated using a micro-transaction model where subscribers were charged every time they did things such as print pages. Based on customer feedback, they shifted to a model where most content is available for use with on flat subscription. Fee. This year, they've begun to implement a patron-driven acquisition model.

Looking forward, Sayar will be running the ebrary business unit; Warnock will move to ProQuest to work on strategy. Given the ambitious vision outlined by Kahn, he has his work cut out for him.

The ebrary content platform has definitely gained some ardent advocates in libraries. I heard one librarian say "not only do we love ebrary, but our students love ebrary. They really do." At the end of the day, when we ask ourselves how libraries will respond to the dizzying changes in both information and economic landscapes and worry about what will happen, isn't love all we really need?

Sunday, January 9, 2011

Bridging the eBook-Library System Divide

Despite what you might have read on the blogs, libraries show no signs of imminent ebook-induced death. The latest data from Overdrive, the dominant provider of eBooks to public libraries, shows staggering growth. Digital checkouts doubled in 2010 to 15 million, looking at Overdrive alone. Based on the buzz at this weekend's American Library Association Midwinter Meeting, Overdrive should blow those numbers away in 2011- It seems that almost every librarian I've talked to here has decide to "take the plunge" into eBooks in a big way in 2011.

The ebook companies focused on academic libraries are experiencing the same growth- Ebook Library told me that for the prior year their monthly sales have been double the prior year. The biggest plunge was taken by Proquest, which announced their acquisition of ebook provider ebrary. (I’ll have a separate story on that later.)

To some extent, most libraries have been only sampling the ebook water, and despite noted usability issues and e-reader device fragmentation, patrons seem to want more and more and librararies are responding to patron demand. But not everyone is happy. One librarian told me, after a few beers, that “Overdrive sucks!” and then went on to use language unsuitable for a family-oriented blog.

As far as I can tell, there are two issues around Overdrive that are troubling libraries. One derives from the DRM system from Adobe that Overdrive uses. Adobe’s system is pretty much the only option for libraries and booksellers other than Amazon and Apple; Overdrive has no choice but to use this system in order to work with reader devices and software from Barnes&Noble, Sony and Kobo. The Internet Archive’s Brewster Kahle, in a panel on Saturday morning, slammed the Adobe system, even though it’s used by the Archives OpenLibrary. In OpenLibrary's experience, users were able to complete a lending transaction in only 43% of their attempts. Overdrive is working to improve the smoothness of these transactions, and is introducing new support methods to make the processs easier.

The second issue was discussed by library system vendor executives at Friday’s RMG President’s Panel. According the Polaris Library Systems President Bill Schickling, many of his customers are worried that their libraries will be marginalized by ebook providers like Overdrive.  Although Overdrive offers extensive customization options for their ebook lending interface, libraries are still upset that patrons have to use separate interfaces for books and ebooks, one provided by Overdrive and the other provided by their ILS vendor. Libraries often think of the library system as their primary "brand extension" on the internet.

It seems a bit odd that this should be an issue. For years, libraries have lived with databases and electronic journals delivered from separate systems. But books are different. Libraries want ebooks and books to live side by side. It makes little sense to force a user who wants to read a Steig Larsson novel  have to check in two places to see print and digital availability.

Overdrive is working overtime to address this second issue, it seems. Overdrive's CEO, Steve Potash, told me that his company is working on opening a set of APIs (application programming interfaces) that will allow system vendors, libraries and other developers to more deeply integrate Overdrive's ebook lending systems into other interfaces. Overdrive has needed these interfaces internally to build reading apps for Android, iPod and iPhone. Overdrive hopes to have an iPad-optimized reading app in Apple's iTunes stare by the end of first quarter 2011, and will be working with selected development partners to work out many of the details. Potash hopes Overdrive will be able to unveil the APIs this summer at the ALA meeting in New Orleans.

The Overdrive APIs and the usability improvement they lead to should come as welcome news to libraries and library patrons everywhere. Library system vendors and developers in libraries will have a lot of work to do over the coming year.

And library patrons will be reading a lot of ebooks.

Wednesday, April 7, 2010

The Library IS the Machine

When librarians catalog a book, they do their best to describe a thing they have in their hands. The profession has been cataloging for a long time, and it tends to think that it's reduced the process to a science. When library catalogs became digital in the 1970's, the descriptions moved off of paper cards and into structured database records using a data format called MARC. That stands for MAchine Readable Cataloging, and as one Google engineer recently complained, "the MAchine Readable part of the name is a lie". The problem that Google's machines are having with these records is that the descriptions have always been meant for humans to read, not for computers to parse and understand.

Cataloging librarians are not stupid, and they've been working since the very beginning of digital cataloging to make their descriptions more useful to computers. They've introduced "name authority files" to bring uniformity to things like subject headings and author and publisher names. Unicode has brought uniformity to the encoding of non-roman characters and diacritics. XML has replaced some of the ancient delimiters and message length encoding. And perhaps most importantly, for a long time they've been embedding identifiers in the catalog records. Despite all this, library catalog records are still not as computer-friendly as they should be.

The move towards identifiers is worth special note. The use of identifiers in libraries dates to the first industrialization of libraries that took place in the 19th century. The classification systems of Melvil Dewey, Charles Ammi Cutter and the Library of Congress were all efforts to make library catalogs more friendly to machines.  Except the machines weren't digital computers, the machines were the libraries themselves. From the shelves to the circulation slips, libraries were giant, human-powered information storage and retrieval machines. The classification codes are sophisticated identifier systems upon which the entire access system was based. So maybe MARC isn't a lie after all!

The rest of the world took a while to catch up on the use of identifiers. The US began issuing social security numbers in 1936, but it wasn't until the 60's with the adoption of ISBN in the 1966 and ISSN in 1971 that the entire publishing industry began to use identifiers to more efficiently manage their sales, delivery and tracking of products.

The same properties that made identifiers useful in physical libraries make them essential for digital databases. Identifiers serve as keys that allow records in on table to be precisely sorted and matched against records in other tables. Well designed identifier systems provide assurances of uniqueness: there may be many people with the same name as me, but I'm the only one with my social security number.

Nowadays, it sometimes seems that almost any problem in the information industries is being solved by the introduction of a new identifier. Building on the success of ISBN and ISSN, there are efforts to identify works (ISTC),  authors (ORCID, ISNI), musical notations (ISMN), organizations (SAN), recordings (ISRC), audio-visual works (ISAN), trade items (UPC) and many other entities of interest. We live in an age of identifiers.

The apotheosis of indentifiers has been achieved in the Linked Data movement. The first rule of Linked Data is to give everything- subject, objects, and properties, their own URI (Uniform Resource Identifier). By putting EVERYTHING in one global space of identifiers, it is expected that myriad types of knowledge and information can be made available in uniform and efficient ways over the internet, to be reused, recombined, and reimagined.

What's often glossed over during the adoption of identifiers is their fundamental pragmatism. The association between any identifier and the real-world object it purports to identify is a thinly veneered but extremely useful social fiction which doesn't approach mathematical perfection. Even very good identifier systems can fail as much as 1% of the time, and automated systems that fail to recognize and accommodate the possibility of identifier failure exhibit brittleness and become subject to failure themselves. Still 99% of perfect works perfectly fine for a lot of things.

A decade ago, the world of libraries and the publishers that supply them embarked on an effort to link together the citations in journal articles and the bibliographic databases essential to libraries with the cited articles in e-journals and full text databases. Two complementary paths were pursued. One effort, OpenURL, sent bibliographic descriptions inside hyperlinks, and relied on intelligent agents in libraries to provide users with institutional specific and relevant links. The other, CrossRef, built identifiers for journal articles into a link redirection system. Together, OpenURL and CrossRef built on the strengths of the description and identification approaches and do a reasonably good job serving a wide range of users, including those in libraries.

Now, however, the slow but sure development of semantic web technologies and deployment of Linked Data has spurred both CrossRef's Geoff Bilder and the OCLC's Jeff Young (OCLC runs the OpenURL Maintenance Agency) to examine whether CrossRef and OpenURL need to make changes to take advantage of wider efforts. In another post, I'll look at this question more closely, but for now, I'd like to comment on what we've learned in the process of building article linking systems for libraries.

1. Successful linking requires both identification and description. The use of CrossRef by itself did not have the flexibility that libraries needed; CrossRef addressed this by making its bibliographic descriptions available to OpenURL systems. Similarly, the OpenURL's ability to embed CrossRef identifiers (DOIs) inside hyperlinks has made OpenURL linking much more accurate and effective.

2. Successful linking is as much about knowing which links to hide as about link discovery. Link discovery and link computation turn out not to be so hard. Keeping track of what is and isn't available to a user is much harder.

3. Bad data is everywhere. If a publisher asks authors for citations, 10% of the submitted citations will be wrong. If a librarian is given a book to catalog, 10% of the records produced will start out with some sort of transcription error. If a publisher or library is asked to submit metadata to a repository, 10% of the submitted data will have errors. It's only by imposing the discipline of checking, validating and correcting data at every stage that the system manages to perform acceptably.

Linking real world objects together doesn't happen by magic. It's a lot of work, and no amount of RDF, SPARQL, or URI fairy dust can change that. The magic of people and institutions working together, especially when facilitated by appropriate semantic technologies, can make things easier.

Reblog this post [with Zemanta]

Tuesday, March 30, 2010

$25 eBook Reader Application Scenarios

My sixth grader goes to school with a 14 pound backpack. A few years ago, Consumer Reports weighed  backpacks at three New York schools and found that sixth graders had the heaviest backpacks, averaging over 18 pounds. A lot of that weight is textbooks, and there's a lot of concern that kids are hurting themselves by carrying around so much stuff.

The Kindle 2 weighs only 9 ounces; shoppers will take home 22 oz. iPads starting this Saturday. How long will it be before schools start issuing ebook readers instead of textbooks?

The big issue, of course, is cost. In my last post, I compared ebook readers to digital watches and other consumer electronics products that saw dramatic price reductions in the years following their introduction. It is inevitable that ebook reader prices will also come down to a point where they can find new applications such as textbooks for school children.

Another possible application is libraries. I've written several times about the difficulties ebooks pose for libraries, but I've not discussed a scenario that's becoming increasingly popular: libraries loaning ebook readers to patrons.

Most libraries that have tried ebook reader lending have found the programs to be popular with patrons. Typically a number of Kindles are loaded with a set of ebooks; sometimes all the Kindles have the same collection; sometimes different books are loaded onto different Kindles and somehow the library has to track which Kindles have which books. Patrons have to be instructed not to use the library Kindle to buy extra books. Unfortunately libraries don't have the budgets they would need to scale these programs.

So far, though, there's not been an ebook reader or reader loading system designed with library lending in mind. Imagine that the readers have dropped to $25 a piece. At that price, it would make sense to issue library reader devices (with a deposit) instead of library cards. If the library circulation system was designed specifically for use with dedicated reader devices, a patron could have access to a universe of books while in the library building; there would likely be a limit on the number that could be taken home. The reader device and circulation system would be designed so as to allay the legitimate concerns that publishers have with ebook distribution by libraries.

The Twilight Saga CollectionAnother possibility is that content could be locked onto cheap reader devices. Imagine going to Target ten years from now, and instead of seeing stacks of the latest After Twilight Saga hardcover at the checkout, imagine seeing stacks of ebook readers preloaded with all ten novels in the Twilight and After Twilight series. Locking the content onto the reader device would enable all the reuse and resale that's possible with print books today- the buyer could lend the reader to friends, sell to a used book shop, or just keep it on a "book"-shelf in its attractive cover.

Each of these scenarios supposes that ebook readers will evolve to become increasingly inexpensive single function devices like the Kindle, and that they will diverge from general purpose media consumption devices like the iPad. A device designed specifically for reading will deliver a better reading experience at a lower price than one designed to support 3D video and gaming.

If you disagree, consider this question: How much reading would my sixth grader be doing if all his textbooks were issued on a gaming machine?
Reblog this post [with Zemanta]