Coffee|Code : Dan Scott - Structured data

Library and Archives Canada: Planning for a new union catalogue

dan@coffeecode.net (Dan Scott) — Mon, 02 Mar 2015 22:46:47 -0500

Update 2015-03-03: Clarified (in the Privacy section) that only NRCan runs Evergreen.

I attended a meeting with Library and Archives Canada today in my role as an Ontario Library Association board member to discuss the plans around a new Canadian union catalogue based on OCLC's hosted services. Following are some of the thoughts I prepared in advance of the meeting, based on the relatively limited materials to which I had access. (I will update this post once those materials have been shared openly; they include rough implementation timelines, perhaps the most interesting of which being that it the replacement system is not expected to be in production until August 2016.) Let me say at the outset that there were no solid answers on potential costs to participating libraries, other than that LAC is striving to keep the costs as low as possible.

Basic question: What form does LAC envision the solution taking?

Will it be:

"Library and Archives Canada begins adding records and holdings to WorldCat" as listed for many other countries in http://www.oclc.org/worldcat/catalog/national/timeline.en.html;
Or a separate, standalone but openly searchable WorldCat Local catalogue that Canadians can use like the Dutch or United Kingdom union catalogues (which lack significant functionality that standard WorldCat possesses, like the integrated schema.org discovery markup)?
Or a separate, standalone but closed catalogue like the Dutch union catalogue GGC and the Combined Regions UnityUK that require a subscription to access?

The answer was "yes, we will be adding records and holdings to WorldCat, and yes, you will be able to search a WorldCat Local instance for both LAC-specific and AMICUS as a whole" - but they're still working out the exact details. Later we determined that it will actually be WorldCat Discovery--essentially a rewrite of WorldCat Local--which assuaged some of my concerns about the current examples we can see of other OCLC-based union catalogues.

Privacy of Canadian citizens

The "Canadian office and data centre locations" requirement does not mean that usage data is exempt from Patriot Act concerns. Specifically, OCLC is an American company and thus the USA Patriot Act "allows US authorities to obtain records from any US-linked company operating in Canada" (per a 2004 brief submitted to the BC Privacy Commissioner by CIPPIC). Canadians should not be subject to this invasion of their privacy by the agents of another nation simply to use their own national union catalogue.

The response: The Justice, Agricultural, and NRCan agencies use US-hosted library systems (the latter running the open-source Evergreen, by Equinox). However, one of the other participants from a federal agency reported that they had been trying to update to Sierra from their Millenium instance but have been stalled for two years because whatever policy allowed them to go live with US-hosted Millenium is not being allowed now.

LAC claimed that, due to NAFTA, they are not allowed to insist that data be held in Canada unless it is for national security reasons. They noted that any usage data collected wouldn't be the same volume of patron data that would be seen in public libraries. They did point out that Netherlands sends anonymized data to OCLC, but that costs money and impacts response time. Apparently the OCLC web site, they claim not to have had a request under Patriot Act.

Privacy of Canadian citizens, part 2

I didn't get the chance to bring this up during the call...

LAC noted in their background that modern systems have links to social media, and apparently want this as part of a new AMICUS. This would also open up potential privacy leaks; see Eric Hellman on this topic, for example; it is also an area of interest for the recently launched ALA Patron Privacy Technologies Interest Group.

Open data

Opening up access to data is part of the federal government's stated mission. Canada's Action Plan on Open Government 2014-16 says "Open Government Foundation - Open By Default" is a keystone of its plan; "Eligible data and information will be released in standardized, open formats, free of charge, and without restrictions on reuse" under the Open Government Licence - Canada 2.0. I therefore asserted:

A relaunched National Union Catalogue should therefore support open data per the federal initiative from launch.
The open data should include bibliographic, authority, and holdings records. Guy Berthiaume's reply to CLA and CAPAL that libraries can use the Z39.50 protocol to try to access records from individual library's Z39.50 servers ignores one of the primary purposes of a union catalogue, which is to avoid that time-consuming search across the various Z39.50 servers of the institutions that contributed their data to the union catalogue in the first place.

The response: The ACAN requirements document indicated a requirement that the data be made available under an ODC-BY license (matching OCLC's general WorldCat license); and LAC needs to get the data back to support their federated search tool.

I asked if they had checked to see if ODC-BY and Open Government License - Canada 2.0 licenses are compatible; they responded that that was something they would need to look into. Happily, the CLIPol tool indicates that the ODB-BY 1.0 and Open Government License - Canada 2.0 licenses are mostly compatible.

Contemporary features: are we achieving the stated goals?

The backgrounder benefits/objectives section stated: "In the current AMICUS?based context, the NUC has not kept pace with new technological functions, capabilities, and client needs. Contemporary features such as a user?oriented display and navigation, user customization, links to social media, and linked open data output were not available when AMICUS was implemented in the 1990s."

Canadian resource visibility

To preserve and promote our unique national culture, we want Canadian library resources to be as visible as possible on the web. This is generally accomplished by publishing a sitemap (a list of the web pages for a given web site, along with when each page was last updated) and allowing search engines like Google, Bing, and Yahoo to crawl those web pages and index their data.

To maximize the visibility of Canadian library resources on the open web, we need our union catalogue to generate a sitemap that points to only the actual records with holdings for Canadian libraries, not just WorldCat.org in general. For example, http://adamnet.worldcat.org/robots.txt simply points to the generic http://www.worldcat.org/libraries/sitemap_index.xml, not a specific sitemap for the Dutch union catalogue.

Our union catalogue should publish schema.org metadata to improve the discoverability of our resources in search engines (which initiated the schema.org standard for that purpose). WorldCat includes schema.org metadata, but WorldCat Local instances do not.

The response: There was some confusion about schema.org, and they asked if I didn't think that OCLC's syndication program was sufficient for enabling web discoverability. I replied in the negative.

Standards support (MARC21, RDA, ISO etc.)

I didn't get a chance to raise these questions.

What standards, exactly, are meant by this?

"Technical requirements including volumetrics and W3C compliance" is also very broad and vague. With respect to "W3C compliance", W3C Standards is just the start of many standards.

Presumably there will be WCAG compliance for accessibility - but to what extent?
Both the adamnet and fablibraries instances landing pages state that their canonical URL is www.worldcat.org, which effectively hides them from search engines.

Mobile support

The W3C Standards page mentions mobile friendliness as part of its standards.

WorldCat.org itself is not mobile friendly. It uses a separate website with different URLs to serve up mobile web pages, and does not automatically detect mobile browsers; the onus is on the user to find the "WorldCat Mobile" page, and that has been in a "Beta" state since 2009. The "beta" contravenes the stated requirements for the AMICUS replacement service to not be an alpha or beta, unless you choose to ignore the massive adoption of mobile devices for searching and browsing purposes, and the beta mobile experience lacks functionality compared to the desktop version.

The adamnet and fablibraries WorldCat Local instances don't advertise the mobile option, which is slightly different than the standard WorldCat Mobile version (for example, it offers record detail pages), but the navigation between desktop and mobile is sub-par. If you have bookmarked a page on the desktop, then open that bookmark on your synchronized browser on a mobile device, you can only get the desktop view.

Linked open data

Linked open data around records, holdings, and participating libraries has arguably been a standard since the W3 Library Linked Data working group issued its final report in 2011.

Data--including library holdings--should be available both as bulk downloads and as linked open data
Records need to be linked to libraries and holdings. For humans, that missing link in WorldCat is supplied by a JavaScript lookup based on geographic location info that the human supplies. This prevents other automated services from aggregating the data and creating new services based on it (including entirely Canadian-built and hosted services which would then protect Canadians from USA Patriot Act concerns).
MARC records should be one of the directly downloadable formats via the web. Currently download options are limited to experimental & incomplete ntriple, turtle, JSON-LD, and RDF-XML formats.

Application programming interface (API)

I didn't get the chance to bring this up during the call...

OCLC offers the xID API in a very limited fashion to non-members, which is one of the only ways to match ISBN, LCCN, and OCLC numbers. LAC should ensure that Canadian libraries have access to some similarly efficient means of finding matching records without having to become full OCLC Cataloguing members.

Updating the NUC

I didn't get the chance to bring this up during the call...

In an ideal world, the NUC would adopt the standard web indexing practice of checking sitemaps (for those libraries that produce them) on a regular (daily or weekly basis) and add/replace any new/modified records & holdings from the contributing libraries accordingly, rather than requiring libraries to upload their own records & holdings on an irregular basis.

Putting the "Web" back into Semantic Web in Libraries 2014

dan@coffeecode.net (Dan Scott) — Thu, 04 Dec 2014 16:15:15 -0500

I was honoured to lead a workshop and speak at this year's edition of Semantic Web in Bibliotheken (SWIB) in Bonn, Germany. It was an amazing experience; there were so many rich projects being described with obvious dividends for the users of libraries, once again the European library community fills me with hope for the future success of the semantic web.

The subject of my talk "Cataloguing for the open web with RDFa and schema.org" (slides and video recording - gulp) pivoted while I was preparing materials for the workshop. I was searching library catalogues around Bonn looking for a catalogue with persistent URIs that I could use for an example. To my surprise, catalogue after catalogue used session-based URLs; it took me quite some time before I was able to find ULB, who had hosted a VuFind front end for their catalogue. Even then, the robots.txt restricted crawling by any user agent. This reminded me rather depressingly of my findings from current "discovery layers", which entirely restrict crawling and therefore put libraries into a black hole on the web.

Thses findings in the wild are so antithetical to the basic principles of enabling discovery of web resources that, in a conference about the semantic web, I opted to spend over half of my talk making the argument that libraries need to pay attention to the old-fashioned web of documents first and foremost. The basic building blocks that I advocated were, in priority order:

Persistent URIs, on which everything else is built
Sitemaps, to facilitate discovery of your resources
A robots.txt file to filter portions of your website that should not be crawled (for example, search results pages)
RDFa, microdata, or JSON-LD only after you've sorted out the first three

Only after setting that foundation did I feel comfortable launching into my rationale for RDFa and schema.org as a tool for enabling discovery on the web: a mapping of the access points that cataloguers create to the world of HTML and aggregators. The key point for SWIB was that RDFa and schema.org can enable full RDF expressions in HTML; that is, we can, should, and must go beyond surfacing structured data to surfacing linked data through @resource attributes and schema:sameAs properties.

The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Tim Berners-Lee, Scientific American, 2001

I also argued that using RDFa to enrich the document web was, in fact, truer to Berners-Lee's 2001 definition of the semantic web, and that we should focus on enriching the document web so that both humans and machines can benefit before investing in building an entirely separate and disconnected semantic web.

I was worried that my talk would not be well received; that it would be considered obvious, or scolding, or just plain off-topic. But to my relief I received a great deal of positive feedback. And on the next day, both Eric Miller and Richard Wallis gave talks on a similar, but more refined, theme: that libraries need to do a much, much better job of enabling their resources to be found on the web--not by people who already use our catalogues, but by people who are not library users today.

There were also some requests for clarification, which I'll try to address generally here (for the benefit of anyone who wasn't able to talk with me, or who might watch the livestream in the future).

"When you said anything could be described in schema.org, did you mean we should throw out MARC and BIBFRAME and EAD?"

tldr: I intended and, not instead of!

The first question I was asked was whether there was anything that I had not been able to describe in schema.org, to which I answered "No"--especially since the work that the W3C SchemaBibEx group had done to ensure that some of the core bibliographic requirements were added to the vocabulary. It was not as coherent or full a response as I would have liked to have made; I blame the livestream camera

But combined with a part of the presentation where I countered a myth about schema.org being a very coarse vocabulary by pointing out that it actually contained 600 classes and over 800 properties, a number of the attendees interpreted one of the takeaways of my talk as suggesting that libraries should adopt schema.org as the descriptive vocabulary, and that MARC, BIBFRAME, EAD, RAD, RDA, and other approaches for describing library resources were no longer necessary.

This is not at all what I'm advocating! To expand on my response, you can describe anything in schema.org, but you might lose significant amounts of richness in your description. For example, short stories and poems would best be described in schema.org as a CreativeWork. You would have to look at the associated description or keyword properties to be able to figure out the form of the work.

What I was advocating was that you should map your rich bibliographic description into corresponding schema.org classes and properties in RDFa at the time you generate the HTML representation of that resource and its associated entities. So your poem might be represented as a CreativeWork, with a name, author, description, keywords, and about values and relationships. Ideally, the author will include at least one link (either via sameAs, url, or @resource) to an entity on the web; and you could do the same with about if you are using a controlled vocabulary.

If you take that approach, then you can serve up schema.org descriptions of works in HTML that most web-oriented clients will understand (such as search engines) and provide basic access points such as name / author / keywords, while retaining and maintaining the full richness of the underlying bibliographic description--and potentially providing access to that, too, as part of the embedded RDFa, via content negotiation, or <link rel="">, for clients that can interpret richer formats.

"What makes you think Google will want to surface library holdings in search results?"

There is a perception that Google and other search engines just want to sell ads, or their own products (such as Google Books). While Google certainly does want to sell ads and products, they also want to be the most useful tool for satisfying users' information needs--possibly so they can learn more about those users and put more effective ads in front of them--but nonetheless, the motivation is there.

Imagine marking up your resources with the Product / Offer portion of schema.org you are able to provide search engines with availability information in the same way that Best Buy, AbeBooks, and other online retailers do (as Evergreen, Koha, and VuFind already do). That makes it much easier for the search engines to use everything they may know about their users, such as their current location, their institutional affiliations, their typical commuting patterns, their reading and research preferences... to provide a link to a library's electronic or print copy of a given resource in a knowledge graph box as one of the possible ways of satisfying that person's information needs.

We don't see it happening with libraries running Evergreen, Koha, and VuFind yet, realistically because the open source library systems don't have enough penetration to make it worth a search engine's effort to add that to their set of possible sources. However, if we as an industry make a concerted effort to implement this as a standard part of crawlable catalogue or discovery record detail pages, then it wouldn't surprise me in the least to see such suggestions start to appear. The best proof that we have that Google, at least, is interested in supporting discovery of library resources is the continued investment in Google Scholar.

And as I argued during my talk, even if the search engines never add direct links to library resources from search results or knowledge graph sidebars, having a reasonably simple standard like the GoodRelations product / offer pattern for resource availability enables new web-based approaches for building appplications. One example could be a fulfillment system that uses sitemaps to intelligently crawl all of its participating libraries, normalizes the item request to a work URI, and checks availability by parsing the offers at the corresponding URIs.

How discovery layers have closed off access to library resources, and other tales of schema.org from LITA Forum 2014

dan@coffeecode.net (Dan Scott) — Sat, 08 Nov 2014 11:41:30 -0500

At the LITA Forum yesterday, I accused (presentation) most discovery layers of not solving the discoverability problems of libraries, but instead exacerbating them by launching us headlong to a closed, unlinkable world. Coincidentally, Lorcan Dempsey's opening keynote contained a subtle criticism of discovery layers. I wasn't that subtle.

Here's why I believe commercial discovery layers are not "of the web": check out their robots.txt files. If you're not familiar with robots.txt files, these are what search engines and other well-behaved automated crawlers of web resources use to determine whether they are allowed to visit and index the content of pages on a site. Here's what the robots.txt files look like for a few of the best-known discovery layers:

User-Agent: *
Disallow /

That effectively says "Go away, machines; your kind isn't wanted in these parts." And that, in turn, closes off access to your libraries resources to search engines and other aggregators of content, and is completely counter to the overarching desire to evolve to a linked open data world.

During the question period, Marshall Breeding challenged my assertion as being unfair to what are meant to be merely indexes of library content. I responded that most libraries have replaced their catalogues with discovery layers, closing off open access to what have traditionally been their core resources, and he rather quickly acquiesced that that was indeed a problem.

(By the way, a possible solution might be to simply offer two different URL patterns, something like /library/* for library-owned resources to which access should be granted, and /licensed/* for resources to which open access to the metadata is problematic due to licensing issues, and which robots can therefore be restricted from accessing.)

Compared to commercial discovery layers on my very handwavy usability vs. discoverability plot, general search engines rank pretty high on both axes; they're the ready-at-hand tool in browser address bars. And they grok schema.org, so if we can improve our discoverability by publishing schema.org data, maybe we get a discoverability win for our users.

But even if we don't (SEO is a black art at best, and maybe the general search engines won't find the right mix of signals that makes them decide to boost the relevancy of our resources for specific users in specific locations at specific times) we get access to that structured data across systems in an extremely reusable way. With sitemaps, we can build our own specialized search engines (Solr or ElasticSearch or Google Custom Search Engine or whatever) that represent specific use cases. Our more sophisticated users can piece together data to, for example, build dynamic lists of collections, using a common, well-documented vocabulary and tools rather than having to dip into the arcane world of library standards (Z39.50 and MARC21).

So why not iterate our way towards the linked open data future by building on what we already have now? As Karen Coyle wrote in a much more elegant fashion, the transition looks roughly like:

Stored data -> transform/template -> human readable HTML page
Stored data -> transform/template (tweaked) -> machine & human readable HTML page

That is, by simply tweaking the same mechanism you already use to generate a human readable HTML page from the data you have stored in a database or flat files or what have you, you can embed machine readable structured data as well.

That is, in fact, exactly the approach I took with Evergreen, VuFind, and Koha. And they now expose structured data and generate sitemaps out of the box using the same old MARC21 data. Evergreen even exposes information about libraries (locations, contact information, hours of operation) so that you can connect its holdings to specific locations.

And what about all of our resources outside of the catalogue? Research guides, fonds descriptions, institutional repositories, publications... I've been lucky enough to be working with Camilla McKay and Karen Coyle on applying the same process to the Bryn Mawr Classical Review. At this stage, we're exposing basic entities (Reviews and People) largely as literals, but we're laying the groundwork for future iterations where we link them up to external entities. And all of this is built on a Tcl + SGML infrastructure.

So why schema.org? It has the advantage of being a de-facto generalized vocabulary that can be understood and parsed across many different domains, from car dealerships to streaming audio services to libraries, and it can be relatively simply embedded into existing HTML as long as you can modify the templating layer of your system.

And schema.org offers much more than just static structured data; schema.org Actions are surfacing in applications like Gmail as a way of providing directly actionable links--and there's no reason we shouldn't embrace that approach to expose "SearchAction", "ReadAction", "WatchAction", "ListenAction", "ViewAction"--and "OrderAction" (Request), "BorrowAction" (Borrow or Renew), "Place on Reserve", and other common actions as a standardized API that exists well beyond libraries (see Hydra for a developing approach to this problem).

I want to thank Richard Wallis for inviting me to co-present with him; it was a great experience, and I really enjoy meeting and sharing with others who are putting linked data theory into practice.

DCMI 2014: schema.org holdings in open source library systems

dan@coffeecode.net (Dan Scott) — Mon, 13 Oct 2014 21:07:13 -0400

My slides from DCMI 2014: schema.org in the wild: open source libraries++.

Last week I was at the Dublin Core Metadata Initiative 2014 conference, where Richard Wallis, Charles MacCathie Nevile and I were slated to present on schema.org and the work of the W3C Schema.org Bibliographic Extension Community Group (#schemabibex). As a first-timer at DCMI, I wasn't sure what kind of an audience to expect: there is a peer-reviewed papers track, and a series of sessions on a truly intimidating topic (RDF Application Profiles), but on the other hand our own topic was fairly basic. As it turned out, there was an invigoratingly mixed set of backgrounds present, and Eric Miller's opening keynote, which gave an oral history of the origins of DCMI and a look towards the future challenges for the organization, reassured me that I wasn't going to be out of my depth.

Special kudos to Eric for his analogy of the Web to a credit card, which offers both human-readable and machine-readable data. A nice, clean image!

Richard, Charles and I opted to structure our 1.5 hour session as a series of short talks followed by a long period of discussion. However, as often happens, the excitement of speaking in front of a room that drew so many attendees that we had to jam with more chairs led to that plan breaking down. I cut my own materials back to illustrating how one of my primary contributions to the #schemabibex effort--representing library holdings using schema.org's GoodRelations-based Product/Offer model--had been implemented in free software library systems, including Evergreen, Koha, and VuFind. I walked from a basic bibliographic record (represented as a Product), through to the associated borrowable items (represented as Offers with a price of $0.00, call numbers as SKUs, and barcodes as serialNumbers), that were offered by a specific Library with its own set of operating hours, address, and contact information... all published out of the box as RDFa in modern Evergreen systems.

I did stray a little to posit that the use case for schema.org is not and should not be limited to "search engine optimization", but that this very simple level of structured data could fairly easily form the basis of an API. In the rather limited discussion that we were able to hold at the end of the session (and encroaching on break time), Charles counselled that libraries shouldn't really bother with dumbing down their beautiful metadata simply to publish schema.org... while I countered that the pursuit of publishing beautiful metadata in the past has generally led librarians to publish no metadata at all, and that schema.org was a great first step towards building a web of cultural heritage metadata meant for machine consumption.

I wish I could have stayed longer at DCMI, but it was Thanksgiving in Canada and there were families to visit and feast with--not to mention children to help take car of--so I had to depart after just a day and a half. I'm encouraged by the steps the organization is taking to renew itself, and I hope to be able to participate again in the future.

My small contribution to schema.org this week

dan@coffeecode.net (Dan Scott) — Sat, 13 Sep 2014 03:27:13 -0400

Version 1.91 of the http://schema.org vocabulary was released a few days ago, and I once again had a small part to play in it.

With the addition of the workExample and exampleOfWork properties, we (Richard Wallis, Dan Brickley, and I) realized that examples of these CreativeWork example properties were desperately needed to help clarify their appropriate usage. I had developed one for the blog post that accompanied the launch of those properties, but the question was, where should those examples live in the official schema.org docs? CreativeWork has so many children, and the properties are so broadly applicable, that it could have been added to dozens of type pages.

It turns out that an until-now unused feature of the schema.org infrastructure is that examples can live on property pages; even Dan Brickley didn't think this was working. However, a quick test in my sandbox showed that it _was_ in perfect working order, so we could locate the examples on their most relevant documentation pages... Huzzah!

I was then able to put together a nice, juicy example showing relationships between a Tolkien novel (The Fellowship of the Ring), subsequent editions of that novel published by different companies in different locations at different times, and movies based on that novel. From this librarian's perspective, it's pretty cool to be able to do this; it's a realization of a desire to express relationships that, in most library systems, are hard or impossible to accurately specify. (Should be interesting to try and get this expressed in Evergreen and Koha...)

In an ensuing conversation on public-vocabs about the appropriateness of this approach to work relationships, I was pleased to hear Jeff Young say "+1 for using exampleOfWork / workExample as many times as necessary to move vaguely up or down the bibliographic abstraction layers."... To me, that's a solid endorsement of this pragmatic approach to what is inherently messy bibliographic stuff.

Kudos to Richard for having championed these properties in the first place; sometimes we're a little slow to catch on!