Hierarchy of identifiers/locators (was: Consider finding a different name for "canonical locator") #28

iherman · Apr 13, 2016

It seems that this term is misleading in understanding what is going on (part of the feedback at the WWW2016 presentation).

bjdmeest · Apr 14, 2016

In what kind of direction should we be looking? Is it the canonical,
locator, or combination that is misleading? Would, e.g., PWP reference URI be a better fit, or a lot worse?

2016-04-13 20:14 GMT+02:00 Ivan Herman [email protected]:

It seems that this term is misleading in understanding what is going on
(part of the feedback at the WWW2016 presentation).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#28

iherman · Apr 14, 2016

I think that the term 'canonical' is too strong. @dret, is this the correct interpretation of what you said?

Maybe saying, although it is a mouthful, 'state agnostic locator', or something like that, would be better

On 14 Apr 2016, at 11:10, Ben De Meester [email protected] wrote:

In what kind of direction should we be looking? Is it the canonical,
locator, or combination that is misleading? Would, e.g., PWP reference URI be a better fit, or a lot worse?

2016-04-13 20:14 GMT+02:00 Ivan Herman [email protected]:

It seems that this term is misleading in understanding what is going on
(part of the feedback at the WWW2016 presentation).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#28

dret · Apr 14, 2016

On 2016-04-14 11:16, Ivan Herman wrote:

I think that the term 'canonical' is too strong. @dret, is this the
correct interpretation of what you said?

maybe. i think you first need an "identity model", saying clearly which
resources you indent to make identifiable. after that, you start
designing your identification model.

if you look at books: an ISBN does not identify a book, but a class of
books, what's often called "the work". i would imagine that for very
many scenarios, beyond that you also have to make a specific instance of
the book/work identifiable (something that real-world books don't
usually do, unless they're limited editions and numbered).

then you also might want to make this work available at different
addresses, such as my copy of my work computer and the one on my laptop
and the one on my tablet, all of which are "copies of my book copy". as
you can see, scenarios quickly get rather complex.

i think you first need a better model of the "levels of identity" you
want to support, and then you can start building something representing
them. the current model seems a bit simplistic, i have to admit.

dret · Apr 14, 2016

thinking about this a little more, it seems to me that you probably don't want to overcomplicate the model with a huge number of identification layers. but maybe one distinction is crucial:

one identifier is assigned by the publisher. it represents whatever identity model the publisher is following when making the work available. it can be something like ISBN, but it also could be unique identifiers if the publisher prefers to do that. this policy is opaque.
another identifier is assigned by the user. it represents the identity of the work they have somehow acquired by a publisher. this identification scheme is opaque as well, but it is different from the former in the sense that users are free to assign and change this as they see fit, whereas the publication identifier should remain unchanged.

does this make any sense? it would at least solve the problem of how to identify my copy of a book across a variety of online/offline location where i might be using it.

dret · Apr 14, 2016

btw, i am just realizing that my comments are terribly out of scope for this issue. my apologies.

iherman · Apr 17, 2016

btw, i am just realizing that my comments are terribly out of scope for this issue. my apologies.

Actually, it is not. Having a crisp notion of what these terms mean is closely related to how it is named… ie, no apologies required! Keep it coming…

Thanks for your interest!

iherman · Apr 17, 2016

Hi Erik, sorry for the late reply (just on my way back from the conference)

On 14 Apr 2016, at 11:58, Erik Wilde <[email protected] mailto:[email protected]> wrote:

thinking about this a little more, it seems to me that you probably don't want to overcomplicate the model with a huge number of identification layers. but maybe one distinction is crucial:

one identifier is assigned by the publisher. it represents whatever identity model the publisher is following when making the work available. it can be something like ISBN, but it also could be unique identifiers if the publisher prefers to do that. this policy is opaque.
another identifier is assigned by the user. it represents the identity of the work they have somehow acquired by a publisher. this identification scheme is opaque as well, but it is different from the former in the sense that users are free to assign and change this as they see fit, whereas the publication identifier should remain unchanged. does this make any sense? it would at least solve the problem of how to identify my copy of a book across a variety of online/offline location where i might be using it.
Yes, I would think that what we had in mind is very close to what you describe (except maybe the unwise use of the terms, as noted in the issue). We say:

A PWP has an identifier. This is assigned by the author/publisher; is meant to uniquely identify the work, the edition, whatever the publisher chooses to identify. A PWP spec would be silent as for how this identifier is assigned, under what authority, etc. I am not even sure we would require it to be a URI (ie, it can simply be an ISBN number). The only thing we require is that each copy of the PWP (ie, when I make a copy on my local machine) MUST keep this identifier unchanged; it is immutable.
A PWP has a what we call, hitherto, a canonical locator; maybe the right term should be a state independent locator. It is a URI (probably we should say an HTTP(S) URI) and is assigned by the user, as you say, because it locates the specific copy of the publication that a publisher, a reseller, or a particular person has. If you make a copy of the work on your local server, that locator will be reassigned, because it should point at your particular copy. It is state independent because you may have, side by side, the same publication packed in ZIP, in tar.gz, and also unfolded into a Web site, essentially; all three 'states', as we refer to them, have the same state independent locator
Finally, each state has, of course, a state dependent locator which identifies separately the .tar.gz and the .zip.

I think what I described is essentially the same as what you did, right?

dret · Apr 17, 2016

On 2016-04-17 17:34, Ivan Herman wrote:

another identifier is assigned by the user. it represents the
identity of the work they have somehow acquired by a publisher. this
identification scheme is opaque as well, but it is different from the
former in the sense that users are free to assign and change this as
they see fit, whereas the publication identifier should remain
unchanged. does this make any sense? it would at least solve the problem
of how to identify my copy of a book across a variety of online/offline
location where i might be using it.
I think what I described is essentially the same as what you did, right?

nope, because your user-assigned identifier is also a locator. which
means that a user cannot track identity across locations, such as having
the same book on their work and home computers and on their tablet.
that's why i was asking for that to be a pure identifier as well, which
is not use for locating. your PWP canonical id/locator then is a level
down because it needs to resolve, and would resolve to possibly
different copies of one user's book copy (ha!).

the book "work" is the ISBN as you and i agree.
the book copy's identifier is something that a book seller might
assign, or something that i assign when managing my library. maybe best
think of it as identifiers as assigned by libraries to their holdings.
each thing they own gets one; it's their inventory (they don't do
inventory by ISBN, of course).
the PWP identifier is a level because the copies are digital and thus
can be copied (this is where it gets confusing, so good terminology is
important). i can have a book in my "library", and then use it in
various places online and/or offline. i still need to be able to
consolidate these into one logical entity, so that i can build software
that can collect all my annotations for one of my books. i don't care
whether i made those on the copy i have on my university server, or on
my home server, or in some offline scenario.

i think you exclude a lot of use cases when you cannot deal with this
(admittedly complicated) layering of identities. i think you are simply
missing the "middle" level in your current model.

cheers,

dret.

iherman · Apr 19, 2016

Hi @dret,

Just try to be sure I understand. What you propose is that a PWP would three layers of identifiers/locators. Without sticking to the FRBR model too much, the way I would say is that there is a hierarchy of identifiers/locators

An identifier for the "work", typically an ISBN
An identifier for a "manifestation", e.g., the library's own catalogue number for the PWP they own and lend (in whatever state, eg, gzip or zip)
A state independent locator of my particular copy
State specific locators of my particular copy (ie, a locator to my copy in zip, and a separate locator to the unpacked version on my server)

(1) is immutable, and typically assigned by the publisher. (2) is usually immutable, in the sense that it is assigned by some authorized parties (library, reseller) but not changed by the end user. (3) and (4) are changed whenever a new copy is made by the end user. (1) and (2) MAY be non-URI identifiers (although for many applications I believe it may be reasonable to expect that (2) is also an HTTP URI), (3) and (4) are URI-s.

Thinking a bit further, trying to see what it requires for the mechanism we have described in the current draft what probably means is that:

The PWP manifest MUST include all four identifiers/locators (actually, (4) is a set of locators).
A PWP processor for a specific copy MUST be able to convert between, say, an annotation expressed using (2) to (4) for the purpose of, eg, displaying the content

First of all, is this what you mean?

I think I could live with this, I see the rationale, but I am a little bit concerned of the extra complication. @bjdmeest, @rdeltour, @lrosenthol, any opinion?

iherman · Apr 19, 2016

I actually wonder whether the model cannot be made a bit more general, and less restrictive. I think that, in practice, it may not be clear when to use which identifier in the model described in the previous comment. What about saying instead:

PWP has the following identifiers/locators:

A set of identifiers (with possibly a label assigned to each identifier to describe what it is for and who is the authority to assign them)
A state independent locator (can we call this a 'canonical locator'?)
State specific locators of a particular copy (ie, a locator to my copy in zip, and a separate locator to the unpacked version on my server)

The rule being that the entries in (1) SHOULD be changed only by the respective authorities, and they MUST be part of the PWP manifest.

This is intentionally fuzzy as for the details of the identifiers, mainly on the access control, but maybe that is a reasonable approach for a technical specification that is not supposed to control policy. Maybe a final specification would have to assign some more metadata to each identifier, but we can leave that for now.

WDYT?

dret · Apr 19, 2016

On 2016-04-19 07:08, Ivan Herman wrote:

An identifier for the "work", typically an ISBN

An identifier for a "manifestation", e.g., the library's own
catalogue number for the PWP they own and lend (in whatever state,
eg, gzip or zip)

A state independent locator of my particular copy

State specific locators of my particular copy (ie, a locator to my
copy in zip, and a separate locator to the unpacked version on my
server)
First of all, is this what you mean?

yes, this now has all the levels i was talking about. a different
identity model fro the one you're having currently.

I /think/ I could live with this, I see the rationale, but I am a little
bit concerned of the extra complication. @bjdmeest
https://github.com/bjdmeest, @rdeltour https://github.com/rdeltour,
@lrosenthol https://github.com/lrosenthol, any opinion?

well, it's complicated and maybe a bad idea. it was just a very rough
first idea of what i thought was minimally necessary to make those
scenarios work where users want to manage their own annotations in a
consistent way, and thus need some way to identify their personal copies
of the resources they are annotating.

i am sure there are other ways to do this. this may be too complicated
and too brittle. it was simply what i was thinking about when discussing
a concrete use case (which actually has even more complications in it).

dauwhe · Apr 19, 2016

Just a quick note that the ISBN doesn't make a very good work identifier, as it's a product identifier. If two items have the same ISBN, they're likely to be the same work (although significant textual differences are common). But if two items have different ISBNs, all you can say is that they are different products. It's possible they're the same work, or even the same manifestation of a work.

dret · Apr 19, 2016

On 2016-04-19 15:03, Dave Cramer wrote:

Just a quick note that the ISBN doesn't make a very good work
identifier, as it's a product identifier. If two items have the same
ISBN, they're likely to be the same work (although significant textual
differences are common). But if two items have different ISBNs, all you
can say is that they are different products. It's possible they're the
same work, or even the same manifestation of a work.

sure. good example is hardcover and paperback, which have different ISBN
numbers. the process of how work identifiers are assigned should be
completely opaque. and it's up to the publisher how to do that, and
which scheme to use. i think URI would be a good starting point, as
there's this: https://tools.ietf.org/html/rfc3187

iherman changed the title from Consider finding a different name for "canonical locator" to Hierarchy of identifiers/locators (was: Consider finding a different name for "canonical locator") Apr 19, 2016

w3c/dpub-pwp-loc

Hierarchy of identifiers/locators (was: Consider finding a different name for "canonical locator") #28

Assignees

Labels

Projects

Milestone

4 participants