Task Forces/identifiers

From Digital Publishing Interest Group
Jump to: navigation, search

Task force on Identifiers

  • Leader(s): Bill Kasdorf, Apex

Task force meeting minutes

  • 2015

Members (Please add your name, organization)

  • Ivan Herman, W3C
  • Bill Kasdorf, Apex
  • Phil Madans, Hachette
  • Julie Morris, BISG
  • Liam Quin, W3C
  • Ayla Stein, UIUC
  • Thierry Michel, W3C
  • Tzviya Siegman, Wiley
  • Laura Dawson, ProQuest

Goals

  • Establish principles and mechanism for identifiers

Background

[Contributed by Tzviya Siegman, Wiley]

There are many methods for identifying objects and their sub-parts. Some methods allow for identifying or referring to sub-sections that do not stand alone. For example, a user may wish to refer to a specific point in a video when citing a Work.

The next step is to assess the criteria for what makes a good identifier.

What do identifiers of fragments of content need to accomplish?

  • They are conceptually and architecturally simple.
  • They are built with an eye towards the future. What is likely to remain constant over time? Can we imagine the same approach working 100 years from now? (A good question is: What has remained relatively stable in the past?)
  • They are able to be implemented now -- they don’t depend on new technologies or standards yet to come.
  • They acknowledge that because of the inherent instability of web content, anchors can never be known for certain, and we may take different actions based on how confident we are. (The web and content are unstable, and the identifers need to stand up to that instability.)
  • They outline how we fail gracefully when we cannot reattach anchors.
  • New anchors can be determined solely from information available on the document to which they refer.
  • They are not expensive to compute on a per instance basis.
  • They function well for the diversity of content and formats, including the lowest common denominator problem: In other words, some pages are highly structured (and that structure can be traversed to efficiently rediscover things that have moved), but some pages are not. Content can be text, graphic, media, or in some cases, more than one format.
  • Queries to see whether annotations exist for a given anchor would not require excessive bandwidth or complex multipart negotiations with remote services. (Simple, short / fast queries are ok.) (Note: this is a specific use case for annotations.)

(This list was taken from https://github.com/hypothesis/h/wiki/Robust-anchors and adapted.)

What are the options?

This list of options for Fragment Identifiers is not comprehensive. Options such as Purple Numbers (http://eekim.com/software/purple/purple.html) and the New York Times Emphasis Project (http://open.blogs.nytimes.com/2011/01/11/emphasis-update-and-source/?_r=0) exist but are either too similar to other examples or not robust enough.

XPointer

XPointer is really two parts - a framework which identifies a scheme, and then a set of schemes. For example, the XPointer

  1. xpath( /book/chapter[3]/p[contains(., "profit")] )

would point to all paragraphs in the third chapter that mention "profit".

There's also a "tumbler" XPointer scheme that uses numbers in the manner of CFI.

There's a registry for adding additional XPointer schemes.

XPath

http://www.w3.org/TR/xpath-30/

XPath is normally used to point into documents (primarily HTML or XML; XPath is defined over a tree-based model of a document).

XPath supports pointing to elements, attributes and text, as well as to processing instructions, comments, and the language is XML namespace-aware.

XPath 1.0 (1999) is implemented today in Web browsers as part of the Open Web Platform.

All versions of XPath support the concept of matching a set of nodes (e.g. elements) and then winnowing that set down using filters, so for example, /html/body/div/table/tr/td might match hundreds of td elements in every table in a document but /html/body/div[@id = 'overview']/table[@class = 'summary']/tr[last()]/td would match all the td elements in the last row of a summary table in the overview section.

EPUB CFI

http://www.idpf.org/epub/linking/cfi/epub-cfi.html

EPUB CFI (EPUB Canonical Fragment Identifier) defines a method for referencing arbitrary content within an EPUB file.

Sample ID:

book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)

Steps are denoted by the forward slash character (/), and are used to traverse XML content. The last step in a CFI path represents a location within a document, either structural (XML element), textual (character data), or aural-visual (image, audio, or video media). The example identifier above points to the 6th element of the EPUB file (the spine - always the 3rd item), an even step to chapter 1, followed by an exclamation point, indicating that this is the entry point to item to which the spine points. The numbers following the slash represent the DOM of the object. It is possible to use character offsets to refer to specific words or letters. To reach the spine, it goes like this:

<package> – ignored since this is the starting point
/1 – even though there is only whitespace (or no whitespace) this is the first inter-element text node
<metadata> /2 – referencing inside metadata would start a new step
/3 – text node between the metadata and manifest
<manifest> /4
/5 – text node
<spine> /6 – starting step for cfis – to reference inside you now start a new step
   /1
   <item> /2 – the cover or some front matter
   /3
   <item> /4 – reference to chapter 1
   ...
</spine>
/7 
(Thanks, Matt G!)

The level of detail allows for failure. If the content is changed, the DOM is still structured in approximately the same manner. The identifier will still target very close to the location of the original content by moving up the nodes until a target is achieved. Because this is based on structure, this identifier can be used for any content, the more nodes, the better. At this point, this method is available only for EPUB. There are no commercially available tools to generate CFIs.

W3C’s Media Fragment

http://www.w3.org/TR/media-frags/

A fragment specification to provide identification of various dimensions (temporal, spatial) in accessing, e.g., content of a video or audio. The specification also provides a processing model on how to handle these fragments over HTTP.

Sample IDs:

http://www.example.com/example.ogv#track=audio&t=10,20

identifies a specific time interval on the audio track of a video.

http://www.example.com/example.ogv#xywh=160,120,320,240

results in a 320x240 box at x=160 and y=120 in the video.

W3C Packaging for the Web Fragment Identifiers

http://w3ctag.github.io/packaging-on-the-web/#fragment-identifiers

Sample IDs:

http://example.org/downloads/editor.pack#url=/root.html;fragment=colophon

is the same as

http://example.org/root.html#colophon
http://example.org/downloads/editor.pack#url=/root.svg;fragment=svgView(viewBox(0,200,1000,1000)) points to specific location within an SVG

This is an identifier scheme to support the proposed Packaging for the Web format. One identifier can be represented in more than one way. Content-type can be included in the URL. This means that the identifier itself supports content negotiation, for example allowing the user to choose between the SVG and png images. It is also possible to identify any location or sub-location within a package. However, the look of the URL can change. This might be a little confusing. Further, the URLs look somewhat different from what most of us are used to seeing. The Packaging for the Web specification is not yet stable or supported.

Open Annotations Fragment Selector

http://www.w3.org/TR/2014/WD-annotation-model-20141211/#fragment-selector

Example in JSON-LD:

{ 
"@id": "http://example.org/anno1", 
"@type": "oa:Annotation", 
"body": {"@id": "http://example.org/body1"}, 
"target": { 
 "@id": "http://example.org/sptarget1", 
 "@type": "oa:SpecificResource", 
 "source": "http://example.org/target1", 
 "selector": { 
 "@id": "http://example.org/selector1", 
 "@type": "oa:TextPositionSelector", 
 "start": 4, 
 "end": 7 
 } 
 } 
}

The Web Annotations Working Group is adopting the Open Annotations data model to create a functional model for annotations on the Web. It combines RDF and hash notation to create an identifier that carries more information.

oa:FragmentSelector is an RDF class (designating a resource which describes the segment of interest in a representation, through the use of the fragment identifier component of a URI).

  • Fragment – Uses # notation, and the spec is very flexible about which fragments are allowed
  • Range – There are a few options for defining range: e.g., text position, data position. This is considered somewhat brittle, but it is very important to the world of annotations. It is crucial to recognize both the beginning and end of the annotated portion of content. Use of state in conjunction is recommended.
  • Text Quote - Used to identify a text within an element by providing:
    • A copy of the text which is being selected
    • prefix - A snippet of text that occurs immediately before the text which is being selected.
    • suffix - The snippet of text that occurs immediately after the text which is being selected.
  • Area - Used for non-rectangular content (SVG)
  • State – This is recommended for use in conjunction with brittle text position. For example, does the annotation apply to the published state? Archived state? Manuscript state? This can be associated with timestamp or request header.

This scheme supports multiple targets and addresses the issue of range, both of which are requirements for several use cases, not annotations exclusively. This is the only (?) scheme that expressly addresses the concept of range, which will be of value when handling publishing concepts, such as indexes. This specification also associates the concept of "state" with identifier. This is clearly useful with regard to annotations, but may not be required for all fragment identifiers. It incorporates a framework for different fragment selector algorithms. The broad, non-specificity could be helpful or harmful. it is also still under development, but there are early implementers.

HathiTrust Research Center's page identifiers

Full specification is at https://www.ideals.illinois.edu/handle/2142/73147 ;

The important point is that the identifier scheme must allow the specification of page references, even if that concept may be considered as inadequate for digital content. See Bill's mail and the related thread.

Proposed Strategy

[Contributed by Ivan Herman, W3C]

We should distinguish two separate categories here:

  • Purely media-type-specific fragments (that includes xpointer, the media fragments as mentioned by Thierry, xpath, svgview, etc.)
  • Package-level fragments like CFI and or the Packaging Fragments (let's refer to this as PFrag for now)

In an EPUB-WEB approach we should not deal with the first category at all. Those are specified by other groups, registered by IETF, etc; the DPUB community should be a user of those just as they are users of the specific media. In the future, the variety of media that can be added to a portable document will just increase and will be open ended; we should be 'clients' of that evolution.

CFI and PFrag have a different concern: the question is how to find a specific document within a package and then, within that document, a finer way of identifying an anchor.

What we should specify, as a set of requirements, is what an EPUB-WEB Fragment (say EWFrag) has to fulfill. Here is a tentative list, based on CFI and PFrag:

  1. EWFrag should have a clear way of identifying a document *within* the media
  2. EWFrag should have a way to follow "paths" of references through several 'hops'
  3. EWFrag should have a way to reuse externally defined fragment id specifications for specific media types
  4. EWFrag should have clear (and simple) conceptual equivalents to URI-s with fragment ID-s when the document is directly accessed on the Web
  5. EWFrag should be based, as far as possible, on technologies widely deployed on the Web (and hence in Web browsers)

For CFI:

  • (1) is fulfilled (for EPUB) starting from the package file
  • (2) is fulfilled through the usage of the '!' character, though the definition seems to rely on XHMTL and SVG elements only, ie, is not really extensible
  • (3) is not fulfilled, as far as I can see; instead, it uses its own identification down to the character level in a document
  • (4) is not fulfilled, it uses its own identification
  • (5) is fulfilled today but may not work tomorrow: it is deeply rooted in XML both for the package file and the target documents; if some packages are defined in other formats (eg, JSON) then this may break down; it may not even work with HTML5 (does the '/' approach, making this differentiation between elements and text children work the same way?)

For PFrag:

  • (1) is fulfilled, using the list of headers within the package
  • (2) is not fulfilled, it can only go one step (from the package down to a document within the package)
  • (3) is fulfilled; in fact PFrag is concerned only by the identification of a document within the package and is oblivious to the rest
  • (4) is sort of fulfilled (per documentation), but is a bit convoluted
  • (5) is fulfilled; it relies on, essentially, HTTP headers, which is part of the basics on the Web

There may be other requirements (Human readability? Ease of generation?) and some of the requirements above are not really important (eg, how important is (2)?). But this is the kind of requirements that we should formulate.

Requirements

[Working Draft for discussion at DPUB F2F 26 May 2015]

The following is provided as a catch-all resource of potential requirements for Fragment Identifiers for consideration and discussion. Except for the first group from Ivan, these are not proposed requirement statements, they are simply a collection of thought-triggers to prompt to help ensure that as many ideas are considered as possible. WG and IG members are encouraged to add any thoughts to the list below in advance of the F2F.

Contributed by Ivan Herman, W3C (see above):

What we should specify, as a set of requirements, is what an EPUB-WEB Fragment (say EWFrag) has to fulfill. Here is a tentative list, based on CFI and PFrag:

  1. EWFrag should have a clear way of identifying a document *within* the media
  2. EWFrag should have a way to follow "paths" of references through several 'hops'
  3. EWFrag should have a way to reuse externally defined fragment id specifications for specific media types
  4. EWFrag should have clear (and simple) conceptual equivalents to URI-s with fragment ID-s when the document is directly accessed on the Web
  5. EWFrag should be based, as far as possible, on technologies widely deployed on the Web (and hence in Web browsers)

Contributed by Tzviya Siegman, Wiley (see above):

  1. They are conceptually and architecturally simple.
  2. They are built with an eye towards the future. What is likely to remain constant over time? Can we imagine the same approach working 100 years from now? (A good question is: What has remained relatively stable in the past?)
  3. They are able to be implemented now -- they don’t depend on new technologies or standards yet to come.
  4. They acknowledge that because of the inherent instability of web content, anchors can never be known for certain, and we may take different actions based on how confident we are. (The web and content are unstable, and the identifers need to stand up to that instability.)
  5. They outline how we fail gracefully when we cannot reattach anchors.
  6. New anchors can be determined solely from information available on the document to which they refer.
  7. They are not expensive to compute on a per instance basis.
  8. They function well for the diversity of content and formats, including the lowest common denominator problem: In other words, some pages are highly structured (and that structure can be traversed to efficiently rediscover things that have moved), but some pages are not. Content can be text, graphic, media, or in some cases, more than one format.
  9. Queries to see whether annotations exist for a given anchor would not require excessive bandwidth or complex multipart negotiations with remote services. (Simple, short / fast queries are ok.) (Note: this is a specific use case for annotations.)

(This list was taken from https://github.com/hypothesis/h/wiki/Robust-anchors and adapted.)

Requirements from the Proposal for Persistent and Unique Entity Identfiers for page-level content within the Hathi Trust Research Center:

  1. Persistent Citability [work when the cited resource position changes with relation to other resources]
  2. Point-in-Time Citability [stable over time; reflects state of resource at a given time]
  3. Reproducibility [citations can be shared]
  4. Supporting "non-consumptive" usage [relevant to HTRC; analog for DPUB might be "Does not alter identified resource"]
  5. Improved Granularity [note that HTRC is only looking for page-level citation; included here for completeness]
  6. Expanded Workset Membership [translation for DPUB might be "easy to implement and use"]
  7. Supporting Graph Representations [DPUB: is there a requirement wrt RDF, JSON, etc.?]

Some high-level plain-English statements for discussion

  • EWFrag should enable addressing arbitrary locations within a textual document, not just nodes based on its markup.
  • EWFrag should enable discontinuous fragments within a document to be associated and addressed as a group. [Very likely this just means perhaps needing to identify a collection of fragment identifiers with another fragment identifier; just added for discussion.]

Here are some requirements statements that express several of the above in a more plain-English form, for consideration; may be more appropriate for a document aimed at the general public than a Note aimed at W3C groups:

  • Identify precise locations within a content document or resource that may be, but are not necessarily, nodes in its markup
  • Identify fragments based on two such locations marking the starting and ending location of the fragment

[Note: I think it's useful to clearly distinguish a "location" (a single point) from a "fragment" (bounded by two such points). When you point to a location, people tend to think "well the markup defines what that fragment is, e.g. a <section>," but that is way too text-and-HTML-markup centric imo.--BK]

  • Apply to any media type in a publication, not just text documents
  • Enable the use of existing media-specific fragment identifiers