[locators] High-level thoughts

Hi all,

Based on the discussion yesterday, I have been musing, and drafted my
thought below.
It is insanely long, sorry for that, the short version is that I make
following statements:

* A PWP locator can be absolute or relative.
* The relative locator allows to link to resources once you know where the
PWP is located
  * and can be derived using the PWP manifest
* The absolute locator consist of the relative locator and the PWP locator.
* The PWP locator is always in a certain state (e.g., locally unpacked, or
hosted packed, or ...)
* However, all instantiations of the PWP link back to the state-less,
abstract PWP, via its Canonical URL
* and that Canonical URL needs to point to at least one instantiation of a
PWP.
* Thus, a PWP can be referenced using its specific instantiation, or via
its Canonical URL.

All of these statements are open to debate of course :).

Also: @Romain: could you give an update to the current state of the use
cases, and how we can help you?

Greetings,
Ben

## States - scope

As per the current state of the PWP WD,
we scope this work specifically that a PWP can have different states
(packed/unpacked, protocol/file),
but otherwise, the contents of the PWP is exactly the same across those
states.

Locating content between PWPs that have different contents (e.g., in
another language, or an earlier version),
are currently out of scope.
Things such as the FRBR model is out of scope,
as this is more about identifiers than about locators.

Also, with locators, there is meant (entire) PWP's and/or individual
resources inside the PWP.
For more fine-grained locations (e.g., the second paragraph of document X),
other efforts are going on, e.g., in the annotation working group.

## Remark: Absolute vs Relative

As far as I see it, it is possible to have relative and absolute locators,
where relative locators will mostly (exclusively?) be used inside the PWP,
and absolute locators might be used for internal links,
but probably mostly for external sources linking to the PWP.

As such, I think of a locator as having two parts:
[PWP locator]*[resource locator]

In the case of a relative locator, the [PWP locator] is missing,
and needs to be derived from context.

### Internal links

Inside the PWP
> i.e., inside the 'container' that holds all contents of the PWP,
> for a packed PWP, this is straightforward, i.e., inside the package,
> for an unpacked PWP,
> I mean inside the subfolder, whether it is file or protocol state

`<p>See <a href="[resource locator]">Section 2</a> for more info.</p>`

Q1: Is this locator the same when

(* Q1a. section 2 is the same file)
* Q1b. section 2 is a different file, but within the same PWP
* Q1c. the PWP is opened protocol/unpacked
* Q1d. the PWP is opened file/packed
* Q1e. the PWP is opened protocol/packed
* Q1f. the PWP is opened file/unpacked
* Q1g. the PWP is opened in a different protocol (e.g., via http or https
or ftp)
* Q1h. the PWP is moved/copied protocol-wise (e.g., from example.com to
books.org)
* Q1i. the PWP is moved/copied file-wise (e.g., from /usr/home/ben/ to
/user/home/bjdmeest/)
* Q1j. the PWP is packed vs unpacked

### External links

>From a (online) website/ (offline) paper/...

<p>John et al. describe an <a href="[PWP locator][resource
locator]>interesting algorithm</a> for this problem.</p>

Q2: Is this locator the same when

* Q2a. The referring document is actually inside the PWP
* Q2b. The referred PWP is accessed protocol/unpacked
* Q2c. The referred PWP is accessed file/packed
* Q2d. The referred PWP is accessed protocol/packed
* Q2e. The referred PWP is accessed file/unpacked
* Q2f. The referred PWP is accessed in a different protocol (e.g., via http
or https or ftp)
* Q2g. the referred PWP is moved/copied protocol-wise (e.g., from
example.com to books.org)
* Q2h. the referred PWP is moved/copied file-wise (e.g., from
/usr/home/ben/ to /user/home/bjdmeest/)
* Q2i. the referred PWP is packed vs unpacked

## Idea

Personally, I see this as two different problems, i.e.,
the [PWP locator] depends on the protocol the PWP is in,
whereas the [resource locator] depends on how the the packed vs unpacked
PWP should be accessed.
To me, the [resource locator] is more technical, i.e., once you have the
PWP,
you can (probably via the manifest) access and link to the individual
resources.
Given the discussion yesterday, I see the following high-level model, to
solve the [PWP locator]:

1. Most importantly, a PWP consists of a Canonical URL and some resources.
2. The identifiers of a PWP are, e.g., ISBN numbers, but could coïncide
with this Canonical URL
3. The Canonical URL is the reference to the abstract PWP, whereas
different State URLs refer to specific instantiations of that PWP
4. The Canonical URL does not need to be on the same online place as the
actual PWP (cfr. DOI)
5. The State URLs could be, e.g., the packed version on the publishers
website, the unpacked version on the publishers website
6. or the URL of the local copy of the downloaded PWP

When referencing a publication, the user can reference the Canonical URL or
the state URL.
When referencing the state URL, the Canonical URL could be found, as it is
part of the PWP.

### (technical) TODOs

Systems need to be in place to make sure the Canonical URL can refer to at
least one state URL,
as otherwise only the abstract PWP exists, but no real content.

It should be specified how a PWP references to the Canonical URL.

It should be specified how to access and link to specific resources in a
PWP, via some kind of manifest.

### Fun things

Fun thing #1: the most minimal website can already be a PWP, namely:
the Canonical URL is also a State URL to the unpacked protocol version of
the PWP.

Fun thing #2: a user can remix the local PWP as much as he likes -- e.g.,
stripping out all the videos to create a 'slim' PWP and republishing it --
the remixed PWP could still refer to the 'official' PWP via its Canonical
URL,
and the publisher still keeps authority on 'correct' PWPs,
as the Canonical URL does not need to refer to the remixed PWP, but only to
the authorized PWPs.
Add in checksums etc., and any user can verify whether a received PWP is
the same as the published PWP.

### Bad things

Bad thing #1: there is an insane amount of pressure on the Canonical URL.
If this URL dies, then all instantiations of the PWP are disconnected.

Ben De Meester
Researcher Semantic Web
Ghent University - iMinds - Data Science Lab | Faculty of Engineering and
Architecture | Department of Electronics and Information Systems
Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
t: +32 9 331 49 59 | e: ben.demeester@ugent.be | URL:
http://users.ugent.be/~bjdmeest/

Received on Thursday, 28 January 2016 10:12:20 UTC