W3C

DPub Archival Task Force

03 Mar 2016

Agenda

See also: IRC log

Attendees

Present
Heather_Flanagan, Leonard_Rosenthol, Tim_Cole, Ayla_Stein, Bill_Kasdorf
Regrets
Tzviya_Siegman, Deborah_Kaplan
Chair
Tim Cole
Scribe
Heather_Flanagan

Contents


<TimCole> Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Mar/0009.html

<TimCole> scribenick: HeatherF

<TimCole> minutes from 18 Feb: https://www.w3.org/2016/02/18-dpub-arch-minutes.html

TimCole: Approval of the minutes?
... minutes approved, no discussion

Administrivia & Announcements

<TimCole> From HeatherF: http://www.dpconline.org/advice/preservationhandbook

<lrosenth> http://www.archives.gov/digitization/strategy.html

TimCole: Will miss a few of our scheduled meetings; Ayla Stein will co-lead TF and chair calls when needed.

Outreach Updates?

TimCole: Any updates on the outreach effort?

Bill_Kasdorf: haven't had a chance to reach out yet

TimCole: haven't had a chance to reach out yet

lrosenth: haven't had a chance to reach out yet

We are bad, bad people

<astein> Neither have I

TimCole: this will go on the agenda next time. And we mean it!

Use Cases

<TimCole> http://w3c.github.io/dpub-pwp-arch/Archival-UCR.html#LOCKSS

TimCole: This is the beginnings of a use case posted this morning.
... This doc is meant to be the place to put use cases, talk about requirements for those use cases that are relevant to the PWP vision

<TimCole> more about LOCKSS: http://www.lockss.org/about/how-it-works/

TimCole: This use case talks about LOCKSS in particular, see link about how LOCKSS works.
... LOCKSS goes and spiders/scrapes publisher websites that they have permission to archive. Then, when someone has permission to access that content, LOCKSS acts as a proxy cache.
... New versions will be posted and LOCKSS will update, as per usual proxy behavior.

Bill: For LOCKSS, are they accessing the library, or are they accessing the publisher?

TimCole: They are accessing what the library is subscribing to.

<astein> I didn't see anyone on the queue

TimCole: People coming through the library servers would see what was available in the proxy cache, and LOCKSS would siphon off copies for the archives.
... CLOCKSS works more directly with the publishers and makes that copying more routine.
... Some interesting issues come up: when the publisher content go away, and we move into a pure archive/perservation point, then LOCKSS servers up the cached copy as long as the ACCEPT headers match what LOCKSS has cached
... The browser goes to a new version of the content and says "I only accept HTML5"; LOCKSS only has "HTML4"; at which point, LOCKSS will try to migrate on the fly to HTML5 or it will respond with a "406" error

Bill: Will it also say "but we can give you a different format?"

TimCole: not sure

lrosenth: that seems a sensible model. Are we saying that's a model we like, or just saying that this is what it does, these are the facts?

<TimCole> PWP: http://w3c.github.io/dpub-pwp/#arch

TimCole: Right now, we are just collecting the use cases. We should be considering whether the facts in the use cases cause concern.
... Right now, the PWP talks about Service Workers, which provide a local way to serve resources.

lrosenth: there are no requirements that the PWP must use Service Workers. That said, it has no bearing on this, because we're talking about a server operation, where a server is doing the caching, not the client
... unclear as to what the issue is here; if we publish in PWP format, then LOCKSS will store that. If we request it as a PWP, then it will be served as a PWP. If I request in another format, the server may convert it on the fly.
... the reverse is also true: if I post content in PDF, and instead of asking for that as PDF I ask it as a PWP, the server may convert it on the fly.

TimCole: Probably correct. Just wondering whether there is something that the client will do with the PWP that may somehow result in the copy that the LOCKSS server cached not having everything that the client needs

Bill: When we look at the Portico model, a very contrasting model, it will be easier to see that, given how PWP is being spec'd, how compatible is it to these two important dark archiving schemes?
... Is there anything we're doing that creates a problem? Is there anything we're doing that can optimize what they're doing?

lrosenth: Right. And for LOCKSS, the answer is "we don't need to care"
... the packaged versus the unpackaged differentiation becomes interesting. If I publish the PWP unpackaged, and there is no provision in the server to provide a packaged version, then LOCKSS is going to have to do a huge amount of work to archive the PWP

Bill: That is still better than where they are now. If they are now going to a website, what other resources are there that are essential to that content? e.g., fonts, media, etc.
... PWP is going to make that easier, even in unpackaged form. the PWP must unambiguously get you those resources.

TimCole: The problem is that LOCKSS right now works on the assumption that whatever its grabbing from a web browser is all it needs. It doesn't know anything about the formats in any great detail. It does not take advantage of a manifest to build a package.

Bill: OK. Then perhaps the PWP would enable LOCKSS to do a better job than it can currently do to archive the publication in a more complete and correct way.

lrosenth: It will still require LOCKSS to do more work, which is fine from our perspective.

TimCole: that's a concern this use case needs to highlight.

<TimCole> http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html

<TimCole> http://blog.dshr.org/2013/02/rothenberg-still-wrong.html

TimCole: This is a debate that covers both sides of the LOCKSS model.
... Why it doesn't work, why it does.
... This will be interesting to look at as we discuss the CLOCKSS model as well.

Bill_Kasdorf: Will also want to talk to Craig Van Dyke, who now works at CLOCKSS

TimCole: so, potential question about packaged versus unpackaged. Will do more refinement on the LOCKSS use case, highlighting what we see as issues, and what we see as non-issues.
... Also planning on a use case based on Portico.
... As part of their normalization process, they would develop a package from a manifest, so they could maintain both the content and appearence over time.

lrosenth: Unlike EPUB, PWP does not require that all things in a package or all items referenced in a manifest are from the same site or are self contained. PWP allows for external references.
... Someone taking the material and creating a package may not have the rights to the externally referenced material.

TimCole: a very good point. How you deal with normalization is an important issue.

Bill: The fundamental strategy with Portico was to normalize so that they may support format migrations, keeping the format up to date and uniform. However, it can't migrate those external resources, it can only link to them to the extent the links are stable (which they may not be)

TimCole: Also raises the question that comes up with other formats: when the fonts are not available for whatever reason, what are the fall backs.

lrosenth: This is why PDF/A exists. We may want to say whether there is a PWP/A. You can do all these things with PWP, but if you want one that is archive-ready, you will need to do these additional things (as requirements)

TimCole: There may be a need for that, even if we don't have to define it fully in the docs we are working with.

lrosenth: We can just come back to the main group and say "we think this is important to do" and leave it at that.

TimCole: For next call, will have a Portico use case.

Upcoming Agendas

<scribe> ACTION: Tim to come up with a use case for Portico that illustrates some of the features of portico as a model [recorded in http://www.w3.org/2016/03/03-dpub-arch-minutes.html#action01]

TimCole: if you don't have a git hub account to get the use cases updated, send Tim the details

<astein> ack {HeatherF}

HeatherF: Will not be available for scheduled call on 17th.

Bill: Also will not be available for scheduled call on 17th.

<astein> 24th works for me

TimCole: propose moving the next call to the 24th of March; Tim will propose to the list

Bill: are we looking to interview and report back on the outreach contacts, or bring them in for calls?

TimCole: If the contacts can meet with us at one of our times, invite them in to come in at the half hour (reserving the second half of the call for the discussion). If they can't need to do the interview.

<astein> works for me

TimCole: they are welcome to come in earlier, of course
... We have talked a few times about PDF/A; would lrosenth be able to walk us through some of the important lessons from that process?

lrosenth: yes, can do. Have a presentation that has been given in the past that may be useful and relevant. What PDF/A is and is not.

TimCole: We will put that on the agenda for the next call
... What else needs to be done before our next call? Perhaps start thinking about what other write ups we might need?
... What started this group was the need for some text under the Archival and library services in the PWP white paper. We can talk in that text about the need for a PWP/A approach.
... We also have a glossary we may want to expand on. Are there any other products we want to expand on?

*crickets*

<astein> \o/

scribe: Then with that, we can adjourn! Talk to you in (probably) three weeks

Summary of Action Items

[NEW] ACTION: Tim to come up with a use case for Portico that illustrates some of the features of portico as a model [recorded in http://www.w3.org/2016/03/03-dpub-arch-minutes.html#action01]
 

Summary of Resolutions

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.144 (CVS log)
$Date: 2016/03/04 17:15:47 $