W3C

DPub Archival Task Force

07 Apr 2016

Agenda

See also: IRC log

Attendees

Present
Tim Cole, Ayla Stein, Nicholas Taylor, Tzviya Siegman
Regrets
Bill Kasdorf, Leonard Rosenthol, Markus Gylling, Heather Flanagan
Chairs
Ayla Stein, Tim Cole
Scribe
Tim Cole

Contents


<ayla_stein> Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Apr/0020.html

Approve Minutes https://www.w3.org/2016/03/24-dpub-arch-minutes.html

ayla_stein: objections?
... hearing no objections, approved.

Identify Archival use cases relevant PWP and assign write-up of > each to TF member(s)

ayla_stein: a use case for LocKSS

ntay: format migration may not be in scope for use case

ayla_stein: could you expand on why not in scope

timC: It seems to me that PWP may make migration on access
... as I understood LOCKSS format migration is driven by use
... and when people use a PWP that is unpacked on server, do they necessarily get archivable package

ntay: format on migration may not be current
... LOCKSS access system has not needed this yet.
... risk model of LOCKSS is primarily concerned with the bits.
... don't see PDF, GIF, ASCII as being of concern for client-side rendering obsolence.
... if GIF could no longer be rendered by mainstream Web browser
... then LOCKSS might put mechanism in place that when client requests an image but doesn't accept gif,LOCKSS could migrate to png
... But LOCKSS doesn't see this as a use case.
... my understanding is that PWP is not a file in the same sense
... it's a more a framework and manifest
... Could still put together a CLOCKSS / LOCKSS

TimC: Difference between CLOCKSS and LOCKSS

ntay: same underlying technology

TimC: Ntay can you walk us through the acquisition process

ntay: 2 mechanisms by which we retrieve content
... 1 Web harvest, collect content as it would be presented to User
... plug-in provide extra intelligence to crawler so it can parse various units that contain multiple publication (e.g., issue)
... helps it figure out what the units it needs to make archival package(s)
... decision about what to package per publisher largely being figured out by LOCKSS

TimC: so the manifest of PWP might simplify this.

ntay: yes

TimC: assumes that what is need for archiving is same as what is needed for portability

ntay: 2nd method is more back end process
... publishers are making content available on the back end for archiving services
... more typically the source files, e.g., includes pdf but also XML, etc., but may not have all of the presentation (CSS) files
... things are neatly organized in a tree.

TimC: How would you phrase some of these as user stories...

tzviya: the back end approach is not necessairly relevant for PWP

ntay: in CLOCKSS model, all content archived is dark until there is a trigger event (publisher goes out of business, natural disaster knocks out servers, etc.)
... if we harvested, the user will see what they are used to seeing.
... backend acquisition makes it quality of access experience
... to the extent that we can make use cases generic, not necessairly tied to archiving service
... there may be specific examples from David's Blog post that talk about problems in the absence of a manifest
... so this may help us

TimC: use case archival service wants to harvest (spider) a PWD, and expects to find in the manifest what it needs to make sure it gets all the right pieces.

ntay: yes

<ayla_stein> yes!

ntay: another use case is versioning, if one part of a PWP gets updated how is that update handle by archiving service

tzviya: also a revisioning use case
... e.g., an update for a mis-spelled word in chapter 3
... vs. a new version of chapter 2.

ayla_stein: keeping track of errata and retractions

tzviya: we can start these as issues it GitHub
... re errata, these might be done as annotation
... we do have to consider what to do with errata, revisions, versions, etc.

ayla_stein: removal retractions (publisher just removes the item)
... what does the archive service do?

ntay: would be surprised if retraction resulted in deletion from an archive

tzviya: we give retractions their own DOI, separate from the original article's DOI
... shows how people DOIs for different purposes

ayla_stein: Medusa digital repository at Illinois does include digital monographs
... so what does that archive need to facilitate archiving
... archivist needs more a sense what makes a document valid -- i.e., health check
... sounds like he needs some sort of archivist validation

TimC: what does validation mean? how does validatiy change over tiime?

tzviya: there is an e-pub check system for validating

ayla_stein: it does sound like he wants some way to read the publication and know how to validate it
... might not have to be an external tool

tzviya: ePub has a validator, but has not come up yet for PWP

ntay: so what does ePub check do

tzviya: checks HTML, structure, etc.

<tzviya> epubcheck https://github.com/IDPF/epubcheck

ntay: been focused on how PWP will help verify completeness
... not clear whether you could easily check appearance and/or browser compatibility

ayla_stein: not clear that responsive design is of concern yet to the Library / Archive space

ntay: Responsive Design is a best practice for Web Archiving

<ntay> https://library.stanford.edu/projects/web-archiving/archivability

TimC: the basic uc a archiving service wants validate a PWP as being adequate for archiving.
... Ayla will write something up.

ayla_stein: archivist will be worried about the range of content that can be included in a PWP, since these technologies change over time
... my understanding that PWP

TimC: if PWP is wide open about what it includes
... does that mean that some PWPs may not be archivable?

tzviya: is this the same issue as comes up when we talk about Archiving the Web?

ayla_stein: Leonard's discussion about PDF/A experience may help also

ntay: my understand of PDF/A, the ability to embed arbitrary content means you end up with binary blogs
... as archivist we deal with not having control all the time
... so while some formats easier to archive than other, it isn't that there's a non-archivable format

TimC: use case for making assessment of risk (from archiving perspective)

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.144 (CVS log)
$Date: 2016/04/11 09:47:28 $