DPub IG Archival Task Force -- 21 Apr 2016

<TimCole> Meeting: DPub IG Archival Task Force

<TimCole> Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Apr/0074.html

<HeatherF> scribenick: heatherf

Agenda Review

TimCole: any additions to the Agenda?
... Hearing none, we'll proceed with the agenda as posted.

Approval of minutes from 7 April

Minutes - https://www.w3.org/2016/04/07-dpub-arch-minutes.html

Minutes Approval

TimCole: any concerns or questions about the minutes?
... Hearing none, we'll consider the minutes of the previous call approved.

Minutes approved.

PDF/A

PDF/A Standards Wiki: http://pdf.editme.com/pdfa

lrosenth: that wiki hasn't been used in a while; not as up to date as it could be, but it will work for now

<lrosenth> Presentation: http://drops.pdfsages.com/1et0B

lrosenth: will send out a set of more current links later
... the presentation focuses on things that the working group did and did not do during the process of developing PDF/A
... PDF/A is a series of ISO standards: 19005 in three parts.
... Initial history involves the US federal gov't in 2002-2003 ran into the issue in that they had adopted PDF in a variety of use cases, but found that not all PDFs are created equal.
... PDF had not been designed with validation in mind, for example. The gov't needed a way to define a subset of PDF that was optimized for their use cases.
... They brought Adobe and others on board, opening this from fed gov't use cases to a generally more reliable PDF (now, PDF/A)
... There are three parts to the standard: Part One = original standard, based on current version (1.4) of PDF standard.

<scribe> ... New versions of PDF developed after that (post 1.4), so then the effort of PDF/ came to align with PDF as an ISO standard (post 1.7)

UNKNOWN_SPEAKER: PDF/A part three introduced one change.
... The biggest debate between one and two was whether to allow attachments in a PDF/A file, and if so, what types.

Bill: when you say attached to it, are you talking about a completely separate doc, or something embedded?

lrosenth: yes, embedded file.
... In PDF/A one, specifically forbid embedded files. Then found there were many real world use cases to allow embedding (e.g., archiving world uses collections which would have benefited from that feature)
... In part two, allowed for embedding PDF/A within other PDF/A. But there is still controversy around whether inclusion of non-PDF/A files, esp. in the archival community.
... In part three, you can embed anything. That's the biggest difference between two and three.
... The archival community still has many negative things to say about arbitrary blobs within a file, but in other use cases, this is necessary.

Tim: When we talk about embedding, I think of embedding fonts. Where does that fit?

lrosenth: What's common to all of the parts is that all data must be self-contained. All the fonts, images, text, color profiles, must be included. There can be no external references to content.

scribe: There is no security, encryption, or DRM. This was agreed to from day one. The problem with security is that over time, you can lose passwords and render the content inaccessible.
... PDF, because it was not designed for validation, allowed you to do the same thing but in different ways. PDF/A restricts that to allow only one way.
... Last big thing, consistent across all parts, is that nothing that will change the content out from under the user without their explicit permission (e.g., JavaScript cannot hide or change things as they might with forms)

Tim: That answers the question, raises several things to think about

lrosenth: In PDF/A there are three levels of conformance, a b and c. The reason for the three levels of conformance is because there are so many use cases.
... b = basic. Exists primarily for paper-to-PDF scenarios. If you have a document that is full of images, that is acceptable in a PDF/A. If you have no text there, it doesn't make sense to require things like Unicode or accessibility tags.

correction: three levels of conformance: b, u, a
... u = if there is text, Unicode is required.
... a = all or accessible; you have to conform to every single thing in the standard, and the additions support accessibility
... a led to PDF/UA, which focused entirely on accessibility. Things like alt-text on images, defined word breaks, semantic tagging.
... When we think about PWP and archiving, all of the things talked about have been about a file format. How to create a format where the file itself is a long-term format.
... The PDF/A standard incorporates another set of requirements involving the conforming reader requirements. The standard defines how the reader must interact with the file, to make sure things render as expected.
... Thirteen years ago, not every PDF viewer used embedded fonts. So, had to specify that in the stanrdard, that if embedded fonts were present, had to use them rather than substituting your own.
... Similarly, color has to be rendered in a device independent manner. Red must stay red.
... There are other things around forms, digital signatures.
... Digital Signatures in PDF/A led to a collaboration between US and other national standards body, creating a new set of international standards called PAdES

Tim: Regarding device indepedence, how far does PDF/A take that? Does PDF/A allow for a different view per device (e.g., mobile vs laptops)

lrosenth: PDF/A has no awareness of device. Where color comes in, when you take the PDF to a difference device, the color will follow the rules of color management.
... We don't have to worry about color management in PWP because the open web platform does not do color management.
... Also important is the metadata and marginalia (the stuff you write in the margins). For archivists, the marginalia is sometimes more important than the text.
... The idea that we have an archival format meant that things like comments and annotating, and a rich and extensible metadata capability that can cover the whole doc or parts, was important.
... You can do document metadata or individual object metadata. In the context of packaging the PWP, that's something we'll need to consider.
... The one extention that had to be made to PDF and PDF/A around metadata is that we're talking about something that is self-contained and understandable in the long term future. So, if you use custom metadata, then you need to include the schema for that metadata in the metadata itself.

Tim: that sounds very important.

lrosenth: PDF and PDF/A continue to evolve. Current areas of focus include things we didn't understand.
... Multimedia is allowed in PDF but isn't allowed in PDF/A because we didn't understand the archival needs for audio and video. This is still an area of development in the archival industry itself.
... We have come to understand 3-D, so there is an effort to incorporate 3-D elements into a PDF/A. This is somewhat relevant to the PWP.
... Archiving the presentation and the data of a form is easy, but what we don't have a good grasp on is how to archive the business rules.
... Business rules are done entirely in scripting, but that's not a great solution for archival documents if you want people to see the same thing far into the future.

<ayla_stein> What about GIS, shape, or other geospatial file formats?

<ayla_stein> Does PDF/A allow those?

lrosenth: Still figuring out how to preserve that business data.
... PDF 1.7 does not support geospatial data, but it is part of the upcoming 2.0 standard. PDF/A is also looking at it and everything else that's being proposed in 2.0.

Archival Use Cases for PWP

Tim: We have mentioned a few things that we need to think about for PWP. Maybe there are more things we need to talk about. This TF is meant to communicate some of these issues to the larger group through simple, granular use cases.
... One comes to mind: an archival service wishes to augment the metadata of a PWP that is important for their service. Or maybe the PWP publisher wishes to do that. What are the requirements on making sure that metadata can be read in the future?

<ayla_stein> sure

<TimCole> scribenick: ayla_stein

Tim: Heather, since you're becoming the expert on use cases, can you talk about possible use cases here...

<HeatherF> https://www.w3.org/dpub/IG/wiki/UseCase_Template

<TimCole> http://w3c.github.io/dpub-pwp-arch/Archival-UCR.html

HeatherF: on the dpub wiki there's a template of what should be included in a use case

Tim: we tried to create a page in github.
... other than title we tried to include description

HeatherF: the two most important aspects are description and requirements
... for how to take some of the PDF/A lessons learned how they can be applied to PWP use cases, I need to stare at the minutes for a while and see how use cases could come out of this

Tim: Can we talk informally to talk about possible use cases

HealtherF: you mentioned a possible use case about archivists needs for metadata

*HeatherF

Tim: Does that make sense Leonard?

lrosenth: when we started this group there are a couple of goals
... what are the general things that are base requirements for the PWP for archiving
... the other idea is that maybe there's an archival profile for PWP. The interesting thing could be if we want to go down that path, is what the difference is
... for the metadata, if we need to include the schemas, is that part of the archival profile or the PWP?

Tim: yes we need to think about htis

HeatherF: Why don't we think about it this way. A magazine publisher or trade publisher doesn't need or care about the archival requirements.
... we need to make sure that there isn't a barrier for them to pick this up

Tim: this could be at the component level...

Bill_Kasdorf: Another thing from Leonard's presentation is what we mean by a profile is a subset.
... this seems to be an important issue. The archival profile should be about PWP in every aspect. PWP should be flexible. Archival profile should be more strict

Tim: That's something we should bring back to the larger group to think about
... archival profile needs to still be readable by a PWP reader

*missed what Bill said

Tim: we talked a bit about this with Nicholas
... what kind of browser, email (I missed this)
... the other thing that Leonard mentioned is what kind of script...
... I was just saying I'm not sure how the scripts for javascript what that will mean for PWP

HeatherF: unfortunately you're right, javascript and dynamic scripts have become quite ubiquitous. Which is annoying
... I think having dynamic scripts in any kind of long term storage is a bad idea but I don't know how we're going to get around that

lrosenthal: can we take advantage of the fact that we both have file format rules and business rules but have rules where the script should be included by not executed...

Tim: we're coming up on the end of the hour and I think we have the fodder for a number of use cases
... some of you are used to editing on github some are not. In some way or another I'd like everyone to help flush out the usecases

<TimCole> http://w3c.github.io/dpub-pwp-arch/Archival-UCR.html

Tim: the link address is viewable by everyone...
... can we get people to try and do that either in the github document or via email if they have an idea for a use case
... we try to do that within the next two weeks so we can talk about it in the next call
... how iwll that work with assembling use cases?

HeatherF: It's kind of on going...we'll be talking about it at the F2F at the end of the meeting

Tim: try to think about use case title and descriptions at least. Heather, can you look at the url at the link I posted and see if they need to be re-worked, etc
... Leonard and Bill can you think about use cases?

Leonard and Bill: Yup

lrosenth: this isn't a huge priority but a few calls ago we talked about getting some of the people involved in archiving. We did bring in the people from CLOCKSS and LOCKSS.
... I met the KB, the Dutch National Library version of Nicholas Taylor who would love to participate in one of our calls

Tim: Can you share his contact info with Ayla and I?
... we're still waiting on a couple of contacts that I'm responsible for from the Internet Archive and Los Alamos National Laboratory

*thanks Bill!

Tim: anything else?
... okay let's save some time. We'll have the next call in two weeks where we'll talk about use case ideas that people have come up with or talk about what's already up there
... sound good?

*end

DPub IG Archival Task Force

21 Apr 2016

Attendees

Contents

Agenda Review

Approval of minutes from 7 April

Minutes Approval

PDF/A

Archival Use Cases for PWP

Summary of Action Items

Summary of Resolutions