See also: IRC log
<TimCole> Meeting: DPub IG Archival Task Force
<TimCole> Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Apr/0074.html
<HeatherF> scribenick: heatherf
TimCole: any additions
to the Agenda?
... Hearing none, we'll proceed with the agenda as posted.
Minutes - https://www.w3.org/2016/04/07-dpub-arch-minutes.html
TimCole: any concerns or
questions about the minutes?
... Hearing none, we'll consider the minutes of the previous call approved.
Minutes approved.
PDF/A Standards Wiki: http://pdf.editme.com/pdfa
lrosenth: that wiki hasn't been used in a while; not as up to date as it could be, but it will work for now
<lrosenth> Presentation: http://drops.pdfsages.com/1et0B
lrosenth: will send out a set of
more current links later
... the presentation focuses on things that the working group
did and did not do during the process of developing PDF/A
... PDF/A is a series of ISO standards: 19005 in three
parts.
... Initial history involves the US federal gov't in 2002-2003
ran into the issue in that they had adopted PDF in a variety of
use cases, but found that not all PDFs are created equal.
... PDF had not been designed with validation in mind, for
example. The gov't needed a way to define a subset of PDF that
was optimized for their use cases.
... They brought Adobe and others on board, opening this from
fed gov't use cases to a generally more reliable PDF (now,
PDF/A)
... There are three parts to the standard: Part One = original
standard, based on current version (1.4) of PDF standard.
<scribe> ... New versions of PDF developed after that (post 1.4), so then the effort of PDF/ came to align with PDF as an ISO standard (post 1.7)
UNKNOWN_SPEAKER: PDF/A part three
introduced one change.
... The biggest debate between one and two was whether to allow
attachments in a PDF/A file, and if so, what types.
Bill: when you say attached to it, are you talking about a completely separate doc, or something embedded?
lrosenth: yes, embedded
file.
... In PDF/A one, specifically forbid embedded files. Then
found there were many real world use cases to allow embedding
(e.g., archiving world uses collections which would have
benefited from that feature)
... In part two, allowed for embedding PDF/A within other
PDF/A. But there is still controversy around whether inclusion
of non-PDF/A files, esp. in the archival community.
... In part three, you can embed anything. That's the biggest
difference between two and three.
... The archival community still has many negative things to
say about arbitrary blobs within a file, but in other use
cases, this is necessary.
Tim: When we talk about embedding, I think of embedding fonts. Where does that fit?
lrosenth: What's common to all of the parts is that all data must be self-contained. All the fonts, images, text, color profiles, must be included. There can be no external references to content.
scribe: There is no security,
encryption, or DRM. This was agreed to from day one. The
problem with security is that over time, you can lose passwords
and render the content inaccessible.
... PDF, because it was not designed for validation, allowed
you to do the same thing but in different ways. PDF/A restricts
that to allow only one way.
... Last big thing, consistent across all parts, is that
nothing that will change the content out from under the user
without their explicit permission (e.g., JavaScript cannot hide
or change things as they might with forms)
Tim: That answers the question, raises several things to think about
lrosenth: In PDF/A there are
three levels of conformance, a b and c. The reason for the
three levels of conformance is because there are so many use
cases.
... b = basic. Exists primarily for paper-to-PDF scenarios. If
you have a document that is full of images, that is acceptable
in a PDF/A. If you have no text there, it doesn't make sense to
require things like Unicode or accessibility tags.
correction: three levels of
conformance: b, u, a
... u = if there is text, Unicode is required.
... a = all or accessible; you have to conform to every single
thing in the standard, and the additions support
accessibility
... a led to PDF/UA, which focused entirely on accessibility.
Things like alt-text on images, defined word breaks, semantic
tagging.
... When we think about PWP and archiving, all of the things
talked about have been about a file format. How to create a
format where the file itself is a long-term format.
... The PDF/A standard incorporates another set of requirements
involving the conforming reader requirements. The standard
defines how the reader must interact with the file, to make
sure things render as expected.
... Thirteen years ago, not every PDF viewer used embedded
fonts. So, had to specify that in the stanrdard, that if
embedded fonts were present, had to use them rather than
substituting your own.
... Similarly, color has to be rendered in a device independent
manner. Red must stay red.
... There are other things around forms, digital
signatures.
... Digital Signatures in PDF/A led to a collaboration between
US and other national standards body, creating a new set of
international standards called PAdES
Tim: Regarding device indepedence, how far does PDF/A take that? Does PDF/A allow for a different view per device (e.g., mobile vs laptops)
lrosenth: PDF/A has no awareness
of device. Where color comes in, when you take the PDF to a
difference device, the color will follow the rules of color
management.
... We don't have to worry about color management in PWP
because the open web platform does not do color
management.
... Also important is the metadata and marginalia (the stuff
you write in the margins). For archivists, the marginalia is
sometimes more important than the text.
... The idea that we have an archival format meant that things
like comments and annotating, and a rich and extensible
metadata capability that can cover the whole doc or parts, was
important.
... You can do document metadata or individual object metadata.
In the context of packaging the PWP, that's something we'll
need to consider.
... The one extention that had to be made to PDF and PDF/A
around metadata is that we're talking about something that is
self-contained and understandable in the long term future. So,
if you use custom metadata, then you need to include the schema
for that metadata in the metadata itself.
Tim: that sounds very important.
lrosenth: PDF and PDF/A continue
to evolve. Current areas of focus include things we didn't
understand.
... Multimedia is allowed in PDF but isn't allowed in PDF/A
because we didn't understand the archival needs for audio and
video. This is still an area of development in the archival
industry itself.
... We have come to understand 3-D, so there is an effort to
incorporate 3-D elements into a PDF/A. This is somewhat
relevant to the PWP.
... Archiving the presentation and the data of a form is easy,
but what we don't have a good grasp on is how to archive the
business rules.
... Business rules are done entirely in scripting, but that's
not a great solution for archival documents if you want people
to see the same thing far into the future.
<ayla_stein> What about GIS, shape, or other geospatial file formats?
<ayla_stein> Does PDF/A allow those?
lrosenth: Still figuring out how
to preserve that business data.
... PDF 1.7 does not support geospatial data, but it is part of
the upcoming 2.0 standard. PDF/A is also looking at it and
everything else that's being proposed in 2.0.
Tim: We have mentioned a few
things that we need to think about for PWP. Maybe there are
more things we need to talk about. This TF is meant to
communicate some of these issues to the larger group through
simple, granular use cases.
... One comes to mind: an archival service wishes to augment
the metadata of a PWP that is important for their service. Or
maybe the PWP publisher wishes to do that. What are the
requirements on making sure that metadata can be read in the
future?
<ayla_stein> sure
<TimCole> scribenick: ayla_stein
Tim: Heather, since you're becoming the expert on use cases, can you talk about possible use cases here...
<HeatherF> https://www.w3.org/dpub/IG/wiki/UseCase_Template
<TimCole> http://w3c.github.io/dpub-pwp-arch/Archival-UCR.html
HeatherF: on the dpub wiki there's a template of what should be included in a use case
Tim: we tried to create a page in
github.
... other than title we tried to include description
HeatherF: the two most important
aspects are description and requirements
... for how to take some of the PDF/A lessons learned how they
can be applied to PWP use cases, I need to stare at the minutes
for a while and see how use cases could come out of this
Tim: Can we talk informally to talk about possible use cases
HealtherF: you mentioned a possible use case about archivists needs for metadata
*HeatherF
Tim: Does that make sense Leonard?
lrosenth: when we started this
group there are a couple of goals
... what are the general things that are base requirements for
the PWP for archiving
... the other idea is that maybe there's an archival profile
for PWP. The interesting thing could be if we want to go down
that path, is what the difference is
... for the metadata, if we need to include the schemas, is
that part of the archival profile or the PWP?
Tim: yes we need to think about htis
HeatherF: Why don't we think
about it this way. A magazine publisher or trade publisher
doesn't need or care about the archival requirements.
... we need to make sure that there isn't a barrier for them to
pick this up
Tim: this could be at the component level...
Bill_Kasdorf: Another thing from
Leonard's presentation is what we mean by a profile is a
subset.
... this seems to be an important issue. The archival profile
should be about PWP in every aspect. PWP should be flexible.
Archival profile should be more strict
Tim: That's something we should
bring back to the larger group to think about
... archival profile needs to still be readable by a PWP
reader
*missed what Bill said
Tim: we talked a bit about this
with Nicholas
... what kind of browser, email (I missed this)
... the other thing that Leonard mentioned is what kind of
script...
... I was just saying I'm not sure how the scripts for
javascript what that will mean for PWP
HeatherF: unfortunately you're
right, javascript and dynamic scripts have become quite
ubiquitous. Which is annoying
... I think having dynamic scripts in any kind of long term
storage is a bad idea but I don't know how we're going to get
around that
lrosenthal: can we take advantage of the fact that we both have file format rules and business rules but have rules where the script should be included by not executed...
Tim: we're coming up on the end
of the hour and I think we have the fodder for a number of use
cases
... some of you are used to editing on github some are not. In
some way or another I'd like everyone to help flush out the
usecases
<TimCole> http://w3c.github.io/dpub-pwp-arch/Archival-UCR.html
Tim: the link address is viewable
by everyone...
... can we get people to try and do that either in the github
document or via email if they have an idea for a use case
... we try to do that within the next two weeks so we can talk
about it in the next call
... how iwll that work with assembling use cases?
HeatherF: It's kind of on going...we'll be talking about it at the F2F at the end of the meeting
Tim: try to think about use case
title and descriptions at least. Heather, can you look at the
url at the link I posted and see if they need to be re-worked,
etc
... Leonard and Bill can you think about use cases?
Leonard and Bill: Yup
lrosenth: this isn't a huge
priority but a few calls ago we talked about getting some of
the people involved in archiving. We did bring in the people
from CLOCKSS and LOCKSS.
... I met the KB, the Dutch National Library version of
Nicholas Taylor who would love to participate in one of our
calls
Tim: Can you share his contact
info with Ayla and I?
... we're still waiting on a couple of contacts that I'm
responsible for from the Internet Archive and Los Alamos
National Laboratory
*thanks Bill!
Tim: anything else?
... okay let's save some time. We'll have the next call in two
weeks where we'll talk about use case ideas that people have
come up with or talk about what's already up there
... sound good?
*end