16:49:37 RRSAgent has joined #dpub-arch 16:49:37 logging to http://www.w3.org/2016/04/21-dpub-arch-irc 16:50:01 Zakim has joined #dpub-arch 16:50:32 Meeting: DPub IG Archival Task Force 16:51:00 Chair: Tim_Cole 16:51:30 Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Apr/0074.html 16:55:57 RRSAgent has joined #dpub-arch 16:55:57 logging to http://www.w3.org/2016/04/21-dpub-arch-irc 16:57:12 RRSAgent has joined #dpub-arch 16:57:12 logging to http://www.w3.org/2016/04/21-dpub-arch-irc 16:57:25 Meeting: DPub IG Archival Task Force 16:57:59 zakim, this will be dpub-arch 16:57:59 ok, TimCole 16:58:11 ayla_stein has joined #dpub-arch 16:58:12 HeatherF has joined #dpub-arch 16:58:14 Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Apr/0074.html 16:58:25 present+ Tim_Cole 16:58:37 present+ Heather_Flanagan 16:58:54 present+ ayla_stein 16:59:21 Regrets: Nicholas Taylor 16:59:37 regrets+ Nicholas_Taylor 17:00:03 Bill_Kasdorf has joined #dpub-arch 17:00:07 yay! 17:00:07 scribenic: heatherf 17:00:17 present+ Bill_Kasdorf 17:00:24 scribenick: heatherf 17:00:24 present+ Leonard_Rosenthol 17:01:26 topic: Approval of minutes from 7 April 17:01:29 Minutes - tps://www.w3.org/2016/04/07-dpub-arch-minutes.html 17:01:37 Topic: Agenda Review 17:01:38 Minutes - https://www.w3.org/2016/04/07-dpub-arch-minutes.html 17:02:56 Topic: Minutes Approval 17:03:24 TimCole: any concerns or questions about the minutes? 17:03:43 Minutes approved. 17:03:52 Topic: PDF/a 17:04:07 PDF/A Standards Wiki: http://pdf.editme.com/pdfa 17:04:32 lrosenth: that wiki hasn't been used in a while; not as up to date as it could be, but it will work for now 17:04:41 Presentation: http://drops.pdfsages.com/1et0B 17:04:46 ... will send out a set of more current links later 17:05:34 ... the presentation focuses on things that the working group did and did not do during the process of developing PDF/A 17:05:55 ... PDF/A is a series of ISO standards: 19005 in three parts. 17:06:31 ... Initial history involves the US federal gov't in 2002-2003 ran into the issue in that they had adopted PDF in a variety of use cases, but found that not all PDFs are created equal. 17:07:01 ... PDF had not been designed with validation in mind, for example. The gov't needed a way to define a subset of PDF that was optimized for their use cases. 17:07:26 ... They brought Adobe and others on board, opening this from fed gov't use cases to a generally more reliable PDF (now, PDF/A) 17:07:53 ... There are three parts to the standard: Part One = original standard, based on current version (1.4) of PDF standard. 17:08:40 ... New versions of PDF developed after that (post 1.4), so then the effort of PDF/ came to align with PDF as an ISO standard (post 1.7) 17:09:07 ... PDF/A part three introduced one change. 17:09:24 ... The biggest debate between one and two was whether to allow attachments in a PDF/A file, and if so, what types. 17:10:00 Bill: when you say attached to it, are you talking about a completely separate doc, or something embedded? 17:10:09 lrosenth: yes, embedded file. 17:11:01 ... In PDF/A one, specifically forbid embedded files. Then found there were many real world use cases to allow embedding (e.g., archiving world uses collections which would have benefited from that feature) 17:11:48 ... In part two, allowed for embedding PDF/A within other PDF/A. But there is still controversy around whether inclusion of non-PDF/A files, esp. in the archival community. 17:12:25 ... In part three, you can embed anything. That's the biggest difference between two and three. 17:13:16 ... The archival community still has many negative things to say about arbitrary blobs within a file, but in other use cases, this is necessary. 17:14:02 q? 17:14:18 ack ti 17:14:50 Tim: When we talk about embedding, I think of embedding thoughts. Where does that fit? 17:15:19 s/thoughts/fonts 17:15:26 lrosenth: What's common to all of the parts is that all data must be self-contained. All the fonts, images, text, color profiles, must be included. There can be no external references to content. 17:15:50 (Thanks, Tim - misheard the word) 17:16:51 ... There is no security, encryption, or DRM. This was agreed to from day one. The problem with security is that over time, you can lose passwords and render the content inaccessible. 17:17:36 ... PDF, because it was not designed for validation, allowed you to do the same thing but in different ways. PDF/A restricts that to allow only one way. 17:18:37 ... Last big thing, consistent across all parts, is that nothing that will change the content out from under the user without their explicit permission (e.g., JavaScript cannot hide or change things as they might with forms) 17:19:18 Tim: That answers the question, raises several things to think about 17:20:07 lrosenth: In PDF/A there are three levels of conformance, a b and c. The reason for the three levels of conformance is because there are so many use cases. 17:20:56 ... b = basic. Exists primarily for paper-to-PDF scenarios. If you have a document that is full of images, that is acceptable in a PDF/A. If you have no text there, it doesn't make sense to require things like Unicode or accessibility tags. 17:21:56 correction: three levels of conformance: b, u, a 17:22:07 ... u = if there is text, Unicode is required. 17:23:14 ... a = all or accessible; you have to conform to every single thing in the standard, and the additions support accessibility 17:23:54 ... a led to PDF/UA, which focused entirely on accessibility. Things like alt-text on images, defined word breaks, semantic tagging. 17:24:33 ... When we think about PWP and archiving, all of the things talked about have been about a file format. How to create a format where the file itself is a long-term format. 17:25:01 ... The PDF/A standard incorporates another set of requirements involving the conforming reader requirements. The standard defines how the reader must interact with the file, to make sure things render as expected. 17:25:43 ... Thirteen years ago, not every PDF viewer used embedded fonts. So, had to specify that in the stanrdard, that if embedded fonts were present, had to use them rather than substituting your own. 17:26:08 ... Similarly, color has to be rendered in a device independent manner. Red must stay red. 17:26:35 ... There are other things around forms, digital signatures. 17:27:54 ... Digital Signatures in PDF/A led to a collaboration between US and other national standards body, creating a new set of international standards called PADASS 17:27:57 q+ 17:28:06 ack ti 17:28:41 Tim: Regarding device indepedence, how far does PDF/A take that? Does PDF/A allow for a different view per device (e.g., mobile vs laptops) 17:29:41 lrosenth: PDF/A has no awareness of device. Where color comes in, when you take the PDF to a difference device, the color will follow the rules of color management. 17:30:04 ... We don't have to worry about color management in PWP because the open web platform does not do color management. 17:30:50 ... Also important is the metadata and marginalia (the stuff you write in the margins). For archivists, the marginalia is sometimes more important than the text. 17:31:26 ... The idea that we have an archival format meant that things like comments and annotating, and a rich and extensible metadata capability that can cover the whole doc or parts, was important. 17:31:48 ... You can do document metadata or individual object metadata. In the context of packaging the PWP, that's something we'll need to consider. 17:32:35 ... The one extention that had to be made to PDF and PDF/A around metadata is that we're talking about something that is self-contained and understandable in the long term future. So, if you use custom metadata, then you need to include the schema for that metadata in the metadata itself. 17:33:14 Tim: that sounds very important. 17:33:44 lrosenth: PDF and PDF/A continue to evolve. Current areas of focus include things we didn't understand. 17:34:20 ... Multimedia is allowed in PDF but isn't allowed in PDF/A because we didn't understand the archival needs for audio and video. This is still an area of development in the archival industry itself. 17:35:00 ... We have come to understand 3-D, so there is an effort to incorporate 3-D elements into a PDF/A. This is somewhat relevant to the PWP. 17:35:22 ... Archiving the presentation and the data of a form is easy, but what we don't have a good grasp on is how to archive the business rules. 17:35:47 ... Business rules are done entirely in scripting, but that's not a great solution for archival documents if you want people to see the same thing far into the future. 17:35:54 What about GIS, shape, or other geospatial file formats? 17:35:59 Does PDF/A allow those? 17:36:03 ... Still figuring out how to preserve that business data. 17:37:15 lrosenth: PDF 1.7 does not support geospatial data, but it is part of the upcoming 2.0 standard. PDF/A is also looking at it and everything else that's being proposed in 2.0. 17:38:14 s/PADASS/PAdES 17:38:36 Tim: We have mentioned a few things that we need to think about for PWP. Maybe there are more things we need to talk about. This TF is meant to communicate some of these issues to the larger group through simple, granular use cases. 17:39:16 ... One comes to mind: an archival service wishes to augment the metadata of a PWP that is important for their service. Or maybe the PWP publisher wishes to do that. What are the requirements on making sure that metadata can be read in the future? 17:39:40 sure 17:39:59 scribenick: ayla_stein 17:40:28 ...Heather, since you're becoming the expert on use cases, can you talk about possible use cases here... 17:40:30 https://www.w3.org/dpub/IG/wiki/UseCase_Template 17:40:48 http://w3c.github.io/dpub-pwp-arch/Archival-UCR.html 17:40:55 HeatherF: on the dpub wiki there's a template of what should be included in a use case 17:41:10 Tim: we tried to create a page in github. 17:41:31 .....other than title we tried to include description 17:41:47 HeatherF: the two most important aspects are description and requirements 17:42:38 ....for how to take some of the PDF/A lessons learned how they can be applied to PWP use cases, I need to stare at the minutes for a while and see how use cases could come out of this 17:42:59 Tim: Can we talk informally to talk about possible use cases 17:43:14 HealtherF: you mentioned a possible use case about archivists needs for metadata 17:43:24 *HeatherF 17:43:42 Tim: Does that make sense Leonard? 17:43:59 lrosenth: when we started this group there are a couple of goals 17:44:22 ...what are the general things that are base requirements for the PWP for archiving 17:44:55 ...the other idea is that maybe there's an archival profile for PWP. The interesting thing could be if we want to go down that path, is what the difference is 17:45:29 ...for the metadata, if we need to include the schemas, is that part of the archival profile or the PWP? 17:45:36 Tim: yes we need to think about htis 17:46:08 HeatherF: Why don't we think about it this way. A magazine publisher or trade publisher doesn't need or care about the archival requirements. 17:46:26 ...we need to make sure that there isn't a barrier for them to pick this up 17:46:38 Tim: this oculd be at the component level... 17:47:07 Bill_Kasdorf: Another thing from Leonard's presentation is what we mean by a profile is a subset. 17:47:39 ...this seems to be an important issue. The archival profile should be about PWP in every aspect. PWP should be flexible. Archival profile should be more strict 17:47:59 Tim: That's something we should bring back to the larger group to think about 17:48:26 ...archival profile needs to still be readable by a PWP reader 17:48:37 *missed what Bill said 17:48:49 Tim: we talked a bit about this with Nicholas 17:49:22 ...what kind of browser, email (I missed this) 17:49:36 ...the other thing that Leonard mentioned is what kind of script... 17:50:47 Tim: I was just saying I'm not sure how the scripts for javascript what that will mean for PWP 17:51:13 HeatherF: unfortunately you're right, javascript and dynamic scripts have become quite ubiquitous. Which is annoying 17:51:35 ...I think having dynamic scripts in any kind of long term storage is a bad idea but I don't know how we're going to get around that 17:52:32 lrosenthal: can we take advantage of the fact that we both have file format rules and business rules but have rules where the script should be included by not executed... 17:53:04 Tim: we're coming up on the end of the hour and I think we have the fodder for a number of use cases 17:53:33 ...some of you are used to editing on github some are not. In some way or another I'd like everyone to help flush out the usecases 17:53:45 http://w3c.github.io/dpub-pwp-arch/Archival-UCR.html 17:53:54 ...the link address is viewable by everyone... 17:54:18 ...can we get people to try and do that either in the github document or via email if they have an idea for a use case 17:54:37 ...we try to do that within the next two weeks so we can talk about it in the next call 17:54:49 ...how iwll that work with assembling use cases? 17:55:09 HeatherF: It's kind of on going...we'll be talking about it at the F2F at the end of the meeting 17:55:50 Tim: try to think about use case title and descriptions at least. Heather, can you look at the url at the link I posted and see if they need to be re-worked, etc 17:56:05 Tim: Leonard and Bill can you think about use cases? 17:56:10 Leonard and Bill: Yup 17:56:50 lrosenth: this isn't a huge priority but a few calls ago we talked about getting some of the people involved in archiving. We did bring in the people from CLOCKSS and LOCKSS. 17:57:12 ...I met the KNB version of Nicholas Taylor who would love to participate in one of our calls 17:57:27 Tim: Can you share his contact info with Ayla and I? 17:57:49 s/KNB/KB, the Dutch National Library 17:58:10 ....we're still waiting on a couple of contacts that I'm responsible for from the Internet Archive and Los Alamos National Laboratory 17:58:15 *thanks Bill! 17:58:21 Tim: anything else? 17:58:51 ...okay let's save some time. We'll have the next call in two weeks where we'll talk about use case ideas that people have come up with or talk about what's already up there 17:58:55 ...sound good? 17:59:06 *end 17:59:39 rrsagent, make log public 17:59:49 rrsagent, draft minutes 17:59:49 I have made the request to generate http://www.w3.org/2016/04/21-dpub-arch-minutes.html TimCole 18:00:12 rrsagent, make log public 18:00:54 zakim, bye 18:00:54 leaving. As of this point the attendees have been Tim_Cole, Heather_Flanagan, ayla_stein, Bill_Kasdorf, Leonard_Rosenthol 18:00:54 Zakim has left #dpub-arch 18:01:19 rrsagent, bye 18:01:19 I see no action items