18:00:52 RRSAgent has joined #dpub-arch 18:00:52 logging to http://www.w3.org/2016/02/18-dpub-arch-irc 18:01:10 dkaplan3 has joined #dpub-arch 18:01:14 liam has joined #dpub-arch 18:01:18 Present+ Deborah_Kaplan 18:01:26 Present+ Liam_Quin 18:01:29 present+ Bill_Kasdorf 18:01:30 astein has joined #dpub-arch 18:01:38 Zakim has joined #dpub-arch 18:01:53 rrsagent, set log public 18:02:05 Meeting: DPUB Archival TF 18:03:09 Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Feb/0104.html 18:04:41 scribenick: astein 18:04:46 astein is giving scribing a try 18:05:18 TimCole: am I forgetting anything about scribing process that I should tell Ayla? 18:05:35 lrosenth has joined #dpub-arch 18:06:00 dkaplan3: you say the person's name, what they say, and '...' if they continue 18:06:09 https://dev.w3.org/cvsweb/~checkout~/2002/scribe/scribedoc.htm?content-type=text/html#Quick_Start_Guide 18:06:23 dkaplan: type '??' if you don't know what they say 18:06:29 Present+ TimCole 18:06:45 Present+ astein 18:06:51 present+ Leonard 18:07:12 Topic: Prior Minutes 18:07:16 TimCole: first up, minutes 18:07:25 https://www.w3.org/2016/02/04-dpub-arch-minutes.html 18:07:41 TimCole: which are located at the link posted in IRC 18:08:19 TimCole: Leonard and Deborah had an email chat about the minutes 18:08:28 dkaplan: the minutes were orrect, there was a misreading 18:08:48 the minutes were correct 18:08:57 TimCole: minutes were accepted 18:09:08 TimCole: Let's talk about meeting time 18:09:55 TimCole, which is how we ended up with today's time, but there weren't any times that were good for everyone. This time is kind of hard 18:10:17 someone: why do you think this is Europe unfriendly, early evening seems the perfect time 18:10:41 no preference 18:10:42 Leonard actually, not someone 18:10:50 q+ 18:11:37 dkaplan: small preference for Tuesday 18:11:43 astein: preference for Thursday 18:11:53 TimCole: 1st and 3rd Thursdays of each month through May 18:12:01 TimCole; Let's see if we can get done by then 18:12:25 Proposed Resolution TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May 18:12:26 TimCole: typing something in here... 18:12:55 TimCole: Resolution: TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May 18:13:07 Resolution: TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May 18:13:42 Topic: Portico 18:14:02 TimCole: Let me talk about my conversation with a couple people from Portico 18:14:19 Leonard: What is Portico? 18:14:36 http://www.portico.org/digital-preservation/ 18:15:21 Bill_Kasdorf: Portico is one of the biggest dark archiving in scholarly publishing...so if a publisher goes out of business, that's a triggering event, which will provide access to all the libraries who subscribed to it so they'll never lose access to it 18:15:44 http://www.ithaka.org/ 18:15:48 Bill_Kasdorf: in the scholarly publishing world, the other big one is LOCKS/CLOCKS 18:16:01 *LOCKSS/CLOCKSS 18:16:21 TimCole: gave link to parent organization, ithaka, which also runs JSTOR 18:16:55 TimCole: skyped with Amy and Sheila about what we're doing and what they're doing that we might want to know about/keep track of 18:17:04 http://www.portico.org/digital-preservation/wp-content/uploads/2013/08/Porticopublishersbrochure.pdf 18:17:09 q+ after tim is done with this summary 18:17:23 http://www.portico.org/digital-preservation/services 18:17:25 TimCole: http://www.portico.org/digital-preservation/wp-content/uploads/2013/08/Porticopublishersbrochure.pdf for a brochure of what they do 18:17:25 q+ to hold until after tim is done with this summary 18:17:34 oh I am not on the queue 18:17:54 q? 18:17:57 thanks! 18:17:59 use q- to remove yoruself 18:18:00 ack astein 18:18:50 TimCole: As Bill said, they get originals from the publishers. What they get from the publishers vary quite a bit. Typically they get master files, what is actually published. Sometimes they get renditions 18:19:18 They sometimes get is publications if two and a half formats, xml, etc... 18:19:48 TimCole: sometimes they get just a zip with a bunch of folders, with one folder containing XML, one folder containing, pdf, etc\ 18:20:24 TimCole: for each publisher, Portico creates a profile, so they can normalize what they get 18:20:55 TimCole: They normalize against standards like JATS, which was created by the National Library of Medicine. They're starting to work with EPUB 18:21:33 TimCole: as Bill said, they're a dark archive...they do try to get a PDF, or XML transformed into HTML 18:22:19 TimCole: They try to render what they're given. They just started looking into EPUB. They're hoping that what they get in EPUB they won't have to normalize. They're very interested in what the EPUB group is talking about... 18:22:55 TimCole: The other thing they struggle with a little bit is metadata. Those of us who have been involved with EPUB...they have similar discussions 18:23:48 TimCole: They extract information from their archive and create simple dublin core metadata...this could allow a very simple discovery layer if the publisher goes away. They do try to have an html display of that metadata 18:24:21 TimCole: graphics are an interesting thing they deal with. They usually get them in several different resolutions, including thumbnails and high res which they save 18:25:04 TimCole: ...they do run into some issues with older PDFs breaking. ...they're trying to make sure that they pay attention to pdf so they can migrate to newer versions of PDF or PDF/A 18:25:19 TimCole; as of right now they don't automatically transform everything into PDF/A 18:25:22 ... 18:25:32 JHOVE 18:25:46 JOVE - blech :(. (poorly implemented and unsupported) 18:25:49 TimCole: They're doing some work with JHOVE which is a service that identifies file formats 18:25:49 http://jhove.openpreservation.org/ 18:26:22 TimCole: They're doing some work with interoperability of file formats, content, and metadata that publishers use 18:26:34 TimCole: they wish that publishers would try to use RDF more 18:26:52 TimCole: I learned a lot of things, including a few that surprised me,??? 18:26:53 q? 18:26:59 ack dkaplan 18:26:59 dkaplan, you wanted to hold until after tim is done with this summary 18:27:20 dkaplan: this is a good place for me to jump in that I wanted to say and everything about what you said they did 18:27:31 dkaplan - I know…doesn’t make it better 18:27:55 dkaplan: there's another thing that's vital. Ultimately, the job of the archivist, is that you're going to give them some stuff and they're going to figure out what to do with it 18:28:03 q+ 18:28:14 dkaplan...we want to make certain assumptions about fixity, workflows, PREMIS.... 18:28:37 dkaplan: we want to say that PWPs can be described in a certain way, but they'll take what we give them 18:29:11 dkaplan: it would be great if we could say, if we put a punch of metadata in the manifest, what could it be that you could extract, etc 18:29:59 dkaplan: these places are taking disparate and undescribed datasets...they're taking everything...JHOVE looks at file formats and says 'ARG this file format isn't going to be support soon, you should do something about it" 18:30:26 dkaplan...what kinds of things would you extract from our manifest if we could put stuff in the manifest? 18:31:14 ack bill 18:31:27 TimCole:...Yes they'll take content, but sometimes they'll do bit-level preservation as long as there is software to use, so you can get it back. Othertimes, they very actively transform file formats, etc. Prefer file formats that are easy to read 18:31:40 q+ 18:31:51 Bill_Kasdorf: +1 to everything Tim and Deborah said. .... 18:32:26 Bill_Kasdorf: Is PWP a format that publishers oculd easily provide to these archivists or is it something that archivists could transform 18:33:18 Bill_Kasdorf: Portico's strategy was always to normalize things so they have a master format..They focused on scholarly journals that pretty much all used the same format. Used to be ??? and is not JATS. Books are bits? 18:33:47 Bill_Kasdorf: they started out with the big publishers, Springer, Elsevier, who all have great workflows 18:34:16 Bill_Kasdorf: but as they started to deal with smaller publishers, books, different kinds of content, this whole normalization plan starts to shake 18:34:45 Bill_Kasdorf: I was really interested to see that they said they're interested in EPUB and PWP 18:35:34 Bill_Kasdorf...ideally it would be great to see PWP be something that both the providers and the recipients of archival content could agree on...that would take enormous tension out of process 18:36:14 TimCole: they wouldn't reduce their reliance on JATS, they'd take a PWP or EPUB publication as an additional format that they could archive 18:36:55 TimCole: But I don't think they were suggesting that they would stop receiving content that would be normalized into JATS...it depends on what they get 18:37:22 q? 18:37:22 TimCole: They don't have any content in EPUB yet. It might be something they can use internally 18:37:31 ack lros 18:37:54 leonard: Bill said something...'what are we trying to achieve in this group' 18:38:10 lrosenth: I sent out that info about PDF/A... 18:38:33 q+ 18:38:45 lrosenth: there are still concepts that we do know about...we could talk about best practices for creating an archival document that will withstand the test of time 18:39:08 lrosenth: we look at the open web platform... 18:40:01 TimCole: Marcus and someone had an idea to bring up to the bigger group...??? 18:40:10 lrosenth: what is metadata and what is it not? 18:40:17 ack dka 18:40:20 lrosenth: identifying use cases perfect 18:40:47 dkaplan: I like the idea that one of our deliverables being best practices considering PWP won't be done yet 18:40:58 dkaplan: I think use cases is another good deliverable 18:41:41 dkaplan: with outreach to as many organizations as possible, ask them 'in an ideal world, what are the things that they would want to extract from the PWP. What would they want to extract from PWP/A' 18:42:07 q+ 18:42:19 dkaplan: for those not in the library world, PREMIS is a data dictionary that is used to describe events on an object and who performed the action 18:42:47 dkaplan: maybe we would decide that a PWP/A would have some sort of space where you could something like PREMIS or maybe not... 18:43:22 dkaplan: that would be decided by talking to these different organizations. Which of these are more related to PDF/A and which aren't? 18:43:22 q+ 18:43:23 :P 18:43:42 ack lros 18:43:56 dkaplan: we could make a recommendation for the minimum that every manifest should have in a PWP 18:44:27 lrosenth: the idea of an audit train is actually described in PWP since the beginning for the same reasons. Even though it was specified it was never used 18:44:38 ack bill 18:45:08 Bill_Kasdorf: this has been an excellent discussion. 2 fundamental concerns are for people who are archiving are: versioning and migration 18:45:32 Bill_Kasdorf: one of the biggest problems is: this works now how can I make sure it will work tomorrow? 18:45:57 Bill_Kasdorf: some sort of info about if you do it this way you will be able to get to it in the future. 18:46:13 Topic: Use Cases 18:46:24 ha! 18:46:55 q+ 18:47:04 ack dka 18:47:08 TimCole: UseCases - how do we get started on our use case document - wiki, github, etc? 18:47:32 dkaplan: I would recommend either wiki or github because they're a little bit easier than email for multiple authors 18:47:33 github? 18:47:42 +0 18:47:44 0 (no preference - either is fine) 18:47:47 +1 18:47:47 0 18:47:48 0 18:47:55 dkaplan: I think we should use whichever one has the least amount of not this votes 18:47:58 +1 github 18:48:18 TimCole: I should talk to Ivan to make sure I set up the github page correctly 18:49:04 *debate between github or wiki* 18:49:23 TimCole: advantage to github is that you can create issues 18:49:43 Bill_Kasdorf: I just used the wrong voting indicator! 18:49:51 TimCole: we'll try github to see how it works 18:50:30 TimCole: we're talking about use cases that are driving recommendations or best practices for preserving PWP documents 18:50:43 TimCole: Does that definition work? 18:50:50 Topic: Who's doing what over the next 2 weeks 18:51:24 TimCole: we have our task force page, we mentioned on that that we need to reach out to LOCKSS/CLOCKSS, Portico, NISO 18:51:40 TimCole: Who is going to do what? I volunteered to contact NISO and Portico 18:51:48 TimCole: ??? 18:52:20 TimCole: do initial outreach, either an email or phone call, and see if they can join one of our calls 18:52:42 lrosenth: what about NARA or Library of Congress? 18:52:54 lrosenth: I'll be happy to reach out to them 18:53:13 Bill_Kasdorf: I could offer similar contacts at the British Library and the KB (Dutch National Library) 18:53:51 dkaplan: I'm trying to reach out through extended contacts to the National Library of Australia but I don't actually have contacts 18:53:58 Bill_Kasdorf: I could probably help with it 18:54:23 Bill_Kasdorf: DPLA or Europeana, actually I'll retract that 18:55:00 dkaplan: that being said, the DPLA is right in my backyard. They're really easy to reach out to. I can reach out to Mark Matienzo 18:55:10 dkaplan: he's worked on more than DPLA 18:55:25 Bill_Kasdorf: Boston Public has a big project with special collections 18:55:53 TimCole: do we have anyone interested in talking to LOCKSS/CLOCKSS 18:56:19 Bill_Kasdorf: I don't have a lot of bandwith but I could get Vicki from LOCKSS/CLOCKS 18:56:20 S 18:56:43 TimCole: could introduce Ayla to CLOCKSS/LOCKSS 18:57:01 TimCole: Ayla and Tim will contact Chris Prom and Bill Ingram UIUC 18:57:26 TimCole: add your name and who you're talking to wiki. Reach out to contacts for our next call 18:57:38 TimCole: maybe we can talk a little more about timeline as well 18:58:23 lrosenth: can contact the GPO 18:58:43 dkaplan: I can talk to a lot of people who are very active in government depository libraries 18:59:44 TimCole: we're going to stay on Thursdays at 1PM EST 18:59:57 Bill_Kasdorf: will you send a calendar invite? 19:00:00 TimCole: yes 19:00:26 rssagent, draft minutes 19:00:47 thanks for doing that, I wasn't sure if I should 19:00:55 I have to run to another meeting! 19:02:06 RRSAgent, draft minutes 19:02:06 I have made the request to generate http://www.w3.org/2016/02/18-dpub-arch-minutes.html TimCole 19:02:50 RRSAgent, set log public 21:08:45 Zakim has left #dpub-arch 21:12:12 liam has left #dpub-arch