W3C

DPUB Archival Task Force

18 Feb 2016

Agenda

See also: IRC log

Attendees

Present
Deborah_Kaplan, Liam_Quin, Bill_Kasdorf, Tim_Cole, Ayla_Stein, Leonard_Rosenthol
Regrets
Tzviya_Siegman, Heather_Flanagan
Chair
Tim_Cole
Scribe
Ayla_Stein

Contents


<TimCole> scribenick: astein

TimCole: am I forgetting anything about scribing process that I should tell Ayla?

dkaplan3: you say the person's name, what they say, and '...' if they continue

<TimCole> https://dev.w3.org/cvsweb/~checkout~/2002/scribe/scribedoc.htm?content-type=text/html#Quick_Start_Guide

dkaplan: type '??' if you don't know what they say

Prior Minutes

TimCole: first up, minutes

<TimCole> https://www.w3.org/2016/02/04-dpub-arch-minutes.html

TimCole: which are located at the link posted in IRC
... Leonard and Deborah had an email chat about the minutes

dkaplan: the minutes were correct, there was a misreading

the minutes were correct

TimCole: minutes were accepted
... Let's talk about future meeting times

TimCole: The Doodle poll is how we ended up with today's time, but there weren't any times that were good for everyone. This time might be kind of hard for European participants.

lrosenth: why do you think this is Europe unfriendly, early evening seems the perfect time

<lrosenth> no preference

dkaplan: small preference for Tuesday

astein: preference for Thursday

TimCole: 1st and 3rd Thursdays of each month through May?

TimCole; Let's see if we can get done by then

<TimCole> Proposed Resolution TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May

RESOLUTION: TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May

Portico

TimCole: Let me talk about my conversation with a couple people from Portico

Leonard: What is Portico?

<TimCole> http://www.portico.org/digital-preservation/

Bill_Kasdorf: Portico is one of the biggest dark archiving in scholarly publishing...so if a publisher goes out of business, that's a triggering event, which will provide access to all the libraries who subscribed to it so they'll never lose access to it

<TimCole> http://www.ithaka.org/

Bill_Kasdorf: in the scholarly publishing world, the other big one is LOCKSS/CLOCKSS

TimCole: gave link to parent organization, ithaka, which also runs JSTOR
... skyped with Amy and Sheila about what we're doing and what they're doing that we might want to know about/keep track of

<TimCole> http://www.portico.org/digital-preservation/wp-content/uploads/2013/08/Porticopublishersbrochure.pdf

<TimCole> http://www.portico.org/digital-preservation/services

TimCole: http://www.portico.org/digital-preservation/wp-content/uploads/2013/08/Porticopublishersbrochure.pdf for a brochure of what they do

TimCole: As Bill said, they get originals from the publishers. What they get from the publishers vary quite a bit. Often they get master files, what are behind what is actually published. Sometimes they get renditions as well or instead of master files.

On average Portico gets publications in two and a half formats, xml, etc...

TimCole: sometimes they get just a zip for each item, sometimes zips for an entire issue with a bunch of folders, with one folder containing XML, one folder containing, pdf, etc\
... for each publisher, Portico creates a profile, so they can normalize what they get
... They normalize against standards like JATS and BITS, which were created by the National Library of Medicine. They're starting to work with EPUB
... as Bill said, they're a dark archive...they do try to get a PDF, or XML transformed into HTML
... They try to render what they're given in case they ever need to. They just started looking into EPUB. They're hoping that what they get in EPUB they won't have to normalize. They're very interested in what the EPUB and DPub groups are talking about with regard to PWP...
... The other thing they struggle with a little bit is metadata. Those of us who have been involved with DPub IG...we have had similar discussions
... They extract information from their archive and create simple dublin core metadata...this could allow a very simple discovery layer if the publisher goes away. They do try to have an html display of that metadata
... graphics are an interesting thing they deal with. They often get them in multiple different resolutions, including thumbnails and high res which they save
... ...they do run into some issues with older PDFs breaking. ...they're trying to make sure that they pay attention to pdf so they can migrate to newer versions of PDF or PDF/A

TimCole; as of right now they don't automatically transform everything into PDF/A

TimCole: They're doing some work with JHOVE which is a service that identifies file formats

<dkaplan3> JHOVE

<lrosenth> JOVE - blech :(. (poorly implemented and unsupported)

<dkaplan3> http://jhove.openpreservation.org/

TimCole: They're doing some work with interoperability of file formats, content, and metadata that publishers use
... they wish that publishers would try to use RDF more
... No big surprises, but the conversation was useful and I learned a lot more about the details.

<Zakim> dkaplan, you wanted to hold until after tim is done with this summary

dkaplan: this is a good place for me to jump in that I wanted to say and everything about what you said they did

<lrosenth> dkaplan - I know…doesn’t make it better

dkaplan: there's another thing that's vital. Ultimately, the job of the archivist, is that you're going to give them some stuff and they're going to figure out what to do with it

dkaplan...we want to make certain assumptions about fixity, workflows, PREMIS....

dkaplan: we want to say that PWPs can be described in a certain way, but they'll take what we give them
... it would be great if we could say, if we put a punch of metadata in the manifest, what could it be that you could extract, etc
... these places are taking disparate and undescribed datasets...they're taking everything...JHOVE looks at file formats and says 'ARG this file format isn't going to be support soon, you should do something about it"

dkaplan...what kinds of things would you extract from our manifest if we could put stuff in the manifest?

TimCole: ...Yes they'll take content, but sometimes they'll do bit-level preservation as long as there is software to use, so you can get it back. Othertimes, they very actively transform file formats, etc. Prefer file formats that are easy to read and not dependent on specialized software.

Bill_Kasdorf: +1 to everything Tim and Deborah said. ....
... Is PWP a format that publishers oculd easily provide to these archivists or is it something that archivists could transform
... Portico's strategy was always to normalize things so they have a master format..They focused on scholarly journals that pretty much all used the same format. Used to be NLM and is now JATS. Books are BITS.
... they started out with the big publishers, Springer, Elsevier, who all have great workflows
... but as they started to deal with smaller publishers, books, different kinds of content, this whole normalization plan starts to shake
... I was really interested to see that they said they're interested in EPUB and PWP

Bill_Kasdorf...ideally it would be great to see PWP be something that both the providers and the recipients of archival content could agree on...that would take enormous tension out of process

TimCole: they wouldn't reduce their reliance on JATS, they'd take a PWP or EPUB publication as an additional format that they could archive
... But I don't think they were suggesting that they would stop receiving content that would be normalized into JATS...it depends on what they get
... They don't have any content in EPUB yet. They are just experimenting now, but they think it might be a format they could maybe archive internally without having to transform to BITS or something else. Might need to normalize a little.

leonard: Bill said something...'what are we trying to achieve in this group'

lrosenth: I sent out that info about PDF/A...
... there are still concepts that we do know about...we could talk about best practices for creating an archival document that will withstand the test of time
... we look at the open web platform...

TimCole: From previous call -- Markus and Tzvyia had the idea that this group should bring to the IG use cases relevant to the development of PWP docs and other IG work.

lrosenth: what is metadata and what is it not?
... identifying use cases perfect

dkaplan: I like the idea that one of our deliverables being best practices to the extent we can considering PWP won't be done yet
... I think use cases is another good deliverable
... with outreach to as many organizations as possible, ask them 'in an ideal world, what are the things that they would want to extract from the PWP. What would they want to extract from PWP/A'
... for those not in the library world, PREMIS is a data dictionary that is used to describe events on an object and who performed the action
... maybe we would decide that a PWP/A would have some sort of space where you could something like PREMIS or maybe not...
... that would be decided by talking to these different organizations. Which of these are more related to PDF/A and which aren't?

:P

dkaplan: we could make a recommendation for the minimum [with regard to preservation/archiving] that every manifest should have in a PWP

lrosenth: the idea of an audit trail actually being described in PDF/A was there since the beginning for the same reasons. Even though it was specified it was never used

Bill_Kasdorf: this has been an excellent discussion. 2 fundamental concerns are for people who are archiving are: versioning and migration
... one of the biggest problems is: this works now how can I make sure it will work tomorrow?
... some sort of info about if you do it this way you will be able to get to it in the future.

Use Cases

TimCole: UseCases - how do we get started on our use case document - wiki, github, etc?

dkaplan: I would recommend either wiki or github because they're a little bit easier than email for multiple authors

<TimCole> github?

<dkaplan3> +0

<lrosenth> 0 (no preference - either is fine)

<TimCole> +1

<liam> 0

<Bill_Kasdorf> -1

dkaplan: I think we should use whichever one has the least amount of not this votes

+1 github

TimCole: I should talk to Ivan to make sure we can set up a github page for Use Cases correctly

*debate between github or wiki*

TimCole: if we can, one advantage to github might be that you can create and track individual issues in the branch without having to mess up the main use case page

TimCole: we'll try github to see how it works
... we're talking about use cases that are driving recommendations or best practices for preserving PWP documents
... Does that definition work?

Who's doing what over the next 2 weeks

TimCole: we have our task force page, we mentioned on that that we need to reach out to LOCKSS/CLOCKSS, Portico, NISO
... Who is going to do what? I volunteered to contact NISO and Portico
... and Internet Archive
... The goal is to do initial outreach, either an email or phone call, and see if it would be worthwhile (and if they'd be willing) for them to join one of our calls

lrosenth: what about NARA or Library of Congress?
... I'll be happy to reach out to them

Bill_Kasdorf: I could offer similar contacts at the British Library and the KB (Dutch National Library)

dkaplan: I'm trying to reach out through extended contacts to the National Library of Australia but I don't actually have contacts

Bill_Kasdorf: I could probably help with it
... DPLA or Europeana, actually I'll retract that

dkaplan: that being said, the DPLA is right in my backyard. They're really easy to reach out to. I can reach out to Mark Matienzo
... he's worked on more than DPLA

Bill_Kasdorf: Boston Public has a big project with special collections

TimCole: do we have anyone interested in talking to LOCKSS/CLOCKSS

Bill_Kasdorf: I don't have a lot of bandwith but I could get Vicki from LOCKSS/CLOCKS

TimCole: could introduce Ayla to CLOCKSS/LOCKSS
... Ayla and Tim will contact Chris Prom and Bill Ingram UIUC
... add name of who you're talking to to TF wiki page. Reach out to at least some contacts before our next call and we will discuss.
... maybe we can talk a little more about timeline for TF as well

lrosenth: I can contact the GPO

dkaplan: I can talk to a lot of people who are very active in government depository libraries

TimCole: we're going to stay on Thursdays at 1PM EST for our calls.

Bill_Kasdorf: will you send a calendar invite?

TimCole: yes

Summary of Action Items

Summary of Resolutions

  1. TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May
[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.144 (CVS log)
$Date: 2016/02/19 10:29:34 $