See also: IRC log
<TimCole> scribenick: astein
TimCole: am I forgetting anything about scribing process that I should tell Ayla?
dkaplan3: you say the person's name, what they say, and '...' if they continue
dkaplan: type '??' if you don't know what they say
TimCole: first up, minutes
<TimCole> https://www.w3.org/2016/02/04-dpub-arch-minutes.html
TimCole: which are located at the link posted in IRC
... Leonard and Deborah had an email chat about the minutes
dkaplan: the minutes were correct, there was a misreading
the minutes were correct
TimCole: minutes were accepted
... Let's talk about
future meeting times
TimCole: The Doodle poll is how we ended up with today's time, but there weren't any times that were good for everyone. This time might be kind of hard for European participants.
lrosenth: why do you think this is Europe unfriendly, early evening seems the perfect time
<lrosenth> no preference
dkaplan: small preference for Tuesday
astein: preference for Thursday
TimCole: 1st and 3rd Thursdays of each month through May?
TimCole; Let's see if we can get done by then
<TimCole> Proposed Resolution TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May
RESOLUTION: TF will meet 1st and 3rd Thursdays at 1 PM Eastern (US) through end of May
TimCole: Let me talk about my conversation with a couple people from Portico
Leonard: What is Portico?
<TimCole> http://www.portico.org/digital-preservation/
Bill_Kasdorf: Portico is one of the biggest dark archiving in scholarly publishing...so if a publisher goes out of business, that's a triggering event, which will provide access to all the libraries who subscribed to it so they'll never lose access to it
<TimCole> http://www.ithaka.org/
Bill_Kasdorf: in the scholarly publishing world, the other big one is LOCKSS/CLOCKSS
TimCole: gave link to parent organization, ithaka, which
also runs JSTOR
... skyped with Amy and Sheila about what we're doing and what
they're doing that we might want to know about/keep track of
<TimCole> http://www.portico.org/digital-preservation/wp-content/uploads/2013/08/Porticopublishersbrochure.pdf
<TimCole> http://www.portico.org/digital-preservation/services
TimCole: http://www.portico.org/digital-preservation/wp-content/uploads/2013/08/Porticopublishersbrochure.pdf for a brochure of what they do
TimCole: As Bill said, they get originals from the publishers. What they get from the publishers vary quite a bit. Often they get master files, what are behind what is actually published. Sometimes they get renditions as well or instead of master files.
On average Portico gets publications in two and a half formats, xml, etc...
TimCole: sometimes they get just a zip for each item,
sometimes zips for an entire issue with a bunch of folders, with one folder
containing XML, one folder containing, pdf, etc\
... for each publisher,
Portico creates a profile, so they can normalize what they get
... They
normalize against standards like JATS and BITS, which were created by the National
Library of Medicine. They're starting to work with EPUB
... as Bill said,
they're a dark archive...they do try to get a PDF, or XML transformed into
HTML
... They try to render what they're given in case they ever need to. They
just started looking into EPUB. They're hoping that what they get in EPUB they won't
have to normalize. They're very interested in what the EPUB and DPub groups are
talking about with regard to PWP...
... The other thing they struggle with a
little bit is metadata. Those of us who have been involved with DPub IG...we have
had similar discussions
... They extract information from their archive and
create simple dublin core metadata...this could allow a very simple discovery layer
if the publisher goes away. They do try to have an html display of that
metadata
... graphics are an interesting thing they deal with. They often get
them in multiple different resolutions, including thumbnails and high res which they
save
... ...they do run into some issues with older PDFs breaking. ...they're
trying to make sure that they pay attention to pdf so they can migrate to newer
versions of PDF or PDF/A
TimCole; as of right now they don't automatically transform everything into PDF/A
TimCole: They're doing some work with JHOVE which is a service that identifies file formats
<dkaplan3> JHOVE
<lrosenth> JOVE - blech :(. (poorly implemented and unsupported)
<dkaplan3> http://jhove.openpreservation.org/
TimCole: They're doing some work with interoperability of
file formats, content, and metadata that publishers use
... they wish that
publishers would try to use RDF more
... No big surprises, but the
conversation was useful and I learned a lot more about the details.
<Zakim> dkaplan, you wanted to hold until after tim is done with this summary
dkaplan: this is a good place for me to jump in that I wanted to say and everything about what you said they did
<lrosenth> dkaplan - I know…doesn’t make it better
dkaplan: there's another thing that's vital. Ultimately, the job of the archivist, is that you're going to give them some stuff and they're going to figure out what to do with it
dkaplan...we want to make certain assumptions about fixity, workflows, PREMIS....
dkaplan: we want to say that PWPs can be described in a
certain way, but they'll take what we give them
... it would be great if we
could say, if we put a punch of metadata in the manifest, what could it be that you
could extract, etc
... these places are taking disparate and undescribed
datasets...they're taking everything...JHOVE looks at file formats and says 'ARG
this file format isn't going to be support soon, you should do something about
it"
dkaplan...what kinds of things would you extract from our manifest if we could put stuff in the manifest?
TimCole: ...Yes they'll take content, but sometimes they'll do bit-level preservation as long as there is software to use, so you can get it back. Othertimes, they very actively transform file formats, etc. Prefer file formats that are easy to read and not dependent on specialized software.
Bill_Kasdorf: +1 to everything Tim and Deborah said.
....
... Is PWP a format that publishers oculd easily provide to these
archivists or is it something that archivists could transform
... Portico's
strategy was always to normalize things so they have a master format..They focused
on scholarly journals that pretty much all used the same format. Used to be NLM and
is now JATS. Books are BITS.
... they started out with the big publishers,
Springer, Elsevier, who all have great workflows
... but as they started to
deal with smaller publishers, books, different kinds of content, this whole
normalization plan starts to shake
... I was really interested to see that
they said they're interested in EPUB and PWP
Bill_Kasdorf...ideally it would be great to see PWP be something that both the providers and the recipients of archival content could agree on...that would take enormous tension out of process
TimCole: they wouldn't reduce their reliance on JATS,
they'd take a PWP or EPUB publication as an additional format that they could
archive
... But I don't think they were suggesting that they would stop
receiving content that would be normalized into JATS...it depends on what they
get
... They don't have any content in EPUB yet. They are just experimenting
now, but they think it might be a format they could maybe archive internally without
having to transform to BITS or something else. Might need to normalize a little.
leonard: Bill said something...'what are we trying to achieve in this group'
lrosenth: I sent out that info about PDF/A...
...
there are still concepts that we do know about...we could talk about best practices
for creating an archival document that will withstand the test of time
... we
look at the open web platform...
TimCole: From previous call -- Markus and Tzvyia had the idea that this group should bring to the IG use cases relevant to the development of PWP docs and other IG work.
lrosenth: what is metadata and what is it not?
...
identifying use cases perfect
dkaplan: I like the idea that one of our deliverables
being best practices to the extent we can considering PWP won't be done yet
... I think use cases is another good deliverable
... with outreach to as many
organizations as possible, ask them 'in an ideal world, what are the things that
they would want to extract from the PWP. What would they want to extract from
PWP/A'
... for those not in the library world, PREMIS is a data dictionary
that is used to describe events on an object and who performed the action
...
maybe we would decide that a PWP/A would have some sort of space where you could
something like PREMIS or maybe not...
... that would be decided by talking to
these different organizations. Which of these are more related to PDF/A and which
aren't?
:P
dkaplan: we could make a recommendation for the minimum [with regard to preservation/archiving] that every manifest should have in a PWP
lrosenth: the idea of an audit trail actually being described in PDF/A was there since the beginning for the same reasons. Even though it was specified it was never used
Bill_Kasdorf: this has been an excellent discussion. 2
fundamental concerns are for people who are archiving are: versioning and
migration
... one of the biggest problems is: this works now how can I make
sure it will work tomorrow?
... some sort of info about if you do it this way
you will be able to get to it in the future.
TimCole: UseCases - how do we get started on our use case document - wiki, github, etc?
dkaplan: I would recommend either wiki or github because they're a little bit easier than email for multiple authors
<TimCole> github?
<dkaplan3> +0
<lrosenth> 0 (no preference - either is fine)
<TimCole> +1
<liam> 0
<Bill_Kasdorf> -1
dkaplan: I think we should use whichever one has the least amount of not this votes
+1 github
TimCole: I should talk to Ivan to make sure we can set up a github page for Use Cases correctly
*debate between github or wiki*
TimCole: if we can, one advantage to github might be that you can create and track individual issues in the branch without having to mess up the main use case page
TimCole: we'll try github to see how it works
...
we're talking about use cases that are driving recommendations or best practices for
preserving PWP documents
... Does that definition work?
TimCole: we have our task force page, we mentioned on that
that we need to reach out to LOCKSS/CLOCKSS, Portico, NISO
... Who is going to
do what? I volunteered to contact NISO and Portico
... and Internet
Archive
... The goal is to do initial outreach, either an email or phone call,
and see if it would be worthwhile (and if they'd be willing) for them to join one of
our calls
lrosenth: what about NARA or Library of Congress?
... I'll be happy to reach out to them
Bill_Kasdorf: I could offer similar contacts at the British Library and the KB (Dutch National Library)
dkaplan: I'm trying to reach out through extended contacts to the National Library of Australia but I don't actually have contacts
Bill_Kasdorf: I could probably help with it
... DPLA
or Europeana, actually I'll retract that
dkaplan: that being said, the DPLA is right in my
backyard. They're really easy to reach out to. I can reach out to Mark
Matienzo
... he's worked on more than DPLA
Bill_Kasdorf: Boston Public has a big project with special collections
TimCole: do we have anyone interested in talking to LOCKSS/CLOCKSS
Bill_Kasdorf: I don't have a lot of bandwith but I could get Vicki from LOCKSS/CLOCKS
TimCole: could introduce Ayla to CLOCKSS/LOCKSS
...
Ayla and Tim will contact Chris Prom and Bill Ingram UIUC
... add name of who
you're talking to to TF wiki page. Reach out to at least some contacts before our
next call and we will discuss.
... maybe we can talk a little more about
timeline for TF as well
lrosenth: I can contact the GPO
dkaplan: I can talk to a lot of people who are very active in government depository libraries
TimCole: we're going to stay on Thursdays at 1PM EST for our calls.
Bill_Kasdorf: will you send a calendar invite?
TimCole: yes