See also: IRC log
<scribe> scribenick: TimCole
TimCole: Any concerns about
minutes
... hearing none they are approved.
Bill_Kasdorf: Do we need to introduce Craig and Nicholas an introduction
Craig: we've read the documents, but would appreciate a brief intro
Bill_Kasdorf: First the DPub is a
W3C Interest Group
... IGs do not publish recommendations but they inform the W3C
about issues and coordinate with Working Groups that make
Recommendations
... In the context of that work an initiative was done to draft
a vision of a Web-based format that is independent of whether a
document was online or offline
... This led to the Portable Web Publication document
... when online all the components of a PWP document are
available online
... but whether online or offline to the user it's the same
document.
... in the course of this dicussion the issue of forming
archival came up
... we want to make sure that PWP is useful and can be
archived.
Craig: I am the exec dir of CLOCKSS after 37 years in publishing.
Nichoals: Web archiving service
manager at Stanford
... mostly work on Web archiving genreally, but have been
working a lot on LOCKSS
Craig: CLOCKSS is a free-standing
org made up of publisher and libraries
... pubs are billed both annually and by the article or book
archived.
... CLOCKSS builds on the LOCKSS protocol
... CLOCKSS adds controlled -- LOCKSS typically has more
copies...
... libraries have a traditional role of archives, but in
digital world they are not the holders of the digital
copies
... so libraries wanted trusted 3rd parties to whom publishers
could provide content
... 3rd parties (like CLOCKSS) is a dark archive for
safekeeping in case the resources become unavailable on the web
(e.g., publisher goes out of business)
... CLOCKSS for example has 20,000+ journals, so if a journal
goes away, CLOCKSS can provide archival access
... scholars highly dependent on the literature, so access to
lit is crucial
... CLOCKSS helps ensure that access
... CLOCKSS harvests or crawls - publisher agrees that LOCKSS
can harvest
... harvested content is then put in the 12 nodes that CLOCKSS
has spread out across the globe
... 2nd method is to accept files from the publisher (or
retrieved from the publisher)
... Regardless of whether delivered by publisher or crawled,
the norm is to simply archive the files, not to normalize
... we do however, a quality assurance on the materials
ingested.
... LOCKSS allows nodes to confirm that all nodes have the same
content
... a voting system can be used among the nodes to validate
data correct
... this checking that all copies match is constantly ongoing
and repaired as needed
Bill_Kasdorf: question about
nature of the content in CLOCKSS
... when you sign on a publisher is it all content of that
publisher?
... is it mostly scientific literature
Craig: broader than just
science
... anything scholars access from publishers
... in theory all journal and book publications, but in
practice books may not be added right away
... databases, other content types not always in scope
Nicholas: drilling down on how
CLOCKSS acquires content by harvest
... one challenge is that publishers use different
platforms
... we have to account for differences in how content and
related files are served
... we create 'plugins' for each publisher
... one thing that might be useful about PWP would be more
consistent presentation across publisher platform
... this might make easier to acquire content
... but one concern would be that PWD might be a parallel -
which is canonical version? Do they stay in sync
lrosenth: Our thinking right now
- we recognize that for any given publication there is a
canonical version
... our strategy is that there will be a locator associated
with any 'copy' that refers back to canonical copy and/or
breadcrumbs through versions that you have to follow
... so for example if annotations are added you may have a new
version and so a new locater, but you can still go back to the
original
... There is no requirement that a PWP is served as
package
... for example a publication talking about the Mona Lisa, and
so the most correct version of the pub references the Mona Lisa
at the Louvre
Nicholas: the manifest sounds like the Signposting being discussed in the Web Archiving Community more general
lrosenth: yes, sounds
relevant
... the manifest documents everything, every part that is
necessary to 'present' / 'consume' the publication
... this would include what a machine might need
... if you utilize the manifest, and processes the manifest to
have the set of elements needed for the publication
Bill_Kasdorf: If there are elements that the publisher wants to 'protect' , the publisher can count on CLOCKSS not to release any of this until a trigger event occurs
Craig: Yes
Bill_Kasdorf: So there is another
level of abstraction, potentially
... a font might be an example
... some of these fonts may require licenses
... so the PWP may name a fallback
lrosenth: Yes, that seems right.
Important to look at.
... may define which PWP's are truly archival
Nicholas: Seems like spec has a
lot of potential
... as it stands now we have this 2 pronged archival
approach
... PWP may bring these together a little more
... the manifest idea or signposting or some level of semantic
annotation that projects the publisher's perspective of the
publication would be very useful
... right now we have a set of heuristics that we need to keep
revisiting
... a little difficult to know what it might look like in
practice
lrosenth: as i understand LOCKSS,
you make no requirements on the content
... so if a trigger even occurs, you release what you have
Craig: not making any kind of
legal warrant
... but our expectation is to deliver content in a consumable
way
... we're not focused on user experience, but rather on
content
lrosenth: Bill was talking about
fonts
... in case you have a document that references a font, but the
font is not archived
... you don't overtly address font issue?
Nicholas: Once the content is
archived, we try to make sure that all the required content is
archived
... not sure if we are going after fonts
lrosenth: has a lot of implications for what we are trying to do
Bill_Kasdorf: as an XML guy, ideally content is Unicode, so you should be able to consume the content, albeit without the proper glyph
lrosenth: But since they don't normalize, they don't ensure unicode is being made
timCole: manifest may make it easier to make sure you get everything you need?
Nicholas: yes it can be difficult
to know for each platform exactly which links should be
collected (e.g., fonts vs. publisher home page)
... is this related to packaging on the Web (W3C)
lrosenth: that initiative is broader and separate, but as we start talking about what our packages look like, we will look at that work
Bill_Kasdorf: would Nicholas have more time to join DPub IG? Since Stanford already a member of W3C
Nicholas: have already joined...