Task Forces/Metadata/Carol Meyer Interview
Carol Meyer [Bill] head of Business Development and Marketing for CrossRef and immediate Past President of the Society for Scholarly Publishing.
Carol is in an ideal position to comment on the issue of metadata because CrossRef is a receiver of an enormous amount of metadata from publishers and Carol’s job involves a lot of work directly with the publishers. And as I’ve mentioned in many of these interviews, it is really thanks to CrossRef that metadata is perceived as basically a “solved problem” in the context of scholarly and STM publishing.
Carol started by pointing out that one reason for CrossRef’s success and its near universality in the scholarly publishing realm is that it started collecting metadata for a very specific purpose: obtaining just the specific bibliographic metadata required to enable reference linking.
[This was done initially just for journals, and is now literally a given in journal publishing; a journal article is considered invisible if it doesn’t have a CrossRef DOI and thus isn’t registered in CrossRef. They subsequently added many other publication types—books, conference proceedings, etc.—although one of the things that has made the adoption on the book side slower and less “a given” than on the journal side is that whereas journal content is virtually always online, book content hardly ever is (metadata may be available online but the books themselves are almost always still offline print or ebook products—though scholarly books are some of the most likely of any book content to be online, along with reference).]
Carol pointed out that this initial simplicity has turned out to be both an advantage and a limitation. The plus side is that it made it quite straightforward for publishers to be able to supply the metadata CrossRef needed to make reference linking work. (Though this is not without its problems; see below.) On the other hand, the very ubiquity of CrossRef caused two other not-necessarily-valid perceptions to form:
--That CrossRef could do pretty much anything with metadata. Not so: they can only do what the metadata they get enables them to do. So, for example, publishers would like CrossRef to provide email addresses, but CrossRef has not been collecting email addresses because they weren’t needed for reference linking, and of course now they have a gazillion records that are lacking that information. Extremely non-trivial problem for them to address.
--That having a CrossRef DOI confers some sort of legitimacy on a journal article. Not so: it just means the article has been properly registered, it says nothing about the quality of the article or the research. But many authors rush to get DOIs purely because without them their article lacks credibility. And sometimes authors and publishers just “make up” DOIs that are not in fact even registered. Of course they don’t “work” in the system, but it’s a headache and friction in the system.
One key observation Carol made is that “there are standards and there are practicalities—and they diverge.” For example, the solution for metadata implementation is probably RDFa, Semantic Web, etc, but this is very hard for the average publisher to do. Giving CrossRef a specific small set of metadata is one thing [and often their prepress or conversion or hosting vendor does that for them anyway] but a true Semantic Web implementation is way beyond the capabilities of all but the largest and most technically savvy publishers.
She said that “CrossRef even has a problem interoperating with other registration agencies.” [This even though CrossRef and those other registration agencies are in fact very technically expert.] And this is even more a problem for publishers, especially small or medium size publishers that don’t have the technical expertise, because the tools that are available require programming or at least “a programming mindset,” they don’t “just work.”
The big publishers get all this stuff, of course; but the small publishers “are at a real disadvantage” because they still need to work with services that need metadata—e.g. Amazon.
There is also a “legacy data problem,” where much of the metadata is “locked up in PDF,” which is a real struggle to deal with.
She also pointed out that for books, there is a big frontlist/backlist issue. (Not as much the case for journals, though it depends on the discipline and market demands.) Again, it requires a publisher “of a certain size” to be able to deal with this well.
CrossRef’s philosophy has always been “Do the simplest thing that makes it work.” This works for the purpose something was built for, but it gives you legacy issues and doesn’t work for all applications (e.g., the email example mentioned above). Plus you get “nonstandard records” and “compliance issues.”
She also pointed out that for any important development, “when there’s a business need, that’s when it happens.”
Another big issue in scholarly publishing these days is DATA. There is a lot of pressure for the data sets on which research is based to get published, along with the articles and books based on research on that data. We are “hitting a tipping point,” and it is a very complex issue—she characterized it as a “huge unsolved problem.”
An important new initiative at CrossRef is FundRef, which is an interesting example of how a de facto standard can come about. There is a need in scholarly publishing to document the funders of research, and there is a great “community of interest” around it. But no standard for funding bodies existed, “so CrossRef made one, and it has become a de facto standard.” It is an example of their “simplest possible solution” strategy. It involves a taxonomy of funder names, and then associates the funders with the papers. There is pressure to make it more complex but right now that’s what was needed, it could be implemented quickly, and it works. She said “at the end of the day it’s all XML tags and identifiers. Everybody using FundRef uses the same set of tags, and it starts becoming a standard.”
She pointed out that NISO also very quickly developed an OpenAccess indicator at the article level, involving a “license-ref” URL and a “free-to-read” date [the embargo date] (see http://www.niso.org/workrooms/oami/), so “CrossRef is using that and it works.” Publishers were getting a lot of criticism about not complying with OA requirements, but it was because there were no systems to deal with it. Something had to be done quickly.
She also mentioned that although CrossRef metadata schemas are all initially based on Dublin Core, they are WAY more complex and rich—they just use DC as a framework. In fact their CrossMark standard is not based on DublinCore at all. [CrossMark is the service that enables a link to be embedded in an article that returns, to the user, the information about whether the version they are using is in fact the latest version of an article, and points them to a later version if appropriate—very cool.]