Use Case Editing reports on new academic documents

From Library Linked Data
Revision as of 21:04, 26 November 2010 by Abartov (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Name

Editing reports on new academic documents (Ernad)

Owner

Thomas Krichel, krichel@openlib.org

Background and Current Practice

Goals

  1. Editing Reports on New Academic Documents (Ernad) is essentially a piece of software implementing a protocol called the Altai paper. The aim of the design of Ernad was to help editors of subject category to decide whether documents fit into the subject category or not. A running ernad software powers the NEP: New Economics Papers service of the RePEc digital library.
  2. We don't need linked data technology, but if an agreed set of linked data technologies are available, it can help with two issues:
  • First, a more general specification can be written than the Altai protocol for software that generalized the process of a learning classification system that is heavily dynamic. (REUSE-SCHEMAS, REUSE-VOCABS)
  • Second, we can make a standardized representation of the output of the services so that it is easier to import them into related information services. (PUBLISH)

Target Audience

Anybody who has a source of information, that pertains to long-lived themes—as opposed to news stories—and that issues new documents over time and uses editors to filter the documents in a binary fashion is a potential audience for this use case. A typical example, from the academic world, would be an overlay journal. An editor would admit a paper into an overlay journal or reject it.

The ernad software has been done with large volumes of incoming data in mind. It uses machine learning to sort incoming documents from the ones most closely resembling previously accepted document to the documents that least reassembles them.

If we would not have such a system, classification of hundreds of documents could not be handed by our volunteers.

Use Case Scenario

There are two types of users, the general editor and the subject editors. The general editor examines a whole set of new additions to the document stock. This set of new additions has been automatically generated using computer software. The general editor tries to filter out old document. Documents are old either because they are new descriptions of old documents that have recently been added or because there are variants of documents already passed through the system. This is a relatively difficult talk as there has yet been software developed to help with it. Once the general editor has finished the filtering, the remaining document form a new issue of a report that contains all document. This is the "allport" issue. The allport issue is then machine-learned for all subjects that the system supports. The second type of users are subject editors. They login to the system for a specific report. They find the machine-sorted version of the allport, customized for the report they are logged in as. They then proceed to filter the allport issue for the subject. This involves a binary decision for each document. Finally, they can sort the resulting subject report issue to put what they judge the most interesting document first.

Application of linked data for the given use case

Linked data could be used to in two ways. First it can be used to document the outcome of the selection process in a standardized fashion. The outcome is a like a virtual serial, where documents are classified in time and in subject. Second, the linked data can help with the building of more generalized systems that would make setting up a new ernad instance a matter of editing a few configuration files.


Existing Work (optional)

NEP: New Economics Papers has been achieving the use case since 1998. Over 100,000 classification decisions have been made since that time. 87 subject reports are being edited.


Related Vocabularies (optional)

Problems and Limitations (optional)

There is no good free source for metadata about academic articles. One can compile a dataset from sources, but then there is the issue of duplication. The issue of avoiding duplicate documents at the general edition stage is as yet not solved. However, plagiarism detection algorithms may help here.


Related Use Cases and Unanticipated Uses (optional)

References (optional)