LD4LT Annotaton Workshop Zaragoza 2021

From Linked Data for Language Technology Community Group

LD4LT Workshop on Linguistic Annotations on the Web

Minutes / Agenda / Brainstorming document

Detailed description

The workshop is focused on challenging issues of web annotation and to share and enhance the state-of-the-art of linguistic annotation on the web involving cross-discipline and cross-linguistic audiences. It provides background information on major community standards, their benefits and shortcomings, with the specific aim to contribute to and to consolidate an on-going discussion within the W3C Community Group Linked Data for Language Technology (LD4LT) on developing a consolidated LOD vocabulary for linguistic annotations for applications across language technology, empirical linguistics, computational lexicography, digital humanities, etc.

The numerous existing vocabularies that exist for the purpose are neither interoperable with each other, nor do they cover all relevant use cases. Since 2019, LD4LT is thus working towards the harmonization and extension of existing standards for creating, publishing, sharing, accessing and processing linguistic annotations on the web. The goals are to (a) provide a survey about standards, challenges and requirements, to (b) work towards a W3C community report that provides either best practices for annotations or extensions to existing standards, and, ultimately, to (c) inform subsequent standardization efforts.

We aim to provide a general introduction into the topic, to consolidate this discussion, and to discuss directions, goals, and concrete strategies.

List of topics

  • interoperability
  • linguistic annotation
  • linked data
  • web standards
  • knowledge graphs
  • natural language processing
  • digital methods in linguistics
  • Digital Humanities

Past events

As an LD4LT community meeting, this workshop is the first of its kind. However, the organizers have been involved in organizing numerous workshops, summer schools and conference on the topic, including:

  • Seven international workshops on Linked Data in Linguistics (LDL-2013, 2014, 2015, 2016, 2018, 2020): 40-90 participants each
  • Two Summer Datathons on Linguistic Linked Open Data (SD-LLOD 2017, SD-LLOD 2019): 40-50 participants
  • Language Resources and Linked Data tutorial (EKAW-2014)
  • Two conferences on Language, Data and Knowledge (LDK-2017, LDK-2019): 100-120 participants

This meeting builds on this experience, but is dedicated to a more narrowly defined aspect. The LD4LT community group currently has more than 100 members, but for a presence meeting, we adopt a conservative estimate about the expected number of participants (see below).

Workshop

Format

The workshop will be conducted as hybrid half-day workshop in the morning session of the [W3C Day at LDK-2021|http://2021.ldk-conf.org/post-conference-w3c-day/]. On-site participation is possible for attendants of the LDK main conference, external participants (and virtual attendants of LDK) are welcome. For virtual participation, we use Zoom. Attendance is free of charge, but participants have to register at the LDK Registration Page.

The workshop is both a general assembly of the W3C community group Linked Data for Language Technology, and organized in conjunction with the COST Action Nexus Linguarum. In that sense, the format is less a classical workshop with formal paper submission but rather an informal discussion round with invited presentations, but focusing on discussion. It will be a mixture of presentations, use cases and discussion, with a focus on both providing the necessary background for harmonizing linguistic annotations on the web, and discussing the prospective directions, subtasks and possible milestones of such an enterprise as well as use cases.

Schedule

We plan a four hour event 09:00 - 13:00 CET with a 30 minute coffee break 10:30 - 11:00 CET.

CET Speaker Topic
09:00 Christian Chiarcos Welcome
09:10 - 10:30 Background: Linguistic Annotation on the Web
09:10 Christian Chiarcos W3C Standard Web Annotation
09:25 Milan Dojchinovski NLP Interchange Format
09:40 Fahad Khan Text Encoding Initiative
09:55 Thierry Declerck ISO TC37 standards
10:10 Joel Kalvesmaki Text Fragids
CET Speaker Topic
11:00-11:40 Discussion
11:00 all QA: What is missing? What is unclear? Where are problems?
11:30 Christian Chiarcos Summary of LD4LT Discussions on Linguistic Annotation
11:40 - 12:30 Use Cases, Experiences, Extensions
12:00 Francesco Mambrini Linking Latin
11:50 NN Distributed Text Services
12:00 Maxim Ionov Interlinear Glossed Text
12:10 Giedre Valunaite Oleskevicienė Discourse Research: A Case Study on Attitudinal Multiword Discourse Markers
12:20 Christian Fäth Transforming Language Resources
12:30-12:50 Brainstorming
12:30 all What to do next?
12:50-13:00 Christian Chiarcos & Thierry Declerck Closing Remarks

As for the brainstorming session, please feel free to us contribute with questions and ideas, and bring in your perspectives already during background and use case sessions: Minutes / Agenda / Brainstorming document


Who should join

We expect the mixed audience coming from the LD4LT, from Nexus Linguarum and also from outside these networks. Anyone willing to present their state-of-the-art research or participate in the discussions is kindly welcome to join as the workshop expects cross-fertilization both across the project Nexus Linguarum domains and the research coming outside the project with the view joining the project if there is an interest. Depending on Covid-19 situation we expect 15-20 on-site participants and about 50 online participants if the workshop has to be organized in a mixed mode.

Organization

The workshop is a joint activity of the LD4LT W3C and the Cost Action Nexus Linguarum. It will provide a mixture of background presentations, descriptions of use cases and requirements and open discussion.

Organization committee

  • Christian Chiarcos, Applied Computational Linguistics, Goethe Universität Frankfurt, Germany
  • Thierry Declerck, DFKI Saarbrücken, Germany
  • Milan Dojchinovski, InfAI/DBpedia Association, Germany / CTU in Prague, Czech Republic
  • Fahad Khan, Istituto di Linguistica Computazionale ‘A. Zampolli’, CNR, Pisa, Italy
  • Giedre Valunaite Oleskeviciene, Institute of Humanities, Mykolas Romeris University, Vilnius, Lithuania

We would also like to thank Bridget Almas (The Alpheios Project, Ltd., Niskayuna, NY USA) for contributing to the preparation of the workshop.

References

Almas, B., A. Babeu, A. Krohn, Linked Data in the Perseus Digital Library. In ISAW Papers 7: Current Practice in Linked Open Data for the Ancient World, New York : Institute for the Study of the Ancient World, New York University, 2014.

Chiarcos, C., Nordhoff, Sebastian, Hellmann, Sebastian (Eds., 2013), Linked Data in Linguistics -- Representing and Connecting Language Data and Language Metadata, Springer, Heidelberg

Cimiano, P., Chiarcos, C., McCrae, J.P., Gracia, J. (2020), Linguistic Linked Data -- Representation, Generation and Applications, Springer, Cham

Dobrovoljc, K. (2017). Multi-word discourse markers and their corpus-driven identification: The case of MWDM extraction from the reference corpus of spoken Slovene. International Journal of Corpus Linguistics, 22(4), 551–582.

Dupont, M., & Zufferey, S. (2017). Methodological issues in the use of directional parallel corpora: A case study of English and French concessive connectives. International Journal of Corpus Linguistics, 22(2), 270–297.

Hellmann, S., J. Lehmann, S. Auer, M. Brümmer (2013), Integrating NLP using Linked Data, in Proc. 12th International Semantic Web Conference, 21-25 October 2013 (Sydney, Australia).

Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017). Designing annotation schemes: from model to representation. In Handbook of Linguistic Annotation (pp. 73-111). Springer, Dordrecht.

Oleskeviciene, G. V., Zeyrek, D., Mazeikiene, V., & Kurfalı, M. (2018). Observations on the annotation of discourse relational devices in TED talk transcripts in Lithuanian. Proceedings of the Workshop on Annotation in Digital Humanities Co-Located with ESSLLI, 2155, 53–58.

Pareja-Lora, A., María Blume, Barbara C. Lust and Christian Chiarcos (Eds., 2020), Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences, MIT Press, Cambridge, MA

Snyder, B., Barzilay, R., & Knight, K. (2010). A statistical model for lost language decipherment.

TEI Consortium (2020). 15.4 Linguistic Annotation of Corpora. In: TEI P5: Guidelines for Electronic Text Encoding and Interchange Version 4.1.0. Last updated on 19th August 2020. TEI Consortium. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html#CCAN (Date of Access: 04/12/20)

Wei, N., & Li, J. (2013). A new computing method for extracting contiguous phraseological sequences from academic text corpora. International Journal of Corpus Linguistics, 18(4), 506–535.

Zufferey, S., & Degand, L. (2017). Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguistics and Linguistic Theory, 13(2), 399–422.