LD4LT Workshop on Linguistic Annotations on the Web

Minutes / Agenda / Brainstorming document

Detailed description

The workshop is focused on challenging issues of web annotation and to share and enhance the state-of-the-art of linguistic annotation on the web involving cross-discipline and cross-linguistic audiences. It provides background information on major community standards, their benefits and shortcomings, with the specific aim to contribute to and to consolidate an on-going discussion within the W3C Community Group Linked Data for Language Technology (LD4LT) on developing a consolidated LOD vocabulary for linguistic annotations for applications across language technology, empirical linguistics, computational lexicography, digital humanities, etc.

The numerous existing vocabularies that exist for the purpose are neither interoperable with each other, nor do they cover all relevant use cases. Since 2019, LD4LT is thus working towards the harmonization and extension of existing standards for creating, publishing, sharing, accessing and processing linguistic annotations on the web. The goals are to (a) provide a survey about standards, challenges and requirements, to (b) work towards a W3C community report that provides either best practices for annotations or extensions to existing standards, and, ultimately, to (c) inform subsequent standardization efforts.

We aim to provide a general introduction into the topic, to consolidate this discussion, and to discuss directions, goals, and concrete strategies.

List of topics

interoperability
linguistic annotation
linked data
web standards
knowledge graphs
natural language processing
digital methods in linguistics
Digital Humanities

Past events

As an LD4LT community meeting, this workshop is the first of its kind. However, the organizers have been involved in organizing numerous workshops, summer schools and conference on the topic, including:

Seven international workshops on Linked Data in Linguistics (LDL-2013, 2014, 2015, 2016, 2018, 2020): 40-90 participants each
Two Summer Datathons on Linguistic Linked Open Data (SD-LLOD 2017, SD-LLOD 2019): 40-50 participants
Language Resources and Linked Data tutorial (EKAW-2014)
Two conferences on Language, Data and Knowledge (LDK-2017, LDK-2019): 100-120 participants

This meeting builds on this experience, but is dedicated to a more narrowly defined aspect. The LD4LT community group currently has more than 100 members, but for a presence meeting, we adopt a conservative estimate about the expected number of participants (see below).

Workshop

Format

The workshop will be conducted as hybrid half-day workshop in the morning session of the [W3C Day at LDK-2021|http://2021.ldk-conf.org/post-conference-w3c-day/]. On-site participation is possible for attendants of the LDK main conference, external participants (and virtual attendants of LDK) are welcome. For virtual participation, we use Zoom. Attendance is free of charge, but participants have to register at the LDK Registration Page.

The workshop is both a general assembly of the W3C community group Linked Data for Language Technology, and organized in conjunction with the COST Action Nexus Linguarum. In that sense, the format is less a classical workshop with formal paper submission but rather an informal discussion round with invited presentations, but focusing on discussion. It will be a mixture of presentations, use cases and discussion, with a focus on both providing the necessary background for harmonizing linguistic annotations on the web, and discussing the prospective directions, subtasks and possible milestones of such an enterprise as well as use cases.

Schedule

We plan a four hour event 09:00 - 13:00 CET with a 30 minute coffee break 10:30 - 11:00 CET.

CET	Speaker	Topic
09:00	Christian Chiarcos	Welcome
09:10 - 10:30		Background: Linguistic Annotation on the Web
09:10	Christian Chiarcos	W3C Standard Web Annotation
09:25	Milan Dojchinovski	NLP Interchange Format
09:40	Fahad Khan	Text Encoding Initiative
09:55	Thierry Declerck	ISO TC37 standards
10:10	Joel Kalvesmaki	Text Fragids

CET	Speaker	Topic
11:00-11:40		Discussion
11:00	all	QA: What is missing? What is unclear? Where are problems?
11:30	Christian Chiarcos	Summary of LD4LT Discussions on Linguistic Annotation
11:40 - 12:30		Use Cases, Experiences, Extensions
12:00	Francesco Mambrini	Linking Latin
11:50	NN	Distributed Text Services
12:00	Maxim Ionov	Interlinear Glossed Text
12:10	Giedre Valunaite Oleskevicienė	Discourse Research: A Case Study on Attitudinal Multiword Discourse Markers
12:20	Christian Fäth	Transforming Language Resources
12:30-12:50		Brainstorming
12:30	all	What to do next?
12:50-13:00	Christian Chiarcos & Thierry Declerck	Closing Remarks

As for the brainstorming session, please feel free to us contribute with questions and ideas, and bring in your perspectives already during background and use case sessions: Minutes / Agenda / Brainstorming document

Who should join

We expect the mixed audience coming from the LD4LT, from Nexus Linguarum and also from outside these networks. Anyone willing to present their state-of-the-art research or participate in the discussions is kindly welcome to join as the workshop expects cross-fertilization both across the project Nexus Linguarum domains and the research coming outside the project with the view joining the project if there is an interest. Depending on Covid-19 situation we expect 15-20 on-site participants and about 50 online participants if the workshop has to be organized in a mixed mode.

Organization

The workshop is a joint activity of the LD4LT W3C and the Cost Action Nexus Linguarum. It will provide a mixture of background presentations, descriptions of use cases and requirements and open discussion.

Organization committee

Christian Chiarcos, Applied Computational Linguistics, Goethe Universität Frankfurt, Germany
Thierry Declerck, DFKI Saarbrücken, Germany
Milan Dojchinovski, InfAI/DBpedia Association, Germany / CTU in Prague, Czech Republic
Fahad Khan, Istituto di Linguistica Computazionale ‘A. Zampolli’, CNR, Pisa, Italy
Giedre Valunaite Oleskeviciene, Institute of Humanities, Mykolas Romeris University, Vilnius, Lithuania

We would also like to thank Bridget Almas (The Alpheios Project, Ltd., Niskayuna, NY USA) for contributing to the preparation of the workshop.

References

Almas, B., A. Babeu, A. Krohn, Linked Data in the Perseus Digital Library. In ISAW Papers 7: Current Practice in Linked Open Data for the Ancient World, New York : Institute for the Study of the Ancient World, New York University, 2014.

Chiarcos, C., Nordhoff, Sebastian, Hellmann, Sebastian (Eds., 2013), Linked Data in Linguistics -- Representing and Connecting Language Data and Language Metadata, Springer, Heidelberg

Cimiano, P., Chiarcos, C., McCrae, J.P., Gracia, J. (2020), Linguistic Linked Data -- Representation, Generation and Applications, Springer, Cham

Dobrovoljc, K. (2017). Multi-word discourse markers and their corpus-driven identification: The case of MWDM extraction from the reference corpus of spoken Slovene. International Journal of Corpus Linguistics, 22(4), 551–582.

Dupont, M., & Zufferey, S. (2017). Methodological issues in the use of directional parallel corpora: A case study of English and French concessive connectives. International Journal of Corpus Linguistics, 22(2), 270–297.

Hellmann, S., J. Lehmann, S. Auer, M. Brümmer (2013), Integrating NLP using Linked Data, in Proc. 12th International Semantic Web Conference, 21-25 October 2013 (Sydney, Australia).

Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017). Designing annotation schemes: from model to representation. In Handbook of Linguistic Annotation (pp. 73-111). Springer, Dordrecht.

Oleskeviciene, G. V., Zeyrek, D., Mazeikiene, V., & Kurfalı, M. (2018). Observations on the annotation of discourse relational devices in TED talk transcripts in Lithuanian. Proceedings of the Workshop on Annotation in Digital Humanities Co-Located with ESSLLI, 2155, 53–58.

Pareja-Lora, A., María Blume, Barbara C. Lust and Christian Chiarcos (Eds., 2020), Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences, MIT Press, Cambridge, MA

Snyder, B., Barzilay, R., & Knight, K. (2010). A statistical model for lost language decipherment.

TEI Consortium (2020). 15.4 Linguistic Annotation of Corpora. In: TEI P5: Guidelines for Electronic Text Encoding and Interchange Version 4.1.0. Last updated on 19th August 2020. TEI Consortium. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html#CCAN (Date of Access: 04/12/20)

Wei, N., & Li, J. (2013). A new computing method for extracting contiguous phraseological sequences from academic text corpora. International Journal of Corpus Linguistics, 18(4), 506–535.

Zufferey, S., & Degand, L. (2017). Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguistics and Linguistic Theory, 13(2), 399–422.