LD4LT Annotaton Workshop Zaragoza 2021
LD4LT Workshop on Linguistic Annotations on the Web
Minutes / Agenda / Brainstorming document
The workshop is focused on challenging issues of web annotation and to share and enhance the state-of-the-art of linguistic annotation on the web involving cross-discipline and cross-linguistic audiences. It provides background information on major community standards, their benefits and shortcomings, with the specific aim to contribute to and to consolidate an on-going discussion within the W3C Community Group Linked Data for Language Technology (LD4LT) on developing a consolidated LOD vocabulary for linguistic annotations for applications across language technology, empirical linguistics, computational lexicography, digital humanities, etc.
The numerous existing vocabularies that exist for the purpose are neither interoperable with each other, nor do they cover all relevant use cases. Since 2019, LD4LT is thus working towards the harmonization and extension of existing standards for creating, publishing, sharing, accessing and processing linguistic annotations on the web. The goals are to (a) provide a survey about standards, challenges and requirements, to (b) work towards a W3C community report that provides either best practices for annotations or extensions to existing standards, and, ultimately, to (c) inform subsequent standardization efforts.
We aim to provide a general introduction into the topic, to consolidate this discussion, and to discuss directions, goals, and concrete strategies.
List of topics
- linguistic annotation
- linked data
- web standards
- knowledge graphs
- natural language processing
- digital methods in linguistics
- Digital Humanities
As an LD4LT community meeting, this workshop is the first of its kind. However, the organizers have been involved in organizing numerous workshops, summer schools and conference on the topic, including:
- Seven international workshops on Linked Data in Linguistics (LDL-2013, 2014, 2015, 2016, 2018, 2020): 40-90 participants each
- Two Summer Datathons on Linguistic Linked Open Data (SD-LLOD 2017, SD-LLOD 2019): 40-50 participants
- Language Resources and Linked Data tutorial (EKAW-2014)
- Two conferences on Language, Data and Knowledge (LDK-2017, LDK-2019): 100-120 participants
This meeting builds on this experience, but is dedicated to a more narrowly defined aspect. The LD4LT community group currently has more than 100 members, but for a presence meeting, we adopt a conservative estimate about the expected number of participants (see below).
The workshop will be conducted as hybrid half-day workshop in the morning session of the [W3C Day at LDK-2021|http://2021.ldk-conf.org/post-conference-w3c-day/]</span>. On-site participation is possible for attendants of the LDK main conference, external participants (and virtual attendants of LDK) are welcome. For virtual participation, we use Zoom. Attendance is free of charge, but participants have to register at the LDK Registration Page.
The workshop is both a general assembly of the W3C community group Linked Data for Language Technology, and organized in conjunction with the COST Action Nexus Linguarum. In that sense, the format is less a classical workshop with formal paper submission but rather an informal discussion round with invited presentations, but focusing on discussion. It will be a mixture of presentations, use cases and discussion, with a focus on both providing the necessary background for harmonizing linguistic annotations on the web, and discussing the prospective directions, subtasks and possible milestones of such an enterprise as well as use cases.
We plan a four hour event 09:00 - 13:00 CET with a 30 minute coffee break 10:30 - 11:00 CET.
|09:10 - 10:30||Background: Linguistic Annotation on the Web|
|09:10||Christian Chiarcos||W3C Standard Web Annotation|
|09:25||Milan Dojchinovski||NLP Interchange Format|
|09:40||Fahad Khan||Text Encoding Initiative|
|09:55||Thierry Declerck||ISO TC37 standards|
|10:10||Joel Kalvesmaki||Text Fragids|
|11:00||all||QA: What is missing? What is unclear? Where are problems?|
|11:30||Christian Chiarcos||Summary of LD4LT Discussions on Linguistic Annotation|
|11:40 - 12:30||Use Cases, Experiences, Extensions|
|12:00||Francesco Mambrini||Linking Latin|
|11:50||NN||Distributed Text Services|
|12:00||Maxim Ionov||Interlinear Glossed Text|
|12:10||Giedre Valunaite Oleskevicienė||Discourse Research: A Case Study on Attitudinal Multiword Discourse Markers|
|12:20||Christian Fäth||Transforming Language Resources|
|12:30||all||What to do next?|
|12:50-13:00||Christian Chiarcos & Thierry Declerck||Closing Remarks|
As for the brainstorming session, please feel free to us contribute with questions and ideas, and bring in your perspectives already during background and use case sessions: Minutes / Agenda / Brainstorming document
Who should join
We expect the mixed audience coming from the LD4LT, from Nexus Linguarum and also from outside these networks. Anyone willing to present their state-of-the-art research or participate in the discussions is kindly welcome to join as the workshop expects cross-fertilization both across the project Nexus Linguarum domains and the research coming outside the project with the view joining the project if there is an interest. Depending on Covid-19 situation we expect 15-20 on-site participants and about 50 online participants if the workshop has to be organized in a mixed mode.
The workshop is a joint activity of the LD4LT W3C and the Cost Action Nexus Linguarum. It will provide a mixture of background presentations, descriptions of use cases and requirements and open discussion.
- Christian Chiarcos, Applied Computational Linguistics, Goethe Universität Frankfurt, Germany
- Thierry Declerck, DFKI Saarbrücken, Germany
- Milan Dojchinovski, InfAI/DBpedia Association, Germany / CTU in Prague, Czech Republic
- Fahad Khan, Istituto di Linguistica Computazionale ‘A. Zampolli’, CNR, Pisa, Italy
- Giedre Valunaite Oleskeviciene, Institute of Humanities, Mykolas Romeris University, Vilnius, Lithuania
We would also like to thank Bridget Almas (The Alpheios Project, Ltd., Niskayuna, NY USA) for contributing to the preparation of the workshop.
Almas, B., A. Babeu, A. Krohn, Linked Data in the Perseus Digital Library. In ISAW Papers 7: Current Practice in Linked Open Data for the Ancient World, New York : Institute for the Study of the Ancient World, New York University, 2014.
Chiarcos, C., Nordhoff, Sebastian, Hellmann, Sebastian (Eds., 2013), Linked Data in Linguistics -- Representing and Connecting Language Data and Language Metadata, Springer, Heidelberg
Cimiano, P., Chiarcos, C., McCrae, J.P., Gracia, J. (2020), Linguistic Linked Data -- Representation, Generation and Applications, Springer, Cham
Dobrovoljc, K. (2017). Multi-word discourse markers and their corpus-driven identification: The case of MWDM extraction from the reference corpus of spoken Slovene. International Journal of Corpus Linguistics, 22(4), 551–582.
Dupont, M., & Zufferey, S. (2017). Methodological issues in the use of directional parallel corpora: A case study of English and French concessive connectives. International Journal of Corpus Linguistics, 22(2), 270–297.
Hellmann, S., J. Lehmann, S. Auer, M. Brümmer (2013), Integrating NLP using Linked Data, in Proc. 12th International Semantic Web Conference, 21-25 October 2013 (Sydney, Australia).
Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017). Designing annotation schemes: from model to representation. In Handbook of Linguistic Annotation (pp. 73-111). Springer, Dordrecht.
Oleskeviciene, G. V., Zeyrek, D., Mazeikiene, V., & Kurfalı, M. (2018). Observations on the annotation of discourse relational devices in TED talk transcripts in Lithuanian. Proceedings of the Workshop on Annotation in Digital Humanities Co-Located with ESSLLI, 2155, 53–58.
Pareja-Lora, A., María Blume, Barbara C. Lust and Christian Chiarcos (Eds., 2020), Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences, MIT Press, Cambridge, MA
Snyder, B., Barzilay, R., & Knight, K. (2010). A statistical model for lost language decipherment.
TEI Consortium (2020). 15.4 Linguistic Annotation of Corpora. In: TEI P5: Guidelines for Electronic Text Encoding and Interchange Version 4.1.0. Last updated on 19th August 2020. TEI Consortium. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html#CCAN (Date of Access: 04/12/20)
Wei, N., & Li, J. (2013). A new computing method for extracting contiguous phraseological sequences from academic text corpora. International Journal of Corpus Linguistics, 18(4), 506–535.
Zufferey, S., & Degand, L. (2017). Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguistics and Linguistic Theory, 13(2), 399–422.