MultilingualWeb workshop -- 11 Jun 2012

This is the raw scribe log for the sessions on day one of the MultilingualWeb workshop in Dublin. The log has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC is used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following IRC can also add contributions to the flow of text themselves.

Felix Sasaki: session to start 9 a.m.

<omstefanov> David Lewis opens conference !

Welcome

Scribe: Felix Sasaki

Introduction by Vincent Wade

Vincent Wade: vincent: welcome to dublin and tcd, delighted to host this workshop

.. mlw and linking of data across mlw is key to expansion of web

.. in CNGL, we are looking into a value chain from creation to delivery

.. how mlw content can be integrated

.. technology on language and multimedia content, personalization

.. etc. need to be brought together

.. happy to see so many cngl partners here, collaborating nationaly

.. science foundation ireland has invested a lot into CNGL, DERI focusing into SW

.. in FP7 and collaborations across the world, we more and more have these roadmap meetings

.. we are looking into similar problems, so we need to find roadmaps to work together

.. multilinguality is a key part of this

<omstefanov> maximizing impact of our efforts <- key final point of Prof. Wade's talk! very important !

Introduction by Richard Ishida

Richard Ishida: 5th of mlw workshop, very happy to see it taking place here

.. we run the MLW project with help of EC as a project,

.. idea was to bring people from different disciplines together, so that they talk, it worked very well

.. during this workshop we will be more focused, but later we will go back to the general MLW workshop type again

.. 12 years ago Yves Savourel and I started talking about internationalization and localization of schemas, that led to ITS standard; great to see where we came so far

.. thanks a lot and have a good meeting

Introduction by Dave Lewis

Dave Lewis: multilingualism and web content is crucial for many businesses:

.. media providers, richer video and audio, content providers like microsoft

<omstefanov> David Lewis: success of mlw workshops: getting people together who might not otherwise have met and worked together.

.. CMS providers, browsers, ...

.. all need to be aware increasingly of multilingual content

.. aim to get people into the room from industry and academy, and different parts of the topic

.. language technology, localization, web people, with w3c as a core place where people meet

.. and where we advance things to standards

.. another key player from a European perspective is the EC

<omstefanov> David Lewis: core of multilingual issues is W3C ... to advance standards. Other core player is the EC which provides ongoing support.

.. coordination activities have an important role

.. important both for research and infrastructure support

.. now fp7 is ending, looking forward to horizion 2020

.. has many opportunities to bring things together

<omstefanov> EC's Framework 7 will be followed by Horizon 2020 (not next Framework)

.. Today please thing mostly about bridge building: between various disciplines, industry and research

.. and esp. the two themes: MLW (HTML5, going into language services industry) and linked (open) data

.. at the end of the day we have a couple of questions to lay out a roadmap, so keep these questions in mind

Introduction by Kimmo Rossi

Kimmo project officer of the series of workshops - MLW project and MLW-LT project

.. very grateful to Richard for bringing this community to where we are,

.. a few words about our internal re-organization

.. "my" dg will no be called "dg connect"

.. in three weeks three units will be merged, the "data value chain" unit

.. there units were LT, data, and PSI (previously E 1,2,4)

.. LT portfolio will continue to exist, but in a bigger context

.. E.4 does not have projects; so the new G3 unit will handle the E1 and E2 projects

.. new activity of our unit: we will handle policy and legislative issues of public data

.. so-called PSI "open public governmental data"

.. also, our unit will handle two infrastructures of connecting europe facility

Kimmo: LT community needs to see how they can leverage linked data, SW, big data ...

.. these are just keywords, here are just some threads:

.. extracting meaning from text, converting structured data from unstructured data

.. hard task, but can move forward by bringing LT and data community

.. a lot of useful work on terminology, ontologies, taxonomies, nomenclatures etc.

.. very happy that this workshop opens with the colorful speakers related to that area

.. a few more words on CEF - is about infrastructure

.. CEF concludes early designs for 8 infrastructures, e.g. Europeana, multilingual access etc.

.. this is not research, but building systems

<omstefanov> about 78 meur will be in 3 calls to be published in July 2012

<omstefanov> Obj. 4.1 (27meur), 4.2 (31 meur) and 4.3 (20 meur).

Kimmo: objectives - content analytics, and LT, scalable data analytics, SME initiatives on analytics

<omstefanov> 4.1 COntent analytics and lang tech

<omstefanov> 4.2 scalable data analytics

<omstefanov> 4.3 SME initiative on analytics

.. our (LT) role is about extracting meaning from large types of language based information

Kimmo: I'm here the whole day, if you have questions please let me know

<omstefanov> Kimmo Rossi only here today. Invites everyone to come to see him today if interested / have ideas

Dave Lewis: now short self introduction

.. dave: very interesting mix of people here

.. very happy to have also XLIFF TC people here, who will have a meeting here later int the week

.. so we have good expertise from OASIS on the localization side too

<fsasaki> dave checking - who is from industry, research, standardization

<fsasaki> when - who is LT or SW research side

<fsasaki> dave: have done a good job in bringing the people here

<fsasaki> arle: look at IRC - this is where we make the meeting minutes, but also to gather comments

<fsasaki> arle: crowdsourcing content creation - go and write your ideas on the board, people leading the workout sessions will make use of that

Setting the Stage

Presentation from David Orban

David Orban: exponential trends:

<omstefanov> David Orban (dotSUB): challenge is understanding the power of exponential trends

.. in the initial part of the exponential function, people say easily "this is just noise"

.. famous example is human genome project

David: I'm a geek, loving to observe the nature of machines

<omstefanov> nay sayers have an easy time, initially. It takes time to reach a point of visibility. 1% often takes "most" of the timeline. Then the exponential curve gets the rest done more quickly

<omstefanov> We've gone from mainframes thru several levels of devices, to reach the final generation of human-oriented devices.

<omstefanov> The latest, called mobile phones/devices number in the hundreds of millions

<omstefanov> the text generation will outnumber the number of people

<omstefanov> the next...

David: communication among "things" in the internet, talking to humans in emergency or other situations

.. these devices are going to dominate future computing unverse

<omstefanov> in this "next age" the age of automonomour machines, will communicate more with each other than with humans

David: automonous devices already exist, they communicate with each other and us

<omstefanov> eg. of autonomous devices: iRobot - autonomous vacuum cleaner

.. what are the communication signifiers that enable us to operate a vacuum cleaner or a mobile phone?

.. what are the decisions of autonomous cars

.. these are fundamental challenges

.. all developments are very chaotic - standards settings and policy making are essential for this

<omstefanov> in terms of policy making decisions are very chaotic ... industry wants to thrust ahead.

.. consumers jump on board - it is important to balance advantages of new technologies with potential pitfalls

<omstefanov> users want the devices, without thinking too much about consequences

<omstefanov> google glass as e.g. of augmented reality

.. hard for policy makers to keep up

<omstefanov> defacto industry standards will be faster to develop than those standards that standards bodies develop.

.. technology developments - augmented reality interfaces, google glasses etc.

<omstefanov> former may influence latter

<omstefanov> "Code is Law" Lawrence Lessig.

David: need of human societies to interact in a positive way - not in a relationship of winners and loosers

<omstefanov> ... from CODE and other laws of cyberspace (http://code-is-law.org)

.. semantic web can accelerate understanding of this

.. many have seen results that google exposes - creating wikipedia like pages from accumulated search results

.. bring human component of wikipedia like large scale cooperation together with semantic web processing

.. we are creating an interoperable hybrid very powerful system

<omstefanov> Human-computer interoperability is coming ! Wikipedia-like pages created using semantic web tools from Google-like data

.. are are creating the premises of new types of social interactions, that can be abstracted to a political level

<omstefanov> DavidO: recommends the "Proactionary Princile"

<omstefanov> Developing a balanced approach to decision making.

presentation from peter schmitz

Peter Schmitz: building a new architecture in the cellar project

.. deal also with metadata standardization, format standardization

.. esp. in legal domain there are many XML based structures

.. re-use policy of the EC

.. purpose is to increase efficiency in EC

.. currently developing an open data license

.. publications office: publisher of EU

.. EC, european parliament, other institutions

.. publishing in 23 languages

.. man public online services - eur-lex, eu bookshop, public procurement, r&d on cordis

.. position of publication office on re-use and SW / linked data:

.. we are part of EC, so we are part of the execution of this initiative

.. we have a re-use policy led by the dginfos / future dgconnect

.. in about autumn first version of european data portal will be online

.. EU level: standardization participation esp. in the legal domain

.. topics / ideas of re-use of language resources in NLP domain

.. our contributions: multilingual thesaurus like euroVoc, multilingual controlled vocabularies and taxonomies ("common authority tables"), linked multilingual XHTML content (official journal, case law)

.. all these will be provided for re-use in new dissemination architecture

.. legal content started in SGML, now XML, being converted into XHTML

.. content delivery infrastructure for linked open multilingual data

.. will provide storage, dissemination, content, provision of persistent URIs

.. prefix: http://publications.europa.eu

.. support and encourage data providers to provide RDF

.. visualisation tools based on RDF

.. encourage colleagues to provide their data in RDF too

Peter: for open data portal: possibility to contribute ideas, data cataloging ...

Peter: crowd-sourced annotation and adaption of LOD: a question for us

.. we have annotation of official content, but there might be use cases for crowd-sourced annotations

.. but need to define quality support in this

.. provenance tracking, history and storage is important

.. LOD and authenticity - will there be an organization for LOD?

.. how to implement this concept, how to approve it?

.. for thesaurus we have a release mgmt, for controlled vocab we trace histories

.. existing codes will remain in vocabulary + time spans

.. further application domains for MLW - LOD in eGov, health:

.. we provide stable and persistence URIs for data

.. would like to discuss: what about authorized relationsships

.. e.g. a journal is published in 23 languages - what is the authority relation here?

.. about LOD from the public side - provide a European Legislation Identifier (ELI)

.. part of standardization of PSI

.. example: http://eurlex.europa.eu/eli/dir/2008/98

.. will allow to align legislation across Europe

Peter: LOD and it's role in MLW-LT metadata

Peter: integration of LOD through MLW-LT metadata

.. references from web content items, e.g. entries in multilingual thesauri or authority tables etc.

.. enrichment of web content with HQ information, to improve MT, localization workflows etc.

presentation from Europeana (Juliane Stiller + Marlies Olensky)

Juliane Stiller: working on europeana 2 - multilingual access of europeana content and europeana data layer

.. europeana facts: launched 2008, cultural heritage information system

.. data from archives, audio visual archives, libraries

.. build a digital library as a single access point, today access to about 23 mill objects

.. map of Europe - main content is coming from a few European countries

.. e.g. if metadata comes from France, metadata is in french, but object might be from a different language

.. how does multilingual access and search work on Europeana? involves interface,

.. search (query translation and document translation)

.. result presentation (enable users to assess relevance of results)

.. and browsing, important for cultural heritage domain

.. people need to be able to "find the unknown"

.. Europeana: static interface is translated into 26 different languages

.. query translation prototype developed for 10 European languages

.. a document can be translated after being found, via MS translation API

.. now work on semantic data layer

.. multilingual alignment of controlled vocabularies etc.

Marlies Olensky: edm comprises cross-domain metadata - library, archive, museum

.. edm is a roof for different domains and levels of granularity

.. europeana is a cross-domain framework, using SKOS, CIDOC-CRM etc.

.. can use then specific parts e.g. from museum domain

.. basic distinction: "provided item" vs. its digital representation, plus metadata record

.. allow for multiple records of one object

.. composition of objects, important for library and archives domain

.. can re-present contextual resources

.. a metadata format that can be specialized

.. edm case studies, see http://pro.europeana.eu/web/guest/case-studies-edm

.. LOD pilot with 2.4 mill objects, contributions from spain, norway, austria, sweden, belgium

.. places or items are mapped to places

.. multilinguality and EDM:

.. semantic data is multilingual, see data cloud developed in europeana connect project

.. different vocabularies are aligned with each other

.. not only multilingual vocabularies will allow for multilingual search results

.. we also align monolingual results by a pivot vocabulary

.. language tags play a role in the edm too - labels in different languages

.. now example how Europeana portal deals with multilinguality

Juliane: example search for "cheval" - result list, facets, filtered by language

.. metadata is in czech

.. cheval is not in the metadata - result was found because the metadata fields where enriched with different language versions including a thesaurus with the term "cheval"

.. with multilingual enrichment of vocabularies you can enhance multilingual search

.. we are now working on how to present this to the user

.. summary: europeana is very multilingual

.. multilingual "metadata + object + user"

.. hard to retrieve objects but also to present to a multilingual audience

"setting the stage" - QA

Dave Lewis: question on DavidO - easy multilinguality on the web - will that be a source of a big change?

David Orban: yes

.. MT has been very active, statistic approaches created a new generation of MT

.. now statistical MT has reached maximum potential

.. now there is an opportunity to apply further techniques

.. not only for translation of a piece of content, but also for akquisition of content

.. e.g. for training

.. additional techniques to differentate between speakers, understand what not to try to transcribe

Paul: talking about translation and multilinguality, in sense of terms / multilingual alignment / multilingual resources

.. what is the role of e.g. LR? what role does NLP play in Europeana?

.. e.g. beyond terms: lexical resources, morphological information

Juliane: very important, in Europeana connect we built language resources

.. we were looking for resources to implement cross-lingual search

Nicoletta: we should discuss very carefully about the role of LR with respect to MLW, SW

.. there are so many dimensions one should touch

.. at the moment (in MLW-LT) we discuss metadata - "content" in Europeana is another level that touches lexicon etc.

.. there is also big data in our field

.. so there are so many dimensions - the role of LR in big data environment needs to be discussed, including policy issues

Paul: policy issues in terms of standardization in the EU context are important too

.. how do we deal with standardization, making linked data (content, lexicon / linguistic) "official"

Kimmo: very important that we don't kill emerging new activity by standardizing

.. previously things have been killed by standardization

.. our practical approach in MLW-LT and other projects has been: we impose standardization by example, not by conditions and rules

.. has the disadvantage that it is slightly chaotic, but would still speak in favour of involving people in doing something

.. e.g. EU publications office should be the lead in standardizing their work

.. that might become a part of a standard or not, and need to link this to standards work

Phil Archer: talked to EDM people that the EDM could become a part of W3C standardization, also have a lot of questions to peter, will take that offline

Thierry: also important to see if we need other standards like LMF for lexicons etc.

RyanHeart Question: what is the one take-away from this session?

Arle: question for davidO - how to build the bridge from current efforts we saw so far to your vision?

David Orban: competition will drive adoption of models

.. this is just a start of a conversation - Europeana is a wonderful initiative

.. would love to see it to evolve to opportunities that are commercial as well

Alex Lik: why not use regulation as a vehicle for industries to implement things

Ioannis Iakovidis: who will pay for it?

.. things need to be marked driven

.. if there is no carrot nothing will happen

Olaf-Michael Stefanov: to all the speakers - what is the role of the crowdsource for metadata gathering, metadata definition etc.

Juliane: doing that in Europeana to some extend - looking into user logs etc.

<fsasaki> now break

<fsasaki> before break about "qero" project (link to be provided later)

Linking Resources

Scribe: Arle Lommel

Dave Lewis: This section is about linking data.

presentation from Sebastian Hellmann

Sebastian Hellman: researcher at Uni Leipzig. Will be talking about linked data for NLP and Web annotation. This is a broad topic. I will point to projects as an overview.

.. Motivational slide: lots of walled gardens. This is the way it was before RDF. There are many beautiful gardens, but you can't go between them. I want to talk about turning walled gardens into networks of parks.

.. How do we leverage linked data for NLP? They cover many domains. The data is crowdsourced. This is the background.

.. RDF is about semantic interoperability.

.. Third factor is making the output of NLP available on the web.

.. Slide shows huge number of Linked Open Data repositories. Currently linguistic data is one part under cross-domain.

.. Linguistic Linked? Open Data Cloud: Links many areas. How do you fund this? Difficult to fund any one institution. There is a time horizon on funding: may lead to death of projects.
... Funding for cloud remains difficult

.. DBPedia includes eight interlinked language versions. Individual language data is available.

.. Wiktionary2RDF: Communities create wrappers (made by domain experts).Converted to Lemon via Mediator. Anyone can join the community:http://dbpedia.org/Wiktionary

.. Web Technologies for integrating NLP tools and approaches. Once you are immersed in a technology, you don't see other solutions and start trying to apply it where it doesn't apply. There are cases where RDF makes sense, but others where relational databases make sense. Learn *when* to use it.

.. My solution: RDF allows linking between walled gardens. It has certain properties other data models do not provide.

<omstefanov> Says RDF is the way to link up different gardens, but not why

.. Advantaged: URIs available, formal documentation (like UML), easy-to-understand structure, many tools (e.g., LOD2 Stack), indexing and querying allow big picture.

.. NLP Interchange Format (NIF) aims at interoperability between NLP, language resources, and annotations.

.. First released September 2011. Open project. Growing with feedback.

.. NIP allows interlinking between various tools (slide show structure and tools).

.. No current standard mechanism to connect WWW, Giant Global Graph (GGG), and NLP. There is no way to combine the three.

.. Want to allow annotation by various tools. Also human annotation (links, free text, correction of NLP annotations)

.. But all this does not work together. It has the walled garden problem still. Semantic Web is supposed to fix this, but a lot of work remains.

.. Showed example of how to make it work.

.. Feel free to join.

presentation from Dominic Jones

Dominic Jones: Want to start with an info graphic to show the world by nationality. Want to add to it traditional print media and also user-generated content (on electronic devices). These types of content are very different. What we produce is somewhere in the middle.

.. Issue is the challenge of how to localize this stuff.

.. Compare Flickr, Reddit, etc. Raises issues: provenance, access control (linked *open* data vs. linked data—this may be a blocking issue)

.. Architecture based on CMS Lion, uses XLIFF messaging between various components. What we add is an RDF model of translation, provenance, CNGL service and content models.

.. These models represent data we deal with.

.. Book of Kells is here at Trinity, written on calf skin, in a big glass case. You can't access it yourself, relies on gatekeepers. Compare it to the iPad, where the consumer becomes the producer.

.. CMS-LION emphasizes user-generated content. Compare to "telephone" game: our system lets us know who changes what in the translation, what happens in QA, what is consensus, etc.

.. Will show an example of a tweak. We break it into the content model, and into a job. Provenance is critical in working with things. We use a lightweight version of the Open Provenance Model.

.. We have Artifacts, Processes, and Agents. Able to map process diagram to these things. This process allows us to enrich the translation model with information on who did things, and how.

.. We are integrating CMS LION with Panacea. Focus on post-edits to retrain MT systems. Also tying with LSP (VistaTEC).

.. Will use CMS-LION as a test-bed for ITS 2.0 and tie it in with Solas (Limerick test bed for workflow orchestration).
... Now we are at the intersection of Multilingual Semantic Web, Language Resources, and Localization. This is MLW LOD.

presentation by Jose Emilio Labra Gayo

Jose Labra: My work is in multilingual LOD. We translated product schemes and procurement vocabulary in EC projects.

.. We have HTML, human readable, but how do we move from that to machine-readable that is intrinsically multilingual.

.. There is data and there is *multilingual* data. We need to account for human readable information (e.g. "professor" vs. "catedrático". Moving from this to machine-readable is a challenge.

.. Want to talk about best practices impacted by multilinguality. Have 8 best practices.
... 1. Design a good URI scheme. Cool URIs don't change, identify things, are human-readable.

<PhilA> Note to self - Gov Linked data WG Best Practices on Linked Data needs to include section on multilingualism - ref. Juan's talk

.. e.g. dbpedia.org/resource/Spain is good.

.. I'm not sure if internationalized URIs are good or not. Can create problems with phishing, limited support, and human readability across languages.

.. 2. Model resources, not labels. URIs should map to contents, not to particular labels. We don't want to map to different language labels. Use universal pointer with RDF labels to language specific versions. But can cause problems with thesauri. SKOS uses URI-identifiable labels.

.. Question: What happens if we want to use localized URIs. Perhaps using language identifiers in the URI is good, but I don't know.

.. 3. Use human-readable information. Machine information can also be human readable.

.. Question is how to balance between human readable and RDF world.

.. 4. Use labels for all the entities you model, not just concepts, not just main entities. Displaying labels is easier if you don't have to make multiple requests.

.. Problems: Selecting the proper label. Only 38% of non-information resources have labels. Also, avoid camel case or similar notations in labels. "UniversityOfOvieda" is a bad label.

.. 5. Use multilingual literals. IETF lets you select the right tag. But multilingual literals can create problems. The right technology can deliver less than ideal results. E.g., SPARKL works with labels, what happens when you use a language-bound query (e.g., for "Professor" without a language tag). Need to create a default label with no language tag.

.. This is currently unused (only 4.78% of info-resources use a language tag, and only 0.7% use more than one.)

.. We need to balance between RDF and XML and be aware of consequences of mixing. This is a challenge.

.. 6. Use content negotiation. Use Accept-Language. Without it we end up returning too much data. Allows you to get labels in the language you want.

.. 7. include labels without a language tag. This makes it easier for SPARQL queries. Need to know what the default language is. Is there a way to declare the primary language of an RDF data set.

.. 8. Use multilingual vocabularies. Claimed that they should include descriptions in more than one language, but most do not. Also what to do when not localized?

.. Raises issues of when categories don't map precisely across languages.

.. Some other issues: Unicode support. Microdata doesn't allow language declarations. Internationalization not covered in RDF.
... LOD platform WG offers new challenges.

"Linking Data" QA

Question: José, you recommended using literals without language tags at all. But what happens when the literal can mean different things in different languages? E.g., Gift in English is very different than Gift in German.

José Labra: These are difficult issues. There are practices to model lexicons and separate concepts from labels.

Sebastian Hellman: The URIs are not human readable if you do not use IRIs. But then you have to use % encoding which is impossible to read. In DBPedia we use IRIs. We think libraries should support IRIs.

Maxime Lefrançois: I have a question for Dominic. Do you use the W3C Provenance concepts?

Dominic Jones: The Open Provenance Model predates the W3C work and gave rise to it, but we chose it as an off-the-shelf solution.

Thierry Declerck: Question about terminology (did not catch it)

Tadej Štajner: For Sebastian. I've been following the NLP to RDF work. Is there any work on encoding this in an inline format directly in a document. Some of our use cases require this rather than storing them separately.

Sebastian: It might be possible, but it is difficult in general. It is hard for any annotation format. Maybe easier with RDFa. We'll have to discuss more.

Pedro Diez: Maybe we need to make a distinction between the kinds of data we are trying to link. We need a map without ambiguity to link linguistic data and general lexicons. Right now it is different to link to concepts with different literals across languages.

scribe: Regarding this, maybe we need to make distinctions between different kinds of data: brands, names, telephone numbers, words. Most work we can reuse are lexical databases. They represent hard work.

Maxime Lefrançois: In the Linked Open Data Cloud, is there work for linguistic. What are the links between ???

Sebastian: It is a draft right now, but for the Linguistic Linked Open Data Cloud, it is more of a vision.
... We hope the original LOD Cloud will get more interactive over time.

Linked Open Data and the Lexicon session chaired by Arle Romel

Scribe: Jirika Kosek

Bringing Terminology to Linked Data through TBX by Alan Melby

Alan: one of objectives of workshop is to bridge gap between traditional terminology and LOD/LD efforts

.. TBX/RDF

.. TBX is TernBase eXchange standard

.. not single language, family of languages called dialects

.. TBX/RDF is isomorphic mapping of TBX to RDF

.. alows loseless bidirectional conversion

.. why to have TBX/RDF?

.. there is a lot of knowledge in TBX, it should be part of LD

.. LD can benefit from term disambiguation

.. access to well-established knowledge engineering community
... provides concept-based information for translation

<Nathan_Rasmussen> Terminology industry uses TBX for interchange, but

<Nathan_Rasmussen> it is not well suited to an online, open style of data exchange like RDF is

<Nathan_Rasmussen> thus this project is for terminologists to benefit from LD as well as the other way around.

Alan: URIs for datacategories are in www.isocat.org
... TBX/RDF uses XML+RDFa1.1

Managing Director, Interverbum Technology by Ioannis Iakovidis

Ioannis: Our company develops TermWeb - terminology management system

Ioannis describes what tool can do and how complex terminology workflow could be

.. terminology is everywhere

.. challenges - integration with another tools (no standard API)
... term identification and tagging
... some standard/convention is highly desirable

Extending the Use of Web-Based Terminology Services by Tatiana Gornostay

Tatiana is from TILDE company, providing localization, translation, ... mainly for LV, LT, ET and RU

Tatiana: they have to deal with terminology every day

.. efficient communication requires terminology

.. terminology is bridging language and semantic technologies

.. eurotermbank.eu - 2nd largest term. database in Europe
... describes Accurat & TTC tools/projects

.. TaaS - Terminology as a service

.. terminology can enhance automation in LOD
... terminology helps in automating work with multi/cross-lingual metadata

The Need for Lexicalization of Linked Data by John McCrae

John McCrae: PYTHIA - ontology based question answering system

.. proper rdfs:label is only on about 2% of content

.. labels are very amiguious

.. created lexicon model relative to ontolgies
... built on ISO 24613 and SKOS
... further development under W3C Ontolex CG

Cool URIs Are Human Readable by Phil Archer

Phil McCrae: ISA - Interoperability Solutions for European Public Administrations

..Phil describes what interoperability means in terms of term

.. unique identifier of term is very important for interop

.. domain names used in URI should be neutral

.. providing equivalent localized URIs is friendly for users but adds more work for publishers and needs more power for processing

Q&A

Pedro: Dejavu - we had similar problem 18yrs ago.

Tatiana: IMHO natural language can't be formalized. ...

Kimmo: EU commision tries to do this for years. Fomalization of concepts has limits.

Felix: 18yrs ago there were no Web.

Arle: Hopefully we are at inflection point now and we will see rapid progress.

Kerstin: Some terms are translated only in some languages.

Alexander: What to translate and what no to can be hold in metadata.

Tadej: RDF is unable to express that something is translation.

Identifying Users and Use Cases - Matching Data to Users

Scribe: Tadej Štajner

DaveL: After an interesting discussion, it's time to draw out concrete use cases

Thierry: Can you give us short descriptions of the use cases for the technologies and standards describe today. What are the business cases, how can language LOD improve business?

PeterS: There are several in the legal domain and public services.

.. From the transparency side, equivalence of languages is important.
... For the open data, we have to come back to this

Thierry: Is there a concrete requirement for one of those use cases?

PeterS: There were interactions, a lot of feedback comes from institutions that use the technology, as well as from the EU member states, but more on the formal side. There is less informal feedback from the community.

Julianne: For europeana, we are gathering feedback from not just librarians in the community, but also wider public via the Europeana Connect initiative.

.. In terms of multilingual uses, we did access log analyses and we indeed have multilingual users, but majority are monolingual which need to be able to use it.

Thierry: How was the access implemented? SPARQL?

Julianne: For now, the interface is via downloading a dump, the rest is under implementation. A multilingual interface is difficult to implement.

.. With regard to users, there are several groups in terms of requirements: some come from education, professionals and general public.

fsasaki: Are users from the localization/internationalization area a new use case for you?

Julianne: We are aiming to have cross-lingual descriptions for our resources also for othe use cases.

PeterS: We are at the end of the chain, the data gets generated upstream.

Sebastian Sklarß: We work with the german government on e-government projects. We started by using Drupal, and generally the open source CMSes are a big trend in this space.
... the LOD, inferences, semantic technologies still need a lot of acceptance management on the e-government side.

philr: We are very interested in using RDF for integrating various info sources. We want to provide a full picture of the data, currently we do this via synchronizing the disparate sources, which is complicated.

philr: We see opportunity in using RDF for this kind of data integration purposes.

Pedro: We work in localisation. We have 2 scenarios where we need this metadata.

.. 1) real-time processing of metadata for normal translation workflows

.. 2( real-time translation, which happens synchronously on page visit. Here, the client needs to be able to influence the process of serving RT translated content.

TatianaG: At Tilde, terminology management is an invaluable source for automatic cross-lingual annotation.

AlanMelby: What is the status of expressing terminology in ITS1.0?

Yves_: There is a termInfo* family of attributes which express that.

AlanMelby: What about looking at meanings and not just strings of characters - ambiguous queries in information retrieval can be solved via terminology disambiguation.

.. A semi-automatic term concept suggestion could solve this, a human can quickly scan and link documents to existing terminology databases.

.. One of the sticking points are the ontologies: how to name certain subject fields?
... You can get a long ways with the current state of the terminology resources.

fsasaki: Are there any things that customers look for? Any workflows that need to be supported?

AlexanderLik: We don't operate our solutions on the web, but internally. If the new standard supports DITA-based XML, it's easy for us to adapt. A new format would be prohibitive for us.

Pedro: We have clients with public and private areas of information. If you can't keep personal information in the sources, you also can't keep these things in translation memories.
... These things could be better controlled by using a tag set for that .

Thierry: Are there any other aspects about security in Linked Data?

Pedro: If the clients need special security measures, the server needs to be located at the client premises, so that is also an operational issue for us.

paulb: There are several people at DERI working on access control on linked data. There is demand for thins functionality from commercial entities.

Arle: Pedro's concern is legitimate: not all Linked Data is Linked Open Data. This is an important aspect for many users.

PhilA: LD != LOD. W3C is aware of that, last week a paper at the Data Forum, a paper presented an approach that had an ACL flag for every triplet .
... a lot of solutions are coming.

fsasaki: The LOD working groups are also gathering requirements on that aspect.

<fsasaki> see http://www.w3.org/2012/ldp/charter for more info

??: When converting XML data structured into more redundant RDF data structures, clients need argumentation to take that approach..

Thierry: What is the benefit of LOD as opposed to plain XML? Demonstrating business value is important.

DesOates: In Adobe, we already have a "Linked Closed Data" corpus of terms. It's expensive to maintain and manage. Our problem is not the openness, but actual applicabilty.

.. Developers will never look up termbases for UX or documentation. The Linked part is irrelevant if we can't get it into the software source.

Building Bridges

Kimmo: bridge building is often talked about but not very easy to do

Kimmo: we are often overwhelmed by the multitude of information. META-NET (in particular META-SHARE) is a European initiative to gather related language tools, resources under one access point

META-NET project site: http://www.meta-net.eu/

META-NET Strategic Research Agenda and LInked Open Data presented by Georg Rehm

presentation from georg rehm - META-NET and the strategic research agenda + LOD

Georg: conference June 20/21st in Brussels META-Forum

.. what METANET are trying to do is trying to provide each language community with resources in order to overcome language borders, the last remaining border
... focused on under-resourced languages of Europe

Georg: 3 lines of ACTION: META-VISION, META-SHARE, META-RESEARCH

.. ELRA/ELDA are part of the consortium and have already committed to use the META-SHARE approach

.. META-NET grew from T4ME, CESAR, METANET4U, META-NORD projects

.. Strategic Research Agenda - document to mobilise researchers, users and providers of LT for collaboration & community building

.. require appropriate programme, appropriate actors and the appropriate support for an effective SRA

.. 3 vision groups: Translation and Localistion; Media and Information Services; Interactive Systems

.. produced 30 whitepapers discussing language resources and identifying gaps in terms of resources, funding etc.

.. Research Priority Themes are "Translation Cloud", "Social Intelligence" and "Socially Aware Interactive Assistants"

.. Translation Cloud is not restricted to Machine Translation, includes human translation/translators, LSPs...

.. Social Intelligence is concerned with multilingual text mining, discussion platforms at a European level...

.. Socially Aware Interactive Assistants = "super-Siri", focus on Speech Technology

.. Data Challenge will be one of the challenges for Horizon 2020. It is concerned the with data-value chain, open data, big data, linked data

.. Under the translation cloud, access to data (such as linked data) can be used to help improve language technologies including machine translation
... Social intelligence can help provide novel methods to access data, to clean data sets...

<omstefanov> Where

Georg: We need to consider data security issues and how this effects data distribution/sharing

<omstefanov> ... can I find out more about the Socially Aware internactive applications? Specifically speech recognition? Can't find anything useful on either http://www.meta-net.eu/forum/the-strategic-research-agenda-for-multilingual-europe/ or http://www.lt-world.org/

Georg: we need to publish various language resources (terminology, TMs, wordnets etc) as linked (open) data

.. META is an open strategic alliance with 600+ members with the goal of supporting the Strategic Research Agenda: http://www.meta-net.eu/join

<fsasaki_> omstefanov, more info is at http://www.meta-net.eu/sra - soon you'll have the SRA linked with more info about that topic and others from ehre

Building Bridges QA

Kimmo: Horizon 2020 is still only a proposal, nothing is set in stone. The topics may change and the Data Challenge is not guaranteed to be a challenge and additionally there is no guarantee that language will be included if it is a challenge.

DLewis: What is the timeline for these decisions to be made?

Kimmo: By the end of next year. It's a political process

NickCampbell: Is the Latvian person really going to teleirc someone in Portugal?

Georg: It is a vision - looking beyone what is currently possible

RSchaler: Siri isn't great, but it works. What is the real revolutionary output of what you're presenting?

Georg: The capabilities are very limited on the iirc. However, you can't really engage in a dialogue with the system. We want to extend it to something that is following you across devices and has a profile of your preferences etc.

Q: We have a limited number of languages. The EC/META-NET could invest in making the speech tools support some of the lesser-supported languages or help finance projects that would make speech tech. broaden in scope for more languages and more accents

Georg: Many of these tech. research gaps are identified in the language white papers.

<RyanHeart> So far, the language industry functions based on markets?

PhilArcher: in w3c office in New Dehli it came across that people prefered to "speak" to the web - based on the proposed work you can really transform countries

<RyanHeart> Would companies have invented a keyboard if users could have talked to devices?

DLewis: We have to always be aware of the question re. what is truly innovative/novel about the work.

<RyanHeart> Is 'vision' about extending current technologies to other languages, platforms, environments?

Q: Clearly adding current capabilities to new languages is a huge issue. Regarding the current capabilities of the technology, are there any theoretical breakthroughs that are needed in order to accomplish the vision for 2020 or is just a matter of applying existing technologies better?

Georg: Of course there are many breakthroughs needed. One of the main components of the SRA is a roadmap document that builds timelines of what needs to be done when, with priorities identifying what problems need to be solved along the way
... we have to be realistic and credible at the same time

DesOates: the key is to join the dots between the technologies. The round-tripping between the technologies still needs to be done

We need to ensure to include the link between language resources and language technologies in the SRA

Kimmo: You can have a critical discussion about this at METAFORUM

.. We should not let the fact that Google/Apple have already developed the technologies to discourage further research/development - trying them out it is clear that the problems (e.g. interactive assistant, subtitling) are not completely solved.

<RyanHeart> The point is not to say: Apple or Google have solved the problem (they might or might not have), but whether we can manage to come up with revolutionary new ideas for technology solutions, rather than working on extending existing ones.

Kimmo: session concluded

<mhellwig> scribe: mhellwig

action plan discussion led by dave lewis

Scribe: Moritz Hellwig

DaveL: What changes of thinking and planning to we need to advance multilingualWeb

.. we have two communities (language resources and language technologies) and there are synergies between them - they share confereces and projects

.. a lot of the activity relies on research which causes people to worry about sustainability

.. on the other side we have the private sector, which may be side driving activities

.. besides multilingual or multinational organisations, localisation industry which has a long tail of companies

.. MultilingualWeb-LT is interesting, because it brings the communities together

.. the question is: how do we address the requirements of the communities?
... and how do gather information from all the communities

Q: Where is the data to be linked? Maybe we need an inventory of the most important data that need to be linked. Who is caring enough to pay for it?

??: If you talk about sustainability, what is the motivation for people to host the LOD. All participants in the LOD cloud have different incentives and that's why it works.

scribe: the publisher provides the data and then everybody can take it, use it. As long as the incentive is there LOD will continue

dF: so far no real use case has emerged. LSP serve the publisher, but the publisher doesn't need the data link. Another problem: "here is data, now link it" is the wrong approach. It is not sustainable to link things that were created

??: the motivations for LOD are very diverse. one motivation to open databases could be that there are many silos in information technologies that do not integrate. So a motivation is to integrate and create interoperability.

scribe: people want to open data, data that is multilingual

fsasaki: luxembourg workshop: what are the important parts of the infrastructure work to make the vision presented work? Best practices would help.
... a project that creates best practices and uses them will be helpful

SebastianSklarss: food agricultural organisations - started in 1980 - converted their data to make it linkable. They provide the infrastructure for linking up; they feel the have the mandate to host the data. From there you can start linking up

??: datawarehousing exists separate from the language industry; the data is structured to extract meaning from the information.

scribe: (referring to LOD cloud) it looks like a swarm. on breakthrough to find is

??: Open Linked Data WG was discussed taht it's not really applicable to linguistics. Persons have thhe same level of granularity, but this is not the case here. Secondly, privacy issues are problematic, linguistic data is natural speech

Pedro: internet has three problems: how to retrieve and maintain amounts of data. Second, how do you make it accessible. Third, data maintenance and retrieval of data should be cheaper.
... We have Web 2.0, now 3.0. We cannot do 4.0 without LD

??: Emergency cases can use LD across languages, for example a building is on fire and the cleaning lady doesn't speak English

Arle: What do we need from Localisation service providers. We need the involvement of various communities, we want to bring them together.

??: it's not the localisation service providers / tool creators fault if prices go. They have to be aware that the industries are not willing to pay more for fragmentation changes

DaveL: What is the motivation is a good question. We can publish stuff, but making the links it the important point. Maintaning links could be expensive. We have to make LD useful. Data warehouses are a good example, because they found a way of making it profitable.

dF: data hygiene is expensive. Back to the LSP: they serve publishers but there is a disconnect between the two.

fsasaki: data needs to be transformed. bring it an xml format, then to an HMTL format. You run into issues. The key issue of LOD in the diagram (DaveL presentation) is it only an interoperability layer or is it a layer to organise the workflows

dF: use cases are very important. The incentives are there, they are signs of sustainability. Often there is no consensus in the industries or no implementation commitment. Emphasis should be on real use cases.

??: two motivations for LOD movement: government accountability and responsibility. And for that you need links.

scribe: Public sector uses LOD. There is the UK crime map, on a heat map can see crime. The police are the biggest users; they save money by accessing their own data.
... publishing data is expensive, but the cheaper than the alternative

Thierry: you need to publish the document in the public sector, but they aren't compatible nor comparable. Linking data is best.

.. It's cheaper to have only one terminology database where others can link to. In the medical domain they are pushing that. We should talk to those communities to see how it's working for them

SebastianSklarss: micropayment is missing for LOD. If it is made for machines, we need a mechanisms to make machines pay for it.

dF: micropayment is feasible.

.. LOD should be paid for by the government. It's a kind of utility and payed for by taxes. We need to influence the policy makers. If they don't see it, then we must go for market sustainability.

DaveL: two parts - gain public sector support with very clear use cases for the public sector. On the other hand, adding value (e.g. data warehousing) for the private sector

Pedro: amount of LOD is so huge, that you cannot control it by one body or organisation

DaveL: What we need to look for is use cases.
... Are there people in the room interested at looking at these issues?

fsasaki: would like to add another use case. Processes that data goes through.
... define provenance and the state of the translation. It's already done, but expensive. Maybe we can make it cheaper

<fsasaki> fsasaki: see also the linked open data working group - co-chaired by IBM with one motivation: making software development processes easier to organize, more sustainable etc.

Nicoletta: small use case: can we link what we have in meta share with the lingual data. Else we have the risk of running different directions.

.. One of the problems linking the cloud, there is a double meaning of "data". There's data and language resources we use to work on the other data.

DaveL: [ends session; thanks participants]

.. Let's talk tonight and in the next couple of days in more detail. We should use the opportunity to move things forward. The use case focus is really important.

Kimmo: closing remarks

.. important comments on government involvement. We do not need to standardise more. I say no first to test you.

.. we need to know: What do you want us to standardise? What has consent?

.. LOD is a buzz word that is used in a rather sloppy way. Because we have obsessed over LOD, but it's true that LD is equally linked.

.. Business models - e.g. micropayment - is an important point. We need payment mechanisms, small amounts per transaction

.. a sort of application store in a data environment

.. "What data should be linked and why" is a good question and we should test ourselves. If we don't know maybe then it's not a good idea

.. The importance of the event is discussions. And the documentation will be available (scribes, presentations, ...).

.. We have linked the constituents, but may take several years before things get moving
... in the future we'd like to co-locate this event

DaveL: [closes today's event; thanks participants]
... Videos, presentations, transcripts will be available

<fsasaki> cfp for localization world event is here http://www.localizationworld.com/lwseattle2012/feisgiltt/

DaveL: [thanks to organisers of event]

<fsasaki> meeting adjourned

[End of Minutes]

MultilingualWeb workshop, Dublin

Day 1: Linked Open Data – 11 Jun 2012

Contents