MultilingualWeb Workshop, Madrid 2014 - Day 2

Note: The following notes were made during the Workshop or produced from recordings after the Workshop was over. They may contain errors and interested parties should study the video recordings to confirm details.

Agenda: Agenda
Chair: Arle Lommel (DFKI) and Felix Sasaki (DFKI, W3C)
Scribes: Felix Sasaki, Roberto Navigli, John McCrae, and others

Topics
Users

Machines Part II

Scribe: fsasaki

Chair:Hans Uszkoreit

Seth Grimes, "Sentiment, opinion, and emotion on the multilingual Web"

seth: consulting company in Washington DC, see details on my web site
... sentiment analysis, innovation - innovation is often coordinated from community efforts like multilingual web
... overview of my presentation: sentiment topic from view of "big" data
... 4 types of data: machine data, some thorugh interactions with humans,
... profile (e.g. individual, demographic, ...)
... and media: text, audio, images, video
... and two super types: facts, feeelings
... about feelings: sentiment analysis = computational study of opinions, sentiments, emotions, expressed in text (not only)
... again from "Bing Liu, NLP handbook": definition of sentiment value as a quintuple
... questions for business: what are people saying? what is "hot topic"?
... has has opinion about xyz evolved?
... example from wall streat journa of sentiment analyis
... levels of sentiment analysis: corpus data space, document, statement, ...
... +/- polarity is too simple
... for sentiment + emotion

<lupe> sentiment analysis is full of jargon and "noise"

seth: prediction of feelings and intend. example: distinguish between predictions, feelings, wishes

<lupe> proposal of turning the quintuple model in a seven parameter model

seth: sentiment analysis beyond text: e.g. facial image recognition for analysing emotions, or audio streams
... sentiment does not translate 1:1 between languages
... W3C work on sentiment recommendation: emotion markup language spec, not (yet) applied in industry

Asunción Gómez-Pérez, "The LIDER Project"

asun: how linked data is used for language technology - example:
... person is searching with term "red"
... in various data sources the person can find information about "red"
... in some sources she can find also translations of the term
... or synoym
... if a person is looking for linguistic information, it finds several pieces of information in several places
... so information is complementary but not connected
... so idea of LIDER is how to connect such resources
... heterogenitiy of linguistic resources
... there is ecosystem of resources, high quality, curated, including clear IPR rights
... problem is: trying to use language resources is hard because they are in several places
... there are several formats, APIs to access them
... idea of LIDER: linked data would allow to transform resources in one format to allow uniform access
... transformation into RDF of language resource content + language resource metadata
... instead of having 5,6,7 independent resources after transformation we have just one unified resource
... linked data cloud: has data in many domains
... linguistic resources could be just one domain in the linked data cloud
... one issue: how to establish relation between domain knowledge and linguistic knowledge in the cloud?
... linguistic linked data cloud: set of LOD, linguistic domain
... resources in RDF and interconnected with other LD resources
... requirements on resources, coming up via discussions with META-SHARE people: keep track of license, provenance, use of resources
... such requirements came up via discussions with META-SHARE + ELRA
... unform access to language resources
... representation via RDF, query via SPARQL
... idea of linguistic linked licensed data
... representation of lexica in RDF: using lemon
... for provenance the W3C provenance ontology
... for licenses ODRL
... for corpora: NIF
... what is LIDER about?
... about resources (see above) and how these can be used for NLP tasks relevant for content analytics
... linguistic data represented as RDF - not in this project but in a few years this may help such tasks
... activities in LIDER: technical. Analysing what extensions are needed for doing content analytics
... for that we need vocabularies
... for metadata and the resources themselves
... then guidelines for helping transformation into RDF
... then: identifiy NLP services that may use the resources
... also a reference architecture (not implementation) of working with the resources
... the work will be done in the LD4LT group http://www.w3.org/community/ld4lt/
... started in LIDER internally, now done in the public + open LD4LT forum
... idea is now to come up with a vocabulary describing the linguistic resources
... vocabularies, roadmap, guidelines, reference architecture will seed the LD4LT group
... and then will be discussed in the LD4LT group
... there are other W3C groups related to the topic
... BPMLOD, started after Rome MLW event http://www.w3.org/community/bpmlod/
... and ontolex working on lexicon topic http://www.w3.org/community/ontolex/
... currently LD4LT group has 62 participants, 1/2 of them outside of LIDER
... please join the group to contribute to the discussion
... and join the challenge!
... busy agenda of activities
... ask for input - please fill in the questionaire today to win a nice price today :)

Martin Brümmmer, Mariano Rico & Marco Fossati, "DBpedia: Glue for all Wikipedias and a Use Case for Multilingualism"

martin: dbpedia = extracting knowledge from dbpedia
... mapping dbpedia data into linked data
... result = very large multilingual knowledge base that provides a common structure over different language versions of wikipedia
... organisation of dbpedia
... organised in chapters
... they are maintaining their specific version
... supported by the newley founded dbpedia association
... lod cloud picture. dbpedia is in the centre
... chapters are relatively indpendendent. Todays speakers in this presentations are from three different chapters
... example industry use case from ULI (Unicode internationalisation technical committee)
... we are compiling abbrev. from dbpedia
... the abbrev. could be confused with sentence boundaries or other text segmentation items
... we do this since ULI is using abbrev. as exceptions in the rule based text segmentation tools
... they need more data to do that
... we do this with dbpedia by extracting the abbrev. and modeling them as lemon dictionary entries

Marco: talking about italy chapter
... industry use case with italian startup
... we leverage dbpedia to build gazeteers of real world entities
... gazetteer = both language and domain specific
... example of music domain. storing music related entities allows to do queries like: give me all musicians from area xyz
... example open government http://opencoesione.gov.it
... contains entities like companies, organisations
... dbpedia helps with enriching + interlinkinng
... improves browinsg capabilities
... digital library area example: florence digital library
... and last scenario: data driven journalism
... journalised query dbpedia to tell story, build infographics, do fact checking etc.
... now different topic: event that just has ended: Italian dbpedia mapping sprint
... mapping italian data to dbpedia ontology
... italian data is language + culture specific
... dbpedia aims at representing the whole world
... the dbpedia harmonizes data from different chapters
... we did a hackathon on this

Mariano: now spanish dbpedia
... in wikipedia + dbpedia lots of languages
... we encourage you to have your own chapter
...
... spanish dbpedia has both spanish + other language information
... location of queries
... from spain but also many other places
... 70% from spanish language browsers
... but also english and other languages
... SPARQL queries made:
... there are some special days with a lot of queries
... "IP monsters" that can put the system down
... lessons learnt: you have to take care of "IP monsters"
... and noise generators

Jorge Gracia & Jose Emilio Labra, "Best Practises for Multilingual Linked Open Data: a Community Effort"

jorge: web of data is more and more multilingual
... but so far still many data sets are monolingual
... e.g. language tags are still underused
... many design decisions need to be made concerning multilingualism: select right vocabularies, generate RDF, publish data, ...
... example bilingual dictionary - what to do to move this to linked data?
... or: existing ontology: how to localise this in my own language?
... or group wants to give answers to some design issues

about http://www.w3.org/community/bpmlod/

jorge: chaired by jose, John, myself
... relation to other groups
... the BPMLOD started to identify use cases, but this is now done by LD4LT
... the ontolex group: we take their lexicon / ontology spec and will produce BP to using LEMON
... BPMLOD is also data on the web
... so we can take ideas from the W3C "data on the web" best practices group and input to them aspects of multilingual data on the web
... activities so far: topic classification, use cases, patterns, best practices & guidelines
... currently we are discussing patterns for creating guidelines
... jose: topics for BP
... use cases: localisation workflow, lexicalisation of RDF data sets
... localisation of ontologies
... etc.
... now working with patterns
... sometimes difficult to establish difference between patterns and best practices

jose discussing various patterns for naming

jose: people in this workshop could provide many nice use cases, please think of contributing!

Hans Uszkoreit, "Quality Machine Translation for the 21st Century

hans: presenting on behalf of Josef van Genabith
... about single digital market
... a big topic in Europe
... ecommerce in Europe still has barrierers in Europe
... many European companies support languages that are not in the list of 24 official languages
... there is still language boundaries
... EU committs to multilngualism
... but even them cannot master the challenge - not everything can be translated
... in the commercial world it is even much more fragmented
... uneven distribution of language technology support for European languages
... quality of MT technology in particular: same issue
... example german: bad as a target language
... works reasonable as a source language
... in addition to general MT problem there is quality problem
... some things cannot be done by both rule based or statistical systems
... there is MT for lots of languages without post editing
... example news papers: sometimes only 6% can be used without any post editing
... take any journal of computational linguistic: gap between a few and the under resourced languages is widening
... difficult languages: morphology rich languages
... free word order languages
... then there is lack of training and resources
... qtlaunchpad project
... aim was: what should be done next in MT?
... pay attention to language pairs
... sometimes it is specific to linguistic characteristics of languages
... sometimes it depends just on resources
... then there is what projects are funded:
... a long time progress on gist translation was funded
... and content translated into English
... but we are mostly concerned with translations from English + high quality
... working with different algorithms + more features: tweaking them is rather alchemy than science
... e.g. "let's see if hierarchical works for this language pair"
... so now more work on this in a systematic way
... so now more work together with translators themselves
... that has started with people present here and many other companies + organisations
... our new project proposals reflect that
... about quality: MT is measured with e.g. BLEU score but in translation industry that does not fit
... so we are doing work about a quality model that unifies both worlds, see work presented by Arle on MQM (Multidimensional quality model)
... the QT21 proposal work plan reflects these areas
... and also aims at building a bridge between research and application areas
... team of QT21: many research centres, organisations like GALA / FIT / TAUS, companies, DGT (associated)
... research is prepared for solving the problems
... this community around QT21 also is ready to combine efforts with the data + multilingual web represented here

Machines Part II Q&A

dave: how to get e.g. MT researchers to use linked data source?

hans: that is one of the crucial things, like to hear other thoughts as well
... example: a student did researach on WMT data
... evaluation results for MT
... all of that data is still in text files
... to e.g. query things like: how does language xyz with MT appraoch ... perform? cannot be done without looking into the text file

asun: for people working onMT
... if you provide use cases for us to analyise
... we could look into that: what are the task in MT that could benefit from linguistic resources in linked format?
... we need also cross-education
... the LD people need to learn about MT
... and the other way round
... we need some kind of common actions

hans: agree
... and there is area of community driven efforts
... e.g. the R&D area of wikipedia
... that area needs to come together
... bootstrapping of such community efforts with linked data + MT

tomas: one should try to address multilingual issues in the general context of the web
... there is the w3c "data on the web best practices" group
... one should look into that too

marco: we see dbpedia as the best approximation of human knowledge
... e.g. schema.org has a different purpose

kerstinSteffen: interesting to see that MT has different critera
... could this have an effect on ecommerce for certain languages?

hans: spanish will become one of major languages because of growing importance of latin america
... but see also korean, chinese, ... lot's of unsolved problems

peterSchmitz: have you considered ADMS for describing linguistic resources?
... you talked about dcat - also thinking about adms?

jorge: we are exploring + analysing the different metadata vocabularies

<fsasaki_> jorge: we are currently discussing what vocabularies to take into account, will look into that

<fsasaki_> hans: wrapping up session

<fsasaki_> .. we did not have time to discuss everything

<fsasaki_> .. we sometimesget imperfect things: wrong translation, imperfect semantic tagging, imperfect sentiment analyiss

<fsasaki_> .. we need to have a bootstrapping approach, tagging things as uncertain, and allowing technologies to make things more certain

Users

Scribe: lupe

Chair:Thierry Declerck

Pedro Diez-Orzas, "Multilingual Web: Affordable for SMEs and small Organizations?"

<lupe> most Language Service Providers (LSP) are SME

<lupe> Global and regional markets are multilingual

<lupe> The number of languages a SME can translate is important but there are other issues

<lupe> for instance, web content treatment

<lupe> in which formats?

<lupe> how users and localizers relate?

<lupe> SME represent more than 90% in different stats, and they give jobs to more than 80% people

<lupe> there is a great difficulty to find well prepared people (translators) for the technological environments

<lupe> seveal challenges: adapting cost to market? and others

<lupe> Smes need win-win business

<lupe> SME need more people creating CMpople

<lupe> s/ creating content management

Don Hollander, "Universal Acces: Barrier or Excuse?"

<TomasCarrasco> No standard for Multilingual Web Sites Community Group (MWS) - http://www.w3.org/community/mws

<lupe> ttechnologies accepted, browsers, search engines, email, ecommerce, social media, mobile dvices...

<lupe> s/mobile devices

Dennis Tan, "Internationalized Domain Names: Challenges and Opportunities, 2014"

<lupe> what is the state of IDN today?

<lupe> 36 IDNs ccTLDs

<lupe> in 2014, over 100 new gTLDs

<lupe> starting Japanese TLDs

<lupe> as well as Chinese

<lupe> Idn usage, and how we can move usability?

<lupe> 70% of the .com websites are in English

<lupe> but 50% of the population live in non-English speaking countries

<lupe> As for Content Language Analysis there is a higher percentage of IDNs as redirect links (see graphics)

<lupe> so let's go into a multilingual experience

<lupe> what is perception: reality

<lupe> IDNs in applications, mobile applications have poor support

<lupe> email search engines have also poor sipport in IDNs

<lupe> searchers expect to have it in local language, not aware of IDNs

<lupe> call to ACTION: from registries to registrars, ..

<TomasCarrasco> Also Internationalized Resource Identifiers (IRIs) - http://tools.ietf.org/html/rfc3987

Georg Rehm, "Digital Language Extinction as a Challenge for the Multilingual Web"

<lupe> Kornai's division of lang in extinction

<lupe> from digitally thriving langs to moribund/dead langs

<lupe> In the scope on Metanet, Metavision, Metashare and Metaresearch

<lupe> Different langs papers have been produced (31 languages)

<lupe> Cross lingual comparison: MT text analytics, Speech recognition, and resources

<lupe> but there are great differences between EU languages and others

<lupe> at least 21 European langs in danger extinction

<lupe> Some questions were posed in parliamentary sessions

<lupe> Different community bodies were invited to participate in the study EFNIL, NPLD, Council of Europe

<lupe> to fill the gaps of the langs missing

<lupe> See the table providedwith the comparison of languages, echonologies and resources

<lupe> Document produced: Strategic research agenda for multilingual Europe 2020

<lupe> some priority themes: translingual clouds, tools and technologies to overcome language barriers

<lupe> see META:NET SRA

<lupe> H2020-ICT-17 Cracking the lang barrier" already mentioned in the WS

<lupe> MT considered an obligatory component but not ripe yet for production

<lupe> CRACKER : mentioned yesterday, see slides

Tatiana Gornostay, "Towards an open Framework of E-services for Translation Business, Training, and Research by the example of Terminology Services"

<lupe> Towards an open FW of E-services for Translation Business..

<lupe> Some considerations on Open Framework of e-services for translation:

<lupe> open stnds, resusable & flexible GUIs

<lupe> H2H

<lupe> Translation important in Content management process

<lupe> this framework has a cross sector view: from business to academia, freelance..

<lupe> Another step in the FW is terminology

<lupe> it's really at the heart of succesful comm

<lupe> but consistency should be ensured

<lupe> this can be reached with cross-disciplinary standards

<lupe> and recommendations

<lupe> how can terminology as a service be considered?

<lupe> in searching, identifying, extracting, visualizing publishing,...

<lupe> see termunity.com

<lupe> TAAS: terminology as a service: is free, provides cloud services for terminology work

<lupe> it works with different langs: Latvian, Polish, Russian, English...

<lupe> TAAS accepts 14 formats: pdf, doc,...

<lupe> recall and precision can be measured

<lupe> many external DBs are used in the lookup

<lupe> See the demo at https/demo.taas-project.eu/projects

<lupe> TBX, TSV and CSV can be the export formats

Users Q&A

<lupe> How can one deal withfunny letters in the IDN?