MultilingualWeb Workshop, Madrid 2014 - Day 1

Note: The following notes were made during the Workshop or produced from recordings after the Workshop was over. They may contain errors and interested parties should study the video recordings to confirm details.

Agenda: Agenda
Chair: Arle Lommel (DFKI)
Scribes: Felix Sasaki, Roberto Navigli, John McCrae, and others

Welcome and Keynote
Developers
Creators
Localizers
Machines - Part I
1. Roberto Navigli, "Babelfying the Multilingual Web: state-of-the-art Disambiguation and Entity Linking of Web Pages in 50 Languages"
2. Victor Rodríguez Doncel, "Towards high quality, industry-ready Linguistic Linked Licensed Data"

Welcome and Keynote

Introduction to Workshop

Scribe: Felix Sasaki

introduction to workshop from Félix Pérez Martínez, Director de la Escuela Técnica Superior de Ingenieros de Telecomunicación de la UPM (ETSIT UPM)

now introduction from Victor Robles Forcada, Director de la Escuela Técnica Superior de Ingenieros Informáticos de la UPM (ETSIINF UPM)

now introduction from Arle Lommel, DFKI

Alolita Sharma "Keynote: Multilingual User Generated Content at Wikipedia scale"

Alolita asking who is using and editing wikipedia - many hands go up

Alolita: wikipedia is a resource for the multilingual web
... I want to talk about the scale of multilingual content in wikipedia
... more than 30 mill. articles
... covering 287 languages
... represents diversity on mlw, a trendsetter in contributing content on the Web
... 4.5 mill. articles in English, the largest language on the web today - we like to have all other languages have as much content
... German, Dutch, French, Swedish have also many articles
... happy about the energy in the EU about multilingualism
... if you look at the rest of 287 languages - anything from 100 000 articles to 1 mill articles in 37 languages
... and 73 languages with less than 100 000 articles
... so numbers are dramatically different
... incubation / inception of new languages - but they are so small compared to large languages
... so 287 languages is a nice number, but demographics of contributions in the conten community is imbalanced
... how do we change this?
... this is what the MLW community is looking to change, and we do too
... we have 500 mill. users using wikipedia every month
... out of that 21 mil.. pages being read
... mobile: six months ago it was 4 mill - now rising dramatically
... mobile web is an important trend also how user contributions come in
... via tables + mobile platform
... from traditional web browser on desktop
... Europe is on top for consumption of wikipedia content
... asia pacific (China, Japan, Korea, ...) also many users, also US / Canada
... these are the three larges geographies consuming wikipedia content
... in north america: line is steady
... as people get broadband access
... that changes the dynamics of content consumtion + production
... take china: we are blocked and there are local competitors
... we can be accessed e.g. from hong kong
... but we'd like to get the access in China too

<TomasCarrasco> Poster - Big Multilingual Linked Data (BigMu) - http://dragoman.org/bigmu

Alolita: in tokio on the train: people have at least 4 devices simultanously!
... so maybe small number of users, but many access points to content at the same time
... so mobile is growing heavily
... this is the platform of consumption where you need language technology + standard support
... some numbers on wikipedia articles
... separating by language, article numbers, ...
... take example of catalan: small language but very passionate community, contributing web content
... many non latin languages come into top 25 of wikipedia
... stats.wikimedia.org gives you details on these and other statistics
... early adopters of user generated content are mostly Europeans
... next generation: first time broad band / mobile devices, did not have desktop access but mobile directly
... example: growth in Indic languages, or Arabic content, e.g. Farsi, then CJK languages
... long tail languages, e.g. native american languages or also catalan
... example newari, language in nepal
... it is 2nd largest wikipedia in India after hindi
... so passion of communities is a driver for user content creation on the web
... important that the information on wikipedia is free, otherwise uptake would not happen
... sharing information from users + organizations, so top down + bottom up
... needs a critical mass
... the "tablet as a platform" for content consumption for the next years in asia and other regions
... don't have any browser yet "speaking" all languages
... and other challenges for web in mobile
... user experience often is 2nd class for non-latin languages for several browsers + devices
... lack of high quality content and reference data
... this is highly related to what we are discussing at this workshop
... we want high value content to be created, but how should that happen
... and make available language data, dictionaries etc.?
... what is wikipedia doing in this space?
... traditionally we have handled wikitext as XML, now we move to HTML
... we look at language selection mechanisms, web fonts

<chaals> [Big up to Wikipedia making Web fonts for scripts with few fonts (or none!). It is a really good thing to do…]

Alolita: these are important for consumption
... input methods
... grammar, plural and gender for internationalization in javascript / php
... content translation integreated in wikipedia
... using machien translation, translation memories, dictionaries, glossaries, wikidata
... content translation is for us a new very interesting topic that I'd like to discuss esp. at this event
... a difficult area since there are not enough tools to support language beyond the top 5
... we need to have good s/machien/machine/
... machine translation, TMs, dictionaries, glossaries etc.
... example: suggested translation to help generation of new content
... today content contribution on wikipedia is a very difficult process
... if somebody is doing a translation there is no unified set of tools
... even to take content across European languages but the problem is getting harded with other languages
... where are we heading: generate rich high-quality content
... deliver multilingual user experience
... be mobile and everywhere
... commoditize language software and make it avail. open
... and keep the web open and free
... we need to collaborate more - all of you are thinking about multilingual web components
... we cannot build all these tools unless you collaborate
... language technology is not only google, facebook, universities etc. - we need collaboration among such stakeholders
... in your ecosystem: try to make content available
... you have to have seed language applications to enable content contributions
... we have several contributions to this event, see the current slide, so you wil hear more about this in other contributions

Keynote Q&A

Hans Uzskoreit: bootstrapping for multilingual content creation is important
... see e.g. thai wikipedia creation via machine translation
... setting up workflows for this bootstrapping including the edits in the original language etc.: is there anything of that sort? do you have the money to that?

Alolita: we are looking intobringing in edits in one language into other languages
... it is very complex, as you give with the example of Thai wikipedia
... you want to have that cycle of content being merged back in
... we are trying to keep the problem simple
... first we want to tackle the problem of content creation
... before tackling the "bringing edits back" problem
... another example "translators without borders"
... they get content created but they are unable to sync that back into wikipedia
... we'd love to get that back into languages that are now poorly covered

Hans Uzskoreit: would it be possible to see: what has been translated by mt e.g.?

Alolita: yes, we are planning to provide such kind of information

Thierry: question on wiktionary data

Alolita: just discussing on how to use wikitionary data in translation process
... currently there is no API to work with wikitionary data in wikipedia
... we need to work to make wikitionary data machineconsumable

David Chan: about wikipedia + wiktionary:
... english wikipedia contains a lot of content in other languages + scripts
... both for English wikipedia + wiktionairy
... so en.wiktionary is not English only content

Alolita: we have wikidata project
... in which we are trying to create structured data

Seth Grimes: people working on MT + parallel corpora
... there are many wikipedia pages that were translated without correction
... so eating gargabe from one MT system to create another one-
... can you quantify the number of MT page in wikipedia?

Alolita: we prefer not to have MT pages at all
... the community will revert pages if such pages are avail.

Seth Grimes: mechanisms to detect MT translated content?

Alolita: working on that
... e.g. looking at number of edits by users
... this algorithm is not in production yet

Roberto: the vision of the number of resources wikimedia foundation is working on?
... e.g. wiktionary, wikidata, ...
... wikitionary is a big project with duplication of efforts but there is no clear structure
... so if you link to an entry you link to the entry as a whole not the sense
... so what is the general vision of the wikimedia foundation and how wikipedia can play a role?

Alolita: there are many projects and these are user generated too
... see e.g. wikimedia commons on images
... which has a major impact on images on wikipedia multimedia
... there is wiktionairy for dictionaries
... then e.g. wikisource, wikiversity, ...
... if you go to wikimedia foundation you will see the whole list
... 10 projects running in parallel
... all created by users
... wikidata is a very crucial project for us
... so far users have created content with templates
... and use wikitext
... this has organically grown without much structure + efficiency

<TomasCarrasco> Working on CAT tools - Multilingual Electronic Dossier (MED) - https://joinup.ec.europa.eu/site/med

Alolita: wikidata is our first attempt to take the user generated content and generate structured machine consumable content
... wikidata is in its infancy
... a lot of work has to be done
... see e.g. the infoboxes as an example
... that provides info you "this is a person etc."
... that is starting to get generated from structured content
... this is happening step by step
... we are planning to take other parts of a page later
... that is the beginning of supporting that infrastructure
... we are collaborating with other organizations all the time
... it cannot be done by us - we need others, so please let's join

David Chan: there is a chance for you to provide information
... we have no work yet to move rdfa data into wikidata
... or no provenance information to identify "who has translated s.t.", but we are working on that

Alolita: we also have grants that can support such work
... but we do work on a collaboration model

Dan Tufis: about post editing - do you consider it is possible to have multiple versions of the same source document edited differently?
... do you think about trying to rank various versions of post edited versions?

Alolita: today we don't rank post edited versions
... we do look at the quality of the language of the article
... formula is simple: editors is a community of interest
... metrics what is good is self governed
... that has worked will for top 10 languages
... in smaller languages the story is different
... because the model is driven by interest and editors
... that is the case e.g. by some English articles
... and becomse a bigger issue in the smaller languages
... that is just user behavior
... we are looking into developing tools to detect e.g. robots doing edits

arle: thanks a lot for the presentation and q/a, now coffee break

Developers

Scribe: Dave Lewis

Chair: Philipp Cimiano

Pau Giner, David Chan & Santhosh Thottingal, "Best Practices on Design of Translation"

Santhosh describing technology components for translation

e.g. content translation mediawiki extensions

pau: avoid repetitive steps, e.g. find equivalent links via wikidata
... we can provide initial automatic translation and keep track how this is modified by the user
... if you just add automatic translations we can give you a warning
... described node.js media wiki components fro MT, TM, dictionary, link adaption
... experimenting with free MT, includng open source e.g. moses
... front end supports handling quality of MT content, marking, reviewing
... more detail on media wiki wikipage on Content Translation

<chaals> Content Translation page
... universal language selector already available - URL on slides

Feiyu Xu, "Always Correct Translation for Mobile Conversational Communications"

... translation for mobile users

... needs language knowledge and context
... Yocoy shortly release mobile application with 'always correct' translation of millions of phrase translation for travellers

... face to face communications in situation-based dialogues

... currently 5 languages, all directions

<TomasCarrasco> http://en.wikipedia.org/wiki/Hungarian_Phrase_Book

... uses conversation-based templates

... spoken input

... can include images

... minimal/distant supervised learning based on monolingual input

... avoid need to parallel text for training

Richard Ishida, "New Internationalization Developments at the World Wide Web Consortium"

... some issues:

... Justification in CSS for non engligh scripts

... e.g. arabic justification, where you stretch words rather than spaces between words

... using special 'tapwheel' unicode character

... but rules are complex, this many dependencies, which makes tapwheel not ideal

... But also there is lack of consensus on word vs space stretching that need to be resolved.

... Inter-letter spacing, aka 'tracking'

... e.g. in Thai where there are spaces within unicode characters

... require recoding of characters.

... e.g. Indic letter spacing - graphene clustering. Again complex with font dependencies.

... What is i18n activity doing about this

... Japanese layout requirements is 'flagship'

... working on Korean layout

... starting on Chinese, perhaps including musical notations

... interest in other script including mongolian

... Also, predefined counter styles, around 120 styles to paste into CSS

... looking for input, both general review of layout specs and typographic expertise

... input into digital publishing and its cultural representative

Charles McCathie Nevile, "Multilingual Aspects of Schema.org"

... yandex has mt for 42 language

... but here as sponsor of schema.org

... Schema.org meant for general developers vs. linked data

... meta-data for search engines - attempt in 90's failed due to spam

... not much better of detecting spam.

... can use microdata, RDFa, JSON-LD

... happy to add other syntaxes

... a flat vocab

... tech discussion in public, but owned by the 4 search engine

... significant uptake of 10% on public content

... needs improvement, including multilingual

... Issues:

... capitilisation in property and entity names,

... difficult to translate

... Other areas:

... sports is a big global topic

... e.g. info for cricket very different to soccer

... developers bring their sporting upbringing in designing schema

... Another is 'actions', as things you can do, e.g. on a web site

... difficult to even discuss terminology sometimes

... Another is microdata

... adoption was somewhat political, and spec was easier to translate for engineers

Developers QA session

<daveL> Hans DFKI asks: have discussed impact of Schema.org on multilingualweb

<daveL> ... if we want one schema in many language, and Schema.org is good candidate

<daveL> ... would Yandex support this in W3C?

<daveL> Charles: schema is now a more community effort, but still controlled by the four owners

<daveL> ... but have been looking to grow this on W3C model

<daveL> ... and want to engage broader public involvement

<daveL> ... But don't aim for schema.org to be the the 'one' vocabulary

<daveL> ... and would encourage use of linked data vocabs

<daveL> .. in summary Yes

<daveL> Charles asks: I have contributed to articles in wikipedia that has articles in more than one language.

<daveL> ... can we use more than one source language article as sources in new language

<daveL> David Chan: current workflow is from one source to new target

<daveL> ... but would like to address such more such complicated workflow

<daveL> Santhosh: new article is main use case, and synchonisation is key

<daveL> Joachim, Lionbridge: Will there be some linkage in source between sentences and linkages between languages in mediawiki?

<daveL> ... and to Richard one level of collab between unicode and W3C?

<daveL> Santhosh: to capture parallel text at sentence level is questions.

<daveL> ... currently show paragraph by paragraph, but do segment it for translation.

<daveL> ... so will try for this sentence level alignment and publish the resulting parallel data

<daveL> David, will also have subsentence alignment in cases of postediting of machine translation

<daveL> Richard; does W3C and unicode collaborate.

<daveL> ... Yes!

<daveL> ... several key people in both unicode forum and W3C and exchange spec to review

<daveL> Alex O'Connor, CNGL: meaning of a label is highly context dependent

<daveL> ... which is magnified in translation.

<daveL> ... So how do we specify context of meta-data schema?

<daveL> Charles: our approach is look where it works and fix it where it doesn't

<daveL> ... take as much context as possible, and look at what was successful

<daveL> ... for example, text by offer option to try other search engines to assess effectiveness of own results.

<daveL> ... Am a historian by training, and older translations don't stand up so well in modern context

<daveL> ... but we don't have magic solution, but massive data anaytics is powerful

<daveL> Feiyu: no one solution need to track context as mcuh as possible.

<daveL> Philip: session closed, thanks you.

<daveL> Lunch Break

Creators

Scribe: fsasaki, evamen, DL, Lupe

Chair: Paul Buitelaar

Fernando Servan, "UN Food and Agriculture Organization"

... many monolingual sections
... basically English
... English is the language for sharing informationinternally
... in most cases English is the dominant language
... our members said we should provide gist translation to help users understand the content
... stakeholders are users, also government representatives, senior management, translation services, IT service, web service
... selection of solutions: 10 companies invited, 3 replied, 2 qualified
... 3 translators used per language, 800 sentences
... results exactly the same as per BLEU analysis

fernando describing the multilingual web site and the translation workflow

fernando: site has options for evaluation fo results

fernando: feedback - sentence level in browser, page level in browser, user survey
... lessons learned: if pages are not designed for rendering other languages the evaluation is hard

fernando: e.g. text in russian need more space than text in Arabic or English
... so we may have problems putting together the translation in russian
... arabaic has not only technological issues but also the issue that linguists don't agree on the proper translation
... since there are so many variants for arabic
... so there is complexity depending on the language, three languages are just a start
... need to find balance: when is translation widget used and when not?
... next step will be to add functionality to offical FAO web site

<evamen> perfect timing... more at web@fao.org

Marion Shaw, "Multilingual Web Considerations for Multiple Devices and Channels – When should you think about Multilingual Content?"

marion: how many people still hand us a website written in English who just realized: there are many speakers in other languages to cover?
... how many languages should we look at? wikipedia 287, EC 24, UN 6, microsoft 59, ...
... 2.4 bill people use internet

<evamen> Wikipedia 287 languages and facebook 80... why the EC has 24 and the European Parliament 23?

<dF> Just FYI, Google has 60 [60 is the new 40 :-)]

marion: content language vs. user language
... content is mostly English, but only 1/4 of uses speak English
... you need to decide who you want to go to when deciding the target languages
... translation vs. localisation
... if translation is done without taking localisation into account problems occur

<lupe> Localization probem is a challenge for SMT!

+1 to lupe

marion: need to translate with local markets + cultures in mind

<evamen> +1 localization is a challenge for everybody in this room... I guess

marion: terminology is very important and people tend to think about it last

<chaals1> s/more at mailto:web@fao.org/See it in action and try it out at labs.fao.org

marion: after solving such localisaiton problems, how to keep pace with new developments?
... workshop here is in telecommunication school with lot of old devices exhibited - you realize how fast the development is
... also in the web, e.g. change from internet explorer as main browser to a diverse landscapce
... we need to see: how to keep pace with this?
... also span of attention: less than a goldfish
... if people don't get the info in 3 seconds or less they go somewhere else

<lupe> one size does not fit all devices

marion: examples - site seen on laptop vs. on mobile phone, bad branding appearance on desktop or mobile device
... omni channel delivery - tell system what the devices are and it decided what information to give to the system
... next step: development of devices will explode
... we have to try to squash content in different size + formats
... who is responsible for that: us, device manager, ...
... couple of considerations: content, how many languages, what (mobile) devices to cover

s/devices/multiple/devices/

<lupe> here again different backgrounds can help to solve those linguistic and technological problems

marion: is defining web standards enough or do we need to talk to device manufactures saying: changing your devices puts a huge burden on others?

Celia Rico, "Post-editing Practices and the Multilingual Web: Sealing Gaps in Best Practices and Standards"

celia: definition postediting:
... correction of MT generated output
... a bit context of the presentation:
... I coordinated a business oriented research project at Linguaserve
... how to implement post editing at a language service provider
... we also evaluated the use of metadata provided by ITS ("Internationalization Tag Set")
... questions that arose: is there a real benefit of using standardized metadata in post editing?
... do sentences become less understandable via metadata, etc.
... post editing as a multilingual web enabler
... common sense advisory data: market per region 2013
... LSP is growing immensely
... factors that localisation vendors take into account: quality, use of translation system, translation memory, glossaries
... bigger volumes, faster turn around, cutting prices - we need to automate
... do we need a person to check if everything is fine?
... I say: yes. Call this post production / post editing
... four scenarios: no post editing at all. For internal content, browsing, gisting, ...
... then rapid post editing: urgent text (only serious errors fixed)
... partial post editing: minimum changes
... full post editing: complete revision
... what is in ITS 2.0 for post editing?

<lupe> ITS 2.0 adds value to content!

<evamen> Yes it does +1 ITS adds value to content!!

celia: ITS 2.0 adds value to content
... in the project we had a comprehensive list of information post editors should take into account
... a long "wish list" for post editors
... now taking guidelines + language dependent rules and see how to apply ITS 2.0 metadata ("data categories")
... in the Linguaserve project (EDITA) we looked at: how can we use an ITS2 data category for rule activitation
... example "Translate" data category: can activate blocking for translation
... or "localization note": can e.g. provide sentiment information
... or "MT confidence": a post editor may say "above a certain threshold I will not touch the text"
... conclusion: 4 main aspects to take into account
... post editing is enable in multilingual content production, ITS2 can help to facilitate clean + simple post editing

Gemma Miralles, "SEGITTUR"

Gemma: segittur is attached to ministry of industry, energy + tourism in Spain
... we work with regional areas to sell better destinations
... three major web site: spain.info etc.
... budget: 200.000 Euro per year for translation
... translate into 25 languages, openCMS as CMS system
... 80% of web traffic comes from search engines
... spain.info is official site for tourism information
... three types of content: directories, promotional content, description
... site is translated into 18 languages, but fully translated only into 5 languages

spainisculture.com for culture promotion

gemma: points to improve: translations sometimes don't sound natural, menu meaning is incorrect, content sound old, priorities for translation are not well defined, UK version has not the right words for good position in search engines
... goals to achieve - well-planned content strategy
... easy to navigate
... readable
... everyday english or other language (plain language)

<lupe> Menus imply cultural gaps difficult to fill in translation

gemma: findable
... keeping brand / tone, be credible
... research every country where we want to sell
... take into account all search engines not only google, e.g. yandex for russia, baidu for china, ...

[chaals - is baidu interested in schema.org?]

gemma: geotagging for search enginges
... take into account: language used, local links, local domains + hosting for selling in a given country

<chaals> [+1 "avoid complex sentence structure" …]

gemma: now new version for mobile - last year 12% of visitors
... different type of content: short, simple paragraphs, avoid complicated sentences and sub headings
... some quotes for finishing: much of SEO is common sense - coming up with useful content and services that has the words people search for (Matt Cutts, Google)

<chaals> [But don't just avoid complex language for mobil. Using clear language is not the same as dumbing down content. It is the same as writing *well*.]

gemma: speak the users language e.g. what they use for queries (source missed)

Alexander O Connor, "Marking Up Our Virtue: Multilingual Standards and Practices in the Digital Humanities"

<lupe> Quotation fron Anne Wierzbicka 2011 about division between sciences and social science

<lupe> and the humanities

alex: is it digital humanities if I emailed it as a PDF?
... no - we say: there is an humanities enquiry that could not be done manually
... it is about capturing, analysis, storage, dissemination
... current key topic areas: text analysis, literary analysis, archives & repositories, data mining / text mining, visualisation
... digital artefacts: manuscripts, books, films, diaries, archeological finds, secondary sources

<lupe> digital artifacts range from manuscripts, books, fil, diaies..

http://en.wikipedia.org/wiki/Roberto_Busa- father of digital humanities

scribe: what is a document? image of the document, transcript, normalised text, associated metadata of entities etc. mentioned in the text?

alex: library vs. archive is different
... library stores for individual retrieval
... archives are collections - can be enourmous
... none of which open for e.g. 1914
... there is a lot of content that will *never* be digitalised
... in cendari archive: look into archive research guide
... assumption is: you will not solve your humanist problem on the web
... there are many hidden items
... idea is to unify these hidden resources that are not on the web
... notion of entities
... DBPedia notability of context or crowd-sourcing
... metadata items like VIAF, EDM, BIBframe proivdes references to identify entities
... archaic languages / language forms - how to automatically identify them in highly regular content
... 15 century people did not have spell checkers - how to identify entities automatically
... relevant standards: OAI-PMH for harvesting metadata

<evamen> Alex review some standars from GLAM world... nice to see them here... but short of booooooring

alex: MODS for describing object metadata
... TEI for encoding text

see http://www.loc.gov/standards/mods/ , http://www.tei-c.org/index.xml

alex: XML based encoding has gone on for a long time, need to turn take such legacy content into standardized form
... research landscape is huge, if you want to learn more please talk to me!

Creators QA session

<evamen> does anybody see the URL of the prezi presentation from Alex?

evamen, see here: http://oconnoat.github.io/MLWebDH/#/title

[scribe missed question and answer, if others can help please do]

<evamen> thanks... I will Tweet it too :-) (I need new glasses... or none :-(

gemma: we don't have the problem you mentioned, we use both names to avoid that

joel from adobe

joel: what is a computer humanist

alex: interesting - how does a computer humaist look like
... and what can a computer do that a human can't
... being faster is not enough

chaals: teaching scientific scientists why they are more or less scientific than they think

alex: in pure computer science efforts:
... e.g. we use all dbpedia and do xyz,
... but that is not sufficient, one needs to take many non technical aspects into account

arle: an example: a scholar presented image collection
... people said: "that is not how we do things in this field!"
... so cross-research work is hard

alex: the underlying assumptions when doing computer + humanities work need to be analyzed carefully

question to gemma - what languages are covered?

gemma: main countries for us are UK, Germany, France, now starting with Russia and later China
... for others: we have sites for e.g. Japan, Poland etc., but focus is above countries

chaals: gemma and fernando talked about English + Arabic
... and that these languages have very significant regional variation in the places they are used
... how does the UN handle the variable language question?

fernando: applies to English and e.g. Spanish
... you speak "UN English": UK spelling, very standardized
... same for Spanish
... in latin america vocabulary is more rich than in Spain
... same for Arabic
... in China there is a lot of discipline how things are done

olaf-stefanov: UN starts with modified UK English
... everything else I agree with what Fernando said about other languages
... sometimes there is no solution - e.g. example that a word is different in classic arabic versus modern arabic
... a challenge for translation but also interpretation

Coffee break

Localizers

Scribe: rnavigli

Chair: Phil Ritchie

Jan Nelson, "The Multilingual App Toolkit Version 3.0: How to create a Service Provider for any Translation Service Source as a key Extensibility Feature"

Jan: I will start with a blue screen
... I am a computer engineer... I will talk about the Multilingual App Toolkit v3.0
... the goal: discuss the role of XLIFF in localization from a developer perspective and show an overview of the model
... you shoud work with more and more languages to capture more market
... WIth more languages, your % of market share increases, but slowly...
... 10 languages: 78% coverage
... Microsoft windows covers 108 languages...
... .. I have to think multilingual on my own market, for instance in San Francisco
... +45% speak a non-English mother language
... it's culturally diverse and there are opportunities for developers who cross the language barrier
... the developers have to care about translations and localization
... about OASIS XLIFF now
... XLIFF allows you to define extensible XML vocabularies and promot the adoption of a specification for the interchange of localizable software aod document objects and metadata
... from OASIS, a non-profit consortium
... MIcrosoft sponsors on the TC working on the XLIFF 2.0 std
... Let's see a demo of the Multilingual App Toolkit
... this is visual studio 2013... let's look at the resources... we have AppResources.resx
... as I enable the Multilingual App Toolkit you can see that now I have an xlf file, it's created by default that takes control of the resx files
... I will add another language
... go to "Translation Languages"
... search for "German"
... I will now have AppResources.de-DE.xlf under the resources folder
... now I build the solution
... I now open this German xlf file
... we have an editor from the toolkit
... we have a bit of workflow, we see whether the content is translatable or not
... let's consider the word "add", click on "suggest"
... I get a number of translation suggestions
... I can see from which provider each translation comes from
... and we have three of these services
... now I can translate the whoe resource
... let's see this small phone app that I created with the language portal service API
... I can pick the language
... English (US)
... as source and the target language (German)
... I can see the terminology coming from the terminology portal
... if you're an app developer it does take too many clicks to have my translator into German
... let's make it a bit more difficult now
... look at the translation provider list in this XML file
... you can use any translation provider you like
... file is "Translation Manager"
... there's opportunities to describe confidence
... I pay for quality insurance usually
... e.g. translations from TMs
... let's look at some resources... Microsoft Translator APIs, we demonstrated an app using maps
... Multilingual App Toolkit is free
... MAT 3rd party Translation Provider sample is an example using the TAUS APIs
... and the OASIS XLIFF 2.0 std specifications under development
... thanks

Joachim Schurig, "The Difference made by Standards oriented Processes"

Joachim: I'm from Lionbridge
... working at the core translation memory software at Lionbridge
... I'm more on new LT research
... now
... also working on translation standards
... How do we produce today at Lionbridge?
... you saw from Jan how to request translations
... we have a suite developed in-house, we have a portal, a workflow tool, a translation tool, with its own editor(s)
... we can plug in MT
... we use a number of standards in this workflow
... we use XLIFF 1.2, TMX
... the workflow system transports about 1/3 of our production these days
... the rest is handled half manually half automatically
... the TM workspace module, we have more than 15 billion words
... lots of benefits of LIFF 1.2 internally
... a case study is: Visual studio (Orcas), the first large project using XLIFF
... what promise of interoperability of XLIFF 1.2?
... it's a fairly good std, we participated in its creation, specs were not specific enough, and there were legal diversions...
... interoperability was just a promise for 1.2
... the benefits however still let it fly
... .2 billion words in XLIFF 1.2 per year!
... this means the largest translator provider provides 95% in XLIFF
... less problems at delivering projects
... standardization v1.0 represents the traditional translation process
... we could limit acceptance level for new clients from several thousand US$ to $100 thanks to reduced transaction cost
... in the traditional workflow, we have TMX, SRX, TBX, XLIFF
... these stds which accompany today's workflow represent old fashioned workflow
... now, look at CMS systems
... what happens is we are more and more forced to get content from these systems to translate and get it back into the systems
... it's no more document-based
... we have to get access to internal XML data
... it requires a lot of manual adaptation and configuration to parse the XML files
... what is important during the translation of natural language is the segmentation into sentences
... we lose a segmentation which we apply when we read and filter the XML files when we deliver the files back to the CMS systems
... this means we don't know how to associate to an existing source
... we are working to make this process semi-automatic
... better XML markup would help
... a major help for multilinguality and CMS systems
... we need to avoid losing the segmentation at the sentence level
... EU as a marketplace it creates a transaction cost
... it's a disadvantage to not have a solution to our problems
... what about a translation memory external to the data?
... CMS should contain lookup data which we could reuse and interfaces to these data
... many content providers, but no standardized way to get electronic data through automatic means
... much more success with XLIFF 2.0

Rob Zomerdijk, "Content Relevancy starts with understanding your international Audience"

Rob: I'm speaker #13.....
... let's talk about understanding the consumers
... useful for your localization efforts
... I'm from SDL
... we built our customers on social conversations
... conversations have a huge haystack, you need to find the needle in the haystack
... how to find the needle, light the haystack and then apply the magnet
... how to learn about consumers and the products they are interested in
... how are customers talking about headphones?
... look at this blogpost, it says: "I normally sleep on my stomach", he has some money (not flying economy)
... the guy is using headphones but not listening to music
... this guy is from the US (talking about baseball)
... you have to understand the customer journey
... it starts with people aware of certain brands, searching for products, they buy and start using them
... if they have problems they might call the customer center
... there's a mass volume of data in conversions between customers from several social media
... in many languages
... you have to scrutinize the conversation... you need to score your performance on a scale 1-100
... to know where were the problems for a given product
... for instance there were connection problems, but evaluation and shopping was good
... the scoring can be applied also in other countries, but maybe you need to focus on other areas, for eample you might have issues in the localization process
... and we have the EU, so a lot of languages
... but the EU is not popular! 57% of its population trusted it in 2007, in August 2013 only 31%!
... but the EU also has good things
... roaming is a good example
... what we did is we scored the performance of the EU and the trust of citizens in all its countries against consumers and citizens talking about roaming in the EU from Nov 2013 to Mar 2014
... conversations are usually scoring low, below par, but about roaming they are generally happy
... why is that happening?
... look at the mail online English newspaper: "the end of mobile roaming charges", with people posting their opinions
... let's look at some of them
... you can see: "the EU protects the public"
... and other positive comments
... also talking about roaming free
... the EU should use this kind of commentary in the communication, it can drive more trust
... but people are also negative: "non-travelers are going to pay the bill"...
... in the communication of the EC you need to take into account that people also have this opinion and you should use this in your communication to citizens
... another example is Skype, part of Microsoft nowadays
... translated into 111 languages!
... Skype has to understand how consumers are using their products
... in some countries Skype is more popular, in others less
... Skype has a traditional market research
... we said Skype shouldn't use a traditional market research route
... they should use the online audience conversating about their experience with their product
... Skype now is using the methodology and technology I showed before to prioritize markets
... on which markets should they focus first
... and which markets to focus next
... traditional market research would have taken much longer, while with two months we identified the relevant markets
... so SDL social intelligence is how to better spend your resources to understand what is going on with customers and citizens.

David Filip, Dave Lewis and Arle Lommel, "Quality Models, Linked Data and XLIFF: Standardization Efforts for a Multilingual and Localized Web"

DaveL: we have a multiedit presentation
... a brief update on consensus building in std activities
... a little update of our multilingual web meetings and where to go from here
... DaveF: XLIFF 2.0 is more than promise
... Lionbridge is not the first one to use it
... Microsoft is a huge success case study with 1.2

DaveF: 1.2 is kind of old (become std in 2008)
... XLIFF 2.0 has improved over 1.2, a lot of talking within the community
... OASIS announced it as a standard candidate
... in 70 days we should have it as the new standard
... please do come to the 5th XLIFF Symposium, at LocWorld Dublin 2014, June 3-4 2014.
... content analytics and localization will be a focus
... XLIFF 2.0 is modular to allow for rapid release, but it's not backward compatible
... 1.2 had the problem of being too big
... 2.0 has 20% of its features, the rest is available via modules
... 2.1 is fairly specific by now
... based on the collaboration between ITS 2.0 and XLIFF 2.0
... advanced validation support is a plan for 2.1
... fsasaki will report on this in Dublin, Wed June 4
... we will discuss XLIFF 2.x roadmap
... with requirements gathering and yearly release schedule
... it's good that Microsoft is onboard

Arle talking about multidimensional quality metrics (MQM)

Arle: QTLaunchPad, an EU-funded project on assessing quality
... we don't agree on how to assess MT quality
... BLEU has problems
... you increase the BLEU score, but you do not see substantial improvement for human consumption of the automatic translations
... you take garbage and make it slightly less stinky, that's the metaphor
... MT methods require reference transations
... but these evaluation methods cannot be used for production purposes, because you have to translate
... human quality assessment takes too much time
... and it's not principled
... people just don't agree
... we don't even know what we mean by quality
... we define quality in a new way
... a quality translation demonstrates required accuracy and fluency for the audience and purposes, and complies with specifications and takes into account end-user needs.
... why not use a single, shared metric, no matter what it is?
... first:, which one? there are so many: they don't agree what constitutes quality
... the only thing they agree on is terminology...
... the solution we propose: at the moment it looks insane
... there are many categories, that are the union of the factors people check when assessing a translation
... we have identified a MDM core, wth a shared vocabulary of terms you can use: accuracy, verity, fluency
... with further specifications
... our approach: we don't assume, we need specifications
... specifications are based on 12 parameters, ISO standardTS-11669
... http://www.ttt.org/specs
... often people don't make these specifications explicit
... e.g. a client gave us a video localized, but the youtube automatic translation system was garbage, but the problem was the domain was very specfic and this was not specified upfront
... you don't use all of MQM (or its core), but just part of it, the part you need
... see a couple of examples we did
... MQM for MT diagnostics in a research setting
... another example is on SAE J2450
... you can represent almost any metric and see how they are similar or different
... you can use open-source and online tools, demo at http://www.translate5.net
... http://scorecard2.gevterm.net

<lupe> demo: http://www.translate5.net

Arle: now, integration with XLIFF and ITS
... current work is on MQM namespace
... you lose some detail, because there's more in MQM
... further developments in QT21 project and CRACKER
... we are looking for feedback http://www.qt21.eu/mqm-definition

DaveL: let me now talk about ITS 2.0
... there are several metadata tags, and it's specifically about processes related to internationalization and localization
... you want to say which text you want to be translated, annotate individual portion of texts, or apply rules to be applied to the whole corpus
... and we want to do that in a std way, with an agreedsyntax
... the specification of ITS 2.0; we finished the W3c recommendation with several categories, last year
... there's a broad range of data categories
... some from ITS 1.0, how you annotate content to feed intothe translation process
... I18n categories, pointers to external resources, preserving space, etc.
... we also have categories for integrating LT, specifically MT and Named Entity Recognitionbr /> ... e.g. MT confidence scores
... we also provided additional categories for localization quality issues, provenance (who or what translated some piece of text?)
... let's look at an example of MT confidence scores
... this can be expressed with the new ITS
... you can specify which engine provided the translation and with which confidence
... another example is text analysis
... you can identify a particular word as recognized, e.g., from a NER engine
... and specify a confidence score again

<fsasaki> [FYI: via ITS2 text analysis to access multiling linked data sources - without knowing details about them http://googleknowledge.github.io/qlabel/]

DaveL: a third example is:localization quality issues, where you can specify the type of quality issue, comment, provide the level of severity, etc.
... ITS 2.0 in summary, we did it in 16 months!
... was very intensive working process, with co-funding from EU
... over a thousand successful conformance tests
... to wrap up, we can see that MQM, ITS 2.0 and XLIFF are an example of great collaboration
... also opening the door moving beyond XML, e.g. have a mapping from ITS to an RDF ontology
... I've been working CNGL on a new project called Falcon where we're looking at tracking the provenance of localization processes
... and also on the LIDER project, come next afternoon, where we are looking at more use cases beyond I18n and localization
... plus the 4th of June, the workshop colocated with LocWorld
... thanks!

Localizers session QA

Tomas: bilingual data, there could be a file for each pair of languages; XML is nice, but it's not the most appropriate format for tabular data

DaveF: we have to go bilingual
... with version 2.0 we can work with a default language forexample
... you can use a reference language to sucessfully deal with multiple pairs of languages

Joachim: you cannot be multilingual, otherwise you cannot handle it from a service aspect
... I understand your problem
... but I think XLIFF is an old paradigm format, and what you want and need is dynamic on-demand online access to data
... Jan: I agree that having a bilingual format works really well for what we do today
... I liked in XLIFF 2.0 and ITS 2.0 caring about multilingualism, I think it is going to happen soon

Pedro: in the CMS one file per content, or one file with all the content... sometimes things that appear simple can be much more difficult

Felix: there was a lot of conversion of translation workflows, translation quality, etc.
... LIDER works on multilingual resources that could be used toimprove and address these issues
... could these resources be used for improving translation?

DaveL: multilingual resources are an asset
... the real issues are in mind quality, the resource quality, for ex. multilingual translation memories
... how variabile is the quality of the different translations you get out of that
... provenance is another issue
... other metadata, the definition of terms, how accurate are they
... when we talk about sharing of resources, we need to know who owns them, where they come from, etc.

Arle: many of these researches are academic, there's an education process... LIDER can do some of the education

Joachim: about tighter integration into CMS systems, why formats, which standards?
... you could make a commercial model out of this
... but currently we are used to seal our technologies

Jan: developers, when looking at adding languages, having to manage the resource shift is quite difficult
... if we can provide quality insurance, moving from files to services, is a different thing going on here

Machines Part I

Scribe: jmccrae

Chair:Dan Tufis

Roberto Navigli, "Babelfying the Mulitlingual Web: state-of-the-art disambiguation and entity linking of web pages in 50 languages"

<jmccrae> domain-specific web content is available in many languages and domain specific

<jmccrae> infromation extraction could be performed by multilingual text understanding... but this is hard

<jmccrae> approach is based on BabelNet... a semantic network of 50 languages

<jmccrae> available at http://babelnet.org

<jmccrae> this week 2.5 version released

<jmccrae> 21+ million definitions, 67 million word senses

<jmccrae> BabelNet can be a multilingual invertory for concepts, named entities

<lupe> Babelnet is great! + 1, Roberto

<fsasaki> +1 too!

<jmccrae> this can be used for word sense disambiguation (WSD)

<jmccrae> goal is to select most appropriate sense for word

<jmccrae> example: "Thomas and Mario are strikers playing for Munich"

<jmccrae> BabelNet can combine entity linking (NER) and word sense disambiguation

<jmccrae> online at http://babelfy.org

<jmccrae> give text as input and it associates words with babel senses

<jmccrae> Step 1: Calculate semantic signatures, i.e., concepts near the word

<jmccrae> Step 2: Fina all possible meeanings of word in context

<jmccrae> s/fina/find/

<jmccrae> Place entities into graph and look for connections

<jmccrae> Remove least connected sections of the graph

<jmccrae> Step 5: Select the most reliable meanings

<jmccrae> (Not sure what happened to steps 3 and 4)

<jmccrae> Results show that this is state-of-the-art

<jmccrae> for both WSD and NER

<jmccrae> example using Babelfy shows identified words

<jmccrae> clicking on words shows meanings

<jmccrae> example shows the system working for Italian as well

<jmccrae> one features to be added soon is language-agnostic entity disambiguation

<jmccrae> for entites not in Wikipedia or other resources in that language

<jmccrae> use other language (English) to perform disambiguation

<jmccrae> slight performance hit associated with this

<jmccrae> work funded by erc and lider project

Victor Rodriguez Doncel, "Towards high quality, industry-ready Linguistic Linked Licensed Data"

<jmccrae> focus on licensing aspects for linguistics resources (LRs)

<fsasaki> scribe: jmccrae

work derives from LIDER project (see talk tomorrow 9am)

to participate in LIDER discussions join W3C Community group at http://www.w3.org/community/ld4lt

linked data is not a format but a method of publishing data as a linked web of documents

linked data enables web of machines

open data has more visibility and re-use

proprietary resources have protection and enable certain business models

this division can be overcome by licensing

currently, most resources in the LOD cloud are not well licensed

many non-open or unspecified licenses

unlicensed worses cannot be used, as there is a risk of being sued

non-commercial licenses can be a license compatibility issue

Minutes formatted by David Booth's scribe.perl version 1.138 (CVS log)
$Date: 2014/07/07 09:49:43 $