W3C

MultilingualWeb Workshop, Rome, Day 1

12 Mar 2013


Note: The following notes were made during the Workshop or produced from recordings after the Workshop was over. They may contain errors and interested parties should study the video recordings to confirm details.


Agenda
http://www.multilingualweb.eu/documents/rome-workshop/rome-program
Raw IRC log
http://www.w3.org/international/multilingualweb/rome/IRC/12-mlwrome-irc.txt (incomplete)
Chair
Arle Lommel (DFKI)
Scribes
Felix Sasaki, Karl Fritsche

Contents

Welcome and Keynote

Welcome and Workshop Intro

Scribe: Felix Sasaki

Daniel Gustafson on behalf of FAO introducing the conference

Arle showing an example of internationalization needed

Arle: language and culture are not easy issues to solve
... we will learn about things from other communities, which normally won't see
... this workshop series shows what actions across communities are needed
... the mlw workshop series drives the development of a community who takes up the challenge of the mlw
... we also act as a catalyst for future projects that take up the challenge of the mlw
... e.g. European projects that work on the topics discussed here
... we want to improve the use of standards and BP
... and want to improve support of multilingual features in browser agents
... we have seen real engagement in this area to tackle the issues

Arle going through admin issues and the program

Mark Davis & Vladimir Weinstein, “Keynote: Innovations in Internationalization at Google”

Scribe: Felix Sasaki

mark: we will focus on some of the technologies at google and how we make our products more multilingual ... at google we have to deal with core localization - here we will talk about work which is above the core
... google is about search for text
... we take synonyms into account, but more recently we want to take entities into account
... "entity i18n"
... we do this e.g. to look at wikipedia to find out how entities look like
... english wikipedia is huge, so we did cross connections of wikipedia to find more about entities
... part of this is names, e.g. personal names, google+pages with more free form names
... and also URLs, which present their own problems
... some problems are related to security
... many characters are look alikes
... this creates opportunities for spoofing
... normalization of names: when are two names the same?
... that includes handling of inflection, how people want their names represented
... this involves issues related to semantics, encoding, formatting etc.
... recently we worked on plurals and gender
... plural and gender are tricky features to deal with
... currently we have patterns for numbers written as digits
... used in messages, units, contact numbers, etc.
... now handing over to Vladimir who will take us through these areas

Vladimir: we have a nice way to represent e.g. gender and number information across languages
... we want translators to handle with this properly
... we built a tool that would show the localization specialist what he needs to see

Vladimir: engineers write forms in English
... in our code we have ways to specify how gender and plural should be treated
... examples on how this works for various languages

Vladimir: now phone number example
... operating system that runs on the phone
... for each phone number we want it to be unique
... that is, there is a country code, area code, digits to dial the number
... people in different countries don't want to think about +xx, they just want to write things as they like
... so we have an open source phone number library
... see http://code.google.com/p/libphonenumber/
... it handles parsing, formatting, canonicalization
... getting types and examples
... finding numbers in text
... canonicalization is a hard problem
... google contact book has a flag that tells you the country of the number
... it allows you to fix the number if it's wrong, and it relies on above library
... geolocation of number is an issue
... in some places you cannot solve it
... in us there is a physical, territory designation
... in Europe that is different
... so the problem can't be solved in general
... now about addresses
... e.g. for sending a check
... see library at http://code.google.com/p/libaddressinput/
... this is also open source
... allows for validation of regionsy, layout and basic validation
... to e.g. give a street address that is actually meaningful

Mark: "getting language settings wrong" - an issue for users and also people inside Google
... was a hard problems to fix
... worse even for enterprises, since the language often is set but some administration
... we created "universal language settings"
... this allows people to set the language across languages
... sounds simple but rollout across products is hard
... allows for setting more than one language
... did some analysis of gmail - a fair amount of users speaks more than one language
... fallback is needed if preferred language is not available
... a use case is to serve better content in search scenarios
... with language settings the outcome is better than trying to guess the users language
... "60 language initiative" at google
... we did a 40 language initiative
... we showed google internally how support for more languages helps to get more satisfied customers
... that was an incentive google internally to have more language support
... now we are rolling out 60 languages in many products

Vladimir: important for people that they can interact with their device the way they want
... speech-text library important for that; our library for that now has support for 42 languages including accents / dialects in 46 countries
... text input - dedicated team is developing input methods for many languages
... input methods on android, native input methods to store on your device
... also cloud input methods
... librar(ies) for input http://www.google.com/inputtools/
... with dictionaries and word frequency data we are trying to guess what people want to type
... which is helpful for many users
... another team at google creating fonts for all unicode scripts

see http://code.google.com/p/noto/

noto = no tofu

Vladimir: dealing with so many fonts and font data
... needs tools for reading and writing basic font tables, special bit maps glyphs
... allowing us to serve smaller subset of fonts
... we have a sfntly font library for that

http://code.google.com/p/sfntly/

Vladimir: now google translate
... now support for 65+ languages
... if you use a chrome browser the browser can detect the content of the website and will offer translation
... we also allow you to submit feedback to make the translation better
... we are allowing people to access content on the web
... we do not shoot for specialized engines
... now google localization infrastructure
... see http://translate.google.com/toolkit
... used now for all google localization
... important to have everything available under google control, so that changes can be rolled out everywhere easily
... the toolkit can also be used for outside users
... you can use translation memories, glossaries etc. for yourself
... world of localization is not well standardized
... we created ARB format
... for web applications
... we created an easy to use JSON format to use at runtime to skin applications at runtime
... we also allow people to produce localized content
... example of youtube caption translation
... allows to take your captions from the video and ask your friends to translate it yourself
... next iteration even allows you to buy captions from a vendor
... general ideal is that people can access translation of their content on many different levels

Keynote Q&A

Chaals: question on ARB format
... w3c standardized various XML formats related to that
... google didn't follow that path and choose the json way
... why did google take that path?

Mark: google for the format is to have something light
... that can be mapped to xml, but something simple

Tomas: do you follow language identification means?

Mark: we use bcp 47
... we do not follow language accept header in HTTP
... since that has never really been used in a consistent way

Tomas: sending in the header "this is my language preference" - how about that?

Mark: we cannot rely on people having set put the header right
... then people are also not depending on which machine they are working on
... things are handled just be the google (account) settings

Gavin Brelstaff: are you using timed text standards? (= caption translation standard)

Vladimir: we support one standard here, not sure which one
... different standards for captions have different ways to convert from one to another

Christian Lieske: you use quite a bit of natural language processing in your tools
... e.g. google translation in 65+ languages
... in some areas: are you still working with rule-based, that is not statistical methods?

Mark: google translate uses masses of bi-lingual data
... that doesn't work with languages that need ordering
... so there are now pre-ordering steps
... and syntactic analysis more and more being used
... advantage of data approach that easily you can accomodate new language pairs
... just with new data
... the rule based approach was very labor extensive for us
... google translate does not map into the most commercial languages
... but the languages that have most of the data

Tomas: do you tackle the data that you use in MT?
... if you have the choice between a smaller amount with good quality and large "dirty" data sets?

Mark: we take into account large data sets in standard format, but we also want to translate tweets
... also need to make sure we do not use data for training that has been MT translated already
... or example of date representation: sometimes auto-generated with CLDR, that then influences training

Richard: maybe explain what CLDR is
... and explain what mobile internet tool halfbuzz does

Mark: CLDR is a project for gathering localization data
... e.g. formatting of dates, times, numbers, collation and sorting rules, currency formatting
... contact numbers
... halfbuzz is an open source project to handle complex scripts
... it is gradually developing to support more and more scripts in the world

Developers

Jan Anders Nelson & Jörg Schütz, “Going Global with Mobile App Development: Enabling the Connected Enterprise”

Scribe: Felix Sasaki

Jan: windows 8 phone toolkit
... visual studio IDE, provides pseudo languages for "in house" testing
... XLIFF support
... integration with MS translator service
... via interent
... windows phone 8 - you create a new project, binding to resources, then testing in various languages
... you test in many languages, apply resource changes across languages
... re-starting testing, ... takes a lot of work to ship in one language
... demo today in the toolkit: how we are trying to make the process easier
... will show how to add a new language, how to export to xliff, store the data in sky drive etc.
... so high level overview: how you can create a windows 8 app in various languages
... in the enterprise it is important: "bring your own device" scenario
... in the past that didn't work in the enterprise, due to security, protocols etc.
... this tool supports that scenario, but also goes beyond the enterprise

Jan demoing the toolkit

Jan: resources file, an XLIFF file
... a specific locale for pseudo localization
... now adding french as a new language
... there is a notation for a translator service

the editor allows you to generate XLIFF (XML Localization Interchange file format)

Jan continuing demo - now loading resource file

Jan: in windows 8 you have language preference settings - you can add several ones and a fallback too
... xliff support important, for working with many translator services
... we continue to work active in the XLIFF TC to assure that there will be interoperability

Gavin Brelstaff & Francesca Chessa, “Multilingual Mark-Up of Text-Audio Synchronization at a Word-by-Word Level: How HTML5 May Assist In-Browser Solutions”

Scribe: Felix Sasaki

Gavin: example of a movie - french movie with english subtitles
... example of closed captions in HTML5 - just put track, use timed text markup
... can be e.g. vtt, or srt standards
... it gives you line-by-line translation

<lmatteis> does youtube support this?

Gavin: supported by youtube

example of english subtitles for italian clip on wikimedia commons

Gavin: difficult to type all numbers of timed text
... we will show later how that can be easier
... in pisa mlw workshop we gave a presentation
... of alignment of bilingual text
... example works on the desktop
... using javascript, html, jquery, and we can do a go through highlighting of the text
... semantics can spread on the whole part of the page, not always easy to extract
... somebody has put some timed text markup in
... we can say "we are here in the text"
... we don't make a video - we work on text

Gavin: a bit like karaoke
... you can hear it in the original language and follow in your language

demo of alignment across languages - left side italian, right side english, the highlighting of words moves with the audio

Gavin: we have a way to mark up semantic correspondance - green is direct correspondance

Gavin: and there are other color codes for other types of equivalence
... HTML under the hood: audio element, no controls
... there are start and end + time identifiers
... markup is reflecting "semantic segmentation"
... there is also an archived format, can be saved by server, an XML format specified in text encoding initiative
... we do that for historical reasons - with JSON it might be archaic, but academics use it a lot
... the hard task is to deal with overlapping hierarchies
... it would be useful if some kind of timed text standard would address the overlap issue

demo continuing

Gavin: visual interface to add cue points
... why are we doing this?
... aim to activate poetic memory
... complementary to external memory produced by europeana - we try to get the memory inside heads of people

Gábor Hojtsy - Multilingual Challenges from a Tool Developer's Perspective

Scribe: Felix Sasaki

Gabor: have a decade of experience on support of multiple languages in drupal
... drupal is mostly used as a CMS platform, examples of several various famous sites

Gábor: first gettext .po format
... we use since it fulfills our needs

Gábor: location information, plural format, message texts etc.
... we only use this as a transportation format
... very simple, small
... we don't need to deal with gender issues

Gábor: that's rarely an issue for drupal web sites

Gábor: at drupal.org we have 20 000 modules hosted
... you need to have incentives for people to participate in open source efforts

Gábor: we found that there is a lot of overlap between drupal projects
... when you use projects together you can share translations
... we built a sub model for building translations for drupal
... we found that encouraging micro contributions is helpful
... people can submit one translation each

Gábor: we built a diffing tool etc. and people get fame by contributing more

Gábor: configuration as a problematic area
... configuration in drupal can be edited by the user

gabor: we use YAML for configuration
... we ship these with drupal itself
... people create their own configuration - people should ship the stuff as part of the software
... breaking it down to small pieces was very beneficial
... that is a kind of dual system that we needed to support
... getting acceptance for two different translation models is difficult - it will freak out developers
... you need to have one clean API
... we go with the 2nd model to identify pieces in the content
... for workflow support we have manual support for XLIFF
... we have vendors that build tools - XLIFF tool, translation management (tmgmt) tool that has support for ITS and is demoed at the showcases

(see more about ITS + tmgmt at http://drupal.org/sandbox/kfritsche/1908598 )

Reinhard Schäler, “Enabling the Global Conversation in Communities”

Scribe: Felix Sasaki

Reinhard: a confession - I'm not a developer - but "every saint has a past, and every sinner has a future" :)
... rosetta foundation is about empowering language communities
... two-three years ago we found that this is also about a business model
... you can reach 90% of the customers by just localizing in 50 languages
... that's not 90% of population, but 90% of customers
... story of bloomberg storyful
... a startup. It tells you looking at tweets what effects stock prices
... access to information and knowledge is crucial for money
... "social localization": non market localization
... localization for world piece - all the good things you can do
... with this you will reach 70% of citizens - not customers, but citizens
... many organizations active in this space
... so situation here: no business case, but lots of activity
... problem is how to connect content with volunteers that want to help?
... the answer we got was: why not use what we have got already?
... all the stuff we have works well in a commercial setting
... but often it is closed and working in silos
... with "705 of population" scenario anything can happen
... what we need is something open, configurable, standards based
... that is where XLIFF and w3c standards come in
... SOLAS is a localization architecture
... I will talk about productive part of SOLAS
... david filip and dave lewis have a demo here to show the interoperability in SOLAS productivity
... focus is on impact rather than commercial value
... we are about to announce an open source project about solas
... we have just signed a related licensing agreement with Univ. of limerick
... we will open source it an announce that next week at the gala conference
... we see solas as the tool that will connect massive amount of content with massive number of people
... not only to produce wealth
... the project will run in trommons - translation commons

various solas screenshots demonstrating the functionality

Reinhard: example of translating wikipedia article
... easy to organize with solas
... you can have tasks or sub tasks that volunteer translators can take up
... matching interest of translators with specific tasks is the key
... that's why it is called solas match
... you find the preferred task that fits you
... we need to find bankers to support this

Q&A for Developers

ioannis: many different platforms
... for translation: google, microsoft, solas, ...
... drupal
... is there the danger to have too many platforms? Can this be united

Gavin: the standards based web is the platform of the future

gabor: agree that this is a problem
... we had a problem in drupal too - if we build it for ourself it needs to be maintained, build the features
... if we cannot sustain interest it is a problem
... a different solution is to have a backend that we re-use and just have our interface on top of it

reinhard: you need to integrate, standardize and strive for interop

ioannis: is this happening?

reinhard: in solas

Gavin: 5 years ago the browser wouldn't have give me technology to do what I demoed

Jan: you heard XLIFF, CLDR; Unicode, ...
... there are many standards we can work on
... think of the network stack
... this is similar: as we work on the multilingual stack
... we have to be careful on what becomes part of the standardization
... see e.g. the ITS shown at the workshop

olaf-michael: competition is important
... as long as we have serious competition, standards are helpful
... if we don't have competition, things will not develop as fast as we'd like

mark: at reinhard and gabor - how do you manage your community
... if you get 10 different translation for the same thing for example?

reinhard: we are trying to manage as little as possible
... and to trust
... that implies some monitoring and intervention
... we don't have entry barriers
... people can take part and can contribute
... there are good examples in the open source community
... that this works
... there are big questions around quality of translations
... but there are no of the shelf answers to this
... I think twitter is trying to work with 500 000 volunteers
... there is no silver bullet solution
... you trust your community and work with it

mark: an example - using a volunteer translation
... you can take the wikipedia approach to take the latest translation
... my question about manage is: how do you choose the latest translation?

reinhard: we have proofreaders - we trust them more than editors

gabor: every language community in drupal has their own subsite
... the manage permissions
... we found out that they have very different mgmt styles
... I found that in the open source world you need to have incentives to start work
... you need to have tasks that make people happy
... the micro contribution approach is helpful with that
... just read a book about incentives for people, how to make them happy doing that - very useful

xyz: memorizing of music - at Gavin: do you have experience with that?

Gavin: there are cantador, singers that listen to each other while singing
... that active listening while doing helps with memorizing

xyz: students in Germany have more and more problems to memorize what we did previously, so we need ways to help here

chaals: many people talk about XLIFF - the web world uses json
... but tools use XLIFF etc. internally
... how do we manage the transition from XLIFF to json etc.
... how do we know when standards change?

Gavin: you have to be clear what is an declaration and what is an object
... a declaration is self-evident
... in json you can't declare
... we are using gettext, yaml, xliff, tools for ITS
... we use different tools for different problems
... we have a community trained and people using them
... if we don't see a benefit to move to new approaches we won't move

jan: agree
... we support many platforms, e.g. html5 windows web apps
... how do we manage to export to something else than XLIFF
... when we look at services it comes arbitrary - can be xliff, json etc.
... so keeping the eye on the ball of standardization becomes crucial
... we have to continue whose conversations

reinhard: XLIFF still has a long way to go
... sometimes standards can remove competition
... it is good to have competition in some areas
... XLIFF is trying to solve a problem that is not related to competition
... more uptake of XLIFF will help everybody
... it reminds me of character encoding discussion - now everybody is using unicode
... such development allows to concentrate on more interesting problems

Creators

Román Díez González & Pedro L. Díez-Orzas, “ITS2.0 Implementation Experience in HTML5 with the ‘Spanish Tax Agency’ Site”

Scribes: Karl Fritsche and Felix Sasaki

Roman: from spanish tax agency
... we are partners in the use case

Roman: why are we here - we are contributing with linguaserve to the MLW-LT project

pedro: client looses control over translation process in the tooling scenario
... the ITS2 metadata allows to re-gain control for the client

roman: having as the client control - that is what ITS2 is good for us
... we use various metadata items ("data categories"): translate, domain, localization note, localization quality issue, mt confidence, provenance

pedro: page in various languages
... ITS2 metadata specifies what can be translated or not
... we have also localization quality issue created by the post editor
... we also need to develop related best practices about using ITS2 e.g. by post editors

roman: shifting to HTML5 various steps
... shallow HTML5: obsolete attributes
... second, automatic annotation
... third, manual annotation facilities
... example of domain name annotation - tagging was done by scripts
... provide an editor for manual annotation

pedro: last 30 seconds:
... next steps is to end the use case
... it is complete and functional
... exploring best practices is a critical topic, once the standard its2 itself is finished
... there are other metadata items like "readiness" that are not part of Its2 but which can be extensions to its2
... then methodologies for post editing
... and specific tools for dealing with the metadata are needed

Hans-Ulrich von Freyberg - Standardization for the Multilingual Web: A Driver of Business Opportunities

Scribes: Felix Sasaki & Karl Fritsche

hans: MLW-LT is not only for geeks, but also for accountants
... about cocomore: communication and IT, largest drupal dev teams in Germany and spain
... why do we engage in MLW-LT? we want to lay the foundation of a real integration of a CMS in the localization chain
... our role in the MLW-LT project: contributing to the ITS2 standard
... and enhancing drupal to work with ITS rules
... also creating a use case to demonstrate that ITS2 creates business benefits
... not only for localization but business at large
... example client vdma - industrial association in Germany of exporting companies
... export business means multilingual challenge
... today vdma has to handle 9 European languages
... vdma has to publish online and offline for 60 sub sectors = domains that have to be covered
... they have a central product database that also has to be multilingual
... all has to be managed
... we said to vdma that ITS2 / mlw-lt metadata can help the vdma business
... implementations in drupal for ITS2:
... its rules to be used in drupal
... a wysiwyg editor for applying ITS rules
... and we implemented a translation mgmt tool
... it allows to view and edit the metadata without a cms system
... now more about the use case - annotation editor screen shot
... it allows to deal with items that have been defined in ITS 2.0
... export of XML file to linguaserve
... here how the XML arrives at linguaserve
... in the format you have XML ITS 2.0 metadata
... after translation the whole information returns to Cocomore, in the CMS for review
... two more interfaces in the CMS
... translation process overview
... interface of language mgmt tool
... tool to view and edit the metadata, the jquery plugin
... options of saving time
... translation process steps that have to be done from the client point of view
... from the client and LSP point of view we analysed the process of an LSP
... e.g. receiving, storing data etc.
... processing annotation information etc.
... in all process steps a lot of time saving
... conclusion: standardization combined with automatic annotation and round tripping was 80%
... we also reduced the time line

Brian Teeman, “Joomla: Building Multilingual Web Sites with Joomla! the Leading Open Source CMS”

Scribes: Karl Fritsche & Felix Sasaki

Brian: Joomla is a tool to help build and manage web sites. Largest of the open source CMSs, community managed.
... Since day 1 (2005) “language” was important for Joomla: goal is to work with all languages, where ever possible.
... Community supplies all translations: teams or individuals, depending on the language.
... Even installation and running of Joomla works in “your” language.
... en-gb is default language for Joomla, but not required by any part of Joomla. Most people want to run web site in their language.
... Several languages: tag translator content within Joomla.
... Often not all content needs to be translated. Joomla allows you to specify what (not) to translate.
... Joomla 3 is mobile ready: responsive out of the box for all devices.
... (Example of web site done with Joomla containing several languages’ content (English, Russian, ...)).
... Joomla also allows for extensions’ translation. opentranslators.org helps to provide Joomla extension translations: 81 languages, more than 1000 translators.
... If you don’t like the translation provided by Joomla, language manager allows you to override translation.
... Via translation mgmt system user can specify what should be translated for which target language.

Vivien Petras, “The Europeana Use Case - Multilingual and Semantic Interoperability in Cultural Heritage Information Systems”

Scribes: Karl Fritsche & Felix Sasaki

Vivien: Europeana is the European digital cultural heritage library we want every digital heritage object in Europe within Europeana
... many different languages, English isn’t even under the top 5
... users use their own language
... many collections from Germany, therefor many meta tags in German
... even if there are many pictures most of the people only look at items, with meta tags with their native language
... search is not multilingual aware, same results for all languages
... music and images has no language, therefore most objects are in the "Multilingual" language
... because most user only look at native languages, most of the items maybe never get clicked
... user want to search for their languages
... we could use semantically enrichment to improve multilingual search
... but this led to other problems
... new enrichment plan to link to contextual vocabularies from providers
... Europeana is now Open Data, we provide a sparql endpoint and RDF download

Inna Nickel, Daniel Naber & Christian Lieske, “Tool-Supported Linguistic Quality in Web-Related Multilanguage Contexts”

Scribes: Karl Fritsche & Felix Sasaki

Christian: linguistic quality is highly scenario dependent and never the same
... requirements are very different in style, voice and terminology
... integration into open office
... sample the single word “Link” - what is the meaning of this?
... NLP (natural language processing) is about voice control and machine translation and linguistic quality check
... tooling to support high quality translation is needed

Inna: open source tool for SLQ (source language quality) – LanguageTool. Supports ITS localization quality types
... LanguageTool can be used standalone, embedded in java or could be used through OKAPI over HTTP
... Firefox plugin to do quality report directly in the browser
... in a project we implemented Russian languagetool rule set from enterprise scenario
... limitations: processing possibility of languagetool and complexity of enterprise data
... languagetool has the ability to check if a homepage is simple to read, which is important for supporting accessibility

Creators Q&A

Scribe: Felix Sasaki

Localizers

Bryan Schnabel, “Making the Multilingual Web Work: When Open Standards Meet CMS”

Scribe: Felix Sasaki

bryan: multilingual web needs multilingual content
... I focus on drupal CMS and XLIFF extensions to that
... drupal has an out of the box solution - translation core module
... "translator logs into drupal"
... they click "translate" tab
... choose the language of translation
... add a new translation etc.
... a unique name .. boomb!
... pros of out of the box solution: easy to use
... cons: translator can't leverage translation memory
... needs access to drupal cms
... and he can do harm
... so idea is to leverage drupal with XLIFF
... in the xliff scenario, the roles are like this:
... drupal admin selects the node types to translate
... he exports to xliff
... saves it to harddrive
... some times post processing is needed
... to avoid change of text <target state="final"> is set
... XLIFF is then sent to LSP
... so the translator doesn't get drupal nodes, but XLIFF
... translator can use TM, their translation tool etc.
... I got what I wanted:
... translation is like I wanted
... some strings are not part of the xliff model
... e.g. UI strings
... the drupal admin can export these as PO files
... and there is a workflow PO > XLIFF , translation > PO , then into drupal again
... advantage of approach:
... better for translator
... now about CCMS Component content management system
... CCMS uses topic based authoring
... DITA lets you have topics, maps, images etc.
... we could talk about thousands or millions of files
... now workflow in CCMS
... the DITA aware nows about the translation related properties of content
... coming back to the two scenarios: in the first scenario, we put everything into a ZIP file and the LSP has to deal with this
... the disadvantage is that they have to deal with millions of topics to translate
... now augmenting the process by using XLIFF
... see the XLIFF dita open toolkit plugin
... integration with trisoft / visual studio
... now with new workflow: 1 xliff file, not millions of DITA files sent to LSP
... advantage: LSP don't have to know dita
... disadvantage: it is complex (scribe missed)

Sinclair Morgan, “How do you publish one thousand web pages, in 12 languages, at a high quality, 50% quicker than you can today?”

Scribe: Felix Sasaki

Sinclair: want to emphasize how machine translation can help to manage high volume of content in a high quality translation process
... three scenarios - conventional translation from scratch, with TM, and with MT + post editing
... scenario 1): 2500 words per day per translator. Total costs = 86.000 Euro
... scenario 2): 42.500 Euro
... assuming 50% leverage
... scenario 3): trained MT + PE: many variables are important
... e.g. MT output quality, language direction, quality of the source content, the training content, translation environment etc.
... snapshot of recent term projects:
... very dependent on projects and content
... average improvement of 43%
... so that's 30.000 Euro
... in scenario 3) we use both MT and TM
... we have a team of post editors, not translators
... so we had in 1) 40 days, in 2) 30 days, in 3) 14 days
... and the above cost savings
... what is needed for this performance?
... you need MT system with high quality
... a set of baseline languages
... you need a large amount of data to build trained engines
... you need a system that is easy to minister and secure
... MT adaptation is important for specific applications
... you need to be able to integrate MT system into your environment
... comparison between conventional translation versus baseline vs. trained mt systems
... the trained mt system scores much better than a baseline system
... productivity increase is important
... integration with translation environment
... easy to use interface, TM and automated workflow, support functions (terminology, spell checker, qa tools, reporting)
... efficient integration means: retain all benefits that we have in an MT environment, and add the savings from MT

bryan: example of baseline + trained MT engine - quality of trained MT engine is clearly better
... human resource: need MT developers, MIT linguists, post-editors
... post-editing is a professional skill

Sinclair: how to guarantee high quality: use the same qa checks as for human translation
... include a linguistic review of the post-editing work
... use the same qa standard as conventional translations
... review gains of process
... summary: need a language technology infrastructure platform
... including - productivity, quality, automated workflow
... SMT, baselines, fast and cost efficient training
... easy to deploy MT
... and underpinning that with human resources

Charles McCathie Nevile, “Localization in a Big Company: The Hard Bits”

Scribe: Felix Sasaki

chaals: introducing yandex - major russian search engine
... providing all kinds of services running in russian
... companies DNA comes from language technology tools
... the company was started based on russian morphology analysis
... they applied these to search
... originally this was just for russian
... our search results page has 1-2 results for each page, for a particular user in a given time + place
... we try and give you local results, but we try not to do too much of that
... we try not to personalize too much, people don't like that
... linguistic processing of russJan: gender issues, case issues
... some services are localized, focus still on russia

example of yandex homepage

home pages for Russia and Kazakhstan

chaals: BEM (block-element-modifier)
... open source library to let front end developers to put together a new page
... yandex uses it for users settings, e.g. flags per language
... we are using flags and iso language codes
... different content in different languages, different design aesthetics
... every company has an internal bias
... language technology is yandex DNA
... we are building on top of that has been developed so far
... standards are important, we recently joined w3c
... our developers speak and read and write .. in russian
... that is something the company has to deal with
... thank you for your attention

Hans Uszkoreit, “Quality Translation: Addressing the Next Barrier to Multilingual Communication on the Internet”

Scribe: Felix Sasaki

hans: other speakers are from business and talking about success story
... in business if you are dissatisfied too long you are out of business
... in research if you are satisfied too long you are out of business
... MT success stories: free only MT systems, in-house online MT systems

hans: there are MT success stories
... closely related languages work well, but others and specialized languages don't work
... MT translation research has concentrated on high volumes
... needed for inbound translation etc.
... there is a lack of translation quality for outbound translation
... let's take a new approach:
... separation in good enough, almost good enough and not usable for outbound

hans: good can be 5-75%, then 15-65, then 5-75
... increases in bleu score: currently it is gained in the "red" part, that is "not usable"
... the new current approach tries to recognize truly good high quality estimation
... in the middle is compter assistant translation - how to move the "almost good" into the "good" area
... many projects now help to move to the left, e.g. wrt to post-editing
... we tried to push the topic with various instruments
... one is META-NET a network of excellence that created a vision process in line with the preparation of the EU horizon 2020 program
... we focused on 31 languages in Europe
... see http://www.meta-net.eu/whitepapers
... then there is an infrastructure for sharing resources, see http://www.meta-net.eu/meta-share/
... and there is a strategic research agenda, see http://www.meta-net.eu/sra
... the SRA describes the needs of the industry, the predictions, mega-trends, research priorities
... strategic considerations:
... we sid we would concentrate on some areas which have a high chance of being successful for Europe
... three research topics for the SRA
... translingual cloud, social intelligence, socially aware interactive assistants
... all topics are highly interconnected
... a lot of technology is used by several groups as the same time, e.g. a text parser
... about the translingual cloud: can be a method of generic and special-purpose checking
... automatic translation, language checking, post-editing etc.
... systematic concentration on quality barriers
... ingredients: semantic translation paradigm
... exploitation of strong monolingual natural language analysis and generation
... and modular combination of specialized analysis
... european service platform: in addition to the strategic research agenda, there is a proposal of a platform for services not restricted to translation
... you can hook into services that are not yet multilingual
... finally about a project that is a pilot to prepare something: QTLaunchPad
... assemble data and tools
... create a shared quality metrics
... has been demoed and is now being finished
... then extending existing platforms for sharing
... consortium comprises DFKI, CNGL DCU, ILSP athena and Univ. of Sheffield
... as a subcontractor GALA
... QTLP planning panel:
... has many names in the European translation industry
... important bit is semantics based translation
... already in 1949 that was a vision by Warren Weaver
... now in statistical MT, it is clear that we have to go into semantics deeper
... this can be a talk of people working in semantic web
... if you look into the work on semantic web
... there is a stuff that can be used for semantically used MT

Localizers Q&A

Scribe: Felix Sasaki

questions from Paula Shannon, missed by scribe

Sinclair: cost calculation is still an issue
... if MT is provided as a service there is no additional cost

hans: my answer to my question
... in a spirit what I proposed is close to "knowledge based translation"
... but it is different: the approach to semantics has changed completely
... in the past the idea was that people sit together and build enhanced semantic models
... but now we are getting to applications based on also crowed sourced applications
... now people find huge collections that can be used as interlingua
... so it will come, but in a totally different dress

various questions from tomas

bryan: the zip file approach - we never did it that way

dan: specialized looking for language universals found only 6-7 universal
... this is not enough for an MT system

hans: a very good point
... highly specialized systems in speech and MT
... the systems do not perform better than the generic one
... but that doesn't mean that this has to stay
... it only shows that we are doing things wrong
... this shows just that we don't get various models right
... if you talk to developers of big systems
... right now, because of huge amount of data
... the generic system is still ahead
... but I would bet quite a bit of money that the specialized systems will be better
... that is why we leave the space for many companies
... that google and bing will never enter

chaals: not so sure
... yandex has no reason to go away from any area of translation
... and not to augment our tools with tweaks that will make an improvement

hans: this is not a technical question
... of course a generic system can do what the specialized system does
... but it is a question of whether the big players then it comes to semantics will have all the knowledge of every subfield
... and if they will share it
... if it is possible in this more semantic age to have more modeling and that are not shared, then there is a possibility that the business of the smaller companies is successful

Gavin: missed question

Minutes formatted by David Booth's scribe.perl version 1.137 (CVS log)
$Date: 2013-05-08 09:57:00 $ Results corrected by Arle Lommel and Nieves Sande