MLW-Workshop -- 15 Mar 2012

This is the raw scribe log for the sessions on day one of the MultilingualWeb workshop in Luxembourg. The log has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC is used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following IRC can also add contributions to the flow of text themselves.

Welcome

scribe: Felix

Introduction

Richard introduces the project and the workshop

Welcome by Piet Verleysen

Piet: europe has 24 official languages

.. we still have difficulties still today

.. it is not surprising that people have problems to understand each other

.. the web helped a lot in moving communication between people forward

.. in the mlw project it is fundamental to bring the experts of the world together to make things easier

.. all websites should be multilingual

.. it is still difficult to combine 100-200 languages in a website

.. it should be easy to have access in any language for users

.. it should be more easy for the linguists to transfer information between languages

.. using the linguists's advanced tools

.. important problem, you should find the right (technological) solutions to make life easier

.. after 60 years of computer development I'm surprised how technology makes life complicated

.. integration and easy of use of tools is still far away

.. this is today - multilingualism on the web and in our life is a complex problem

.. if you want to share enthusiasim and passion, it is important to have identification

.. I saw a logo of a project - be aware of the importance of that, so that people identify you

.. please use the right symbol in your work

.. this is the 4th workshop, a milestone in the project

.. please come up with some conclusions to make our life easier

.. to achieve that you need standards and interoperabilty

.. there is no policy session here - policy is important, without policy support nothing will happen

.. your "best solution" will not be used in the world

.. I'm sure our colleagues from dginfso are aware of this and Kimmo Rossi will work hard on this

.. I wish you a good work and looking forward to see your results that make life of users, linguists, technicians easier

Welcome from Kimmo Rossi

Kimmo: 4000 people working in Luxembourg in translation

.. glad to be here - most of this is close to this building (Jean Monnet building)

.. dg infos projects behind this event: MultilingualWeb and MultilingualWebLT http://www.w3.org/International/multilingualweb/lt/

.. we are in a re-organization process, name of departments might change

.. Richard mentioned that we are funding two projects behind this event

.. new project MultilingualWeb-LT - this the wrap up of "heritage project", and a start of MultilingualWeb-Lt

.. the follow up project will take the message about the gaps and the challenges to build practical reference implementations that mean something to the industr

.. it is very focused on machine translation, content management and localization

.. "LT" of course stands for "language technologies"

.. we had planned to combine this workshop with a showcase of European projects, but that will be separate

.. first at LREC in Istanbul in May with an exhibition of European projects

.. and META-FORUM event organized by META

.. join the alliance of META to demonstrate a push for language technology

.. it gives you visibility and new businesses

.. the META-FORUM event Brussel 20-21 will feature an exhibition of LT projects in our portfolio

.. about future opportunities: the "Connecting Europe facility" (2014-20)

.. it is most concrete opportunity to demonstrate what LT can deliver

.. CEF consists of several parts (roads, energy grids) and digital service infrastructures

.. these infrastructure contains "multilingual access to online services"

.. that is our part in CEF

.. I will suggest a breakout session tomorrow about what that part of CEF should contain

.. idea is to have "language services available everywhere"

.. idea is not to take things away from the industry, but provide a platform to share and trade, for industry, public sector and citizens

.. and aim is to make the web truly multilingual

.. if you have further questions I can tell you more in the breaks or during the breakout session

Keynote: Ivan Herman, "What is Happening in the W3C Semantic Web Activity?"

Ivan: For some people semantic web is a "knowledge management system", with big ontologies

.. other people don't care about ontologies, they think about large amounts of data

.. others think about enhancing search

.. others about integrating data

.. so people do this and that. Example from a chinese university

.. incredible wooden structure, beautiful but complicated

.. people described the knowledge how the structure was put together

.. and they created beautiful videos showing that - this is knowledge management at its best

.. next example: medical application from the US

.. takes a lot of data. Aim is to personalize the data, combine the data, extract knowledge etc.

.. BBC has pages on music and musicians, example of "Eric Clapton" page

.. BBC does not act facts themselves - they have a system that aggregates the data from other providers, so again a very different application

.. another eample: IMBD - gives reviews on movies

.. in the source they have added additional structured data (microdata)

.. that will be used by google in search. The google search result shows a 4 stars assessment of the movie, taking from the site during crawling

.. this is the current state of semantic web: we have many application areas, see above

.. the general idea behind all this is: there is a lot of data on the web

.. more and more applications rely on the existence of the data

.. we do not want data silos

.. imagine a web that had documents but without links between them

.. real value of the web is not pages on the web, but links between pages

.. example of three different interfaces related to neuro biological issues

.. they have three different interfaces, and the data bases that need to be combined hard wired

.. via the web, we can achieve linkage between such data silos, so that the data is a kind of unity

.. semantic web is a set of technologies with the real goal to build a web of data

.. on a longer term, we want to see the whole web as a huge global data base

.. as a long term goal

.. what is happening at w3c today?

.. in that area

.. RDF is the data format for semantic web

.. SPARQL is used to query RDF data, like SQL for relational data bases

.. SPARQL is about graph patterns in the semantic web "graph"

.. SPARQL has been a standard for some time. Now we are working on new features

.. describing new features in SPARQL

.. sparql has been already a unifying point between semantic web applications

.. with sparql 1.1. this becomes more complex, but also more powerful

.. exporting of RDF is another topic - one approach is "direct mapping"

.. is good for a general conversion

.. but we need another step to have the graph that we really want

. it is a layer on top of the direct mapping, to give additional rules for creating the RDF graph

.. to create what your application needs. The additional step is expressed by R2RML

.. both the direct mapping and the R2RML approach are currently being implemented ("candidate recommendation" phase)

.. should be finalized (a w3c "recommendation") by this summer

.. next topic: adding data to HTML pages

.. data per page is not much compared to data bases, but still there is a lot of data

.. that is very valuable for search engines or other applications

.. two approaches: microdata and RDFa

.. both very similar. RDF can be extracted by both

.. microdata has been optimized for "one vocabulary at a time", doesn't have data types

.. RDFa provides the full power of RDF, with the price of more complexitiy

.. RDFa light is on the same level of complexity of microdata

.. next topic: RDF working group

.. RDF itself is the basis of all semantic web technologies

.. it's like links from one page to the other. The only difference is that in RDF the links that have a name, and there is additional infrastructure to make use of that

.. RDF is being cleaned up in RDF 1.1, no big changes

.. the turtle serialization is being standardized

.. and other features are being added, only a few

.. work on RDF 1.1. has begun a year ago

.. last working group is called provenance

.. goal is to add metadata to data on the web like: how was the data created

.. revision structure, revision history

.. for this you want one vocabulary - that is the goal of the provenance group

.. there needs to be a balance between something simple and useable, and something more complete

.. that is the balance that the group is working on

.. now coming to linked open data cloud

.. there are a lot of data sets out there

.. LOD diagram is nice but a bit misleading

.. there is an additional diagram showing interlinkage more clearly - there are still many links missing

.. major challenges of Semantic web are: scale of the data

.. interlinkage

.. ability to read and write data ("SPARL update")

.. currently discussing "linked data platform WG"

.. to work on HTTP infrastructure to modify linked open data

.. other challenges: data quality, ...

.. other challenges: role of reasoning with the amount of data

.. highly distributed data

.. huge amount of data in a few vocabularies

.. how to do inferencing in this kind of setup not easy

.. major challenge is really interlinked data on the web

.. semantic web is trying to help

.. about multilingual web

.. what can be the relationship between multilingual web and semantic web

.. I have the impression that semantic web can give powerful technologies to categorize knowledge

.. that can be created in different languages

.. linked data also gives a source of information that you can use

.. e.g. analyze a blog, fetch semantic web data to use for that analyzes

.. not always for translation, but also for language specific technologies

.. semantic web has a very simple way to represent languages

.. we need more complex ways

.. english is used for all vocabularies

.. with the current infrastructure it is hard to reason across languages

.. we also have a cultural issue - we find vocabularies that are badly designed in terms of localization

.. a need for improvement

.. looking forward for the discussion to learn what you see

Charles McCathieNevile: how does semantic web relate to private data

Ivan: ivan: big question mark - how to combine access control with semantic web

.. currently you have sometimes semantic web applications behind firewalls

.. but that's no solution

Charles: what do we know about how people use semantic web?

Ivan: we know a little bit more. We had a workshop last december about how linked open data was used in the enterprise

.. one message was: there is lot's of data here

.. but there is the need of a low level APIs about access

.. that is wanted e.g. by large companies

Developers

Jan Nelson, "Support for Multilingual Windows Web Apps (WWA) and HTML 5 in Windows 8"

Jan: we put a lot of efforts into languages in windows 8

.. 109 languages in total

.. example: 35 mill. customers in US that speak spanish at home

.. so language have a huge opportunity

.. windows store: helps to deliver apps in more than 200 markets

.. with developer support for localization

.. metro style apps technology stop

.. lot's of programming languages supported

.. c++, html5, etc.

.. multilingual app toolkit

.. purpose to help to manage translation

.. has a pseudo language engine for localization testing

.. now demo of the toolkit

.. showing a weather app

.. in the app preference language is now set to German

.. rebuilding the app, it shows up in various languages, with pseudo translation including bing machine translation services

.. XLIFF files are being created on the fly to support translation

.. in a separate editor translations are handled, including marking non-translatable text

Tony Graham, "XSL-FO meets the Tower of Babel"

Tony describing the creation of layout via XSL like the creation of the world - this is just great - sorry, I can't scribe this

Tony: xsl 1.1. has a large section on internationalization

.. xsl has always been good on i18n: writing modes for multiple scripts

.. properties are defined in terms of "start and end", not "left, right, ..."

.. xsl-fo has the concept of different baselines of text

.. in XSL 2.0 we want to do a lot more of internationalization

.. in 2008 w3c had a japanese layout taskforce

.. experts working to define Japanese layout

.. taking a japanese standard as a basis

.. and the expertise of Japanese layout experts

.. the "japanese layout document" is useful for implementing these features

.. ruby applies to Japanese and Chinese (bopomofo etc.)

.. there is a lot of information about Japanese thanks to the layout taskforce

.. most translated document: "Universal declaration of human rights"

.. often used to compare quality of layout in various languages

.. these days we can cover a lot of languages just with web browsers

.. UDHR is also avail. in Unicode, see http://unicode.org/udhr/index.html

.. last year I worked on formatting Khmer

.. I used UDHR as an example, there were many issues in the Khmer layout

.. so there is a need to learn more about local needs related to layout

.. the japanese layout taskforce is very useful

.. the requirements document is used by XSL, CSS, other groups

.. should w3c make more taskforces? Requires more funding, efforts

... easier with the w3c badge, easier to justify

.. or should there be a multilingual layout community group?

.. easy to set up, see http://www.w3.org/community/

.. contributor agreement makes it easy to use the outcome

Richard Ishida & Jirka Kosek, "HTML5 i18n: A report from the front line"

Richard: presenting key issues related to multilingual topic currently done in HTML5

.. describing the i18n working groups in w3c: i18n core, MultilingualWeb-LT group

.. internationalization interest group, other mailing lists etc.

.. please participate and contribute, we need your support and input

.. example of bidi in embedded text, visualization wrong because of missing directionality information

.. new "bdi" tag to create proper visualization

.. next topic: ruby

.. additional information e.g. about pronunciation of pictographic (Japanese) characters

.. Japanese layout document - currently producing a 2nd version of that

.. gives a lot of detail - would love to have this for Korean, Chinese, Arabic, Indic scripts

.. if you want to participate or know people who want to participate, please let us know

.. ruby in HTML5: there is no "rb" tag, you can put several annotations in ruby element

.. some problems, e.g. you want to highlight the ruby text itself: doesn't work because there is no specific to select

.. you can use a "span" element, but that has issues to

.. we are working on these questions currently, looking for advise

.. working also with implementors on moving this forward

.. now jirka about the "translate" flag

Jirka: jirka: localization and translation has a lot of issues

.. "translate" flag helps with this. In many documents you have flags that should not be translated

.. if you use automated translation it would be helpful to have additional metadata that will help - it identifies parts not to be translated

.. also helpful for human translation and translation workflow in general

.. "translate" attribute proposal started a year ago at a multilingualweb workshop, but now it's added to HTML5

.. online machine translation services support this already, e.g. bing translate and google

.. it is also supported by content formats like DITA and DocBOOk

.. in the MLW-LT working group, we will work on better integration of this into HTML5 and other metadata

Developers session, Q&A

Christian Lieske: has the "translate" flag been considered for inclusion in CSS?

Jirka: don't think so

.. in CSS there is just plain strings

.. there is no markup to convey additional metadata

.. if you need to localize CSS, I propose a pre-processing step

Richard: CSS is for presentation

.. it is not the content

.. for the bidi tags for example, you could do bidi

.. but we strongly recommend that you don't

.. because the bidi information is part of the document

.. so I would propose to see CSS just as the presentation layer

Ivan Herman: ruby, bidi and translation

.. these are features that non-XML formats also want to have

.. like JSON, RDF etc.

Felix Sasaki: MLW-LT group will work on bringing some of the features into other formats, we should talk about how to add that into Semantic Web

Jirka: for JSON you can have HTML inside it that contains the "translate" flag and other markup

.. I would hope people rather produce XML which makes it easier to make that kind of metadata

Davide Sanseverino: question of "translate" flag

.. currently we create rules for several elements, not only one - what to do about this?

Jirka: the ITS 1.0 specification has a mechanism to create such rules. It is not in HTML5, but you can combine both HTML5 "translate"

.. and use a processing chain with rules

Richard: "translate" attribute is an interoperable solution

.. bing translate, google translate recognizes it

.. there are other solutions, but that are not standard

for more info about "translate", see http://rishida.net/blog/?p=831

Unknown person: from university of karlsruhe

.. we had a scenario to annotate fine grained localization information

.. how do you deal with this?

Ian: you saw examples for windows8

.. it is up to translators to deal with what they want to represent

.. we support them

Richard: if you want to translate something like luxembourg french there is a way of labeling

Axel Hecht: we talked a lot about the translate flag in the past, happy that it was standardized

.. sometimes people are asking of specific translations,

.. have you asked about having more values for translate to specify that?

Jirka: in full ITS there is support to specify things like that

.. as part of the MLW-LT project, we are planning to have a mechanism that supports RDFa, microdata or other mechanism to include that in HTML5 and other areas

<Jirka> For Axel - support for terms in ITS: http://www.w3.org/TR/its/#terminology

Felix Sasaki: call for feedback about features of MLW-LT, please give us your feedback and let's put implementations into the centre

Creators

scribe: Jirka

Jan Nelson is introducing speakers

Brian Teeman, "Building Multi-Lingual Web Sites with Joomla! the leading open source CMS"

Brian: introducing joomla cms
... community project, no company behind it
... joomla supports 57 languages
... joomla provides 3 options for translating websites
... 1 - machine translation using widgets from Google, Microsoft
... quality is not guaranteed, not indexed
... 2 - parallel translation using plugins, everything has to be translated
... translations are indexed
... the question is whether we should just translate or provide local content
... 3 - sites within site, translate content only when appropriate
... the key in joomla is categorise, add and show
... for each language different menu can be provided

Loïc Dufresne de Virel, "How standards (could) support a more efficient web localization process by making CMS - TMS integrations less complicated"

Loïc: tensions between relying on standards and using new technologies

... showing ugly XML
... translation handled by two plans
... plan A - more automated, developed in 9 months, 6x efficient then plan B
... plan B - more manual process developed in 3 months
... for interoperability, all processes has to be updated to support Unicode
... maybe also to support XLIFF

Gerard Meijssen, "Translation and localisation in 300+ languages ... with volunteers. The best practices"

Gerard: goal of wikimedia is to allow all human to share same knowledge, thus localization and translation is very important
... wikipedia now in 283, requests for 129 more
... problems with fonts for scripts
... solved by using webfonts
... there are no good free fonts for all scripts, wikimedia is supporting development of some fonts
... missing input methods for some languages
... using ISO-639-3, Unicode and CLDR
... using TM and MT
... all localizers and translators are volunteers
... l10n is more expensive then development
... support more languags then CLDR
... 6000 languages are still not supported
... languages not supported in CLDR are not supported in applications (text editors, browsers)
... looking for a solution

Creators session, Q&A

xyz: How do you support users who are looking for content which is there, but users don't know the language.

Gerard: currently only the current language is being searched

... ongoing project for searching in several langauges at one time

Richard Ishida: Do you use BCP 47 or ISO-639-3

Gerard: there is no difference between language and locale sometimes
... BCP 47 is used when ISO-639-3 is not sufficient

Tomas Carrasco Benitez: MediaWiki, Joomla, ... the solve similar problems, but solutions are different, there is no standard.

Gerard: we want to use standards, please help us to improve CLDR

Tomas: we are lacking standard for multilingual websites
... each system uses different approach for translating content

Jan: this is purpose of MLW and MLW-LT

Reinhard Schäler: How do you motivate localizers to work for free

Gerard: Tools are not prepared for some languages

Brian: joomla is completely community driven, people wanted to build web sites in their languges
... we make it easy to supply translation for additional languages

Axel Hecht: in Mozilla each localization team has different motivations

xyz: are styles switched when translation is done on the flight

Brian: yes, in joomla

Localizers

<chaals> scribe: chaals

Spyridon Pilos, "The Machine Translation Service of the European Commission"

Spyridon: new mt service at EC

.. work started in october 2010

... we already have system around open software for a lot of languages, being used since last july
... I want to explain what we need in standards to make this work better
... focus on openness and flexibility, and ensuring technological indepenence
... (repeating what people have said, a bit)

[slide - service architecture]

scribe: We have users, and we want to connect data. We have orgniased the project in 3 action lines - the MT engines, working closely with the data part
... Data part focused on preparation to improve output quality.
... Our users are the Commission, and services funded by the Commission (eg ted - tender documentation)
... For MT we started with Moses, because it is an EC-funded open source system, and started using it and collecting feedback.
... We want to use more data, more MT technologies where Moses isn't the best so we want to be able to swap it out
... handle post-editing, ...
... My focus is the data.

[slide - Multilingual Web = Multilingual Content]

scribe: An author, different translators who each have their own working methods, a publisher.
... A different publisher might not work in the same way, so the content needs to adapt.
... Publisher needs to be prepared to receive the different languages

[slide - Language Applications]

scribe: We want to give data to the web, and get it from the web.
... Getting data from one website is easy. But adding a second source meant having to rewrite the systems, and if a site changes there is more work to do. And so on for each website.
... Where there is no standard to follow, this is normal.

[slide - Giving our data to the Web]

scribe: ... We want a system that takes data from databases, and makes it possible to automatially publish in multiple languages.
... There should be continuity in what users get.
... We have had to make our own approach, and then we need to stick to it.

[slide - Conclusion]

scribe: We need to be able to get Mulitlingual information from more, and publish it to the Web.
... Need to allow free flow of information between applications without losing a lot of time on adapting data.
... We expect MLW-LT to show a feasible approach, and demonstrate the benefits of this.
... We are trying to be active (echoing richard's "tell us what you need"). And we are ready to change.
... We have our internal systems, which we are ready to abandon for a broader standard if there is one.
... So we are major users prepared to test, and to actively contribute in development.

Matjaž Horvat, "Live website localization"

MH: The difference between Slovenia and Slovakia: there is love in sLOVEnia
... Seeing my name written MatjÅ§ got me involved in localisation

[slide - Exiting Approach]

... We localise a lot of stuff at Mozilla. Usually we extract text, give the strings to localisers, and then post them back to the Web

[shows a website in english, how you translate the string, and what it looks like afterward]

... Problem. Localisers don't see the context which is a problem. And don't see the available space.
... What can we do?
... In HTML5 we have contenteditable, which makes it possible to just change text on a website - e.g. translating things you see.

http://pontoon-dev/mozillalabs.com -> a development project to work with this.

[live demo based on flaky versions of everything...]

... add a line of code to a site, then in the pontoon side you can give the URL and start translating.
... A UI at the bottom to manage the translation, and then you select some text, and edit it to change the language.
... It's all cool.
... Except...
... How do you transalte metadata like <title> or error messages in javascript?
... We have an advanced mode that shows all the strings you have.

David Lewis, "Metadata interoperability between CMS, and the Multilingual Web LT project"

DL: ML-LT follows from the multilingual web project. Get involved...
... There is a new W3C working group

[slide - MultilingualWeb-LT]

... How do we make it easier to integrate content going through translation?
... Already getting uptake from people beyond the project.
... Started with a lot of representation from localisation industry, we could do with more input from Content Management and Users ...
... Key to the process is not just specification, but actual implementation.

[slide - Approach]

... Heritage owes a lot to the ITS specification
... it is nice that it is small, but we could add some more useful information using this.
... What are the useful things to add? There are different things different people will want
... Looking at HTML5 compatibility, and things like metadata in CMS content for the 'deep web'.
... Don't want to invent new stuff where we can use things that already exist.

[slide - Candidate Stakeholders]

... Main message: we need to look at the whole stretch from production to consumption.
... There are lots of players, and different ways of building the workflows.
... We want to find real requirements - problems that people actually have

[slide - Scope of Use Cases]

[slide - Source Content Processing]

[slide - Localisation Quality Assurance]

... different approaches possible, and we need to think about e.g. what simple authors are doing, and how to work with people who have strong systems that need to integrate with e.g. XLIFF

[slide - CMS-L10N integration via RDF and XLIFF]

... Exploring ways of working with formal systems for tracking the process

[slide - Leverage Target Quality Metadata]

... There are some things that flow through the process, some things that are important for particular steps.

[slide - Rich Metadata for translation]

[slide - Next Steps]

... We're working in public, and we hope to get involvement as well as being transparent about what we are doing.
... Will hold a workshop in Dublin 11-12 June, getting close to finalising requirements
... And then there are more things to work on beyond the scope of this project - multimedia, javascript, etc

Localizers session, Q&A

Reinhart Schäler: We wanted to be able to share translations and let communities rate and review them.

MH: We were thinking of this, taking inspiration from Universal Subtitles that allows people to help provide video subtitles. Nothing to show yet though

Des Oates: In architecture of MTU you had what looks like an API between various MT engines. We're looking at something similar in Adobe. Are you going to make those interfaces public, and are you interested in standardising the approaches?

SD: We're taking solutions supported by our institutional IT department. We're developing on the basis of commercial systems, bulding it to allow implementing rules for different types of request.
... if you have multiple MT engines for a given language, you call one or another based on e.g. the domain. But it is purely internal.
... This is something that is available, that has been customised for each client. I don't see interest in making the custom configuration standard.

Lloyd: What kind of effort do you have in source quality in machine translation?

SD: We are aware of the importance of quality. We have no way to impose rules on the sources.
... many users are drafting things that are not in their native languages. We have editing units to help, we are considering using authoring support, but in practical terms this looks extremely expensive to provide.
... we're very early in this process.

DL: In ML-LT the question hasn't come up yet. I think it is an interesting use case.

??: Coud you expand on the policy for open source?

SD: Interesting question. There is a change in policy since december - now commission documents are by default made available for everything, unless there is a clear justification for restricting access.
... There is a new open data initiative starting in line with this trend.

Anonymous: MH, does the system give translation memory, how are translations reported back and integrated online, and can it be linked ot other automatic translation services?

MH: Right now it uses translation memory from our own localisation work.
... Linking to other machine translation services is possible - we switched already to the Microsoft service (although we only have that at the moment it is easy to switch)
... Integrating to the services. Pontoon can detect every text node, and you translate a page, or using getText to do localisation.
... so we create hooks for getText and use them to create metafiles.

Tomas Carrasco Benitez: https://addons.opera.com/addons/extensions/details/swaplang -> extension that identifies pages which point to alternative languages, so users can select them.

[It's open source - feel free to adapt, improve or port it]

??: Do you have community participation?

MH: Facebook has a similar tool. And I hear stories that there are fights in communities about whose translation should win.
... we don't use pontoon with live sites yet. We could limit access, etc.
... but we want everyone to participate. Need to consider how to handle this.

???: Yes, this happens. At the end of the day we have to decide on who we accept - choose an authority, and then try to merge differences.

Felix: MW-LT is on a very tight schedule.
... please tell us soon what you do and need and fill in the questionnaire at https://www.w3.org/2002/09/wbs/1/mlw-lt-requirements/.

Machines

<scribe> scribeNick: RyanHeart

Peter Schmitz, "Common Access to EU Information based on semantic technology"

Peter: Mission: From the EU to the public.

... Production of publications and preparation of publications in all EU languages.

... Different types of publications: Official and non-official.

... Official journal: 866 issues, 22/23 languages with > 1m pages.

... Consolidation of EU law is another area of work.

... Different online services are also provided: ear-lex (law), bookshop, etc.

... The idea behind the CELLAR project is to create one single repository for all metadata.

... Peter illustrates the structure of the CELLAR project with the target architecture consisting of a portal, index and search, content and metadata, post production and production layers.

... Peter highlights the dual nature of the repository in CELLAR, covering both content and metadata.

... The system has passed its development stage, according to Peter, and is now deployed.

... Another common portal is being developed, outlines Peter, to provide a better and easier-to-use interface to CELLAR.

... The CELLAR project uses a common data model, an ontology based on FRBR model.

... Peter explains that the CELLAR project uses RDF and taxonomies represented in SKOS.

<fsasaki> FRBR = Functional Requirements for Bibliographic Records

... Coded metadata supports the delivery of multi-lingual content, explains Peter,...

… which is also used to index the content.

... Interoperability is achieved by adopting standards as much as possible, such as METS (metadata encoding and transmission standard), Dublin Core, GRBR, Linked Open Data (LOD) and Standard Query Language (SPARQL), according to Peter.

... At the same time, the EC also contributes to the development and definition of standards, says Peter...

… including around core metadata (to enable global reach), using common authority tables (to harmonize metadata), and driving an exchange protocol for EU legislative procedures.

... The European Legislative Identifier (ELI) is under preparation, says Peter.

Paul Buitelaar, "Ontology Lexicalisation and Localisation for the Multilingual Semantic Web"

Paul: It's about accessing business information across languages

… SAP is a partner in the project building a business analysis tool based on the DERI approach

… showing an example of how the system, called Monnet, is working.

… Ontologies cannot be directly translated who describes how a lexicon is used for translation.

… The research objectives of Monnet are around the development and use of multilingual ontologies and the exploitation of domain semantics to improve MT.

… the financial use case for the Monnet project is 'Harmonizing Business Registration across Europe' using XBRL and xEBR.

… the methods used for domain training of term translation include hybrid methods, including domain lexicon generation from wikipedia & domain parallel corpora, LDA topic modeling with features mixed-in from the ontology etc.

… another use case is that of public services in The Netherlands, presenting different requirements and complex semantics.

… GELATO (Generation of LAnguage and Text from Ontologies) is one of the methodologies used.

… Ontology Lexicalisation is one of the central topics in the Monnet project.

… there are a number of different use cases in this area in Ontology Localisation, Ontology-based information extraction etc.

… The project is working with the W3C Ontology-Lexicon Community Group and has proposed its own 'Monnet' format.

Tadej Štajner, "Cross-lingual named entity disambiguation for concept translation"

Tadej: Translating proper names is a big problem for statistical MT systems, one that cannot be solved by the HTML5 translate attribute.

... Depending on the source and target languages, there are different rules for the translation of proper names.

... One solution for this problem is to check whether a translation for an entity already exists.
... The information presented in a document is checked against a knowledge base and disambiguated.
... The knowledge base contains labels and entities.
... This requires a good coverage of entities in the knowledge base (kb) and works better in more widely used languages.
... A solution for languages without a wide coverage would be to use a kb that is in a different language from that of the document.
... There are a number of different ranking features that could be used, including popularity and context similarity.
... For example, if Kashmir was used close to Led Zeppelin, it would be obvious that the song rather than the country was referred to.
... Cross-lingual gathering of candidate entities only works for proper names and only if they are not translated to local languages.
... Context similarity works in a vector space, treating the distinct worlds as dimensions. This does not work across languages.
... The solution is to not compute similarity but to map texts.
... This can be achieved by training parallel corpora with Canonical Correlation Analysis (CLIR) techniques. This has been implemented for EuroParl.
... Future work proposed includes that of the FP7 project XLike and the standardization work in the W3C Multilingual Web - LT Working Group.
... The annotations can be used in HTML and are transparent for normal CMS operations and web browser rendering.
... I am now going to do a demo RDF a Lite, enrycher.ijs.sl

Machines session, Q&A

Ivan Herman: Question about CELLAR project. You create a silo, but to you produce links to other data sets, such as government data?

Peter: You are right. We are aware of this and would, indeed, be interested in linking up with other similar public data repositories.

Joerg Schütz: Peter, is there any established interaction with DG Translation, as you share a lot of architectural and data management issues.

Peter: What is your organization? Ah, Bioloom.
... DG Translation is one of our customers, in a sense.

?: A question for Paul. Domain lexicon generation from Wikipedia - how did you do it?

Paul: we looked at the terms to be translated and extracted them. Then went to the domain-specific Wikipedia entries and to other languages and retrieved the translations.

Olaf-Michael Stefanov: A question for Tadej. In relation to name disambiguation- what have you done in relation to cities that exist in different countries, such as Vienna or Wien?

Tadej: We look at the context.
... Therefore, Vienna in the USA would not be confused with Wien in Austria.

Christian Lieske: A question for Paul and Tadej - you first identify language-neutral entities; then you do not use MT, but what do you use?

Paul: we actually do MT.

Tadej: There are people approaching the same problem using MT, and it works reasonably well.
... But my point is that we do not have to use MT, that we can use a cheaper approach and achieve very similar results.

Felix: Let me thank again all the speakers. Please be back at 16:30 for our next session.

Users

<Arle> scribe: Arle

Annette Marino, "Web translation, public service & participation"

Annette: Will discuss web communication and its importance for citizens
... We are lucky to live in Democratic societies, but we should not take it for granted. Many do not enjoy freedom.

<RyanHeart> Great to hear about 'citizens' rather than 'customers'.

... Choosing leaders is not enough. Citizens need to participate. The internet provides a way for citizens to interact with leaders. For the EU the importance of good web communication cannot be overestimated.
... But how can we communicate with citizens if we don't speak their language?
... Fortunately for the EC, we have specialized web translation service in the DGT that helps with communication and assists in redesigning websites to assist citizens.
... We don't just translate, but also localize the whole message with the target country in mind.
... Our team has small, autonomous teams for each language. The lines between planners and translators are short to increase participation.

<RyanHeart> A 'human' translator speaking. A first, after close to two years :)

... [Dutch translator who is speaking is not on list of speakers.] Human translators are underrepresented in this discussion. [Asks for show of hands about different audience profiles.]
... Want to discuss what we do as translators. We try to get people to consider multilingual needs from the start, to keep it in the back of the mind at all times.
... That's why we fight to keep content short and simple, think about consequences in other language versions.
... Keeping things short and simple in the Commission can be difficult.
... We face the challenge of matching formats with our tools. We lag a bit, but our web masters keep wanting to add new tools.
... Tools are improving, but it is often a challenge for translators to know what to translate and what not to translate.
... There is a steep learning curve.
... [Back to Annette]
... Since we cannot translate everything, we have to choose priorities carefully. We focus on top-level pages and navigation.
... For specialist/niche pages, MT may do, but for information going to a large audience, multilingual and user friendly in the local style are required.
... The bigger the audience the higher the profile.
... We need to understand how citizens use the web and social media to help make the best decisions.
... Quality assurance is our goal. We have to check closely. This requires close collaboration with web teams. QA work is time-consuming and expensive, but hard to quantify.
... [Back to Dutch translator]
... Now I want to share some examples of what we do. We have huge volumes of legislation, but you will not read laws in EuroLexis, so we have a portal with short, concise information that covers practical needs for citizens.
... We try to put out as much national information from authorities as possible to make this a one-stop shopping site for information where citizens can find it all.
... This is tricky: 27 languages from 23 countries. If there are too many languages on a page, you can't use it. What would happen if you found Maltese when you need another language? Some human intervention is essential.
... Another example: website on legislation that allows you to propose citizens’ initiatives: if you get 1,000,000 signatures, the EU is obliged to propose a law.
... We cannot use MT for this since it could invalidate efforts.
... Last example is the Commission home page: we try to translate as much as possible.
... We do not just translate: we localize. For example, if a Portuguese museum wins an award, we might not translate it for a Dutch user, but instead put some local content.
... [Back to Annette]
... We deploy our multilingual expertise in service of citizens. We are translators, but first and foremost communicators in service of citizens.

Murhaf Hossari, "Localizing to right-to-left languages: main issues and best practices"

Murhaf: I work for Apple on localization and did studies in Dublin.

... I will talk about why right-to-left (RTL) languages importance and best practice.
... To start with, I want to talk about a friend who wanted to do software business in the Middle East. [Shows examples of promotions that don't work because cartoon shows solution messing things up]
... The whole flow, right-to-left, means that the whole screen flow needs to be reversed.
... [Shows screen shot with UI mirroring from OS X Lion]

... To make a site compatible for Hebrew and Arabic, everything must be adjusted
... Everything must be right-aligned.
... You need directionality support for text. [Shows example in Roman characters of RTL, LTR, and bidirectional text]
... The Unicode bidirectional algorithm (UBA) can handle display. Text can be entered the same way (first character first), but is displayed properly.
... The algorithm reorders the characters in the way the user would expect based on the language.
... It has a set of rules to try to change the order from the input string to what the user expects. [Shows example of reordering rules]
... UBA does a good job in most cases. But there are a few cases where it does not. E.g., if the paragraph direction is not detected correctly based on first character; if strings with different directionality are nested in difficult ways; if strings contain numbers, names, etc.; strings that are ambiguous for humans as well.
... If we can improve the difficult cases, it would be a great goal.
... [Shows example in which “Apple” is the first word in an Arabic string, which sets Left-to-Right as direction, but it should be RTL.]
... [Example in which “Yahoo!” is separated from the ! because of ordering; also one in which file extension is in the wrong area.]
... [Shows example in which parentheses end up in the wrong place]
... Right now you can use extra markup, tags, Unicode control characters to force behavior, but this is manual action and based on experience.
... The problem with manually adding them is that the translator may now know what to do and they are not easy to use since they require knowledge about the UBA. The are invisible, which means they may be lost, breaking the string.
... Sometimes there is no way to check until runtime.
... UBA needs to be improved based on studying cases where problems occur. We should find patterns and then parse strings to improve behavior.
... Numbers are difficult. People think they do not change, so they may hard code them, but Hindi numbers are used in Arabic, for instance.
... Best practices include: site must support RTL, avoid composed strings, avoid weak and neutral characters that cause UAB problems, don't enforce direction, support localized numbers, support multiple locales.

... In Tunisia they use Western numbers. Other places use others.

Nicoletta Calzolari, "The Multilingual Language Library"

Nicoletta: I will speak on the Multilingual Language Library.
... It is at the heart of the Multilingual Web
... The motto is “Let’s build it together!” Community involvement is critical.

... We want to make more use of the trend for sharing. Part of META-Share for resources. It is a big step, but not enough. We need to move to collaborative resources.

... Interoperability gains priority in this scenario.
... NLP is data intensive. Annotation is at the core of training, testing, etc. But our community efforts are scattered and dispersed, with insufficient opportunity to exploit results.
... We want MANY (parallel?) tests/data for MANY languages. We want to support all possible types of processing and annotation we may be able to produce as language technology people.
... For example, annotation about time, space, etc.
... It is a step toward making our discipline more like mature sciences.
... Those disciplines have thousands of people working together on the same experiences. We aren't there yet, but to be mature, we must be able to do this if we are to make a step forward.
... Accumulation of massive amounts of multidimensional data is the key to foster advancement in our knowledge about language and its mechanisms.
... We do not want isolated resources. They need to bound together and their relationships examined.
... We want to create an infrastructure for a large language repository to accumulate all knowledge about a language and encourage analysis of interrelations.
... We cannot currently share this knowledge.
... The challenges are not technical or at the design level. They are at the organizational level, in community involvement.
... We are starting with the LREC Repository that hosts a number of parallel/compatible resources in as many languages as possible, focusing on multiple modalities (speech, text, etc.)
... This will be contributed to META-SHARE.
... Authors are invited to process data in languages they know.
... They are invited to focus on different sorts of processing/tagging that they know.
... Processed data is shared back with the project.
... We currently offer data in 64 languages. English has the most, followed by Spanish and Catalan.
... There are many missing languages.
... [Shows table of annotation types.]
... [Shows table to tools used for annotation]

[Shows table of standard formats. Heavy use of TIMEX3 for temporal data markup.]

... All data will be available publicly.
... This is our first experiment. We hope it will set the ground for a large language library.
... It will help us build all knowledge and let us build on each other’s achievements. It requires a change of mentality.
... We need to focus on collaborative mindset.
... Interoperability issues are a problem since we do not require conformance to any standard.
... Please contribute at http://languagelibrary.eu

Fernando Serván, "Repurposing language resources for multilingual websites"

Fernando: I will start with context about the Food and Agriculture Organization of the United Nations.
... We have over 190 member countries.
... Focus on aspects of agriculture, food standards, animal diseases, etc.
... 5 regional offices. Work in a number of languages.
... See www.FAO.org as portal. But we are working in Facebook, Twitter, etc. now.
... [Shows table of users by browser language]
... English is dominant (53%) but other languages are growing.
... Our issues with language call for use of MT. We produce governing bodies’ statutory documents, food standards, news and campaigns, technical information, internal communication.
... We need to make this available in all languages, but our budget is small.
... We use human translation for governing bodies’ documents, and normative documents. We may use MT + post-editing for the other groups, but we want to get to the point where we do not need human intervention.
... We have been testing MT (Moses). We want to reuse legacy translations. We want to integrate TM and MT and use our knowledge and experience to improve the production of multilingual content for the web.
... We want to improve the efficiency of the translation process.
... Not all content can be translated by humans in all languages; we need to accelerate the process, particularly for legal documents.
... We started with allowing users to send queries to the engine and provide translated responses. By monitoring the requests, we would get a better view of what content is demanded.
... This knowledge would help us focus our resources.
... Started with Spanish (for expertise) and Arabic (critical demand)
... [Shows slide of architecture]
... We used TBX, TMX, etc. to use standard formats.
... However, after trying this, we found out that the best format to fit SMT is plain text (.txt) aligned in a certain way.
... We were moving from rich formats to non-rich formats.
... The engine requires that the text be cleaned up from markup. It reduces the information in the available translations for use by the engine.
... Some issues we have found are: (1) there is little information about MOSES for mere mortals; (2) best practices are not documented in the UN network of practitioners.
... We have shared experience in JIAMCATT.
... We found that there are common problems.
... Other issues: there are standards for each part of the process, but they do not integrate with each other, raising interoperability problems. They do not work well together.
... For us, the translate attribute is useful, but what do we do when we have to convert to plain text?
... directionality is a problem, as are numbers, acronyms. (E.g., in Arabic, acronyms are not used.)
... For our texts, English is the source language, but it is written by those for whom English is not a native language. Thus the source is “UN English” but the translations are in native languages. It can create quality problems.

... We are watching the MultilingualWeb-LT project and hope it will help us bring more content to more languages.

Users session, Q&A

Gerard Meijssen: For Nicoletta, is the information in the repository freely licensed?

Nicoletta: It is not a repository of translations, but of language resources.
... MT results could be one resource that could be contributed.

Gerard: The data is in the LREC repository, but it is available under a free license where you can do anything with it?

Nicoletta: The are available for everyone, but if you process the data and voluntarily contribute your processing back, you have to make it available. You can set licensing the ensures availability.

Tomas Carrasco: MOSES for Mere Mortals is from a member of our team. Keep your data, but use open formats. Legal issues can be difficult, but instead we should focus on agreeing on formats so we can share as needed. Sharing data is not enough.

Nicoletta: Let me clarify. We provide data and we ask users to process the data (add annotations). It is all through the META-SHARE platform. The reason is we want to see the results and analyze what we get. We do not ask for a specific format at this point because that is a top-down approach but we want to see what the community does on its own.
... We know that best practices, standards will emerge. It's a different approach.

Daniel Garcia: For Annette, are you involved with translation of social media.

Annette: No.

Dan Tufis: The LREC initiative is great, but have you considered the issue of the quality of the data you are getting? I assume the collection should be reused, but if you don't know the quality, there is not much use of it.

Nicoletta: That is part of the experiment. We need to analyze the data for quality so we can understand the issues that will rise on a bigger scale.

... One possible way it may go is that when you have many layers of annotation, if there is many groups you can look at the issue in many ways.

??: For Fernando. How do your users cope with MT quality? Are metadata from databases (e.g., descriptions, keywords, etc.) translated to provide accessibility even for non-translated materials so that users can know about the availability of data.

Fernando: We use MT only internally for the time being. The results do not go beyond our intranet. Quality is an issue, and because people are used to human translation, we don't want to expose ourselves to risk until we know the results.
... For document production, we translate titles, etc. Much uses controlled vocabulary. We use controlled syntax for URLs.
... We use only English metadata at present in the CMS.

Jörg Schütz: Does SKOS play a role in your efforts?

Fernando: We use terminology database, other resources.

Jörg: For Annette, you mentioned the notion of a default language. How do you decide what it is?

... Is the fall-back always English?

??: Generally yes.

Jörg: That matches my experience.

[Applause for speakers]

Richard: Provides information on reception at Parc Bellevue. If you want to take the bus, take the #18, #12 and go to the Homelius stop. Go further down the road in the same direction. Take the second on the right to Ave. Marie-Theresie. The room is the Salle Marie-Theresie.

MultilingualWeb Workshop, Luxembourg

15 Mar 2012

Contents