MultilingualWeb Workshop, Riga 2016

Note: The following notes were made during the Workshop or produced from recordings after the Workshop was over. They may contain errors and interested parties should study the video recordings to confirm details.

Welcome & Keynote

Scribe: Felix Sasaki

Chair: Arle Lommel, Richard Ishida, Felix Sasaki

Paige Williams (Keynote) "People-First: Multilingualism in a Single Digital World"

Paige: technology enables language, as we know from the last two days. Example of Ruanda: there are three main languages: Kinyarwanda, French and English. When a language is not offered in Kinyarwanda, working with technology has to be done in other languages. Here examples of Kinyarwanda interface pack - people are delighted to be able to work with their local language! We want to understand: what role does language play?

Paige: With local language access a lot of new businesses are born. What role does language play to enable local people to participate in a global economy? Sometimes language support can be a question of live and death, see e.g. the Translators without borders session from yesterday.

Demo about using Skype for cross-lingual real time communication

Paige: relevance is important barrier to access. The more relevant content is (including language), the more users will consume it. Mobile is driving internet adoption. Currently only 30% have connected to internet. 90% are in reach of internet signal. One reason is that there is not enough relevant content for them. To achieve this we need to have at least 92 languages supported. With travel scenarios: which languages do we want to consume in written or spoken communication? Borders are increasingly irrelevant. How many of you have used your mobile phone for internet access here in Lativa, e.g. to find directions, to read email etc? It seems all of you have. I assume it did not always work as you expected. We have a lot to do.

Paige: multilingual is the new monolingual. Location is not a key distinguishable feature for language. See the example of Candada. There is a growing population of Chinese. See also in New York and San Francisco, with very diverse backgrounds (Hong Kong, main land China etc.). Multilingualism means a culture of mutual respect. Here an analysis of tweets in New York city per language. There is English followed by Spanish, then there is Russian, Portugese, Turkish etc. Look deeper at a study, e.g. at CFK airport tweets per language change. User commented that on the map Chinese is missing. The reason is that they used not twitter but a different social app.

Paige: I have worked with microsoft.com for the first 10 of its 21 years. In 1994 I was responsible for getting many different languages rendering with many different character sets. In 97 we added country dropdown list. It was amazing to see the first Unicode database for registration systems. It feels very similar today with the multilingual digital single market.

Paige: Today there is a different enviornment. Short amounts of time, short publication cycles, interconnected services on the Web change our thoughts on localization. Consumers are deciding what to consume. We have to see how to deliver localized customer experience.

Paige: How to enable opportunities? Not enough to offer a single locale or time zone, that may change during travel. Language is dynamic. We seek to understand: which markets, which languages? You need to consider: architecture, local regulations, markets, contexts of markets, and how to use language in a customer experience.

Paige: standards simplifly the process. They help to be more effective. Example of local language program (to assure consistent experience across languages). language portal: millions of localized strings, multilingual app toolkit etc. We train rest of company via the global readiness program. There is a lot of tools, guidance about symbols, gestures and many other items, solutions for developers (internally and externally). Readiness is about enabling experience for the right market in the right language.

Paige: Key is to understand perspective that is both global and local. We must re-define what local means. Example of festival in seattle bellevue, with more than 80 languages. We must embrace diversity. See example of Latvia, a highly multilingual community.

Paige: Removing linguistic borders has never been more personal.

Example of sign language translator that helps hearing and death to communicate.

Paige: We seek to communicate no matter where you are and what language you speak

Example of American photographer living in China, using Skype real-time translation to communicate.

Paige: we seek to bridge digital divide. Even doing a dinner reservation can be a challenge. We put people into the center of our work.

Demo of Microsoft Cortana, used by Chinese user with multiple languages

Paige: Today's interconnected world means meeting people where they are and to connect with them; about acknowleding people, their language, customs etc. About enabling people to connect with each other on their own terms. Thank you for your attention!

Keynote Q&A

Andrzej Zydroń: thinking of Babelfish in the books from Douglas Adams. Babelfish has caused many wars in the history of creation.

Mārcis Pinnis: question about speech technologies. For big languages you worked for decades already. What about smaller languages. Are you thinking about opening technologies up so that others can contribute? E.g. I know that Windows does not offer integrated speech recognizers for several languages. We have tried to work on speech synthesizers, that is managable. For Speech recognizers it is harder.

Paige: I don't have a roadmap about new developments. We are not done, we continue to embrace more languages as we go.

Olaf-Michael Stefanov: to make multilingualism work we need to work together. Would Microsoft to open up for the languages that it does not cover, e.g. via APIs, so that local support could be inserted?

Paige: Any company will ever be able to handle so many languages first party. From that you have to think of language enable strategies.

Mārcis Pinnis: Are you thinking in that direction? Example of spell checker, it was difficult to integrate it into the microsoft solution

Paige: In that case it was rather an operational issue to take 3rd party checker. I have been at Microsoft for a long time. The company is changing rapidly. Here I cannot give you something specific. But keep watching us, we will do new things in the space of language.

Han-Teng Liao: I like the idea of people first. Sometimes the interface is the issue. See e.g. Wikipedia, here it is language first, you don't have country represented. In the future, can we imagine a Canadian French search engine?

Paige: Great point. Technology enables language, language enables technology. The more you can enable experiences through a multilingual approach. Jan may want to add to this

Jan Nelson: working in the area of windows, OS group. For years, locale or country has been strategy, language have been tight to that. Today we are talking about multilingual opportunities in every market, see e.g. the New York example. Last spring in San Francisco I talked to people, in some area people speak English only as a second language. Currently locale concept is in a lot of code. With Windows 10 we see much faster iterations of releases, so you will see more of this soon.

Unknow speaker: how has rise of mobile changed Microsoft approach towards multilingualism?

Paige: it changes everything. You rely on mobile for communication, navigation etc.

Unknow speaker: My language is Slovenian. What Microsoft office does becomes the defacto standard in terms of language support. Microsoft should take into account: for local communities standardization is important.

Paige: Agree, and not only standards but quality and accuracy of local terms.

Arle on Skype scenario in China: you need to have high bandwidth. How will this evolve in the future to enable communication with no internet connection?

Paige: currently localization moves towards borderless support, based on internet connectivity. You clearly need to address the issue you mention.

Unknow speaker: Side effect of multilingualism: people who master to than more languages avoid brain deseases. They better concentrate. Do you have a vision how people will change?

Paige: technology will not replace language learning, it will make communication across languages easier and to better understand other cultures. This will only enrich us in brain development and cultural awareness.

Developers and Creators session

Scribe: Arle Lommel, Felix Sasaki

Chair: Tatjana Gornostaja

Han-Teng Liao, "A Call to Build and Implement a Common Translation and Country-Language Selector Repository"

Han-Teng Liao: I am not a creator or developer, rather an internet researcher. But I benefit a lot from discussions around Web standards, without them I cannot do research. So I am a user of standards to make a bug report: what is wrong with country and language selectors.

Han-Teng Liao: example of selectors. Order and name is quite important. Web site about fly search, with another selector. Order is different. Related to information science: how to order information. If the index is not good it is hard to find things. Becomes more difficult the more items you have: if a web site supports 100 languages it becomes hard to find the right things. Example of hotel web site, again different. Example of European justice web site. Order in such examples is kind of arbitrary.

Han-Teng Liao: These issues matter because of user experience. If autodetection of proper language interface fails the user has to choose. The more choices you have, the higher the cognitive costs are, esp. if order is arbitrary. In European commission guides a language selection mechanism is required. Example of Chinese Wikipedia. They are trying to develop an interface that satisfies Hong Kong, main land China etc.

Han-Teng Liao: Example of EC web site. There are some issues in the code. Question is: what is the right order? Should it be in kind of alphabetic way? Depends on literacy you have, we are used to alphabetic order, that does not apply for other languages. Really frustrating for Chinese users. We have consistent collactions e.g. for databases. Maybe EU can lead the world to have a consistent collation, start with names of languages and languages. Autocomplete on codes and names is also helpful. Should the user know the codes? I think it is not difficult to remebmer the codes for your language.

Han-Teng Liao: Different geographic categories depending on the agency that issues the categories. Check out web site on user experience and language. Please consider not to use language flags but language codes, they are better standards. Conclusion: we need to have better lists for languages and ways for users to deploy them. One should promote information literacy to interact with language and country codes.

Rodolfo Maslias and Roberto Navigli, "Metasearch in a Big Multilingual Terminological Data Space"

Roberto: presentation about project proposal working including many partners from EU administration, research and industry. One objective is to create a common ontology to interconnect different terminologies. As a pivot / starting point we are using BabelNet to have a common ground and enrich it with semantic information coming from many data sources. Includes links to multimedia and many links.

Roberto: BabelNet is part of linguistic linked data cloud. It integrates world knowledge resources like DBpedia with linguistic resources like WordNets. So it already acts as a bridge in LLD cloud. Providing now automatic linking between term bases started with IATE. Initially IATE was created on the bases of legacy data bases. Later it was continously enriched by translators. Now it has millions of validated terms in many languages. Because of the various stages of IATE creation, IATE can benefit from semantic linking. We provide a linking that is language agnostic. It also does not rely on structure of IATE.

Roberto: We wanted to achieve a perfect match in IATE for BabelNet linking. This is not always easy. We started to apply Babelfy, an entity disambiguation and linking system developed at Univ. Sapienza. This already takes into account multilinguality and language agnostic feature, applying bags of words in all languages.

Roberto: one example. pomodoro di serra (tomato grown under class) is not in BabelNet. Babelfy finds relevant concepts that could be matched. You don't have an exact match, but you have idea of what concept to link to.

Rodolfo: we want to create a big metasearch engine that allows retrieval of semantic connections independent of language. In the EU parliament we created a tool with glossary links. Retrieving semantic connections is one of our goals. Objective is to add information in any format - this is always a big problem in metasearch.

Rodolfo: In the EU we have an internal tool called quest. With this project we want to create a kind of open quest tool. The general benefit cross-language search helps access information in unstructured big data: enter a query in one EU language and get results in the others.

Rodolfo: We often have the problem to find terminology to understand certain content. We made an experiment: we do the opposite way. We search for a service in a multilngual space. You can e.g. find services in a single space in the EU, e.g. if you look for a job or in the health area. If you put a job description in Italian and another in Greek. So here you can find a job as a waiter in Greek. Conclusion: our project wants to create multilingual metasearch engine that can help to query undstructured multilingual big data. That is the only way to disambiguate big data sets.

Fernando Fernando Serván, "Moving from a Multilingual Publications Repository to eBook Collections in the United Nations"

Fernando: FAO is an international organization. The web presence of FAO has lots of needs in terms of language coverage.

Fernando: The website of FAO is not totally multilingual. Most of the content is in six major languages: English, French, Spanish, Arabic, Chinese and Russian.

Fernando: FAO has about 12,000 online publications in PDF and HTML. There are over 7 million downloads per month. About 450 pubs are produced in English per year, 150 in French, 30 in Arabic and Russian, plus translations. For eBooks (mobi/EPUB): about 25/year, with >10K downloads per year.

Fernando: Despite low numbers of downlaods, FAO is moving towards eBooks. This change has many consequences. Currently eBook content is hard to find. We need to move to Amazon.com or similar places where people look for eBook. But these venues have their own policies and practices, so positioning FAO is a challenge.

Fernando: talking about the EPUB format: it is a zip file with HTML5 and CSS. Text is fluid, not set in columns. Fonts are a particular problem since we cannot redistribute fonts. This is not a problem in PDF, but a problem in zip files. So we have to encrypt fonts.

Fernando: Metadata can be embedded in ePUBs. Returning to the topic of distribution, each channel is different. Files validated for one site might not be valid for another. File size sometimes an issue. Each vendor validates formatting differently. For Kindle, only half of FAO’s languages are supported: no Chinese, Arabic, Russian. This limits our distribution.

Fernando: We have a need for best practices. We use post-conversion (PDF->EPUB), but there is little guidance on how to do this. Currently, the worlds of publishing and IT tend not to talk to each other. This is different compared to Web and IT.

Fernando: tags for <tables> tend to create problems when we use tables to align text (e.g., acronym lists). Many tables and other items need to be converted to JPEGs.

Fernando: EPUB-Web is a proposed approach to combine the ways we produce for the Web and how we produce for publication. We have difficulties with EPUB validators. The messages and errors are cryptic. This makes fixing a trial and error process.

Fernando: About Next steps: we need to optimize the conversion process. we need partnerships to move forward. Mobile phones will be the distribution platform for much of the world. These are challenges we face.

Juliane Stiller, "Evaluating Multilingual Features in Europeana: Deriving Best Practices for Digital Cultural Heritage"

Juliane: About Europeana: 2300 Institutions contribute, there are over 42 M objects (text, images, video, sound). Europeana is highly multilingual. The Europeana search offers automatic query translation. To foster better query results, metadata for objects is automatically enriched. Via enrichemnt, links from the metadata entries to DBPedia are generated. The search index then pulls in the variants in other languages.

Juliane: In Europeana certain types of entities are target of enrichment: Time, concepts, GeoNames, agents. Sometimes automatic entity recognition fails. This is a problem since the metadata records are created by humans. If an automatic named entity processing and link resolution leads to incorrect results, it leads to loss of trust, irrelevant results, and a bad user experience.

Juliane: We need to evaluate features and come up with best practices for dealing with automatic metadata enrichment. E.g., There is a case where Polend was incorrectly associated with Pollen, but due to a match of the Basque word Polen (pollen) against Dutch Polen (Poland).

Juliane: There is a relationship between queries, enrichment, and the actual objects. In a study we considered top queries and reviewed top responses to see how well enrichments worked. Concepts were successful 78% of the time (22% error rate). About 7% of enrichment mistakes had an impact (i.e., they were the reason for an incorrect retrieval). Changing queries might result in better results.

Juliane: We did not consider user satisfaction, since we do not know what users think of these bad results. The second technology we are looking at to improve multilingual access is query translation. The user can choose up to six languages for query translation. Queries are sent to Wikipedia API, translated, and a new query is created. Europeana then returns a mixed language result list.

Juliane: Compared manual translations as a baseline against automatic translations. 51% of queries were translated correctly. 30% were untranslated (probably due to strangeness of cultural heritage queries). 19% were incorrect. Some incorrect translations were based on disambiguation pages in Wikipedia. Evaluations help to identify problems and target efforts. One important challenge is to incorporate the best evaluation methodologies.

Developers and Creators Q&A

Peter Schmitz: . For Han-Teng. We are publishing our authority lists online, they are available as XML and SKOS. Translations for EU member states can be seen as official. Behind this there is protocol order, an agreement between member states that we have to respect.

Richard Ishida: For Han-Teng. We have guidelines at the W3C on the topic of selection lists. You did not talk about how to identify that a selection list is a list for countries and regions. Example: you have a list of languages, on a Persian page, even if the lists starts with English you may not be able to read it.

Olaf-Michael Stefanov: There is no solution to European protocoll and UN protocoll that have different orders e.g. for country names. An open-source protocoll would be the only solution to this problem.

Kirill Soloviev: I am co-founder of startup for translation industry. About Rodolfo's and Roberto's presentation about a massive multilingual aligned terminology database. A question for Microsoft has a huge terminology database, so why not link that to IATE and other resources?

Rodolfo: on termcord.eu we use Microsoft database links, we have it in an internal portal. It is accessible EU internally. The Microsoft database is a major source for IT terminology This is why we want to do metasearch, because it is impossible that a database like IATE has all terminologies we have. If you have in a general query no result, the metasearch should give more specific results. In metasearch we should have three categories: 1) institutional (covered e.g. in JIAMCATT’s terminology group), 2) academic databases, and 3) industry databases (e.g., Daimler, VW for cars). Many are very reliable, so we want to connect to them, but we cannot incorporate them directly.

Roberto: We don't need to create a catch-all repository, but rather we need to interconnect existing repositories. It is also important to properly exploit interconnections in and across languages. The more connections you have, the more you can work in any language. In the medium-term we can provide automatic tools for anyone to use to enrich their own resources: your input your glossary and are linked to linguistic linked open data cloud.

David Filip: I wonder about the transfer from PDF to EPUB. This seems inefficient, since PDF is not an authoring format. Did you consider to look into source files like indesign or word, to transfer from them to EPUB? PDFs are not designed to keep logic, but rather to provide presentation. The other question in general: is JEPG really enforced in EPUB? Wouldn't PNG or SVG be a better choice? Are they allowed?

Fernando: I totally agree with you about PDF. Using indesign is easier to create an ePub. But production is decentralized and we don't have control from the repository over the workflow. PDF is easier to send for per-page conversion than to try to control in-house. About the second question: We use PNGs and JPEGs. But the conversion from PDF brings in JPEGs.

Han-Teng: there are databases with country names, e.g. Unicode CLDR, UN has official names. Question is: Should the EU use this. Think e.g. about immigration, you have to serve e.g. Spanish people from South America. There is a need here.

Localizers

Scribe: Arle Lommel, Felix Sasaki

Chair: Fernando Serván

Lenoid Glazychev, "Standardizing Quality Assessment for the Multilingual Web"

Leonid: talking about ASTM standard proposal about translation quality. Why a standard for this topic? Standards are essential for the whole cycle of multilingual content production. There is nothing on the market of standards for quality assurance. Approach of the standard is based on hybrid methodology, combining holistic and atomistic factors. importance of holistic quality assessment. Quality assurance can be complete and accurate.

Leonid: there is the quality triangle. It is encompassed of adequacy, holistic factors, and readability. Atomistic quality can be measured based on all existing quality metrics.

Leonid: Challenges of application of linguistic quality standards are time limitations, or the need to use e.g. crowd sourcing.

Leonid: to overcome these issues, we work with a simplified methodology. We are gaining objectivity and accuracy through statistics. In that way we have a simplified quality metrics framework. The quality assurance process is described in a clear and brief manner. We freeze translated content and asses atomistic quality.

Jan Nelson, "XLIFF 2.0 and Microsoft’s Multilingual App Toolkit"

Jan: talking the MAT, I am one of the co-creators. It is a frew plugin that goes into visual studio. It has a common localization workflow in, you can use it for C++, C# etc. The developer does not have to think about this. We support XLIFF in the MAT. We worked on TC for XLIFF 2.0. We had been on TC for XLIFF 1.0. At that time Microsoft was a different company. We were still not ready to address localization standards. Microsoft had grown rapidly and there were a lot of different silos in the company. We are a very different company today.

Jan: we have integration of Microsoft translator in the MAT. At any level you have a full license to the translator. That is not true outside the MAT toolkit, here the license is unlimited. We have an extensibility model to access to further APIs, including TAUS API.

Jan: we provide pseudo translation. We have a translation editor that allows you to work with individuals that don't have visual study but who have linguistic capabilities. When you start MAT, you can add translation languages from the .net language list. You can generate translations and test them. So for developers it does not take many clicks until you have a localized application on your desktop.

demo of the toolkit

Jan: talking about standards: microsoft - we are supporting about 150 standards organisations. We are active in more than 400 working groups. XLIFF is a case that shows our commitment to standards related to language.

showcasing usage of machine translation for generation of creating translations in tool kit, stored as XLIFF file. File can be stored as XLIFF 1.2 and 2.0.

Jan: quality is an important aspect of translation also for app localization. Depending on the application area of the app, quality is more or less relevant.

Jan: there was a discussion before about freezing resources. This does not work for developers. So in MAT we have a state enging that helps to track changes and make choices.

Jan: xliff object model work ongoing - not final yet, will go into the tool kit asap.

Jan: MAT has been a huge success. Example success story. With MAT, in a real scenario, an app was localized from german company that wanted to go to china. They have never done this before. With MAT they did it in three weeks until shipping the product in China.

David Filip and Loïc Dufresne de Virel, "Developing a Standards-Based Localization Service Bus at Intel"

David: Loic is co-author of presentation. Topic is work in progress, partnership between Intel and CNGL - Adapt. Key terms of project: standards based architecture for localization and internationalization service provision at Intel. Other key terms are data models, architecture etc. Key standards are CMIS 1.1., ITS 2.0 and XLIFF 2.0. Microsoft helped a lot in pushing the XLIFF 2.0 standard.

David: approach of project is based on data model and architecture. Both are extensible to achieve vendor agnostic and future proofness. Need for multilingualism is exploding. In 2008 people said ok to translate into 8 languages. Now it is reality to translate into 40 languages. Projects like wikimedia are in 1000 languages - this is still a fraction. It is not a linear growth since you talk about pairs of languages. English is not the only source, you will need to look into more directions. This is why you need to be future proof.

David: key standards for internationalization and localization are ITS and XLIFF. Data model core is metadata rich message format that is based on these two standards. You have a standardized canonical token that travels through the architecture. The workflow is not localization but the whole content life cycle, including many content producers and their tools and localization service providers. Several standards are important here too, but we concentrate on XLIFF 2, ITS2 and CMIS 2.1. ITS2 provides an important bridge. On the CMS side, no matter how many language you have, it is always monolingual content. You need to pair it. Here comes XLIFF. It is a bi-text format.

David: ITS has 19 data categories. You can start with one data category and then add more. XLIFF 2 has a small core. So modularity of standards is critical here. You can start with CMIS capability that supports only basic features of the content. You can start with XLIFF core module and add many more. E.g. size and restriction module is very powerful. It allows you to formulate size and restriction constraints in any way, like have not more than 18 Unicode code points for a given field. But it can also be much more complex and extensible.

David: diagram on interconnecting developers. There are abstractions everywhere. Start with github but not directly. Any developers needs to have access. Get machine translation, test your build with MT etc. Conceptual diagram of solution: in the middle there is enterprise service bus. It does only routing, everything else is done via abstraction, informed by CMIS standard. It allows to abstract repository capabilities of any reasonably structured CMS. There is an abstraction for any number of source control solutions. About github, we are connecting via API which is not going to change for any other source content solution.

David: more discussions will be done in Berlin during the 6th FEISGILTT event in June.

Machines

Scribe: Kevin Koidl, Felix Sasaki

Chair: Feiyu Xu

Asunción Gómez-Pérez and Philipp Cimiano, "LIDER: Building Free, Interlinked, and Interoperable Language Resources"

Asun: summary of current state in LIDER, achievements, and how LIDER has contributed to free linguistic interlinked resources - linguistic linked data (LLD).

Asun: LIDER uses three W3C community groups: LD4LT, BPMLOD and Ontolex. We also have gathering of use cases and requirements, reference architecture etc. All discussions are done in the three W3C groups.

Asun: evolution of how LLD evolved during life time of LIDER. Now many terminologies, thesauri, knowledge bases etc. Also presented here in Riga. LIDER is also providing n Best practices and guidelines (BPMLOD) for LLD, including guidelines for conversion or guidelines. These are summarized in LIDER reference cards. They have been discussed in BPMLOD W3C group. In context of LD4LT group, we built a metadata model based on metashare model. It has been expanded with other information. One requirement is backward compatibility with original model. On top of metadata we have created lingub tool: an aggregator allowing to deploy metadata from different repositories.

Philipp: LIDERs roadmap focuses on understanding current and future needs of the community; to understand these needs a number of radmapping workshops have been executed and reported. Our roadmapping document includes results from over 100 relevant stakeholders in this field. Horizon for our roadmap is along three lines: business line, public sector, and about workflows or value chains involving LLD. We make predictions around three timelines: 2,5 and 10 years. Please look at the roadmap and give feedback.

Philipp: A reference architecture has been developed within LIDER which supports guidlines and standardization. Based on the reference architecture reference cards were published and presented in Riga which include specific guidline.

Philipp: benefit of LLD is not only publishing of datasets but linking, and how the process can made automatic, see the presentation from Roberto and Rodolfo this morning. LingHub is our reference implementation of discovery layer: how to find data sets. There are two layers that LIDER has not spelled out: independent benchmarking services including certification.

Philipp: the architecture is supposed to be an open architecture. This is very important so that the architecture can work for everybody.

Dave Lewis and Andrzej Zydroń, "FALCON: Building the Localization Web"

Dave: FALCON - application of linguistic linked data principles in localization. We are testing how far we can go with LLD in a localization tool chain. We call this the localization web: applying linked data principles to localization: interlinking between resources, standardized APIs for query, extensible metadata. Want to see how to add more value to resources, e.g. making use of metadata that gets generated during localization process. We explore how can use technology to improve terminology extraction and machine translation training. Fits very well with discussions in Connecting Europe Facility next door.

Dave: looking at resources like machine translation training corpora, or terminology resources. Some of these are implicitly connected, we are looking into making connections more explicit. Using e.g. BabelNet to make connections more explicit, e.g. via BabelFy service. Defininitions and picture are helpful for translators.

Dave: finding helpful information on multiworld units. Some of them are in eurovoc terminolgoy, but others are not. Annotating terminology with validated terms e.g. from EuroVoc. We force the multiword units into machine translating traing. During post editing, the post editor will look into phrases. We capture the post editing outcome into term base and reflect that as linked data. We annotate who came up with the links, and approval state.

Dave: Linked Data usage in FALCON is based on various W3C standards. Also, following the guidelines from LIDER helps a lot.

Andrzej: with natural language processing workflows, you have people creating resources, that is used by language technology, and things are improved in cycles. Given a machine translation engine like Moses, we ask: can we create an optimal route for a translation? We have seen very positive outcomes.

Andrzej: using various algorithms to understand what words and phrases relate during MT training. We are now using BabelNet and it helps a lot to improve things. Next step to have integrated workflow. Then we have a segment we can enfore a terminology. We can feed back to SMT engine everything that is coming along. All documents have a life cycle. In FALCON, we re-train, revise and analyse items and provide a lexical-conceptual life cycle in addition to translation.

Andrzej: in FALCON in terms of system integration: we have better in context post editing, feeding term suggestions from post editing to term mgmt, feeding MT training, using BabelNet, publishing interlinks of parallel texts, and capturing information how long each post editing step takes.

Ilan Kernerman, "Semi-Automatic Generation of Multilingual Glossaries"

Ilan: Great to be here. I am a newcomer to this community. Language technology should not be left to programmers, that is kind of my role here. We are a dictionary company. Some people say that there is no use for dictionaries as a product anymore. But our content is very valuable. I will talk about how we we are using the content to create new content.

Ilan: exampe of english learners dictionary. We have many language versions, creating multilingual dictionaries. What we need to go through lingustically: create micro and macro structure, creating head works, revisions, many linguistic teams etc. Workflow implies creating tooling for ourselves. Data is in XML format, we are moving into RDF. Having tools for QA and statistics. We see more and more the two factors (human and technology) together.

Ilan: on language learning: we started creating semi-bilingual dictionaries. We put them together and got a multilingual dictionary. We produced a multilayer multilingual approach. This morning Paige Williams said Multilingual is the new monolingual. This is true. We started to interconnect things internally even before going out. This is the vision: every item in the lexica can be a starting point for every user. That is an aim, we are not here yet.

Ilan: Example of our test site. You have a learners dictionary for each sense of a word, brief translation etc. Example of dictionary published by leading Latvian publisher. In example we have over 40 languages. We are having this monster with very precise translations with each meaning according to the head word. Now exploring: how can we recycle items and do semi-automatic development of multilingual glossaries. First we need to extract items. There are e.g. translations with markers like commas etc. Then we turn translations into head words, add parts of speech, add links back to original entries.

Ilan: Google Translate also has translations in various languages. AFAIK this is on a word-by-word level. We are working on a word-by-sense level: since we had a full English entry divided into different meanings, this becomes possible. We are developing different tables of senses, translations etc. Then there is the semi part of the workflow. The editor has received the data: the programmers have switched the headwords into translations and the other way round. The editor then does manual quality assurance. Showing how this will look like for the editor in a special purpose tool. After editing there is transformation into HTML.

Ilan: Example of German to English index. This is useful for production purposes going from your native language to target language. We then can go back to automatic process and receive this data. Showing HTML representation of XML data. You see how one goes back to English.

Machines Q&A

Olaf-Michel Stefanov: How do you solve missmatches (disambiguations) between translations?

Ilan:There is no answer but there is a focus on that problem. The thinking is that the results cary inacurracy but less then in other systems such as google. Because we are based on the translation of the specific sense of the word. We are still learning.

Felix Sasaki: Question to the whole panel. What role can linguistic data play in big data tasks? This challenge is across different projects Lider? How do you envision this being used in big data?

Philipp:We have seen good examples of the potential use of these technologies here in Riga. For example the interlinking of different terminologies to aid integrating different data sets, also we had a talk by publishing department of EU and they have problem integrating all the meta data that is available to provide like a uniform search. These were good examples.

Feiyu Xu: our group in Berlin is about text analytics, we have two big data projects, applying information extraction technologies from big data. Working together with Roberto. Trying to link linguistic knowledge with knowledge graph. Important to go a level further, a relation level. For this we need more linguistic analytics. We provide via various projects the sar-graph to represent relations. This will help big data analytics approach.

Asun:An another example is the annotation of drugs which would use these technologies in the medical domain.

Feiyu Xu:Question for Ilan: Do you plan to integrate Bablenet into the K- dictionaries?

Ilan:Complex question. Babelnet is amazing however how can we assure the quality. The next step seems to go into more focused terminologies and not going more general. Babelfly for example is making it more easier and also with RDF and linked data its even more easier. The hope is to also connect to Babelnet.

Lightning talks

Scribe: Kevin Koidl

Chair: David Filip

Arle Lommel, "Designing Purpose-Specfic Quality Metrics for the Web"

Arle: One of the frequent questions about this topic is: Why is there not one translation quality metric?

Arle: There is many kind of text and demands for quality levels. When we look into Assments Methods we find the Holistic evaluations, analytic evaluation, Automatic issue detection, Quality Estimation wich is an emerging field in which computers could look and text and their translation and come up with a good idea about which the problem in translation could be.

Arle: We need a way to tell the requirements for a translation job. That is why the specifications are important; and also it is needed a way to say what you do not want.

Arle:To clarify lines of responsability is also another important factor, and the main standard ASTM F25752014 adresses this.

Arle: MQM is more than a catalogue, we have guidelines on how you can select the issues you want to check, and also you can check methods. This MQM is currently working to unify with TAUS LQA.

Felix Sasaki representing the FREME project, "Language and data processing as first-class citizen on the Web"

Felix: FREME is an H2020 project started in February 2015. Fours business Partners are leading fours business cases about Digital content and Linked Data. Main Challenge is data to bring technology of data and language area into rebusiness tooling and outreach is to show how digital content and bigdata sources can be use for monetarisation in the multilingual data value chain.

Felix: In a High level perspective, main challenges are to create and offer solutions to various businesses based on four distinct business cases. Digital Publishing,translation and localisation, Agriculture and Personalisation, are domain where content looses value so need additional services.

Felix: Main aim of FREME is to provide a set of interfaces, graphical and software interfaces, for language and data technologies. It´ design is driven by the fours business cases which have a goal to make language and data technology a first class citizen on the web.

Felix: One of the solutions of FREME is to create new job profiles. The overall architecture in FREME is to assume that there are lots of big data content items which can add value and linked to data sources, so here we have close relation to LIDER project.

Felix:We have gather requirements internally, we will gather feedback from our BC partners about implementations and it is a potential standardisation. We may will organise Workshops on this topic.

Felix Sasaki on behalf of Phil Ritchie, "Ocelot: An Agile XLIFF Editor"

Felix: Ocelot will be used in the FREME framework. It is a tool editor for localisation quality assurance.

Felix: uses XLIFF 1.2 and ITS 2.0 and helps to capture and filter kinds of metadata, provenance data, localisation quality relation data, or machine translation confidence.

Felix: Since the first several improvement have been made, supporting ITS 2.0 and connection to translation memory, or the open source Okapi framwork.

Felix: Ocelot uses FREME e-services which are integrated into Ocelot. Ocelot will get semantic with FREME. It will get translation services or will be improved with additional information.

Ben Koeleman, "Swarm Translation"

Ben: Europe is a place with variety of languages. People are learning at least 1 other language. Swarm translation generates a community of language know how, it creates a community of translation, and creates language content very fast.

Ben: the best practice: harry-auf-deutsch.de was the first successful swarm translation

Ben: 50 different versions were created, 25000 members participated. First translation done in 36 hours.

Lightning talks Q&A

Philipp Cimiano:I have a question about Swarm Translation: It seems that your proposal relies on people wanting. But what about if your needs are that you need a translation that nobody wants to do it?

Ben Koeleman:This is about a community and they work for themselves not for clients.

Dave Lewis:Felix had this idea to explore the demand of common APIs or web platforms, do you see any priority areas?

Felix Sasaki:One Priority is that many people know about data technology but they do not know how to deploy these aplications, so for FREME we have this objective to have access to data sources without being a data expert.

Felix Sasaki:Also in the scenarios of first months of the project we looked at the data sources and sometimes there is a question wether we should do an effort in converting the source in linked data or if we should implement a mapping layer. Then the user won´ have to do his own linked data copy.

Serge Gladkoff:We tried to use Ocelot but it did not work out because of Capital replacement with features, do you have any plans about that?

Felix Sasaki:in FREME we will to make it easier for the users to use the interfaces of these technologies, we won´ provide another cat tool but we will make it easier for other developers to integrate new cat tools.

Users

Scribe: Kevin Koidl

Chair: Olaf-Michael Stefanov

Delyth Prys, "Best Practices for Sharing Language Technology Resources in Minority Language Environments"

Delyth:I will start explaining my background, and this is that the Language Technology Unit, in Bangor university is a self funded research unit. We do not get funding university. Software Licesing and atracted brands give us the employment. We were stablished in 1993 as a terminology center, but we expanded on machine translation and speech technology for welsh.

Delyth: There are 500.000 speakers of Welsh. We always worked with Welsh in a Multilingual context. The problem is the lack of tools for language development in welsh. We recently released a Project: the Welsh National Terminology Portal. Funded by the Welsh Government. We landed to this project because the situation was the lack of accessible tools for developing LT for Welsh, and also that the Expense of developing tools for non-commercial languages was a challenge.

Delyth:Internal tools may not be made available to others. There is also a lack of knowledge about language technology, esp. for smaller languages. Solution was the national portal that we established. This required an audit of tools from the Language Technology Unit (LTU). Government selected most important ones for free (both cost and license, e.g., BSD) distribution.

Delyth:The solution is a free of charged, and also free in terms of licensing, tool to allows the language to be reused. Portal needed to be easy to find and needed to address training (tutorials, ideas for use…) too, since some people who is not well trained in the use of these resources. The latest development in Welsh is the encoding clubs for children in Welsh have emerged.

Delyth:Government selected eight resources (out of nine proposed) for initial release. First of all Social web corpora provide insight for businesses; MT and aligner (for preparing parallel texts) were important. Also Text-to-speech API service, which was really important. Resources to support MOSES, which were crucial. We provide tools to support creation of users’ own engines based on existing TMs. PoS Tagger and language detection API, and the last one Vocab biligunal dictionaries widget.

Delyth:Lessons on coding were translated to Welsh and used text-to-speech. Kids could do Turing tests in Welsh. We used Raspberry Pi fundation to let children program a robot. We want schools to take the same approach.

Delyth:Also Vocab, another resource, is a widget which enriches online texts with links to dictionary information. It incorporates a lemmatizer. This is important for learners due to lenition (change of initial consonants as part of grammar). It connects to the national terminology portal if more information is needed.

Delyth:Launched in March, but it is really taking off and helps by embedding vocab helps in monolingual sites. We do as much as possible as widgets and APIs. We were surprised by the number of inquiries from public organizations wanting to use our tools but not having the skills.

Thibault Grouas, "Building a Multilingual Website with No Translation Resources"

Thibault:In the Ministerium of Culture, Delegation of French language and languages in France made a multilingualweb website without translation resources. We are responsible for all the languages spoken in France. There are about 75 languages historically spoken in France (“languages of France”).

Thibault:We promote multilingualism with a team dedicated to digital tools.In this team we try to accompany the technologies for languages which seems to be very important for language policies today.

Thibault:Linked Open Data is a new area. There was an agreement between the Ministry of Culture, INRIA and Wikimédia France in 2012. INRIA provides expertise in linked data; Ministry of Culture on culture; and Wikimédia France provides the largest body of cultural content. In late 2015 cultural institutions will be incorporated.

Thibault:DBpédia.fr was one of the first projects. Available via SPARQL since 2013. We made experiments on translating HDA-Lab (3000 culture-related languages) with six languages; second was Mueseosphere in nine languages. Competition in 2014 to help develop cultural projects based on linked data. ~30 proposals. Winner for linguistics was CROROS, provided 33-language browsable database of paintings. JOCONDE contains 300,000 images of art.

Thibault:We wanted to allow a full navigation in 14 languages (the ones in DBPédia.fr), but only half a page of interface was needed. We won a “Data Intelligence Award” three months after launch. Budget was quite small, in tens of thousands of euros. So it is an easy model to do and reproduce.

Thibault:[Discussion of technical diagram of approach]

Thibault:Demo with many languages. Demo showed the results for “sculpture”. Showed how Wikipedia data enriches the results. Arabic does bi-di reversal. You can add your own keywords to files. These are propagated across languages. All navigation is available in all languages. A back-end tool makes proposals to the admin team. Each link is human validated. Each link is qualified (e.g., full or partial matches). 90,000 key words were linked in six person months. DBpédia.fr covered 95% of the Joconde keywords.

Dennis Tan, "Towards an End-to-End Multilingual Web"

Dennis:End-to-end multilingual web. We talk about websites for users, but I want to go further to talk about multilingual domain names. Making Domain names also multilingual. To have domains that reflects localization. Go beyond English or even Latin characters. The Internet is fundamentally multilingual.By now about 80 languages are supported in Facebook, Twitter, Microsoft. This is reflecting our diversity of the internet population.

Dennis:Almost half of Web users now come from China. International character domain names are growing, but they remain a tiny fraction.2% is the number of IDN. 98% remain ASCII. Last year Unesco reported about IDNs and say that they are good predictors of content language.

Dennis:Despite that, applications still not support IDN. Browser support remains limited. There are a few mobile browsers that they do not treat IDNs like ASCII. Facebook and Twitter fo example will not allow me to register using an email address in chinese charachters. Only Gmail currently supports internationalized email addresses (among top ten market leaders. This leads us to talk about Universal acceptance.

Dennis:(Fully, some have partial for sending only) In 1985, beginning of times, there were IP addresses and Domain names. TLDs with 3 characters or less. They were hard coded at that point. In 2001 we saw more ones (.info), but this broke expectation for <= 4 character TLDs. In 2010, we saw internationalized TDLs. No longer only ASCII. E.g., .рф (Russian Federation). Since 2013 there are many, many, more.

Dennis:as an example: I found a domain-name validator and tried riga.global (.global is a new TLD recently delegated) and the tool told me it was invalid. Clearly using a fixed list of TLDs instead of the public suffix list does not work as good as using a dynamic list that is constantly being updated.

Dennis:We are in a vicious circle. Few users that use IDNs often find poor execution. So customers have to move away from it, and that leads to poor demand. Growth in Internet users will come from Middle East, Asia, and Africa. About icann.org/universalacceptance: trying to deal with the issue in the long run (ten years)

Dennis:Confusable characters are ones that look alike. This is a problematic issue These can be used by malicious users to spoof domain names. Mixing scripts can be used to do this. There are standards and guidelines to prevent this. Browsers might show Punycode. Browsers might only show non-Latin IF the user language settings match the script of the IDN. There are guidelines on what is allowed in an IDN.

Dennis:Verisign has an open, public API to handle conversion and script analysis. Processing is in JSON format. Showed small example that identifies scripts. Can be used to identify problems.

Users Q&A

Richard Ishida: Question for Delyth. I understand writing systems in my job. For Welsh, do you have trouble with keyboards? And how did coding clubs happen?

Delyth: There is a Welsh UK keyboard which is easy to change in windows.

Delyth: But we still have problems with some of the Welsh digraphs like ll and dd are single letters, so sometimes hyphenation is an issue.

Delyth:The coding clubs came from the UK level because kids weren't learning to code.

Delyth:It's happening in the UK school system.

Felix Sasaki: For Dennis. You said it is an end-to-end task you have to fulfill. You already reached out to stakeholders in past MLW workshops (e.g., in Rome). Where have you been successful (in IT ecosystem terms, not companies) envolving the right stakeholders?

Dennis: when I came to Verisign three years ago, it clicked with me. I am from Peru, but live in the U.S. and have Chinese and Japanese ancestry. So I was exposed to many languages. For me I wanted to know why I couldn't use my languages on internet.

Dennis: I've made some progress. It's all about reaching the right people. This is about making the right conections and keeping the conversations alive and looking for funding. In recent months we've started to find people.

Dennis: I am recruiting. Want every volunteer I can find. I know users want this, but they are turned off by bad experiences.

Richard Ishida: Follow-on a comment of Dennis. Part of the hope is unblocking this with large corporations. Google has started the ball rolling. Hope Microsoft will follow.

Olaf-Michael Stefanov: For Delyth. You talked about installing MOSES on regular PCs. This is tough, so if you've got something good, are you helping others doing that?

Delyth: That's a question for my technical people, but I don't know.

Felix: For Thibault. I looks like you've got great multilingual data sets related to cultural heritage. How can it be bridged to commercial applications. There are companies interested (e.g., in hospitality and travel) in enriching data with cultural info. Have you looked into such deployment.

Thibault: The software is available with a free license under LGPL. The links are published and available. Anyone can use it. Validation is difficult. Sometimes there is a conceptual mismatch.

MultilingualWeb Workshop, Riga 2015

29 April 2015

Contents