See also: IRC log
This is the raw scribe log for the sessions on day one of the MultilingualWeb workshop in Luxembourg. The log has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC is used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following IRC can also add contributions to the flow of text themselves.
scribe: Felix
Richard introduces the project and the workshop
Piet: europe has 24 official languages
.. we still have difficulties still today
.. it is not surprising that people have problems to understand each other
.. the web helped a lot in moving communication between people forward
.. in the mlw project it is fundamental to bring the experts of the world together to make things easier
.. all websites should be multilingual
.. it is still difficult to combine 100-200 languages in a website
.. it should be easy to have access in any language for users
.. it should be more easy for the linguists to transfer information between languages
.. using the linguists's advanced tools
.. important problem, you should find the right (technological) solutions to make life easier
.. after 60 years of computer development I'm surprised how technology makes life complicated
.. integration and easy of use of tools is still far away
.. this is today - multilingualism on the web and in our life is a complex problem
.. if you want to share enthusiasim and passion, it is important to have identification
.. I saw a logo of a project - be aware of the importance of that, so that people identify you
.. please use the right symbol in your work
.. this is the 4th workshop, a milestone in the project
.. please come up with some conclusions to make our life easier
.. to achieve that you need standards and interoperabilty
.. there is no policy session here - policy is important, without policy support nothing will happen
.. your "best solution" will not be used in the world
.. I'm sure our colleagues from dginfso are aware of this and Kimmo Rossi will work hard on this
.. I wish you a good work and looking forward to see your results that make life of users, linguists, technicians easier
Kimmo: 4000 people working in Luxembourg in translation
.. glad to be here - most of this is close to this building (Jean Monnet building)
.. dg infos projects behind this event: MultilingualWeb and MultilingualWebLT http://www.w3.org/International/multilingualweb/lt/
.. we are in a re-organization process, name of departments might change
.. Richard mentioned that we are funding two projects behind this event
.. new project MultilingualWeb-LT - this the wrap up of "heritage project", and a start of MultilingualWeb-Lt
.. the follow up project will take the message about the gaps and the challenges to build practical reference implementations that mean something to the industr
.. it is very focused on machine translation, content management and localization
.. "LT" of course stands for "language technologies"
.. we had planned to combine this workshop with a showcase of European projects, but that will be separate
.. first at LREC in Istanbul in May with an exhibition of European projects
.. and META-FORUM event organized by META
.. join the alliance of META to demonstrate a push for language technology
.. it gives you visibility and new businesses
.. the META-FORUM event Brussel 20-21 will feature an exhibition of LT projects in our portfolio
.. about future opportunities: the "Connecting Europe facility" (2014-20)
.. it is most concrete opportunity to demonstrate what LT can deliver
.. CEF consists of several parts (roads, energy grids) and digital service infrastructures
.. these infrastructure contains "multilingual access to online services"
.. that is our part in CEF
.. I will suggest a breakout session tomorrow about what that part of CEF should contain
.. idea is to have "language services available everywhere"
.. idea is not to take things away from the industry, but provide a platform to share and trade, for industry, public sector and citizens
.. and aim is to make the web truly multilingual
.. if you have further questions I can tell you more in the breaks or during the breakout session
Ivan: For some people semantic web is a "knowledge management system", with big ontologies
.. other people don't care about ontologies, they think about large amounts of data
.. others think about enhancing search
.. others about integrating data
.. so people do this and that. Example from a chinese university
.. incredible wooden structure, beautiful but complicated
.. people described the knowledge how the structure was put together
.. and they created beautiful videos showing that - this is knowledge management at its best
.. next example: medical application from the US
.. takes a lot of data. Aim is to personalize the data, combine the data, extract knowledge etc.
.. BBC has pages on music and musicians, example of "Eric Clapton" page
.. BBC does not act facts themselves - they have a system that aggregates the data from other providers, so again a very different application
.. another eample: IMBD - gives reviews on movies
.. in the source they have added additional structured data (microdata)
.. that will be used by google in search. The google search result shows a 4 stars assessment of the movie, taking from the site during crawling
.. this is the current state of semantic web: we have many application areas, see above
.. the general idea behind all this is: there is a lot of data on the web
.. more and more applications rely on the existence of the data
.. we do not want data silos
.. imagine a web that had documents but without links between them
.. real value of the web is not pages on the web, but links between pages
.. example of three different interfaces related to neuro biological issues
.. they have three different interfaces, and the data bases that need to be combined hard wired
.. via the web, we can achieve linkage between such data silos, so that the data is a kind of unity
.. semantic web is a set of technologies with the real goal to build a web of data
.. on a longer term, we want to see the whole web as a huge global data base
.. as a long term goal
.. what is happening at w3c today?
.. in that area
.. RDF is the data format for semantic web
.. SPARQL is used to query RDF data, like SQL for relational data bases
.. SPARQL is about graph patterns in the semantic web "graph"
.. SPARQL has been a standard for some time. Now we are working on new features
.. describing new features in SPARQL
.. sparql has been already a unifying point between semantic web applications
.. with sparql 1.1. this becomes more complex, but also more powerful
.. exporting of RDF is another topic - one approach is "direct mapping"
.. is good for a general conversion
.. but we need another step to have the graph that we really want
. it is a layer on top of the direct mapping, to give additional rules for creating the RDF graph
.. to create what your application needs. The additional step is expressed by R2RML
.. both the direct mapping and the R2RML approach are currently being implemented ("candidate recommendation" phase)
.. should be finalized (a w3c "recommendation") by this summer
.. next topic: adding data to HTML pages
.. data per page is not much compared to data bases, but still there is a lot of data
.. that is very valuable for search engines or other applications
.. two approaches: microdata and RDFa
.. both very similar. RDF can be extracted by both
.. microdata has been optimized for "one vocabulary at a time", doesn't have data types
.. RDFa provides the full power of RDF, with the price of more complexitiy
.. RDFa light is on the same level of complexity of microdata
.. next topic: RDF working group
.. RDF itself is the basis of all semantic web technologies
.. it's like links from one page to the other. The only difference is that in RDF the links that have a name, and there is additional infrastructure to make use of that
.. RDF is being cleaned up in RDF 1.1, no big changes
.. the turtle serialization is being standardized
.. and other features are being added, only a few
.. work on RDF 1.1. has begun a year ago
.. last working group is called provenance
.. goal is to add metadata to data on the web like: how was the data created
.. revision structure, revision history
.. for this you want one vocabulary - that is the goal of the provenance group
.. there needs to be a balance between something simple and useable, and something more complete
.. that is the balance that the group is working on
.. now coming to linked open data cloud
.. there are a lot of data sets out there
.. LOD diagram is nice but a bit misleading
.. there is an additional diagram showing interlinkage more clearly - there are still many links missing
.. major challenges of Semantic web are: scale of the data
.. interlinkage
.. ability to read and write data ("SPARL update")
.. currently discussing "linked data platform WG"
.. to work on HTTP infrastructure to modify linked open data
.. other challenges: data quality, ...
.. other challenges: role of reasoning with the amount of data
.. highly distributed data
.. huge amount of data in a few vocabularies
.. how to do inferencing in this kind of setup not easy
.. major challenge is really interlinked data on the web
.. semantic web is trying to help
.. about multilingual web
.. what can be the relationship between multilingual web and semantic web
.. I have the impression that semantic web can give powerful technologies to categorize knowledge
.. that can be created in different languages
.. linked data also gives a source of information that you can use
.. e.g. analyze a blog, fetch semantic web data to use for that analyzes
.. not always for translation, but also for language specific technologies
.. semantic web has a very simple way to represent languages
.. we need more complex ways
.. english is used for all vocabularies
.. with the current infrastructure it is hard to reason across languages
.. we also have a cultural issue - we find vocabularies that are badly designed in terms of localization
.. a need for improvement
.. looking forward for the discussion to learn what you see
Charles McCathieNevile: how does semantic web relate to private data
Ivan: ivan: big question mark - how to combine access control with semantic web
.. currently you have sometimes semantic web applications behind firewalls
.. but that's no solution
Charles: what do we know about how people use semantic web?
Ivan: we know a little bit more. We had a workshop last december about how linked open data was used in the enterprise
.. one message was: there is lot's of data here
.. but there is the need of a low level APIs about access
.. that is wanted e.g. by large companies
Jan: we put a lot of efforts into languages in windows 8
.. 109 languages in total
.. example: 35 mill. customers in US that speak spanish at home
.. so language have a huge opportunity
.. windows store: helps to deliver apps in more than 200 markets
.. with developer support for localization
.. metro style apps technology stop
.. lot's of programming languages supported
.. c++, html5, etc.
.. multilingual app toolkit
.. purpose to help to manage translation
.. has a pseudo language engine for localization testing
.. now demo of the toolkit
.. showing a weather app
.. in the app preference language is now set to German
.. rebuilding the app, it shows up in various languages, with pseudo translation including bing machine translation services
.. XLIFF files are being created on the fly to support translation
.. in a separate editor translations are handled, including marking non-translatable text
Tony describing the creation of layout via XSL like the creation of the world - this is just great - sorry, I can't scribe this
Tony: xsl 1.1. has a large section on internationalization
.. xsl has always been good on i18n: writing modes for multiple scripts
.. properties are defined in terms of "start and end", not "left, right, ..."
.. xsl-fo has the concept of different baselines of text
.. in XSL 2.0 we want to do a lot more of internationalization
.. in 2008 w3c had a japanese layout taskforce
.. experts working to define Japanese layout
.. taking a japanese standard as a basis
.. and the expertise of Japanese layout experts
.. the "japanese layout document" is useful for implementing these features
.. ruby applies to Japanese and Chinese (bopomofo etc.)
.. there is a lot of information about Japanese thanks to the layout taskforce
.. most translated document: "Universal declaration of human rights"
.. often used to compare quality of layout in various languages
.. these days we can cover a lot of languages just with web browsers
.. UDHR is also avail. in Unicode, see http://unicode.org/udhr/index.html
.. last year I worked on formatting Khmer
.. I used UDHR as an example, there were many issues in the Khmer layout
.. so there is a need to learn more about local needs related to layout
.. the japanese layout taskforce is very useful
.. the requirements document is used by XSL, CSS, other groups
.. should w3c make more taskforces? Requires more funding, efforts
... easier with the w3c badge, easier to justify
.. or should there be a multilingual layout community group?
.. easy to set up, see http://www.w3.org/community/
.. contributor agreement makes it easy to use the outcome
Richard: presenting key issues related to multilingual topic currently done in HTML5
.. describing the i18n working groups in w3c: i18n core, MultilingualWeb-LT group
.. internationalization interest group, other mailing lists etc.
.. please participate and contribute, we need your support and input
.. example of bidi in embedded text, visualization wrong because of missing directionality information
.. new "bdi" tag to create proper visualization
.. next topic: ruby
.. additional information e.g. about pronunciation of pictographic (Japanese) characters
.. Japanese layout document - currently producing a 2nd version of that
.. gives a lot of detail - would love to have this for Korean, Chinese, Arabic, Indic scripts
.. if you want to participate or know people who want to participate, please let us know
.. ruby in HTML5: there is no "rb" tag, you can put several annotations in ruby element
.. some problems, e.g. you want to highlight the ruby text itself: doesn't work because there is no specific to select
.. you can use a "span" element, but that has issues to
.. we are working on these questions currently, looking for advise
.. working also with implementors on moving this forward
.. now jirka about the "translate" flag
Jirka: jirka: localization and translation has a lot of issues
.. "translate" flag helps with this. In many documents you have flags that should not be translated
.. if you use automated translation it would be helpful to have additional metadata that will help - it identifies parts not to be translated
.. also helpful for human translation and translation workflow in general
.. "translate" attribute proposal started a year ago at a multilingualweb workshop, but now it's added to HTML5
.. online machine translation services support this already, e.g. bing translate and google
.. it is also supported by content formats like DITA and DocBOOk
.. in the MLW-LT working group, we will work on better integration of this into HTML5 and other metadata
Christian Lieske: has the "translate" flag been considered for inclusion in CSS?
Jirka: don't think so
.. in CSS there is just plain strings
.. there is no markup to convey additional metadata
.. if you need to localize CSS, I propose a pre-processing step
Richard: CSS is for presentation
.. it is not the content
.. for the bidi tags for example, you could do bidi
.. but we strongly recommend that you don't
.. because the bidi information is part of the document
.. so I would propose to see CSS just as the presentation layer
Ivan Herman: ruby, bidi and translation
.. these are features that non-XML formats also want to have
.. like JSON, RDF etc.
Felix Sasaki: MLW-LT group will work on bringing some of the features into other formats, we should talk about how to add that into Semantic Web
Jirka: for JSON you can have HTML inside it that contains the "translate" flag and other markup
.. I would hope people rather produce XML which makes it easier to make that kind of metadata
Davide Sanseverino: question of "translate" flag
.. currently we create rules for several elements, not only one - what to do about this?
Jirka: the ITS 1.0 specification has a mechanism to create such rules. It is not in HTML5, but you can combine both HTML5 "translate"
.. and use a processing chain with rules
Richard: "translate" attribute is an interoperable solution
.. bing translate, google translate recognizes it
.. there are other solutions, but that are not standard
for more info about "translate", see http://rishida.net/blog/?p=831
Unknown person: from university of karlsruhe
.. we had a scenario to annotate fine grained localization information
.. how do you deal with this?
Ian: you saw examples for windows8
.. it is up to translators to deal with what they want to represent
.. we support them
Richard: if you want to translate something like luxembourg french there is a way of labeling
Axel Hecht: we talked a lot about the translate flag in the past, happy that it was standardized
.. sometimes people are asking of specific translations,
.. have you asked about having more values for translate to specify that?
Jirka: in full ITS there is support to specify things like that
.. as part of the MLW-LT project, we are planning to have a mechanism that supports RDFa, microdata or other mechanism to include that in HTML5 and other areas
<Jirka> For Axel - support for terms in ITS: http://www.w3.org/TR/its/#terminology
Felix Sasaki: call for feedback about features of MLW-LT, please give us your feedback and let's put implementations into the centre
scribe: Jirka
Jan Nelson is introducing speakers
Brian: introducing joomla cms
... community project, no company behind it
... joomla supports 57 languages
... joomla provides 3 options for translating websites
... 1 - machine translation using widgets from Google,
Microsoft
... quality is not guaranteed, not indexed
... 2 - parallel translation using plugins, everything has to
be translated
... translations are indexed
... the question is whether we should just translate or provide
local content
... 3 - sites within site, translate content only when
appropriate
... the key in joomla is categorise, add and show
... for each language different menu can be provided
Loïc: tensions between relying on standards and using new technologies
... showing ugly XML
... translation handled by two plans
... plan A - more automated, developed in 9 months, 6x
efficient then plan B
... plan B - more manual process developed in 3 months
... for interoperability, all processes has to be updated to
support Unicode
... maybe also to support XLIFF
Gerard: goal of wikimedia is to allow all
human to share same knowledge, thus localization and
translation is very important
... wikipedia now in 283, requests for 129 more
... problems with fonts for scripts
... solved by using webfonts
... there are no good free fonts for all scripts, wikimedia is
supporting development of some fonts
... missing input methods for some languages
... using ISO-639-3, Unicode and CLDR
... using TM and MT
... all localizers and translators are volunteers
... l10n is more expensive then development
... support more languags then CLDR
... 6000 languages are still not supported
... languages not supported in CLDR are not supported in
applications (text editors, browsers)
... looking for a solution
xyz: How do you support users who are looking for content which is there, but users don't know the language.
Gerard: currently only the current language is being searched
... ongoing project for searching in several langauges at one time
Richard Ishida: Do you use BCP 47 or ISO-639-3
Gerard: there is no difference between
language and locale sometimes
... BCP 47 is used when ISO-639-3 is not sufficient
Tomas Carrasco Benitez: MediaWiki, Joomla, ... the solve similar problems, but solutions are different, there is no standard.
Gerard: we want to use standards, please help us to improve CLDR
Tomas: we are lacking standard for
multilingual websites
... each system uses different approach for translating
content
Jan: this is purpose of MLW and MLW-LT
Reinhard Schäler: How do you motivate localizers to work for free
Gerard: Tools are not prepared for some languages
Brian: joomla is completely community driven,
people wanted to build web sites in their languges
... we make it easy to supply translation for additional
languages
Axel Hecht: in Mozilla each localization team has different motivations
xyz: are styles switched when translation is done on the flight
Brian: yes, in joomla
<chaals> scribe: chaals
Spyridon: new mt service at EC
.. work started in october 2010
... we already have system around open
software for a lot of languages, being used since last
july
... I want to explain what we need in standards to make this
work better
... focus on openness and flexibility, and ensuring
technological indepenence
... (repeating what people have said, a bit)
[slide - service architecture]
scribe: We have users, and we want to connect
data. We have orgniased the project in 3 action lines - the MT
engines, working closely with the data part
... Data part focused on preparation to improve output
quality.
... Our users are the Commission, and services funded by the
Commission (eg ted - tender documentation)
... For MT we started with Moses, because it is an EC-funded
open source system, and started using it and collecting
feedback.
... We want to use more data, more MT technologies where Moses
isn't the best so we want to be able to swap it out
... handle post-editing, ...
... My focus is the data.
[slide - Multilingual Web = Multilingual Content]
scribe: An author, different translators who
each have their own working methods, a publisher.
... A different publisher might not work in the same way, so
the content needs to adapt.
... Publisher needs to be prepared to receive the different
languages
[slide - Language Applications]
scribe: We want to give data to the web, and
get it from the web.
... Getting data from one website is easy. But adding a second
source meant having to rewrite the systems, and if a site
changes there is more work to do. And so on for each
website.
... Where there is no standard to follow, this is normal.
[slide - Giving our data to the Web]
scribe: ... We want a system that takes data
from databases, and makes it possible to automatially publish
in multiple languages.
... There should be continuity in what users get.
... We have had to make our own approach, and then we need to
stick to it.
[slide - Conclusion]
scribe: We need to be able to get Mulitlingual
information from more, and publish it to the Web.
... Need to allow free flow of information between applications
without losing a lot of time on adapting data.
... We expect MLW-LT to show a feasible approach, and
demonstrate the benefits of this.
... We are trying to be active (echoing richard's "tell us what
you need"). And we are ready to change.
... We have our internal systems, which we are ready to abandon
for a broader standard if there is one.
... So we are major users prepared to test, and to actively
contribute in development.
MH: The difference between Slovenia and
Slovakia: there is love in sLOVEnia
... Seeing my name written Matjŧ got me involved in
localisation
[slide - Exiting Approach]
... We localise a lot of stuff at Mozilla. Usually we extract text, give the strings to localisers, and then post them back to the Web
[shows a website in english, how you translate the string, and what it looks like afterward]
... Problem. Localisers don't see the
context which is a problem. And don't see the available
space.
... What can we do?
... In HTML5 we have contenteditable, which makes it possible
to just change text on a website - e.g. translating things you
see.
http://pontoon-dev/mozillalabs.com -> a development project to work with this.
[live demo based on flaky versions of everything...]
... add a line of code to a site, then in
the pontoon side you can give the URL and start
translating.
... A UI at the bottom to manage the translation, and then you
select some text, and edit it to change the language.
... It's all cool.
... Except...
... How do you transalte metadata like <title> or error
messages in javascript?
... We have an advanced mode that shows all the strings you
have.
DL: ML-LT follows from the multilingual web
project. Get involved...
... There is a new W3C working group
[slide - MultilingualWeb-LT]
... How do we make it easier to integrate
content going through translation?
... Already getting uptake from people beyond the
project.
... Started with a lot of representation from localisation
industry, we could do with more input from Content Management
and Users ...
... Key to the process is not just specification, but actual
implementation.
[slide - Approach]
... Heritage owes a lot to the ITS
specification
... it is nice that it is small, but we could add some more
useful information using this.
... What are the useful things to add? There are different
things different people will want
... Looking at HTML5 compatibility, and things like metadata in
CMS content for the 'deep web'.
... Don't want to invent new stuff where we can use things that
already exist.
[slide - Candidate Stakeholders]
... Main message: we need to look at the
whole stretch from production to consumption.
... There are lots of players, and different ways of building
the workflows.
... We want to find real requirements - problems that people
actually have
[slide - Scope of Use Cases]
[slide - Source Content Processing]
[slide - Localisation Quality Assurance]
... different approaches possible, and we need to think about e.g. what simple authors are doing, and how to work with people who have strong systems that need to integrate with e.g. XLIFF
[slide - CMS-L10N integration via RDF and XLIFF]
... Exploring ways of working with formal systems for tracking the process
[slide - Leverage Target Quality Metadata]
... There are some things that flow through the process, some things that are important for particular steps.
[slide - Rich Metadata for translation]
[slide - Next Steps]
... We're working in public, and we hope
to get involvement as well as being transparent about what we
are doing.
... Will hold a workshop in Dublin 11-12 June, getting close to
finalising requirements
... And then there are more things to work on beyond the scope
of this project - multimedia, javascript, etc
Reinhart Schäler: We wanted to be able to share translations and let communities rate and review them.
MH: We were thinking of this, taking inspiration from Universal Subtitles that allows people to help provide video subtitles. Nothing to show yet though
Des Oates: In architecture of MTU you had what looks like an API between various MT engines. We're looking at something similar in Adobe. Are you going to make those interfaces public, and are you interested in standardising the approaches?
SD: We're taking solutions supported by our
institutional IT department. We're developing on the basis of
commercial systems, bulding it to allow implementing rules for
different types of request.
... if you have multiple MT engines for a given language, you
call one or another based on e.g. the domain. But it is purely
internal.
... This is something that is available, that has been
customised for each client. I don't see interest in making the
custom configuration standard.
Lloyd: What kind of effort do you have in source quality in machine translation?
SD: We are aware of the importance of quality.
We have no way to impose rules on the sources.
... many users are drafting things that are not in their native
languages. We have editing units to help, we are considering
using authoring support, but in practical terms this looks
extremely expensive to provide.
... we're very early in this process.
DL: In ML-LT the question hasn't come up yet. I think it is an interesting use case.
??: Coud you expand on the policy for open source?
SD: Interesting question. There is a change in
policy since december - now commission documents are by default
made available for everything, unless there is a clear
justification for restricting access.
... There is a new open data initiative starting in line with
this trend.
Anonymous: MH, does the system give translation memory, how are translations reported back and integrated online, and can it be linked ot other automatic translation services?
MH: Right now it uses translation memory from
our own localisation work.
... Linking to other machine translation services is possible -
we switched already to the Microsoft service (although we only
have that at the moment it is easy to switch)
... Integrating to the services. Pontoon can detect every text
node, and you translate a page, or using getText to do
localisation.
... so we create hooks for getText and use them to create
metafiles.
Tomas Carrasco Benitez: https://addons.opera.com/addons/extensions/details/swaplang -> extension that identifies pages which point to alternative languages, so users can select them.
[It's open source - feel free to adapt, improve or port it]
??: Do you have community participation?
MH: Facebook has a similar tool. And I hear
stories that there are fights in communities about whose
translation should win.
... we don't use pontoon with live sites yet. We could limit
access, etc.
... but we want everyone to participate. Need to consider how
to handle this.
???: Yes, this happens. At the end of the day we have to decide on who we accept - choose an authority, and then try to merge differences.
Felix: MW-LT is on a very tight
schedule.
... please tell us soon what you do and need and fill in the questionnaire at https://www.w3.org/2002/09/wbs/1/mlw-lt-requirements/.
<scribe> scribeNick: RyanHeart
Peter: Mission: From the EU to the public.
... Production of publications and preparation of publications in all EU languages.
... Different types of publications: Official and non-official.
... Official journal: 866 issues, 22/23 languages with > 1m pages.
... Consolidation of EU law is another area of work.
... Different online services are also provided: ear-lex (law), bookshop, etc.
... The idea behind the CELLAR project is to create one single repository for all metadata.
... Peter illustrates the structure of the CELLAR project with the target architecture consisting of a portal, index and search, content and metadata, post production and production layers.
... Peter highlights the dual nature of the repository in CELLAR, covering both content and metadata.
... The system has passed its development stage, according to Peter, and is now deployed.
... Another common portal is being developed, outlines Peter, to provide a better and easier-to-use interface to CELLAR.
... The CELLAR project uses a common data model, an ontology based on FRBR model.
... Peter explains that the CELLAR project uses RDF and taxonomies represented in SKOS.
<fsasaki> FRBR = Functional Requirements for Bibliographic Records
... Coded metadata supports the delivery of multi-lingual content, explains Peter,...
… which is also used to index the content.
... Interoperability is achieved by adopting standards as much as possible, such as METS (metadata encoding and transmission standard), Dublin Core, GRBR, Linked Open Data (LOD) and Standard Query Language (SPARQL), according to Peter.
... At the same time, the EC also contributes to the development and definition of standards, says Peter...
… including around core metadata (to enable global reach), using common authority tables (to harmonize metadata), and driving an exchange protocol for EU legislative procedures.
... The European Legislative Identifier (ELI) is under preparation, says Peter.
Paul: It's about accessing business information across languages
… SAP is a partner in the project building a business analysis tool based on the DERI approach
… showing an example of how the system, called Monnet, is working.
… Ontologies cannot be directly translated who describes how a lexicon is used for translation.
… The research objectives of Monnet are around the development and use of multilingual ontologies and the exploitation of domain semantics to improve MT.
… the financial use case for the Monnet project is 'Harmonizing Business Registration across Europe' using XBRL and xEBR.
… the methods used for domain training of term translation include hybrid methods, including domain lexicon generation from wikipedia & domain parallel corpora, LDA topic modeling with features mixed-in from the ontology etc.
… another use case is that of public services in The Netherlands, presenting different requirements and complex semantics.
… GELATO (Generation of LAnguage and Text from Ontologies) is one of the methodologies used.
… Ontology Lexicalisation is one of the central topics in the Monnet project.
… there are a number of different use cases in this area in Ontology Localisation, Ontology-based information extraction etc.
… The project is working with the W3C Ontology-Lexicon Community Group and has proposed its own 'Monnet' format.
Tadej: Translating proper names is a big problem for statistical MT systems, one that cannot be solved by the HTML5 translate attribute.
... Depending on the source and target languages, there are different rules for the translation of proper names.
... One solution for this problem is to
check whether a translation for an entity already exists.
... The information presented in a document is checked against
a knowledge base and disambiguated.
... The knowledge base contains labels and entities.
... This requires a good coverage of entities in the knowledge
base (kb) and works better in more widely used languages.
... A solution for languages without a wide coverage would be
to use a kb that is in a different language from that of the
document.
... There are a number of different ranking features that could
be used, including popularity and context similarity.
... For example, if Kashmir was used close to Led Zeppelin, it
would be obvious that the song rather than the country was
referred to.
... Cross-lingual gathering of candidate entities only works
for proper names and only if they are not translated to local
languages.
... Context similarity works in a vector space, treating the
distinct worlds as dimensions. This does not work across
languages.
... The solution is to not compute similarity but to map
texts.
... This can be achieved by training parallel corpora with
Canonical Correlation Analysis (CLIR) techniques. This has been
implemented for EuroParl.
... Future work proposed includes that of the FP7 project XLike
and the standardization work in the W3C Multilingual Web - LT
Working Group.
... The annotations can be used in HTML and are transparent for
normal CMS operations and web browser rendering.
... I am now going to do a demo RDF a Lite, enrycher.ijs.sl
Ivan Herman: Question about CELLAR project. You create a silo, but to you produce links to other data sets, such as government data?
Peter: You are right. We are aware of this and would, indeed, be interested in linking up with other similar public data repositories.
Joerg Schütz: Peter, is there any established interaction with DG Translation, as you share a lot of architectural and data management issues.
Peter: What is your organization? Ah,
Bioloom.
... DG Translation is one of our customers, in a sense.
?: A question for Paul. Domain lexicon generation from Wikipedia - how did you do it?
Paul: we looked at the terms to be translated and extracted them. Then went to the domain-specific Wikipedia entries and to other languages and retrieved the translations.
Olaf-Michael Stefanov: A question for Tadej. In relation to name disambiguation- what have you done in relation to cities that exist in different countries, such as Vienna or Wien?
Tadej: We look at the context.
... Therefore, Vienna in the USA would not be confused with
Wien in Austria.
Christian Lieske: A question for Paul and Tadej - you first identify language-neutral entities; then you do not use MT, but what do you use?
Paul: we actually do MT.
Tadej: There are people approaching the same
problem using MT, and it works reasonably well.
... But my point is that we do not have to use MT, that we can
use a cheaper approach and achieve very similar results.
Felix: Let me thank again all the speakers. Please be back at 16:30 for our next session.
<Arle> scribe: Arle
Annette: Will discuss web communication and
its importance for citizens
... We are lucky to live in Democratic societies, but we should
not take it for granted. Many do not enjoy freedom.
<RyanHeart> Great to hear about 'citizens' rather than 'customers'.
... Choosing leaders is not enough.
Citizens need to participate. The internet provides a way for
citizens to interact with leaders. For the EU the importance of
good web communication cannot be overestimated.
... But how can we communicate with citizens if we don't speak
their language?
... Fortunately for the EC, we have specialized web translation
service in the DGT that helps with communication and assists in
redesigning websites to assist citizens.
... We don't just translate, but also localize the whole
message with the target country in mind.
... Our team has small, autonomous teams for each language. The
lines between planners and translators are short to increase
participation.
<RyanHeart> A 'human' translator speaking. A first, after close to two years :)
... [Dutch translator who is speaking is
not on list of speakers.] Human translators are
underrepresented in this discussion. [Asks for show of hands
about different audience profiles.]
... Want to discuss what we do as translators. We try to get
people to consider multilingual needs from the start, to keep
it in the back of the mind at all times.
... That's why we fight to keep content short and simple, think
about consequences in other language versions.
... Keeping things short and simple in the Commission can be
difficult.
... We face the challenge of matching formats with our tools.
We lag a bit, but our web masters keep wanting to add new
tools.
... Tools are improving, but it is often a challenge for
translators to know what to translate and what not to
translate.
... There is a steep learning curve.
... [Back to Annette]
... Since we cannot translate everything, we have to choose
priorities carefully. We focus on top-level pages and
navigation.
... For specialist/niche pages, MT may do, but for information
going to a large audience, multilingual and user friendly in
the local style are required.
... The bigger the audience the higher the profile.
... We need to understand how citizens use the web and social
media to help make the best decisions.
... Quality assurance is our goal. We have to check closely.
This requires close collaboration with web teams. QA work is
time-consuming and expensive, but hard to quantify.
... [Back to Dutch translator]
... Now I want to share some examples of what we do. We have
huge volumes of legislation, but you will not read laws in
EuroLexis, so we have a portal with short, concise information
that covers practical needs for citizens.
... We try to put out as much national information from
authorities as possible to make this a one-stop shopping site
for information where citizens can find it all.
... This is tricky: 27 languages from 23 countries. If there
are too many languages on a page, you can't use it. What would
happen if you found Maltese when you need another language?
Some human intervention is essential.
... Another example: website on legislation that allows you to
propose citizens’ initiatives: if you get 1,000,000 signatures,
the EU is obliged to propose a law.
... We cannot use MT for this since it could invalidate
efforts.
... Last example is the Commission home page: we try to
translate as much as possible.
... We do not just translate: we localize. For example, if a
Portuguese museum wins an award, we might not translate it for
a Dutch user, but instead put some local content.
... [Back to Annette]
... We deploy our multilingual expertise in service of
citizens. We are translators, but first and foremost
communicators in service of citizens.
Murhaf: I work for Apple on localization and did studies in Dublin.
... I will talk about why right-to-left
(RTL) languages importance and best practice.
... To start with, I want to talk about a friend who wanted to
do software business in the Middle East. [Shows examples of
promotions that don't work because cartoon shows solution
messing things up]
... The whole flow, right-to-left, means that the whole screen
flow needs to be reversed.
... [Shows screen shot with UI mirroring from OS X Lion]
... To make a site compatible for Hebrew
and Arabic, everything must be adjusted
... Everything must be right-aligned.
... You need directionality support for text. [Shows example in
Roman characters of RTL, LTR, and bidirectional text]
... The Unicode bidirectional algorithm (UBA) can handle
display. Text can be entered the same way (first character
first), but is displayed properly.
... The algorithm reorders the characters in the way the user
would expect based on the language.
... It has a set of rules to try to change the order from the
input string to what the user expects. [Shows example of
reordering rules]
... UBA does a good job in most cases. But there are a few
cases where it does not. E.g., if the paragraph direction is
not detected correctly based on first character; if strings
with different directionality are nested in difficult ways; if
strings contain numbers, names, etc.; strings that are
ambiguous for humans as well.
... If we can improve the difficult cases, it would be a great
goal.
... [Shows example in which “Apple” is the first word in an
Arabic string, which sets Left-to-Right as direction, but it
should be RTL.]
... [Example in which “Yahoo!” is separated from the ! because
of ordering; also one in which file extension is in the wrong
area.]
... [Shows example in which parentheses end up in the wrong
place]
... Right now you can use extra markup, tags, Unicode control
characters to force behavior, but this is manual action and
based on experience.
... The problem with manually adding them is that the
translator may now know what to do and they are not easy to use
since they require knowledge about the UBA. The are invisible,
which means they may be lost, breaking the string.
... Sometimes there is no way to check until runtime.
... UBA needs to be improved based on studying cases where
problems occur. We should find patterns and then parse strings
to improve behavior.
... Numbers are difficult. People think they do not change, so
they may hard code them, but Hindi numbers are used in Arabic,
for instance.
... Best practices include: site must support RTL, avoid
composed strings, avoid weak and neutral characters that cause
UAB problems, don't enforce direction, support localized
numbers, support multiple locales.
... In Tunisia they use Western numbers. Other places use others.
Nicoletta: I will speak on the Multilingual
Language Library.
... It is at the heart of the Multilingual Web
... The motto is “Let’s build it together!” Community
involvement is critical.
... We want to make more use of the trend for sharing. Part of META-Share for resources. It is a big step, but not enough. We need to move to collaborative resources.
... Interoperability gains priority in
this scenario.
... NLP is data intensive. Annotation is at the core of
training, testing, etc. But our community efforts are scattered
and dispersed, with insufficient opportunity to exploit
results.
... We want MANY (parallel?) tests/data for MANY languages. We
want to support all possible types of processing and annotation
we may be able to produce as language technology people.
... For example, annotation about time, space, etc.
... It is a step toward making our discipline more like mature
sciences.
... Those disciplines have thousands of people working together
on the same experiences. We aren't there yet, but to be mature,
we must be able to do this if we are to make a step
forward.
... Accumulation of massive amounts of multidimensional data is
the key to foster advancement in our knowledge about language
and its mechanisms.
... We do not want isolated resources. They need to bound
together and their relationships examined.
... We want to create an infrastructure for a large language
repository to accumulate all knowledge about a language and
encourage analysis of interrelations.
... We cannot currently share this knowledge.
... The challenges are not technical or at the design level.
They are at the organizational level, in community
involvement.
... We are starting with the LREC Repository that hosts a
number of parallel/compatible resources in as many languages as
possible, focusing on multiple modalities (speech, text,
etc.)
... This will be contributed to META-SHARE.
... Authors are invited to process data in languages they
know.
... They are invited to focus on different sorts of
processing/tagging that they know.
... Processed data is shared back with the project.
... We currently offer data in 64 languages. English has the
most, followed by Spanish and Catalan.
... There are many missing languages.
... [Shows table of annotation types.]
... [Shows table to tools used for annotation]
[Shows table of standard formats. Heavy use of TIMEX3 for temporal data markup.]
... All data will be available
publicly.
... This is our first experiment. We hope it will set the
ground for a large language library.
... It will help us build all knowledge and let us build on
each other’s achievements. It requires a change of
mentality.
... We need to focus on collaborative mindset.
... Interoperability issues are a problem since we do not
require conformance to any standard.
... Please contribute at http://languagelibrary.eu
Fernando: I will start with context about the
Food and Agriculture Organization of the United Nations.
... We have over 190 member countries.
... Focus on aspects of agriculture, food standards, animal
diseases, etc.
... 5 regional offices. Work in a number of languages.
... See www.FAO.org as portal. But we are working in Facebook,
Twitter, etc. now.
... [Shows table of users by browser language]
... English is dominant (53%) but other languages are
growing.
... Our issues with language call for use of MT. We produce
governing bodies’ statutory documents, food standards, news and
campaigns, technical information, internal communication.
... We need to make this available in all languages, but our
budget is small.
... We use human translation for governing bodies’ documents,
and normative documents. We may use MT + post-editing for the
other groups, but we want to get to the point where we do not
need human intervention.
... We have been testing MT (Moses). We want to reuse legacy
translations. We want to integrate TM and MT and use our
knowledge and experience to improve the production of
multilingual content for the web.
... We want to improve the efficiency of the translation
process.
... Not all content can be translated by humans in all
languages; we need to accelerate the process, particularly for
legal documents.
... We started with allowing users to send queries to the
engine and provide translated responses. By monitoring the
requests, we would get a better view of what content is
demanded.
... This knowledge would help us focus our resources.
... Started with Spanish (for expertise) and Arabic (critical
demand)
... [Shows slide of architecture]
... We used TBX, TMX, etc. to use standard formats.
... However, after trying this, we found out that the best
format to fit SMT is plain text (.txt) aligned in a certain
way.
... We were moving from rich formats to non-rich formats.
... The engine requires that the text be cleaned up from
markup. It reduces the information in the available
translations for use by the engine.
... Some issues we have found are: (1) there is little
information about MOSES for mere mortals; (2) best practices
are not documented in the UN network of practitioners.
... We have shared experience in JIAMCATT.
... We found that there are common problems.
... Other issues: there are standards for each part of the
process, but they do not integrate with each other, raising
interoperability problems. They do not work well
together.
... For us, the translate attribute is useful, but what do we
do when we have to convert to plain text?
... directionality is a problem, as are numbers, acronyms.
(E.g., in Arabic, acronyms are not used.)
... For our texts, English is the source language, but it is
written by those for whom English is not a native language.
Thus the source is “UN English” but the translations are in
native languages. It can create quality problems.
... We are watching the MultilingualWeb-LT project and hope it will help us bring more content to more languages.
Gerard Meijssen: For Nicoletta, is the information in the repository freely licensed?
Nicoletta: It is not a repository of
translations, but of language resources.
... MT results could be one resource that could be
contributed.
Gerard: The data is in the LREC repository, but it is available under a free license where you can do anything with it?
Nicoletta: The are available for everyone, but if you process the data and voluntarily contribute your processing back, you have to make it available. You can set licensing the ensures availability.
Tomas Carrasco: MOSES for Mere Mortals is from a member of our team. Keep your data, but use open formats. Legal issues can be difficult, but instead we should focus on agreeing on formats so we can share as needed. Sharing data is not enough.
Nicoletta: Let me clarify. We provide data and
we ask users to process the data (add annotations). It is all
through the META-SHARE platform. The reason is we want to see
the results and analyze what we get. We do not ask for a
specific format at this point because that is a top-down
approach but we want to see what the community does on its
own.
... We know that best practices, standards will emerge. It's a
different approach.
Daniel Garcia: For Annette, are you involved with translation of social media.
Annette: No.
Dan Tufis: The LREC initiative is great, but have you considered the issue of the quality of the data you are getting? I assume the collection should be reused, but if you don't know the quality, there is not much use of it.
Nicoletta: That is part of the experiment. We need to analyze the data for quality so we can understand the issues that will rise on a bigger scale.
... One possible way it may go is that when you have many layers of annotation, if there is many groups you can look at the issue in many ways.
??: For Fernando. How do your users cope with MT quality? Are metadata from databases (e.g., descriptions, keywords, etc.) translated to provide accessibility even for non-translated materials so that users can know about the availability of data.
Fernando: We use MT only internally for the
time being. The results do not go beyond our intranet. Quality
is an issue, and because people are used to human translation,
we don't want to expose ourselves to risk until we know the
results.
... For document production, we translate titles, etc. Much
uses controlled vocabulary. We use controlled syntax for
URLs.
... We use only English metadata at present in the CMS.
Jörg Schütz: Does SKOS play a role in your efforts?
Fernando: We use terminology database, other resources.
Jörg: For Annette, you mentioned the notion of a default language. How do you decide what it is?
... Is the fall-back always English?
??: Generally yes.
Jörg: That matches my experience.
[Applause for speakers]
Richard: Provides information on reception at Parc Bellevue. If you want to take the bus, take the #18, #12 and go to the Homelius stop. Go further down the road in the same direction. Take the second on the right to Ave. Marie-Theresie. The room is the Salle Marie-Theresie.