MLW Pisa Workshop, day 2

05 Apr 2011


See also: IRC log


fsasaki, r12a, tadej, charles

This is the raw scribe log for the sessions on day two of the MultilingualWeb workshop in Pisa. The log has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC is used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following IRC can also add contributions to the flow of text themselves.

See also the log for the first day.



Dave Lewis, "Semantic Model for end-to-end multilingual web content processing"

Dave: Presenting on CNGL research
... multilngual IR, real time social media translation etc. are all part of the aim to support the global customer
... web services - benefits for localisation like "pay as your use" models, easy deployment, ....
... industry survey shows barriers for adoption of technology
... web services interoperability - needs to be very careful in profiling

<tadej> yes

<tadej> Dave: proposing employing semantic web technology to the MT use case

dave: semantic web may help to solve the problems we are looking at
... sw is a good mechanism to leverage other things
... tools are maturing
... we are interested in a small part of the sw stack, that is RDF
... RDF is a triple langugae, everything gets a URI and can be referenced, RDF schema provides some basic modeling methods

Dave compares RDF to relational data bases

dave: RDF provides classes, properties, ...
... including multiple heritance, allows combinations in an interesting way
... semantic web has not necessarily standardization, people just create a vocabulary
... if it is taken up, good - a "survival of the fittest" approach
... existing data can be annotated with RDF - e.g. for Web services there is WASDL
... developed a seed taxonomy for next generation localisation (NGL) content
... working with many researchers in CNGL to see whether the taxonomy fits their needs, otherwise it is changed
... have a model refinement cycle for this
... fine-grained roundtrips involving customer, content developer, LSP, translators
... looking into doing this with RDF
... "linked open data" - not focusing so much on reasoning, but to see how to publish data you have
... triple stores are becoming robust, starting to scale
... important vocabulary from LOD: open provenancy vocabulary
... helpful for author, segment and source QA
... next steps:
... revise semantic model, semantic sandpit, content markup via RDFa, not standardising semantics, testing semantic technology
... access control, etc.
... real power of SW is its extensibility
... semantic annotations can help to improve interoperabilty
... provenance linked data can help for roundtripping
... will gather a lot of quality metadata about the content we are localising
... that might be helpful for training statistical MT

Alexandra Weisgerber, "Developing multilingual Web services in agile software teams"

alexandra: introducing swinng project, part of the software cluster
... central principle: emergence
... emergent software: enables combination of components and services for digital comparison
... components can come from ERP, BMP, BPI, the Web, ...
... agility to better acount for reducing waste, empowering the team and the employee, ...
... challenges: find a balance for right amount of documentation
... had experiences with writing larger user concepts or user concepts on the white board

alexandra: actions and research areas: include a technical writer in maximum 2 SCRUM teams
... want to set up a controlling to measure software quality and time to market
... difficult task, software quality is hard to measure

Andrejs Vasiljevs, "Bridging technological gap between smaller and larger languages"

Andrejs: talking about challenges for smaller challenges
... tools should be provided to help to bridge language barriers esp. for these languages
... unesco is working on code of ethics , including demand to represent all linguistic grops in cyber space
... alvin toffler: "survival of smaller langauges depends on outcome of MT versus proliferation of larger languages"
... Tilde is doing both language technology and localization services
... we can see real needs of users and test new approaches
... MT at tilde: first rule-based, switching to data-driven methods in 2008, heavy participation in EU R&D
... about MT development
... not only research, but bring results in tools we provide
... MT, dictionaries widely used in the country
... work with MS research to improve MT engine for our language
... problem of data driven MT: translation quality is low for under-resourced langauges
... other challenge is customization: mass-market, online MT-systems are general
... performance is poor for specific domains
... open source tools like GIZA++ or moses are hard to use for the ordinary user, too complex
... strategies to help: see "LetsMT!" project
... building a platform to gather public and user-provided MT training data
... increasing quality, scope and language coverage for MT
... area is "machine translation for the multilingual web"
... user survey about IPR of text resoruces
... there is some willingness to share data
... another project "Accurat"
... non-parallel bi- or multilingual text resources
... e.g. multilingual news feeds
... wikipedia articles, multilingual web sites, ...
... these show scale of comparability
... we calculate the comparability
... develop comparability metrics
... develop methods for automatic acquisition of parallel texts
... cnosortium has both research institutions and SMEs
... taggingg MT translated tags would be very helpful
... to be able to distingush MT translated texts from human translated text
... common interfaces for MT enginges would facilitae interoperability
... standardization / BP are needed

Boštjan Pajntar, "Collecting aligned textual corpora from the Hidden Web"

Boštjan: about collecting aligned textual corpora from the hidden web

Boštjan: aligned parallel corpus: a text alongside its translation(s)
... usage: translation memory, training MT systems, many NLP scenario
... looked at standards, decided to go for TMX
... XLIFF is in my list in the last bullet point, in brackets
... so XLIFF needs more marketing & development
... getting data: non-english professional web sites
... huge amount of translated text
... in general quality translations

Boštjan: problems:

Boštjan: translation memory is hard to get
... data should have high precision
... .. no standard fully supports automatic harnessing or cleaning of data
... proposed solution: crawl from the web
... > database > list of HTML candidates > list of text candidates > parallel corpora
... see http://kameleon.ijs.si/t4me for more info

Boštjan: we used TMX - is it the right choice?

Boštjan: source language must be defined
... no need for me to do that, I just have paralllel texts for machine consumption
... would need an optional parameter to define the source for each segment
... when you develop a standard, think also about "machines" as users, not only people
... future work: optimization in the areas of two phrase crawling, character encoding, enhanced candidates extraction

<luke> To answer Bostjan's question about "how many errors are acceptable", the answer (frustratingly for him, I'm sure) is "it depends": Is the text a guide for system administrators or the company homepage? Also: what are the type of errors (people can usually understand text with some grammatical errors, but if the key nouns/verbs are incorrect, it could be confusing/embaressing).

Boštjan: web service for TM memmory distribution and filtering (web 2.0 style)

Gavin Brelstaff, "Interactive alignment of Parallel Texts – a cross browser experience "

gavin: interactive alignment of parallel texts
... world wide web: need to both think globally, but alos locally, e.g. in terms of minority languages
... "a seed-bed for poetic expression, beyond mere communication"
... cultural context is important, see R. Jakobson
... there is an osmosis between minority languages and global languages
... everybody becomes a 2nd language speaker
... parallel text alignment <> to communicate semantics
... we have standards-based markup, web delivery cross-browser, non-verbal interactivity ...
... statistical MT will not translate poetry in the next 20-50 years
... we developed a parallel text alignment web interface

demo of interactive text alignment

scribe: standards that have been used for the demo: TEI (XML-based) structure
... presented as XHTML, with CSS, JavaScript
... semantics is not RDF, but the TEI structure

gavin: beauty of Unicode - one can put multilingual information in directly into the content
... pros and cons: we can interact directly with semantics
... w3crange does not work in browsers
... TEI P5 must be subsetted
... CSS selection helps with jquery
... some browser isues, does not work everywhere

Machines session Q&A

question about semantic web for MT training

dave: have thought about that
... e.g. linking to terminology data bases
... looking into lexical markup, there was a presentation at the last mlw workshop about this
... hot topic in MT; linguistically informed MT

discussion about legal issues with gathering corpora via the Web - is it legal at all?

Boštjan: laywers will work on finding that out

Alexandra: all languages in our project need to be finished
... depending on the language it is difficult or easier

christian: funny to see the same questions, I had the remark on IP too, let's see where this goes

christian lieske: everyone mentioned that categorisation of what we find on the web would help with machine analysis

<fsasaki> .. not a question, but a remark: all of you mentioned that categorization of what we find on the Web would be helpful for reliable machine analysis

scribe: some communities have a detailed approach to this
... look at last year's w3c day in berlin and you'll see how work on digital libraries may fit well with machine translation

<fsasaki> (above is presentation from Günther Neher)

??: often pages with the same url that are translated are not exactly the same structure

<fsasaki> (see, in German, http://www.xinnovations.de/downloads-2010.html?file=tl_files/xinnovations.2010/Download/W3C-Tag/Prof.%20Dr.%20Guenther%20Neher.pdf)

bostjan: we have done little testing so far - about 7000 translations - and it worked well
... our preliminary experiments show that it still works very well, even if there isn't the same content on both sides of parallel text

andrejs: see the FP7 project that is looking how to extract comparable corpora

s pemberton: i'm impressed by willingness to translate poetry - i'm performing in an opera and it took me a while to understand some allusions and references (gives examples)

scribe: i'm amazed that you hope ever to do this

gavinB: our approach is to find the interface - to see how far machines can go
... it is possible to a translation based on bare bones - even humans can get things wrong...

jorgS: if you have conceptual mismatches, how do you resolve them?

gavinB: this is where the human translator accepts that they need to go away and study it - in our system we mark it up in red
... the translation will never be exact

jorgS: for dave, what do you think of thenext generation of content generation based on RDF ?

dave: there's still a gap between computational linguists and semantic web folks - there are people looking at how to apply these things, and there are proposals out there
... we're looking at how to integrate those approaches into what we do

jorgB: i'm looking forward to multilingual text generation

lukeS: i was intrigued by gavin's presentation
... seems the best you can do wrt translation is to come up with a separate poem that has the same feel
... but this may be a useful tool for understanding the original material better
... there may be implications for other translation approaces

christianL: i understand the remarks about translation poems with machines - but to me Gavin's talk was about an annotation mechanism based on standards
... there is a need for this approach, and gavin's presentation was inspirational
... more and more acccurate annotations are needed, but there are other aspects to translation and gavin's presentation pointed to many useful aspects of this


Paula Shannon, "Social Media is Global. Now What?"

paula: introducing how social media is changing localization
... showing video on social media
... video emphasizing rapid growth and scale of various SNs, describing the relationship of new generation towards social media
... video focusing on effect of social media on advertising, enabling higher ROI for marketing
... introducing the term "socialnomics"

<Steven> One mistake in the video - it conflated Internet and Web, so the time to 50M users was for the web, not for the internet

paula: describing the notion of reputation control via media - the talk will be about showing how this does not hold in presence of social media
... analogy with toddlers as example of parents not being in control
... in social media, the user is in the middle of the system and his worldview actually defines his experience
... emphasizing other social networks than facebook, e.g. hi5, orkut - a reason for their success was the fact that they were localized
... talking about surveys on social media and lionbridge involvement - how people are using social media multilingually
... companies using social media: a quarter of companies are using all 4 platforms - europe and especially asia businesses are growing much
... faster than u.s. companies, likely due to legal issues
... twitter is increasingly popular, fastest growth
... 60% of tweets are non-english, but twitter localized only in 7 languages
... companies engage in hyper-local strategies, twitter account-per-region
... twitter brought new metric: TPS - tweets per second

paula: smartphones becoming the relevant computing platform
... why are companies engaging? because SM allows them to really interact with the users

paula: strategies of social media: 1) single centralized controlled SM outlet

paula: 2) decentralited local pages - more effective, but users have more control
... it is still a huge opportunity - example: coca-cola has 250 people who are tasked with buying keywords
... important assertions: it is happening quickly, it's huge and growing. instantly available content has more value that quality content
... the real-time aspect also affects localization processes - when localizing a message, the process might take too much time
... real-time multilingual communication does not leave space for pre- and post- editing, leaving a lot of human intervention out
... last assertion: machine translation is being increasingly more relevant for SM outlets

Maarten de Rijke, "Emotions, experiences and the social media"

maarten: intro - academics are not concerning with standards per-se, but trying to get things done
... talk will be about standards supporting intelligent information access of content
... in social media, people still do the same things, but online instead of offline
... presenting concrete project of a political mashup
... gather political social media content, debates, analyze and semantify it. political scientists are interested in tracking topic ownership
... traditionally, this resesarch was conducted via classic clipping, now via social media.
... however, data gathered this way is increasingly multilingual
... another project, CoSyne, about cross-completing wikipedia pages using different language articles on the same topic
... third example: The Mood of the Web - Livejournal has mood annotated blogs, serving as a stream of mood-annotated data
... when following mood patterns accross time, you can try to interpret them, for instance "shocked", "tired"
... what would explain a huge spike in "shocked" in 2008. by combining livejournal streams with news and counting word usage statistics, it turns out that it was the death of actor heath ledger.
... showing a time series on stress measurements, showing a spike at the end of the year - that sort of analyses require a lot of technology for text processing and information extraction
... introducing Fietstas, a multilingual en/nl text processing engine as infrastructure for what was presented

Gustavo Lucardi, "Nascent Best Practices of Multilingual SEO"

gustavo: comparing the SEO process with preparing a gourmet meal
... posing the question, "what are the right ingredients for multilingual SEO?"
... high search engine positioning is very important, holding potential for high revenue
... introducing terms: SEO, MSEO, SMO, Social SEO as different strategies in the field
... an important distinction is that whereas in SEO traffic comes from search engines, in SMO traffic comes from social media
... for example, 500 tweets have more effect than 500 incoming links
... however, SEO still has higher ROI than SMO
... an important concept in SEO is the long tail effect in certain business models
... just translating keywords does bring traffic, but has low conversion rates
... for effective multilingual, international SEO, he recommends the W3C Language Standards as basic rules
... SEO can be multilingual, internation or geographical, not mutually exclusive among these.
... what did we learn doing it:
... 1) focus on the long tail and niche market
... 2) conversions, not traffic
... 3) things change, iterate
... showing examples - a legal company campaign was successful once they used correct glossary translations
... healthcare insurance campaign was better once they regionalized their content
... hotel chain: 12 languages, necessary to cover all

Chiara Pacella, "Controlled and uncontrolled environments in social networking websites and linguistic rules for multilingual websites"

chiara: the talk will be around control of content and the implications where in fact we get more complex multimedia to include in the mix of having a controlled vs. uncontrolled content
... controlled environment - the user does not have influence, the content is relatively static
... in a controlled environment, the developers work with strings with sentences, which are then combined
... in an uncontrolled component, the content is very dynamic, developers have limited control - they combine it with the controlled component before outputting
... even in a single sentence, there may be a combination of controlled and uncontrolled strings
... in the translator's view, the content is treated as token variables
... explaining their approach to i18n: handling languages with gender, number, declensions, etc.
... different languages may have different needs than the source language
... they solve that by "dynamic string explosion", which enables a translator to have multiple translations for the same source string depending on the linguistic context
... ... in romanian, the translator must specify gender, but in finnish and russian, it is even more complicated
... an important aspect and the point of this talk is that facebook users are the translators
... considering machine translation, but haven't implemented it yet
... french was translated in 24 hours, released in three weeks, now supporting 67 language, many released without professional review
... review process:
... 1) translating the glossary of individual terms
... 2) translating the content
... 3) professional supervision and checking
... the tool supports both inline and bulk translation for in- and out- of context translation
... why use community translation: 1) users are domain experts 2) speed 3) reach
... why do users translate: personal satisfaction and pride, leaderboard of translation statistics

Ian Truscott, "Customizing the multilingual customer experience – deliver targeted online information based on geography, user preferences, channel and visitor demographics"

ian: SDL is an international company, and itself also faces the multilingual problem
... big themes: social media and different devices, how information is shaping opinions, relevant content is often in the user's language
... reiterates the point that buyers are sensitive to the language of the content when buying
... while around 50% of tweets are english, it is diminishing
... connecting with visitors: be relevant, listen, understand, engage
... this requires monitoring solutions
... understanding: finding common interests accross languages, demographics and geographies
... it turns out that the common interests are key
... content should be relevant, and better relevance via localisation is reflected in better effectiveness of communication
... presenting the journey of the customer engagement, from research of products to buying and customer support
... for the customer's journey, there's a lot of content with which the user engages that needs to be appropriate
... if people are coming to the website, they are trying to get stuff done, so 'user engagement' may be an obstacle
... users' expectations have changed
... they expect content in their own language

Users session Q&A

DavidGrunwald: to paula - you haven't discussed whether you have the tools in place to harness social media?

paula: they don't crowdsource, they crowd manage - using input from users of various levels of skills, split the work into tasks and monitor that

DavidGrunwald: you are not letting the crowd control the message, as you claimed in your talk

paula: the content that I am referring to is not always in public or social media

ian: with social media, you can translate and listen, but you have to be caution with translating and speaking (with automatic tools)
... agree with crowdsourcing, but it needs to be a love brand, for which people want to write

LukeS: on exploding translations in facebook - there is an open source project that unicode consortium supports that handles a subset of the language morphology problem

DanTufiş: to maarten - what theories is your work relying on

maarten: machine translation as core technologies, political science as application

DanTufiş: points out Osgood's work on subjectivity with using wordnet to extract sentiment

Steven: points out that in paula's presentation, it was not the internet that took 4 years to 50 million, but the WWW


Jaap van der Meer, "Perspectives on interoperability and open translation platforms"

Jaap: interoperability questionnaire
... and interestedness in standards in particular
... quotes some of the statements regarding costs
... where is the friction? mostly TM followed by terminology
... reasons to support: freedom of tool choice
... biggest barriers: lack of compliance, lack of maturity, etc.
... sort of restistance against interop. such as market drop-down
... different perspectives of believers
... realists point of view such as "accept market forces", "show business advantage", "restistance to tools", etc.
... and now the pragmatists: "they have hope..." ;-)
... future outlook (5 years!)
... content increase, multimedia, mobile, more cross-lingual challenges, ...
... brief SWOT analysis (see other TAUS publications too)
... information pyramid representing content disruption
... apply pyramid to SWOT graphic
... business model attributes: old vs. new
... e.g. TM is core vs. data is core; one- vs. multi-directional; word based pricing vs. SaaS; GMS vs. MT embedded
... enterprise in 5 years need a language strategies
... last slide: interoperability agenda
... more changes in the next 5 years than in the past 25 years

Fernando Servan, "From multilingual documents to multilingual websites: challenges for international organizations with a global mandate"

Fernando: [talks about the challenges of multilinguality for international organizations]
... gives the context of the food and agriculture organization of the UN
... 6 languages (en, fr, es, arabic, cz, ru); approx. 12 m words/year
... English has the largest share of doc lang.
... websites in 6 lang. and regional relevance content in 3 lang.
... challenges for doc. and web content: tech., prof. profiles, workflow, "consumer" languages
... additional challenges are: rules and regulations, re-use of translations, TM/MT integration
... no analysis or lessons learned available currently
... envison the employment of CMS, CAT-tools, extend prof. profiles, optimize workflows
... under discussion: employment of open source software, cloud services, etc.
... funding could be based on current SME call of the EC

Stelios Piperidis, "On the way to sharing Language Resources: principles, challenges, solutions"

Stelios: [talks about language resources sharing initiative in the context of MetaNet]
... introduces the objectives and structure of Meta-Net, focus will be on Meta-Share
... emphasis the key challenge of data and how it relates on LT research and development
... another important point in the initial discussions was standards
... observations: making data employable is costly
... Meta-Share shall be an open infrastructure that enables interoperability on various layers
... it is also built on existing projects and initiatives that already in this broad field
... as an umbrella organization which shall also include national efforts
... the main idea of the Meta-Share architecture is distribution based on a "meta schema" model
... users/consumers will have the possibility to search, browse and download resources
... fully supports open source developments including appropriate maintenance
... Meta-Share governance is given by members and associate members; legal issues are under cc

Policy session Q&A

Chaals: Word count is going down does not mean translation workload decreases. Speculate on the implications?

Jaap: Identification of different rating criteria; human interference; word count is unmanageable; more demand
... for MT but with different pricing models

Fernando: New challenges through users; relying on help from different sites

Stelios: Subtitling has a different approach based on intellectual capabilities needed; time of media content
... mutiplied by a certain factor

Chaals: Who owns the data question?

Steven: In the Netherlands all films are subtitled... quotes a translator "we are payed by the word"? What would the

integration of MT mean?

Stelios: Translation based on a "master file", i.e. the translation pricing model applies.

Reinhard: Subtitling for free i.e. by volenteers?

Chaals: Student's translations, shipped to India; there are several models...

Stefanov: Some points need to be highlighted: PEs, interpretation vs. translation, different multi-media presentations,

quality control will change, etc.

Chaals: You mean librarians?

Stefanov: Not really... the picture is changing.

<chaals> [modern librarians learn to manage digital multimedia collections, and don't have to have their hair in a bun anymore. I am often surpriesd that they are not present at all at conferences like this - it seems we're missing out on expertise that seems highly relevant]

Christian: MT in subtitling already existing, e.g. in Scandinavia. Question on whether there are policies aimed at reducing translation costs by limiting use of multimedia?

Chaals: I have seen such rules but there are a lot of options.

END of Session

Minutes formatted by David Booth's scribe.perl version 1.135 (CVS log)
$Date: 2011/04/16 05:15:16 $