W3C

MultilingualWeb Workshop, Rome, Day 2

13 Mar 2013


Note: The following notes were made during the Workshop or produced from recordings after the Workshop was over. They may contain errors and interested parties should study the video recordings to confirm details.


Agenda
http://www.multilingualweb.eu/documents/rome-workshop/rome-program
Raw IRC log
http://www.w3.org/international/multilingualweb/rome/IRC/13-mlwrome-irc.txt (incomplete)
Chair
Arle Lommel (DFKI)
Scribes
Dominic Jones, Phil Ritchie, Dave Lewis

Contents


Machines

José Emilio Labra Gayo, “Multilingual Linked Open Data Patterns”

Scribe: Dominic Jones

Jose: best practices for MLOD given at last workshop. One pattern is a solution to a problem. Good to have catalog of patterns for selection. Common vocabs.
... best solutions for Multlingual Linked open data
... each pattern has description, example, discussion.
... Patterns have name, dereference, long desc, linkings and refuse factors.

Jose: 20 patterns, for community to add to and adapt
... person is an armenian and professor at uni of leon. person has birthplace, postion and worksat
... 1st select a uri scheme. URI is human readable ASCII characters
... another pattern is opaque URIs where local names are not human readable. These are independent from natural language implementation
... These are hard to handle by developers
... So descriptive URIs, Opaque URIs and Full IRIS
... internationalized local names. Domain name is ASCII chars but local name is in local chars

Dave LewisSee http://www.weso.es/app/webroot/MLODPatterns/

Jose: another pattern is to include language tag in the URI
... Dereference: return labels based on language code of the user
... semantic equiv of data needs to be identified.
... Labeling - label everything including using multilingual labels. ML labels have a problem when querying looks for mono-lingual labels.
... solutions = labels with no lang tag
... with this which language is the default? Longer descriptions are difficult to handle, better to have finer grained descriptions to separate out labels.
... for longer descriptions there is the possibility of structured literals.
... linking same concepts in different languages which are identified as being the same. However contradictions exist. Link linguistic meta-data exists, 1st class lang annotations.
... reuse: vocabs are generally mono-lingual. Multlingual vocabs are more difficult to maintain
... can create new localized vocabs
... future work - session on best practices for ML LOD, opportunity to improve catalog / add to / remove from catalog

Asunción Gómez-Pérez, “Multilingualism in Linked Data”

Asuncion: all ML concepts should be addressed in LD generation
... model is simple, everything is in rdfs
... subject, property, value. Unique identifiers, URIs are used. Subjects are represented by URIs.
... using equiv links to link data sets
... lots of info sources in different languages. RDF generation and linked data allows for graphical representation of ML LOD sets
... currently looking at million literals data set
... numbers of literals with language tags has increased from 2011 to 2012
... still mostly in english. Data in other languages are simular. Most data is in English as not many countries are providing LD in languages other than English
... in LD cloud ML queries is achieved through 6 stages.
... 1) specification, how to model data sets. 2) Translate labels of ontology into other languages, align vocabs of other languages. Reuse / align existing vocabs. 3) RDF generation use richer models for applications
... 4) link generation - how to discover cross lingual links - how to represent cross-lingual links - how to store and reuse links.
... concepts are tagged in language-based ontology, these ontologies are linked, cross-lingual links. Properties describe medicine
... ontology in german and spanish, translate german into spanish and check for alignments or use cross-lingual-ontology matching across both
... 5) publication - links can be discovered at run time of offline, some storage method is needed for links already discovered.
... 6) Exploitation how to adapt semantic query to linguistic and cultural background of a user. Also how should results of semantic query be adapted.
... For ML LOD many services need to exist from generation through to consumption - ML LOD should be provided through service translation but now we should start including lang features in the generation of data

Peter Schmitz, “Public Linked Open Data - the Publications Office's Contribution to the Semantic Web”

Peter: Large repository for public linked open data
... publications office of EU is a publisher of EU institutions, legislations and non-legislation documents. Whole process of document management. Finally moving from paper to electronic model and from publisher to data provider
... shift from paper to electronic makes the electronic version of EU journal legally binding.
... Multilingualism is core, 23 languages used. Every EU member state requires publication in their own language. For example 2600 pages per document * 23 langs
... ML supports all member states equally therefore ML public websites must exist. For Law, procurement, CORDIS and general publication bookshop.
... Four systems for the ML semantic web
... CELLAR, EU Data Portal, Eurovoc, MDR
... 1) CELLAR in currently production, not yet public, being loaded / populated, some key concepts - repos is defined by common data model (ontology). Semantic model is built up by these components. Loading is standardized, 30TB of data
... in repos content is stored in top level, meta-data is linked to this. Distribution side and SPARQL end point.
... 700M triples in the store. Mainly PDF, XML and XHTML.
... accessible through RESTFul API or through SPARQL endpoint
... 2) EU Data Portal. Single point of access to all structured data for linking and reuse of commercial and non-commercial data

Peter: RDF based interface for upload of meta-data
... 3) EUROVOC available in SKOS/RDF or XML format.

4) Meta-data Registry (MDR) for concepts which have been validated they are published through CELLAR, Controlled vocabs etc

5) For english all the languages of the EU are presented, translations are discussed between all units in the EU and therefore official translation (by member states) exist

scribe: European Legislation Identifier (ELI) follows W3C RDF / XML to provide data in standardized way.

Gordon Dunsire, “Multilingual Issues in the Representation of International Bibliographic Standards for the Semantic Web”

Gordon: IFLA body which maintains global standards for library and biblio environment.
... Separate to IFLA is ISBD and UNIMARC all three relate to library / biblio standards,
... all three use internationally.
... IFLA has own namespace for standards. Supports conversion from library linked data without loss of information.
... IFLA has 7 languages. Standards generally written in English and then translated into the 7 languages.
... ML website launched in spanish, partial doc in spanish of what exists in spanish already.
... Open meta data reg is used to store classes, URIs for each maintainers. These are Opaque as to avoid lang bias when used in RDF
... ISBD elements - problems occurred when namespace was translated. Translation into spanish became guidelines for doing future translations. Contains much info on the problems / issues of translations.
... Problems 1) scope, what is translated first and what is most useful. (developers - element definitions, labels) (Users - what they see labels of concepts in value vocabs).
... 2) Style: Verbal phrasing, CamelCasing etc
... hasAuthor, hasTitle does not translate perfectly into other languages. CamelCasing looks bad in other languages, whats ok in one language may not work in another language.

3) Disambig methods for creating labels may vary between languages.

4) Language Inflection

Partial translations only preferred label translated, have to track status of translation through a number of stages, schedules and status tracking are required.
... MulDiCat for authoritative translations of IFLA standards, available in open meta-data repository as well. More than 26 langs represented.

Thierry Declerck & Max Silberztein, “Language Technology Tools for Supporting the Multilingual Web”

Thierry: on the web ML pages, dictionaries, tools. not every document is available in every language. When I access web in german or french I dont often get docs in other languages. Mono-lingual search
... semantic resources are already available on the web. We have ML web, pages, resources but we want the Sem Web to run in combination with lang tech so we can annotate text
... GICS - classIDs, Labels, these labels use non-standard formats etc.
... towards ML linguistic Semantic Web so labels can be encoded in RDF using Lemon model - also want to mention Linguistic Linked Open Data.
... annotate text available in multiple languages 1) take all labels, analyze, combine in semantic repos using Lemon and apply to running annotated text. Can also be stored in queryable tool.
... in one ontology you display suggestion for ML labels encoded in ontology
... NooJ can be used to test NLP analysis of labels, difference is way natural langage can be expressed
... need to harmonise and modify a label for NLP. Terminological expansion of labels provide taxonomies for preferred labels. From 1 label 5 labels can be generated annotated using LEMON and exported
... triggering of ellipsis resolution to cross-lingual labels in other languages. Labels are expanded based on property of another language.
... From this we discovered semantic annotation of web documents in many languages.
... Text from spanish stock market, two simular taxonomy generates two annotations, both labels point to same concept but are textually different.
... labels can be displayed in many other languages and allows for annotations in higher level languages.
... needs to make sure these are compliant in terms of standardisation.

Question And Answer

lmatteis: What's the reason behind having 'opaque' URIs, and translating RDF predicates? They are merely identifiers, and as long as 'label' and 'definitions' have been properly translated, I see no reason of further complicating RDF vocabularies with multiple translations.

Gordon: Several reasons for opaque URIs. 1) Not opaque must be based on something, therefore is the label changes the URI cant change so its more confusing. 2) The favoring of any language over another is not good practice. 3) When translating property and class labels we're using opaque URIs

Ivan: Linked Data community doesn't know "anything" you guys are doing. Until Larger LD community is aware of your work I dont see anything changing. For devs to take ML LD into account they need to be aware of your work

Asuncion: Ontology-Lexical WG is being proposed to be used for representing. Big countries investing in LD are english speaking and are not immediately interested in ML LD.

Asuncion: From SW perspective we need a road-map to push these ML issues. White Paper for community addressing these issues.

Ivan: W3C working groups are not suited for this. For example schema.org represents vocabs that are used we cant ignore them. Need to try to get the authors of schema.org to think about ML data
... labeling and documentation in ML form would be a huge step

Jose: I agree with your point, hence catalog of patterns has been produced.
... need to educate hence BP practices for ML LOD

Asuncion: Trying to analyze how languages are used and how these linguistic choices are applied to data sets

chaals: annotating other peoples vocabs are socially difficult. Opaque URIs avoid having a language bias? No, the bias exists in the model, opaque URIs hide this from the top level view. We should be publishing annotations on other peoples vocabs that are broken

jose: in the case of annotating and translating the label that you want.
... labelsforall.info simular to prefix.cc for label and translation recording.

Q: Different communities I second Ivan’s views. In terms of ML-LOD cloud, when someone asks where is ML-ism? A URI is a resource that can be in many languages. Dimensions, Peter S takes of TB of LOD. Many people talk in terms of one record. In ML-LOD 1) concept

Richard Ishida: Aim of workshop is not just for talks but to get people together networking to move things forward.

christian Lieske: 2 things 1- how far does the work you are driving / continuing, effect the content authors. user cataloging, etc. Also how far is the reviewing activity considered a general reviewers toolkit

Peter: no direct connections to author services, everything is translated, we're just proof reading. IN being efficient we work with coded data and cataloging.

Thierry: our work has implication on labels, taxonomies, in terms of impact important we provide impact to provide recommendation to change terminology to make it more applicable.

q: relation between work of the speakers and repositories like free-base?

Feiyu: instance of freebase can be used as a kind of interlingual can be really useful for ML-LOD

Users

Scribe: Phil Ritchie

Pat Kane, “Internationalized Domain Names: Challenges and Opportunities”

Pat Kane: focus on end users
... Users want to use their own scripts
... growth in Asia Pacific driving non-English domain names and urls
... 1m+ international domain names registered in first six months
... 50% cjk ideographs
... Armenian scripts under-served
... major browsers handle idm's quite well
... email addresses used a lot as identifiers in log in's
... What's hindering domain registrations? greater user awareness, registrar's
... better mobile browser support, management tools
... results in a lack of trust (intent for a user to register)
... users want full idn support
... lack of ubiquity an issue
... idn's are second class domains, users are suspicious of them
... not comfortable with idn.ascii
... SME's in China are more open to idn.idn
... 5 key insights: more utility needed, initial resistance to adoption, translation preferences,
... moderate interest in registration and registrar channel expectations.
... Chinese want idn.idn not idn.ascii
... In India respondents do not visit idn.idn
... In Japan comfortable with ascii.ascii
... Korea more passionate about idn.idn
... Need multi-disciplinary groups to push adoption
... Key roles: Registries, Registrar's, content creators, application developers, Governments and businesses
... and standards organisations
... circle of dependency: adoption -- ubiquity
... change ecosystem to enhance user experience
... ubiquity drives trust
... ubiquity means not just desktop but also mobile
... mobile applications are much less capable of handling IDN's

Richard Ishida, “What’s in a Name?”

Richard Ishida: concerned about data and data formats: specifically people's names
... web sites usually ask for "first" and "last"name
... Use "given" and "family" name
... names are more complicated than we generally think
... applications want to parse names and do things with them - e.g. in salutations, search and sorting
... Björk's "surname" is actually her father's name
... "bin" == son of
... Mao Ze Dong - Ze == generational name
... How you would address him depends on a lot of things
... typically he would use a western name to make things easier for western people
... multiple family names: given name plus two family names
... father's name first, mother's name second - varies by country
... Variant word forms indicating gender
... how names are inherited varies
... nicknames used often to help
... written forms can be ambiguous
... many asian names can be transcribed identically
... Recommendation: ask people how you would like to be addressed
... this topic needs a lot more work
... need an authoritative guidance on the problems of handling names

Sebastian Hellmann & Sören Auer, “The LOD2 Stack and the NLP2RDF Project”

Sebastian: LOD == Linked Open Data cloud
... http://lod-cloud.net data sets published on the net
... free, open and open licensed

sebastian going through the lod2 stack
... now about NIF format
... linguistic LOD cloud
... in NIF use fragment identifiers to address primary data
... can query NIF components as a web service
... OLiA: Vocabulary Module - mapping of over 50 Tagsets
... NIF 2.0 plans - links to ITS 2.0, Lemon ontology, XPath uri scheme
... NIF will be free and open
... looking for contributors

Fernando Serván, “Reorganizing Information in a Multilingual Website: Issues and Challenges”

Fernando: FAO has presence in 82 country offices
... uses 6 official EU languages
... FAO users language primarily English followed by Spanish and French
... currently reorganizing content to focus on decentralization and partnerships
... need to accommodate locally generated content
... Issues faced: a lot of unstructured content, web content, language versions do not match, no localized uri's, low reuse of content
... lack of mono- and multilingual ontologies to drive navigation
... do have a stable geopolitical ontology
... need to make best use of existing content, identify normative, use CMS-independent content, use MT (for Arabic and Chinese), better (intended) understand users
... want to utilize standards and best practices: XLIFF, RDF, ITS 2.0, learn from translation workflows, get social - on-demand translation
... allow users to vote for pages that should be translated
... have a set of short term and more longer term goals
... want to prioritize for Chinese, Russian, etc.

Paula Shannon, “The Globalization Penalty”

Paula: McKinsey - "Strong multinationals seem less healthy..."
... Local firms in emerging markets succeed where Multinationals fail
... Marketing defined as a key function
... global means complexity which means cost
... balance central vs local
... The Consumer Decision Journey
... written about in the Harvard Business Review
... many people in the digital age already know what they want to buy before they go to purchase it.
... consumers in the digital age trust social marketing
... push branding is irrelevant in the digital age
... so how do you form your pre-purchase opinion? Search
... changing the rules: The Global Customer Lifecycle
... 71% decide based on in-language search and peer recommendations
... 3 Biggest Problems: Traffic, Conversions, Management
... when search is bad: "Can't find... won't buy"
... Web Localization Maturity Model
... SEO localization generates 15-40% more traffic
... increase search rankings and traffic
... be in the top 3 search results by benchmarking against competitors
... SEO optimized translations is an iterative process
... look at baseline, keyword research, translation, QA, repeat
... percolate keywords throughout content
... analytics and reporting against competitors
... it's not just Google. 6/10 popular social networks in China.
... Yandex expanding out of Russia
... pace of change is accelerating
... global companies need to be hyper-local.
... utilize local search term experts
... Long Tail of search terms
... Methods for executing multi-lingual Pay Per Click: MT, Local Offices, Human Translation, Localization and Optimisation
... hosting 1.5 million pages for clients

Users Q&A

Reinhard Schaler: existing notion of "give up the illusion of control".
... What is stopping the localization industry from handing over to the users?

Paula Shannon: Localization is the step-child. No clear ROI. Localization has been a cost center thus focused on efficiency and cost reduction

Des Oates: Some companies are making steps to user empowerment
... Adobe has ceded control of certain products to users

Paula Shannon: text enrichment can help

Fernando Serván: monitor demand, traffic analytics
... demand to drive translation

Charles: when you prioritize translation on demand, how do you decide?
... how do you balance those things?
... your goal is to server existing users

Fernando: it is tricky, I agree but we are trying to understand users better
... time will tell

Richard Ishida: In India most people are used to ascii.ascii. Yet small percentage of users speak English
... should the market for IDNs be bigger?

Pat Kane: the biggest challenge is the number of languages and scripts

???: Globalization vs Localization vs Multiculturalism
... we ignore the multicultural component
... Fernando mentioned using MT for Arabic and Chinese
... these are difficult MT languages, do you have specific reasons for using MT for these languages?

Fernando: we know Spanish and French are easier
... it is difficult to find the volume of translators for Chinese and Arabic
... SDL: it's important to consider user feedback

Open Space Sessions

Scribe: Dave Lewis

Des Oates: International domain names, chaired by pat kane
... best practice in multilingual linked open data - chaired dom jones
... translation quality - chaired by arle lommel
... from floor
... interest in translation quality on postediting as well as human translation

des: the ball is now over to audience to propose now other topics
... aim is the not just discuss but to propose action plans to deliver upon later
... some personal finding from workshop to date
... input from content creators, advances using ITS2.0 from cocomore, also with joomla and use with xliff in drupal and dita
... also real world use cases with spanish tax office
... the other big topic is multilingual search and ML SEO, including insights from Paula
... a big issue for adobe related to keyword and term management

Multilingual Linked Open Data

Working doc at: http://goo.gl/Th2VA

Ivan Herman, “Towards Multilingual Data on the Web?”

Ivan: Community needs more deployment, use cases, data and linked data
... Underlying tech needs to be seen as stable, so people are not waiting for next big thing, so W3C is not planning anything more to stack
... W3C won’t standardise things, but will rely on community groups, e.g. Open Annotation CG
... W3C want to extend this and perhaps host vocabs with stable URIs, with registry of meta-data. Not a value measure, just cataloguing meta-data and governance and version control of vocab – will include localisation quality
... Need better validation, RDF is not well suited to this, so need structural validation (schema-like) and quality validation – but how to validate a multilingual vocab – a question for this vocab
... Disconnect between LOD and non-LOD , e,g CSV, text files etc. We site developer use data, not linked data
... Reference London Open Data on the Web workshop

Gordon Dunsire, “Multilingual bibliographic standards in RDF: the IFLA experience”

Gordon: IFLA – similar to earlier presentation in plenary session
... Have a few trillion triples potentially, a large high quality collection
... Some translated, some not, and partial translations
... Have a ML dictionary for authoritative publication of pub categories, but not very accessible to end users or web developers

Jose E. Labra, “Patterns for Multilingual LOD: an overview”

Jose: Practical ML LOD guidelines
... Naming guidelines: extrapolated from mono lingual guidelines
... Opaque URI raises some controversy

Charles McCathie Nevile, “Web Standards and Yandex”

Chaas: Tools need to be good
... Opaque URI not helpful
... Yandex is in schema.org, but only look at microdata rather than rdfa because it was easier, but now might be regretting this and rdfa might be better. But because of this, in Russia microdata is more common than microdata.
... At yandex, people tend to add label/meta-data in English, but it was a better process to do it in Russian – on the whole you perform better in your own English.

Roberto Navigli, "BabelNet: a multilingual encyclopedic dictionary as LOD"

Roberto: BabelNet – wide ML semantic network, with encyclopedic and lexicographic from Wikipedia and Wordnet
... http://babelnet.org/
... 6 languages cover, moving to 40+
... 3 million synsets
... Planning to integrate babelnet into linguistic linked open data cloud
... Contribution to LOD: Make available in lemon, real large ML LOD example

Haofen Wang, “The state of the art of Chinese LOD development”

Haofen: apex labs, china– data and knowledge management labs
... http://zhishi.me
... Chinese LOD (CLOD) – 8 million instances, 1 billion triples, Chinese wikipedia, baike.com and baidu encyclopedia site
... Issues: need to use IRI, but limited by use of older browsers
... Naming resources, Wikipedia uses traditional Chinese rather than simplified Chinese
... Integrating with e-commerce sites 360buy, taobao and soc net weibo and dianping to motivate more open LOD data streams
... Align with schema.org

Group report

Jose:
... Notes taken on google docs.
... Naming: descriptive vs opaque, depends on the use case, useful for both
... Labeling: should always have language tag
... Interlinks: sameAs and see also may not be useful in all cases, lexical/linguistic resource interlinking not always the same as conceptual interlinking
... Will start community group on best practice in ML LOD
... Richard Ishida notes this is easy to set up and join

Translation Quality Group Report

Arle: Need to decouple production method from end use
... Source quality is an issue, not always the translators fault
... Quality is dependent on the step in workflow where it is used
... Expectation need to be clear
... Are existing metrics actually valid and reproducible? Some are academic and not useful for production
... Need some process metrics to track these. Ethnographic studies of posteditors.
... Additional factors, see slides.
... So what does multilingual web do to help, three points:
... (1) Context, audience and use – common methods for HT, MT and PR MT – but need to be broadened out beyondQT Launchpad (wee workshop tomorrow)
... (2) Don’t reinvent the wheel, harmonise parallel work
... (3) An ongoing effort needed perhaps centered on MLW community at W3C

International Domain Names Group Report

Pat: It is an ecosystem problems, not just for W3C but other bodies, other voices needed also.
... Perhaps w3C working group could be a starting point

Standards Group Report

David: CMS-L10n roundtrip, term management, and harmonisation efforts
... Seemed to cover many issues in existing ITS2.0-XLIFF mapping, e.g. terminology (usage and forbidden)
... But do need some standardized API, connectors and brokers
... In terminology need a message broker. Especially in interactive scenario with multiple terminology systems in real time.

Names Group Report

Juan Pane: Focus on 3 use cases:
... 1) Recognition, e.g. named entity recognition and resolution, focussed on person names, for MT, for search and also segmentation (over line boundaries)
... 2) Display: sorting names in lists. Contextual usage – formal, familial, full (postal), autocompletion, abbreviation (e.g. in paper author list), text to speech
... 3) Capturing names: transliteration, speech to text, input form – size, order labels
... Problems listed
... Propose perhaps define an ontology of names

Crowdsourcing/Non-Market Strategies Group Report

Reinhard: Discussed different scenarios, e.g. for commercial and for non-market/non-profit, people motivation and associated support systems
... Practical implementations: environments need to be easy to use, looking at Easyling, FAO and SOLAS
... Too few to set up a bigger group, but invite other to participate.

Open Space Q&A

Christian Lieske: why do we need an additional terminology related standard, can’t we reuse existing LOD mechanisms?

David Filip: agrees that linked data can helps but need specific support for terminology. This area also suffers from many poorly adopted standards so a new one might make sense.

Christian Lieske: good to bring LOD, terminology communities together

David Filip: agrees, standards harmonisation is key to going forward. But also there is a gap in the API level

Felix: looks for more standardisation people and localisation companies in the ML LOD best practice group.

Des: supports this call
... asks about CMIS

David Filip: yes, work

Pedro: we are going to GALA, so if there is a clear message we will

Ioannis: also visiting a industrial term working

Dave:let’s open group now, so we have a concrete URL to point people at

Christian Lieske: how will names discussion advance?

Richard: no specific plans

David Filip: asks about locatives in names

Richard: not addressed this yet, as there were more immediate use cases, nor inflection, or other context

Dave: asks if I18n interest group is a good place for this

Richard: community group can be more focussed, IG may reach more people but interest can be more focussed

Felix: agree we need to think about where best to place this. But in all cases we need hero to drive it forward – the ML-LOD BP seems to have two to three

Richard: +1 need committed driving person

Des: wrap up there, thanks everyone

Closing Remarks

Arle: Slides will be available soon, by next week. Linked from programme page
... There is streaming video available already provided by FAO, and better quality lectures available from Video Lectures
... Report will be produced soon, based on scribes
... Thanks to sponsors, Verisign who support workshop and dinner, and QTLaunchpad, who will having, FAO for local support, DFKI and Nieves, and Felix. Thanks to EC and their sponsorships, W3C for logistical home, programme committee, organizing group, speakers, chairs and scribe (especially Felix).
... Funding for conference series comes from MLW-LT which finished end of year. Waiting to hear on further funding from EU projects, but if anyone has further opportunities for funding or is willing to host future events please talk to Felix or Arle.


Minutes formatted by David Booth's scribe.perl version 1.137 (CVS log)
$Date: 2013/05/08 09:57:00 $ Results corrected by Arle Lommel and Nieves Sande