Multilingual Web Workshop -- 27 Oct 2010

This is the raw scribe log for the sessions on day two of the MultilingualWeb workshop in Madrid. The log has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC was used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following IRC can also add contributions to the flow of text themselves.

<fsasaki> scribe: various

presentation from Felix Sasaki

<joerg> start scriping the machines sessions - intro by Dan

<joerg> first speaker is felix - talking about LT and in particular interoperability of technologies

<joerg> introduces applications concerning summarization, MT and text mining and shows what is needed in terms of resources

<joerg> identifies different types of language resources

<joerg> distinguishes between linguistic approaches and statistical approaches

<joerg> machines need 3 types of data: input, resources and workflow

<joerg> show the types of gaps that exist in this scenario: metadata, process, purpose

<joerg> these gaps are exemplified with an MT application

<joerg> purpose specifically concerns the identification of metadata, process flows and the employed resources

<joerg> the identification must be facilitated across applications with a common understanding

<joerg> therefore different communities have to join in and share the information that has to be employed in the descriptive part of the identification

<joerg> a solution that can provide a machine-readable information foundation is provided by the semantic web

<joerg> a microformat or RDFa example gives some insights in how the semantic web contributes to closing the introduced gaps

<joerg> the point to remember is that RDF is a means to provide a common information space for machines

<joerg> the talk closes with some ideas on joint projects

<joerg> ... and specifically on how META-NET is already working is this direction

<joerg> discussion points: language description frameworks and the complexity of RDF for browser developers

Nicoletta Calzolari

<joerg> she extends the notion of language resources to also include tools

<joerg> a new paradigm is needed in the LR world to accumulate the continuous evolution of the multilingual web

<joerg> including the satisfaction of new demands and requirements which shall account for the dynamic needs

<joerg> right now a web of LRs is built up and is driven by standardization efforts

<joerg> for the further evolution distributed services are also needed including effective collaborations between the different communities

<joerg> a very important and sensitive issue concern politics and policies to support these changes

<joerg> several EU funded projects have taken up this new R&D direction, and national initiatives are joining in to build stronger communities

<joerg> critical is to ensure a web-based access together with a global vision and cooperation

<joerg> examples are projects such as CLARIN, FLaReNet and META-NET

<joerg> interoperability between resources and tools is key for the overall success as well as more openness through sharing efforts

<joerg> question: many infrastructures and what about the interoperabilty right now? META-NET should/must solve this issue

Thierry Declerk

<joerg> third speaker is Tierry talking about "lemon" an ontology-lexicon model for LRs

<joerg> lemon is part of the EU funded project Monnet and collaborates with standardization bodies

<joerg> lemon contributes to the multilingual web by providing linguistically enriched data and information

<joerg> the industrial use case in Monnet is the financial world, in particular the field of reporting

<joerg> standards used from the industrial side are XBRL, IFRS and GAAP

<joerg> the encodings in these standards are related to semantic web standards to build a bridge between financial data and linguistic data

<joerg> the approach is exemplified by an online term translation application

<joerg> talk closes with an architectural overview of the Monnet components

<joerg> standards used on the language side are among others LMF, LIR and SKOS

<joerg> strong link to META-NET will be established soon...

Jose Gonzales

<joerg> next talk by Jose Gonzales of DAEDALUS with similar subjects but with a strong market view

<joerg> LRs of DAEDALUS date back to the 1990 when no resources for Spanish were available

<joerg> initial focus was on spell checking

<joerg> which was needed mainly by the media market

<joerg> these developments had an important influence on all future developments such as search and retrieval, and even ontology development

<joerg> multilingual developments followed and were based on the continuous experience in the field

<joerg> and is exemplified by a multilingual information extraction application

<joerg> followed by an example that integrates speech (DALI)

<joerg> current applications include sentiment analysis and opinion mining (in Japanese)

<joerg> an EU funded project (favius) takes into account user-generated content and machine translation

<joerg> some online examples close the talk with an outlook on linking ontologies with lexical tools

<joerg> question: islands of LRs could they be made available in the public? There are ISO initiatives underway to support this direction also in terms of structuring of the resources.

<joerg> An open issue is still the representation format

presentation from Jörg Schütz

<fsasaki> jörg: what is business intelligence?

<fsasaki> .. traditional BI is very complex

<fsasaki> .. four main steps: requirements, app design, development, delivery - outcome of data analysis

<fsasaki> scribe: fsasaki

Jörg: many people involved, too slow, too expensive
... ermerging BI: new dynamic data resources, new online algorithms, new paradigms like agility
... relation to mlw: mainly browser based interfaces in emerging BI

<scribe> .. new model allows for more iterations, is more cost effective

Jörg: there is a need for multilingual BI
... interoperability between BI applications and language tools, resources, etc.
... that is, two different ecosystems need to be linked
... normally they use their own standards
... e.g. in BI, there is XMLA, BPMN, UML, Six Sigma, Unicode
... in mlw, you have ITS, XLIFF, MLF, TMX, TBX, SRX, GMX, Unicode
... inbetween there are protocols for communication
... shared serizalization (XML)
... coupling should be done in a round-trip version
... how do communities interact right now?
... currently trust only in your own standard, fear of more complexity
... lack of reference implementations
... in summary - missing: a common mindest for change, exchange between communities, joint reference implemnetations (e.g. supported by funding), self-adapting and self-learning technologies
... join the "interoperability" discussion at http://interoperability-now.org

talk by Piek Vossen

piek: presenting kyoto project
... text mining across different languages
... text mining platform we use is a very generic relations in text
... which you can tune to your needs
... provocative statement: why translate text if you can get knowledge out of if in a language neutral form?
... evolution of the Web: from 1.0, 2.0 (social Web), 3.0
... in 3.0, if a machine can "understand", they can build applications relying on the Web
... question is: how can we connect the different versions of the Web?
... we need: interoperable representation of the structure of language
... representation of formal conceptual knowledge
... kyoto project: for each language a linguistic processor
... a series of programs with basic processing (finding tokens, words, important structures, main verb / topic of a sentence)
... output is a uniform annotation format (kyoto format)
... that is uniform across all languages

<scribe> .. new languages can easily be pluged in, since the basic processing can easily be developed

piek: based on the uniform format, we can do word sense disambiguation, named entity recognition
... disambiguation relying on wordnet(s)
... all resources are linked to each other and on a general ontology
... that is a kind of attempt to create the global wordnet grid
... vocabularies are represented as wordnet LMF representation
... they map to central ontologies like SUMO, DOLCE and a DOLCE extension developed in the project
... the point is: WSD, NER etc. are used with the same program for all languages
... since they are working on the kyoto annotation format
... fact mining also works on that format, and creates RDF
... after that you need a "language renderer" which creates output for humans
... about kyoto annotation format: based on layered annotation format
... several layers of annotations. Each layer points to a different layer
... with the data in RDF, we can create e.g. a semantic search application
... across languages
... project and linked open data cloud: lexical resources in many languages
... summary: do not focus only on machine translation, but also on conceptual anchoring of the meaning in a shared representation
... also important: what do we do with the output of this?
... currently we have 4000 documents and generate 4 gigabytes of triples
... not trivial how to represent this to machines and users
... for humans we need to have "language renderers" for poential users

scribe missed question

piek: you can accumulate information from many languages

pedro: so translators are still necessary?

piek: he can make use of that information

talk by Christian Lieske

christian: talk is about work by many people in this room, like Richard Ishida, Jirka Koszek, Felix Sasaki
... ITS in your source content can help you with many things which e.g. Piek described
... example: imagine that you receive this document to translate it, spell check it etc.

(slide contains document in many languages)

scribe: questions are: what language is the document in, what are defined terms, footnotes etc.?
... ITS let's you specify information like that in terms of "data categories" - about Translatability, Localization Note, Terminology, Directionality, Ruby, Language information, Elements within Text
... with ITS, we can express annotations in a CSS like manner
... with a local approach (like CSS "class" or "style" attributes)
... and a global approach, using XPath expressions
... usage scenarios for ITS: already there are libraries to extract translatable text or other things
... additional scenario about conversion, see a roundtripping for ITS > XLIFF > XML source http://fabday.fh-potsdam.de/~sasaki/its/
... with ITS annotated documents, you can have a processing chain to get XLIFF
... but you can also imagine to have in a user agent an "I18N / L10N preprocessor"
... assume a user says "I want this in my local language"
... a processor could create an XLIFF version
... that is sent to machine translation / translation memory etc.
... the receiving application knows what to do, if the workflow is clear

Q&A for Machines session

david: ITS - using data categories of using them in RDF, is there a suggested mapping ?

felix: no, but that could be a part of a project to develop that

swaran: to Piek - would you application work for Indian languages?

piek: every language can work with the system, it is language independent. The preprocessing is pretty easy developed for many languages

xyz: why do you need provide the information that a certain element name is in a certain language
... if I throw it in an English browser, it can't do anything with it

richard: the names are tokens - your system would need to understand what they "mean"
... you don't have the situation that you put it into a web browser and he has to do something with it

peter: you need the information at the top of the file saying "here is the schema"
... a developer could decide to create variable names in cyrillic
... but the keywords in e.g. C programming language need to be just the keywords
... I can write e.g. HTML in any language, but it is still HTML

jirka: if you need to know information about mapping, it can be made available by the "renaming" language standard

piek: we have a way to match language specific concepts to a general level
... conceptualization of things using language, that we can handle

xyz: you could combine both projects (ITS and kyoto) so that translation of such words is not done

richard: piek - you have a way of dealing with misalignments/gaps of terminology for a given concept in ontologies, is it a standard way?

piek: we are using a wordnet we are developing in the project, an extension of LMF
... we have formal mappings of concepts into relations
... we have to find out to combine now the mapping relations
... lexInfo is another standardiaztion proposal which needs to be aligned

christian: a question to people who talk about natural language processing
... all the things we heared were about analysis
... is there research there you ask people to start from the concepts
... in a language neutral manner
... so, do not start writing text, but start writing concepts?

piek: once the ontologies become very big, they are not big enough to express all concepts you find in a language
... it makes more sense to approach lexicon and ontology development at the same sense

thierry: extracting ontologies from texts is one approach
... but sometimes you have experts developing domain ontologies
... example of radiology - domain experts did not make any commitments about matching concepts and terms
... that would be the other way around, from ontologies to text
... this morning we talked not so much about translation
... but there is a need for a (human) translator, for the generation phase

pedro: NLP which works more about generation is machine translation in rule based approach

thierry: the monnet project relies on the language independent format
... for some languages it is hard to get the right expression, see piek's example
... there are three projects working on this

nicoletta: both for lexica and ontologies, we are loosing information
... language is more complex than lexica and ontologies
... we need to add another element - the real text
... that gives the real complexity

discussion about terminology translation

thierry: TBX (terminology standard) is encoding the information

abc: what do you link to the linked data cloud? the different wordnets, or the domain (general) description?

piek: both

end of session

<chaals> scribe: chaals

Users session

Facebook Translations and the Social Web

Ghassan: We havea a monthly hackathon on facebook. An engineer did a video to show some things taht take place in real time on facebook.

[plays video]

(looks to me like Opera globe - showing users doing thing in different places to classial music)

Ghassan: I like the one that shows nteractions between different places, which shows how global thing are
... This is not about crowd-sourcing, it's about localising the social web.
... took me clsoe to ayear before I started thinking consciously about what is unique about what we are doing
... why is social media so different.
... Three years ago when I went to interview at facebook I talked to engineers for a while. Then in comes a guy my daughter's age... the CTO.
... He said "I don't want l10n or i18n to slow down product development. is this possible?"
... I got the job... :) So I had to start thinking carefully about my 20+ years in the field.
... I didn't want to go in and start implementing things the way I had always done. I wanted to think about a new start and different ways to do things.
... people talked about crowd sourcing, so I thought "why not".
... I won't talk about the movie or the patent submission for crowd-sourcing.
... The company was founded in 2004, went from being Harvard only to academic institutions in general and in 2007 opened to the public and became a platform for other developers to create applications.
... we launched crowd-sourcing and opened translation into spanish.
... At that time we notified only 20000 people, that it could be translated.
... 2008 launched FBConnect. July we opened translation to any application, 2009 to any site that uses fbconnect. (A million+ sites)

[mission statement]

Ghassan: How well is world represented?
... July we announced 500 million users.

[shows graph. shortly after translation there is a clear change in the growth rate]

Ghassan: each time we translate into a language, there is a huge growth in users. more than 90% of users do things in their local language.
... gone from 75% US to 25% US.
... About 50% use a translated UI (plus 5% who use en-GB)
... 500k people have contributed to translation

[Hmm. about 1 in 1000]

Ghassan: We still have a lot of stuff not translated.
... we have prioritie for what need to be translated.
... Challenge: extending translation and capability for l10n in facebook.
... Who are users? They are supposed to be 14+. We know there are a lot of 7-year-olds who pretend to be 14 and become users.
... So you can't make assumptions about hte demographic. You have to allow people to select their language in a simple way.
... Works well for us.
... On main pages we provide a handful of common languages rather than the full list, to make it easie to select likely choices.
... Within a week of allowing people to choose their language easily, we tripled adoption of localisation.
... If you use e.g. English, in e.g. France, when we release french/basque/etc translations we announce it.
... We moved to more aggressive switching - we almost force you to switch if you're in a region and not using a local langage. After that, switching increased by about 500%
... There are about 8000 strings in the site for notifications.
... Problems like the fact that some languages (arabic, russian, polish, etc) have more than just singular and plural (arabic has dual, russian has sing, 2-4, pl. ...)
... Imagine doing complicated strings and localising them. it looks ugly...

[Examples of construcing phrases using automated rules]

scribe: We created 'dynamic string explosion'.
... We have code that fetches information and can select different translations.
... Microsoft for a long time had closed ecosystems for translation, but it was important. slowly they opened up. We wanted to keep it open.
... In our community we have a rating system - you get rated up and down by the community, until you get to the level where your translations are approved automatically.
... Even then, you might have good translations, but that don't fit in context.
... OF 75 languages, about 20 are done professionally and about 50 are community only.
... We have done a lot fo surveys of quality. Result, no difference between professional and community translation.
... Quality is based on what users want - not what the marketing director (EMEA, or wherever) thinks, until we change marketing director.
... Is this the perfect solution? No, but it's pretty good.

Ghassan: a quick view of translation app.
... An engineer turned a switch, and by the time I woke up, it was 75% translated. Byt the end of the day it was done. A lot of rubbish, but it was all in french.

[screenshots of the translation interface]

(Hmm. Looks a lot like Opera's interface. I guess there aren't so many innovative UI solutions in this space)

Ghassan: You can translate inline, or in bulk.

[reads slides]

Ghassan: Got a question - have we seen highly rated translations that are rubbish? Yes, lots.
... We don't pay anyone to translate, but we have a leader board for community ratings.
... Once a translation has been approved, it is no longer available for voting or rating.
... This is really simple. And yet I have seen applications use it and end up with total rubbish.

Google community translation

Denis: will try to represent community perspective with regard to user generated content.
... I manage translation to african languages, and am based in Nairobi
... In particular, Sub-Saharan Africa - North Africa works with the Middle East team.
... On current count Africa is 5% of Internet usage, but 14% of population (note the error margin for dividing Africa).
... We used to use the number 2% until recently. So usage is growing fast.
... Barriers include prices, and in particular cost of badnwidth getting to the continent in the first place.
... Google is interested in content, and there is an issue of having relecvant content.
... We won't solve all problems alone, we are working with other stakeholders in the area.
... It is key that the colonial languages are used for education. South Africa took a big stand on local languages, but implementation is stil patchy.
... Most exciting is the opportunity to bootstrap written form of languages, because now people can afford to do it e.g. by updating their status.
... The problem of content: (Showing WIkipedia info, because we don't share google information)
... A snapshot of articles - Amharic and Swahili (top written languages in Africa) compared to arabic, russian, chinese, english
... Tipping point in growth of wikipedia vs growth of users, there is a tipping point in growth of wkipedia about end of 2004. (As a proxy figure to suggest growth of user-generated content.
... We use a feedback loop of human translation to improve machine translation.
... We started by translating in silos. Community translation is an attempt to produce high-quality translation for the top 100 languages, working with University students. We provide a party, they do the work for us.
... Dismbiguating in similar languages is very difficult.
... We prioritise against internet penetration, whether something is a trendy language, whether there is content available in the language.
... Never underestimate the value of certificates and t-shirts in getting people to do work.

[shows growth in localised search in Africa. Growing faster than total search growth...]

Denis: Wikipedia is source of locally relevant information. So it can seed translation, and that can be used to seed more content.

[time is up. Stop talking!]

Denis: There is a lack of data to start from, which is needed to bootstrap machine translation.
... How do we standardise stuff for highly related languages?

[Picture: people we worked with at Universities in South Africa and Senegal]

Loic Martinez - Localization and Web Accessibility

http://sidar.org -> Fundación Sidar

Loic: From today's talk, looks like everything is almost solved ;) But there are a couple more things to think about.
... People with disabilities / funtional diversity have the same rights everyone else does. This is important to the web because the web is more and more important to us.
... This has all been approved by various countries an signed as treaties. But it seems countries have forgotten that they are agreeing to *do* things. Maybe we should remind them.
... who knows Web Content Accessibility Guidelines?

[about 1/3 ?]

Loic: Accessibility is not one-one mapped to disability. It's important for people in various situations.
... People with old technology, or new advanced technology that doesn't work well with things common on the web, as well.
... Sidar is Seminario IberoAmericano for Disability and Accesibility en la Red (on the Net)
... So we work in spanish in latin america as well as spain. And also in portuguese.
... Various WCAG guidelines specifically touch the field of localisation.
... Not so much talk today about non-text content. But we need alternatives, whether text, signed, adaptive to multiple devices, etc.
... for non-text content. And these things have to be localised (including sign languages)
... I Spain we have only two different sign languages, in South America there are a handful.
... Important point, people who learn sign languages often ahve diffculty reading

[sign languages are grammatically and structually different from most written languages - about as much in common as chinese and russian]

scribe: There are also issues with sound and how well people can hear it do distinguish the words etc.

[Mapping quick tips on accessibility to quick tips for internationalisation]

Loic: There are some gaps, like localising text alternatives, that aren't in the daily vocabulary of localisers.

<r12a> http://www.w3.org/International/quicktips/

Loic: We shuldn't need a business case for accessibility. In the words of William Loughborough "Accessibility is a right not a privelige". On the other hand in reality we do need to show business cases to get industry to do accessibility, just like for localisation.

Challenges of Multilingual Web in India

<crisvaldes> people in audiovisual translation have worked on accessibility and we have started talking about the web

Swaran: From governmentof India. We have a big issue - there are 22 official languages. A small group in a department in government is trying to do all of this.
... Internet is increasing, but penetration about 8%, almost all in cities and in english.
... Wireless devices are increasing too. So is usage of e-government, between government groups, Gov-business, gov-citizen

[screenshots of state government sites in local languages and in english]

Swaran: There is a mixed status and mixed levels of service available
... some things are done by federal government, others by state governments.
... Unique ID project to number each Indian person has brought Unisys into collaboration with government.
... Will require multilingual systems, that can be used by everyone (so accessibility and usability in multiple languages are also important)
... So far we are focusing only on the 22 constitutionally recognised lanugages (of the 122 languages and ~2400 dialects).
... Hindi, the "national language" isn't spoken all over India. Nor is anything else.
... States can choose their languages, and some states have multiple languages.
... Whenever we do anything, we have to deal with a large complexity in languages.
... We started funding projects in 1991, and we have learned that we need to deal with consortia, including people from different regions to ensure we cover different languages. We're going to start dealing with machine translations among indian languages, and will make it available over the next year or so.
... We need these to enable people in different languages to access content.
... Hindi-Punjabi example: High quality, because they are close relatives.
... We need to be able to do multilingual searching, and presenting results through translation.
... At least to allow a rough idea if a site is worth translating seriously.
... We are working on Optical Character Recognition of Indian languages.
... We are converting content to digital unicode-encoded formats.
... Speech systems are under development and will start to become available soon in a few languages based on [open-source] festival text-to-speech system.
... Next step is to work on phonetic engine.
... In May we opened W3C Office, and are going to move this work forward. We think the language work has to happen in order to build a platform where business can see the business case.
... On W3C side we are looking at issues like encoding, input, display, accessibility, mobile web, etc.
... Proud to say we have all the constitutional languages completely encoded in Unicode and it is an official standard in the country.
... There are various issues of how things actually work in browsers.
... Unicode grapheme clustering might not be covered for all indian languages. We need to get the language experts to look at this and check that they are correct. We are in the process of preparing styling manuals to show what needs to be done. Then the browser developers need to come forward and help us - we are working with mozilla and Opera now...
... There are many rendering issues, some from the OS and some from the browser. We need reference implementations to help show what browsers need to do.
... Working on some standards for speech output, national standards to incorporate accessibility standards.

[time up]

scribe: no standardisation in mobile input methods, problems for other things in mobile

[question marks in multiple indian languages/scripts]

Questions

<sven_noben> Hi, thank you for your interesting speeches. I am Sven Noben and sign language user.

Q: We've done surveys on why volunteers get involved. Our experience is they are motivated to do good. None of the t-shirts, ratings, etc matter, they want feedback from their peers and collaboration.

scribe: I think we are looking at the trees - big corporates - and not the people who are involved in non-commercial web acitivites, which is a huge amount of the web.

Denis: There are people around who know the brand, and want to contribute. But the large concern is whether things are sustainable. Based on our work on Swahili wikipedia challenge, the answer is no.
... We did lots of good in six weeks, but it took 5 years to get to that stage.
... It is difficult and costly to work online.,
... the do-good approach wears off after a while.

Swaran: I think it is difficult to get community participation. There is an NGO behind most community work, and behind them is generally funding, e.g. a company.
... In the end, somebody generally has to pay. Without money, not so much gets done.
... A lotof languages are not yet connectedto the web, and don't yet know the value, so a lot of awareness needs to be generated

Q: Concerned with the fact that you give google worldwide license to content.

scribe: so Google is taking free license material from wikipedia and producing it through a process that puts it under google licensing terms.
... A bit like taking America... you're swapping a few t-shirts for the culture of the world, with a bunch of kids who don't really understand the legal implications.

Denis: The spirit of Wikipedia is for people to contribute content to make it accessible. Use of Google toolkit to enable this is within that spirit. It makes it easier for people to contribute.
... in local languages.
... Translator toolkit is human post-editing. Google machine translation includes making it available to the entire world.

[discussion about the terms and conditions, and who owns the content in the end]

Axel: Thanks for congratulations about our Indian performance. Most of the real creit goes to the RedHat team in India.
... How much would it be interesting to cmpare facebook usage to African internet usage - which is prbably more mobile.

[see http://www.opera.com/smw for some information on mobile web usage, with highlights on Africa every so often]

Ghassan: Good question, don't know the answer in numbers.

Denis: Internet stats people have started publishing facebook users as a common metric. Don't think it gives insight in regards to mobile questions.

Sven: Hi, thank you for your interesting speeches. I am Sven Noben and sign language user.
... I am founder of the signfuse company ( http://signfuse.com ) where we do try to make digital information available in sign languages and provide complete user interfaces in sign languages as well. I think it is important to consider sign languages too when talking about the multilingual web. They are as natural as all the other languages brought to attention in this workshop, and are used as a first language by both deaf and hearing people,

amounting over 100 million people worldwide. My questions thus focus on the language itself, rather than from an accessibility point of view.

scribe: Q1: May I know your opinion on this, and ask you why sign languages are often neglected when thinking about the multilingual web?
... Q2: Do Facebook and Google see opportunities for sign language content / interface?
... What are eventually barriers to take this opportunity?

Ghassan: Translation is done by user groups. If you are interested as a community we can open it up - we have opened up for Cherokee, which only has 1000 speakers. We are opening it to another native american group with 50 speakers.

Peter: In terms of localisation of content. It's in strings. We're talking about sign languages, which often have no written form at all.
... where there are, they are not encoded in unicode.
... So question to Sven: what is the expectation of supporting signed languages in teh multilingual web
... If a signed language has a transcription system, those are potential canidates for unicode, but right now they are not there.

<sven_noben> Support it like any other language. With support of video

Swaran: We have to explore, we are facing this in India too

Chaals: Agree with Sven, you need video (which is a barrier, but a vanishing one. You can get decent communication with fairly low bandwidth).

Q: What do you do about things like small European languages, such as Gallego.

Ghassan: For small languages we aren't really thinking about marketing. There isn't much business requirement to support a small language, but it is philosophically sensible

[CMN: I guess it also makes sense because it is technically not much of a stretch]

Denis: We support spanish official languages...

Ghassan: We do to. The basque community were motivated and provided us with a lot of material.

s/amounting over 100/... amounting over 100/

Q: Can you comment about standardising identifiers for African languages?

Denis: In my first six months at google I tried to prioritise languages to focus on. There are different classification schemes - some people say ther are more languages, others say they are all the same.

[CMN thinks about the case of Serbian/Croatian/Serbo-croat/etc...]

Denis: We need to deal with the realities on the ground. There are varieties of Swahili, and at a practical level we assign locale information to clarify. The other issue is data on usage - that is absolutely lacking, which makes it very hard to make the right decisions.
... Governments' first priority isn't to tell us how many people are literate in a given language, the tech community has to figure that out for us.

Axel: We get contributions in languages, and we don't know what the decision we are making really is...

[/me remembers Claudio Segovia's work on trying to collect the clearly different south american native languages that are all lumped together in official tagging schemes, although some of them have substantial populations of users]

Denis: there are real issues here with slang and languages that are actually used in practice. If they are common enough, should welearn about them. Should htey be languages, or variants? It isn't clear.

Richard: Next session will be a video-cast from US, so we have to be seated on time - there is no flexitime.

Chiara: thanks to speakers.

Software for the world - Mark Davis

Richard: Mark has been president of Unicode consortium since it began.

Mark: I cannot hear, so commnicate by text.
... I am going to talk about latest devlopments in Unicode

[shows slide of unicode taking over, repeat from someone yesterday]

Mark: Huge growth in Unicode, ascii and latin-1 have plummeted.
... Unicode is now about 50%. But there are regional differences - in japan it is 40% but rising, in China 50%
... sample selection includes a lot of pages that people don't look at much.
... Unicode 6 just came out. There are about 70 properties associated with characters in the database.
... One of the biggest problems we have is when people hardcode for particular characters. Having properties lets prograamers make language-independent code
... There are characters like the ruppe symbol, but also a lot of "emoji" - useful or funny symbols that are commonly used on mobiles in japan for messaging.
... We can now (since May) have domain names that are *entirely* internationalised.
... There are problems with deployment, because there are differences between the first version if international domain names (IDNA2003) and the new version. Browsers need to match user expectations too - and in the new version upper case doesn't match lower case, although people expect it to do so since it did in the past.
... There are still a lot of old browsers out there, which don't deal with modern standards.
... There are also issues with characters that are used, but not permitted by the standard.
... UTS46 - between Unicode and browser makers, created a standard to figure out how to do stuff in practice.
... CLDR == Common Locale Data Repository, a dataset maintaine by the Unicode consortium to help programmers be locale-independent
... CLDR is very widely used, so getting improvements in the data there means improvements in lots of software people use
... Products generally translate from the XML format we produce to something optimised for the product.
... Anatomy of a language tag, which can be very big so people have to allow for that.
... Script tag is important for some langauges (chinese, uzbek, ...) and there are other things that might be useful information.
... French has, for example common names for days of the week, or months, so all french locales would use them. But for e.g. fr_CA everything that applies to french applies, but it adds different currency information...
... This is what happens in the locale source information we publish. An implementation might choose to rewrite that, e.g. Posix explicitly fills out all the information for each locale, instead of relying on inhertance.
... There are different sets of exemplar characters. Main set is likely to be used for things like automatic language detection, but there are other letters used in practice.
... The index head letters are for things like defining alphabetical order, to use in a contacts address book or for sorting. These will be in the new version of CLDR for the first time.
... Flexible formats allow for different use cases, like presenting things on mobile phones in many different combinations
... Time Zone formatting is quite complicated, and people use different things. So CLDR provides various different ones
... Unit formatting is interesting because different languages have different approaches to number - czech needs 3 different forms (1, 2-4 or 5+)
... for a language like Arabic there are 6 different forms we need.
... Currencies have different things they use.
... List formats differ. Identifying letters and word spaces is tricky, especially in languages that don't use spaces.

[when I first learned spanish, "ll" was considered a single letter, and german used ß. That has changed]

Mark: Transliteration is important to translate e.g. place names inot different scripts so people can use it.
... In CLDR 9 you can do things like say 'please sort cyrillic characters before latin (or after)'...
... place names are also noted, because they are different in different languages

Q: How does Unicode deal with sign languages?

Mark: If there are standards for the way sign languages are represented in symbols, we would welcome that. We are engaged in a process of doing a lot more work for symbols, too.
... Sorry, I have to cut off now. Thanks for your attention

Q: Thanks, it has been great. Where do you see W3C action in this area going? Has anything changed about that in the last 2 days.

Richard: We have some more workshops coming up. I haven't had time to gather my thoughts, so I don't have an answer yet. Sorry.

Q: Will all presentations be available?

A: yes.

Chaals: Anyone want to comment on the internationalisation issues for dead langauges that are historically important?

Peter: There are a number of forms already in Unicode for historical reasons - Sumerian Cunieform, characters used in manuscripts that have since fallen out of use.

Swaran: Grantha? (south indian language) is an example

Peter: Other issues - input methods are needed. Do we need locale information to identify different hieroglyphic sets?

Comment: Looking at historical texts, there are transformation rules that change over time.
... maybe it is necesary to add something to standards to allow for this?

Peter: Unicode is a standard for representation of characters. text representation is the primary goal. Details of presentation are left to fonts...
... although there are scenarios where it is important to indicate some stuff. Unicode can provide some generic mechanism, e.g. zero-width joiner, to indicate that a ligature is *requested*. Is that something that is needed, or is it better to have some markup at a document level (like font...)
... There are also things like variant selectors to identify which version of a character should be used in a particular place name.

Close

Richard: I hope you enjoyed the workshop. We were excited and nervous about the diversity and mixing people who don't normally meet each other.
... Please fill in the feedback form.
... and leave them on the table.
... A huge vote of thanks to the Universidad Politecnica de Madrid, and to the sponsors of the workshop.
... Especially thanks to Luis and Encarna for making this real. Thanks to the speakers, for giving great talks and finishing on time.
... thanks to the scribes. Next workshop we will start by introducing IRC so you can follow what happens. The logs will be on the web.
... Slides will also be up. Links from http://www.multilingualweb.eu
... (speakers please send slides if you haven't already). We also have video of most speakers.
... We've done the first workshop. We need to think more about how to share the information we are gathering. We have a twitter stream - @multilingweb and a facebook page, and other things can be helpful.
... We have a mailing list people can join, and use it to discuss.

[repeated ads for http://www.multilingualweb.eu/]

scribe: Thanks for coming. Next workshop will be in Pisa, and we expect the dates to be 15/16 March (to be confirmed shortly). Hope to see you there.
... Have a good trip home.

Multilingual Web Workshop

27 Oct 2010

Attendees

Contents