07:25:19 RRSAgent has joined #mlw 07:25:19 logging to http://www.w3.org/2010/10/27-mlw-irc 07:25:36 meeting: Multilingual Web Workshop 07:25:40 chair: Richard 07:25:45 scribe: various 07:26:44 agenda: http://www.w3.org/International/multilingualweb/madrid/program 07:27:02 topic: presentation from Felix Sasaki 07:33:44 joerg has joined #mlw 07:34:41 start scriping the machines sessions - intro by Dan 07:36:57 first speaker is felix - talking about LT and in particular interoperability of technologies 07:40:54 introduces applications concerning summarization, MT and text mining and shows what is needed in terms of resources 07:41:34 identifies different types of language resources 07:42:16 distinguishes between linguistic approaches and statistical approaches 07:43:31 machines need 3 types of data: input, resources and workflow 07:45:36 show the types of gaps that exist in this scenario: metadata, process, purpose 07:46:58 these gaps are exemplified with an MT application 07:51:25 purpose specifically concerns the identification of metadata, process flows and the employed resources 07:53:03 the identification must be facilitated across applications with a common understanding 07:54:07 therefore different communities have to join in and share the information that has to be employed in the descriptive part of the identification 07:57:04 a solution that can provide a machine-readable information foundation is provided by the semantic web 07:58:38 a microformat or RDFa example gives some insights in how the semantic web contributes to closing the introduced gaps 08:00:08 the point to remember is that RDF is a means to provide a common information space for machines 08:01:07 the talk closes with some ideas on joint projects 08:02:20 ... and specifically on how META-NET is already working is this direction 08:06:37 labra has joined #mlw 08:07:41 discussion points: language description frameworks and the complexity of RDF for browser developers 08:08:19 second talk by Nicoletta 08:08:58 she extends the notion of language resources to also include tools 08:10:29 a new paradigm is needed in the LR world to accumulate the continuous evolution of the multilingual web 08:11:53 including the satisfaction of new demands and requirements which shall account for the dynamic needs 08:14:05 right now a web of LRs is built up and is driven by standardization efforts 08:15:10 for the further evolution distributed services are also needed including effective collaborations between the different communities 08:16:58 a very important and sensitive issue concern politics and policies to support these changes 08:18:38 several EU funded projects have taken up this new R&D direction, and national initiatives are joining in to build stronger communities 08:20:31 critical is to ensure a web-based access together with a global vision and cooperation 08:21:10 examples are projects such as CLARIN, FLaReNet and META-NET 08:22:42 interoperability between resources and tools is key for the overall success as well as more openness through sharing efforts 08:27:55 question: many infrastructures and what about the interoperabilty right now? META-NET should/must solve this issue 08:29:06 third speaker is Tierry talking about "lemon" an ontology-lexicon model for LRs 08:30:41 lemon is part of the EU funded project Monnet and collaborates with standardization bodies 08:32:34 lemon contributes to the multilingual web by providing linguistically enriched data and information 08:34:55 the industrial use case in Monnet is the financial world, in particular the field of reporting 08:36:17 standards used from the industrial side are XBRL, IFRS and GAAP 08:39:12 the encodings in these standards are related to semantic web standards to build a bridge between financial data and linguistic data 08:40:24 the approach is exemplified by an online term translation application 08:43:00 talk closes with an architectural overview of the Monnet components 08:45:04 standards used on the language side are among others LMF, LIR and SKOS 08:46:28 strong link to META-NET will be established soon... 08:49:19 next talk by Jose Gonzales of DAEDALUS with similar subjects but with a strong market view 08:51:11 LRs of DAEDALUS date back to the 1990 when no resources for Spanish were available 08:51:44 initial focus was on spell checking 08:52:33 which was needed mainly by the media market 08:54:32 these developments had an important influence on all future developments such as search and retrieval, and even ontology development 08:56:07 multilingual developments followed and were based on the continuous experience in the field 08:57:08 and is exemplified by a multilingual information extraction application 08:58:00 followed by an example that integrates speech (DALI) 08:59:31 current applications include sentiment analysis and opinion mining (in Japanese) 09:00:35 an EU funded project (favius) takes into account user-generated content and machine translation 09:03:51 some online examples close the talk with an outlook on linking ontologies with lexical tools 09:07:58 question: islands of LRs could they be made available in the public? There are ISO initiatives underway to support this direction also in terms of structuring of the resources. 09:08:22 An open issue is still the representation format 09:08:41 END SCRIPTING machines session 09:36:11 fsasaki has joined #mlw 09:36:50 topic: presentation from Jörg Schütz 09:39:05 jörg: what is business intelligence? 09:40:04 .. traditional BI is very complex 09:41:07 .. four main steps: requirements, app design, development, delivery - outcome of data analysis 09:41:09 labra has joined #mlw 09:41:15 scribe: fsasaki 09:42:11 .. many people involved, too slow, too expensive 09:43:19 .. ermerging BI: new dynamic data resources, new online algorithms, new paradigms like agility 09:43:56 .. relation to mlw: mainly browser based interfaces in emerging BI 09:46:12 .. new model allows for more iterations, is more cost effective 09:46:24 .. there is a need for multilingual BI 09:47:19 .. interoperability between BI applications and language tools, resources, etc. 09:47:54 .. that is, two different ecosystems need to be linked 09:48:02 .. normally they use their own standards 09:48:17 .. e.g. in BI, there is XMLA, BPMN, UML, Six Sigma, Unicode 09:48:32 .. in mlw, you have ITS, XLIFF, MLF, TMX, TBX, SRX, GMX, Unicode 09:48:45 .. inbetween there are protocols for communication 09:48:51 .. shared serizalization (XML) 09:49:12 .. coupling should be done in a round-trip version 09:51:17 .. how do communities interact right now? 09:51:36 .. currently trust only in your own standard, fear of more complexity 09:51:45 .. lack of reference implementations 09:52:59 .. missing: a common mindest for change, exchange between communities, joint reference implemnetations (e.g. supported by funding), self-adapting and self-learning technologies 09:53:11 s/missing/in summary - missing/ 09:53:44 .. join the "interoperability" discussion at http://interoperability-now.org 09:54:28 topic: talk by Piek Vossen 09:55:26 piek: presenting kyoto project 09:55:34 .. text mining across different languages 09:55:55 .. text mining platform we use is a very generic relations in text 09:56:03 .. which you can tune to your needs 09:56:35 .. provocative statement: why translate text if you can get knowledge out of if in a language neutral form? 09:57:50 .. evolution of the Web: from 1.0, 2.0 (social Web), 3.0 09:58:27 .. in 3.0, if a machine can "understand", they can build applications relying on the Web 09:58:42 .. question is: how can we connect the different versions of the Web? 09:59:00 .. we need: interoperable representation of the structure of language 09:59:15 .. representation of formal conceptual knowledge 09:59:27 .. kyoto project: for each language a linguistic processor 09:59:53 .. a series of programs with basic processing (finding tokens, words, important structures, main verb / topic of a sentence) 10:00:08 .. output is a uniform annotation format (kyoto format) 10:00:19 .. that is uniform across all languages 10:00:37 .. new languages can easily be pluged in, since the basic processing can easily be developed 10:01:09 .. based on the uniform format, we can do word sense disambiguation, named entity recognition 10:01:31 .. disambiguation relying on wordnet(s) 10:01:43 .. all resources are linked to each other and on a general ontology 10:01:53 .. that is a kind of attempt to create the global wordnet grid 10:02:04 .. vocabularies are represented as wordnet LMF representation 10:02:19 .. they map to central ontologies like SUMO, DOLCE and a DOLCE extension developed in the project 10:03:12 .. the point is: WSD, NER etc. are used with the same program for all languages 10:03:21 .. since they are working on the kyoto annotation format 10:03:38 .. fact mining also works on that format, and creates RDF 10:04:09 .. after that you need a "language renderer" which creates output for humans 10:04:21 .. about kyoto annotation format: based on layered annotation format 10:04:40 .. several layers of annotations. Each layer points to a different layer 10:08:45 .. with the data in RDF, we can create e.g. a semantic search application 10:08:48 .. across languages 10:10:46 .. project and linked open data cloud: lexical resources in many languages 10:11:56 .. summary: do not focus only on machine translation, but also on conceptual anchoring of the meaning in a shared representation 10:12:11 .. also important: what do we do with the output of this? 10:12:25 .. currently we have 4000 documents and generate 4 gigabytes of triples 10:12:41 .. not trivial how to represent this to machines and users 10:13:19 .. for humans we need to have "language renderers" for poential users 10:14:41 scribe missed question 10:15:01 piek: you can accumulate information from many languages 10:15:09 pedro: so translators are still necessary? 10:15:32 piek: he can make use of that information 10:15:38 topic: talk by Christian Lieske 10:17:44 claudio has joined #mlw 10:18:15 .. talk is about work by many people in this room, like Richard Ishida, Jirka Koszek, Felix Sasaki 10:18:48 .. ITS in your source content can help you with many things which e.g. Piek described 10:19:16 .. example: imagine that you receive this document to translate it, spell check it etc. 10:19:25 (slide contains document in many languages) 10:19:57 .. questions are: what language is the document in, what are defined terms, footnotes etc.? 10:22:11 .. ITS let's you specify information like that in terms of "data categories" - about Translatability, Localization Note, Terminology, Directionality, Ruby, Language information, Elements within Text 10:29:31 r12a has joined #mlw 10:29:44 .. with ITS, we can express annotations in a CSS like manner 10:30:02 .. with a local approach (like CSS "class" or "style" attributes) 10:30:16 .. and a global approach, using XPath expressions 10:31:30 .. usage scenarios for ITS: already there are libraries to extract translatable text or other things 10:32:12 .. additional scenario about conversion, see a roundtripping for ITS > XLIFF > XML source http://fabday.fh-potsdam.de/~sasaki/its/ 10:32:30 .. with ITS annotated documents, you can have a processing chain to get XLIFF 10:32:46 .. but you can also imagine to have in a user agent an "I18N / L10N preprocessor" 10:32:59 .. assume a user says "I want this in my local language" 10:33:11 .. a processor could create an XLIFF version 10:33:24 .. that is sent to machine translation / translation memory etc. 10:33:56 .. the receiving application knows what to do, if the workflow is clear 10:35:23 david: ITS - using data categories of using them in RDF, is there a suggested mapping ? 10:35:40 felix: no, but that could be a part of a project to develop that 10:36:07 swaran: to Piek - would you application work for Indian languages? 10:36:45 piek: every language can work with the system, it is language independent. The preprocessing is pretty easy developed for many languages 10:37:28 xyz: why do you need provide the information that a certain element name is in a certain language 10:37:55 .. if I throw it in an English browser, it can't do anything with it 10:38:11 richard: the names are tokens - your system would need to understand what they "mean" 10:38:38 .. you don't have the situation that you put it into a web browser and he has to do something with it 10:39:00 peter: you need the information at the top of the file saying "here is the schema" 10:39:15 .. a developer could decide to create variable names in cyrillic 10:39:42 .. but the keywords in e.g. C programming language need to be just the keywords 10:40:00 .. I can write e.g. HTML in any language, but it is still HTML 10:43:40 jirka: if you need to know information about mapping, it can be made available by the "renaming" language standard 10:46:01 piek: we have a way to match language specific concepts to a general level 10:46:18 .. conceptualization of things using language, that we can handle 10:47:20 xyz: you could combine both projects (ITS and kyoto) so that translation of such words is not done 10:47:43 richard: piek - you have a way of dealing with ontologies, is it a standard way? 10:47:59 piek: we are using a wordnet we are developing in the project, an extension of LMF 10:48:11 .. we have formal mappings of concepts into relations 10:48:26 .. we have to find out to combine now the mapping relations 10:48:42 .. lexInfo is another standardiaztion proposal which needs to be aligned 10:49:01 christian: a question to people who talk about natural language processing 10:49:15 .. all the things we heared were about analysis 10:49:29 .. is there research there you ask people to start from the concepts 10:49:35 .. in a language neutral manner 10:49:56 .. so, do not start writing text, but start writing concepts? 10:50:20 piek: once the ontologies become very big, they are not big enough to express all concepts you find in a language 10:51:07 .. it makes more sense to approach lexicon and ontology development at the same sense 10:51:23 thierry: extracting ontologies from texts is one approach 10:51:34 .. but sometimes you have experts developing domain ontologies 10:52:03 .. example of radiology - domain experts did not make any commitments about matching concepts and terms 10:52:20 .. that would be the other way around, from ontologies to text 10:52:49 .. this morning we talked not so much about translation 10:53:07 .. but there is a need for a (human) translator, for the generation phase 10:53:21 pedro: NLP which works more about generation is machine translation in rule based approach 10:54:17 thierry: the monnet project relies on the language independent format 10:54:40 .. for some languages it is hard to get the right expression, see piek's example 10:54:55 .. there are three projects working on this 10:55:08 nicoletta: both for lexica and ontologies, we are loosing information 10:55:21 .. language is more complex than lexica and ontologies 10:55:33 .. we need to add another element - the real text 10:55:42 .. that gives the real complexity 10:59:41 discussion about terminology translation 10:59:57 thierry: TBX (terminology standard) is encoding the information 11:01:09 abc: what do you link to the linked data cloud? the different wordnets, or the domain (general) description? 11:01:13 piek: both 11:03:31 end of session 11:58:55 Sven2 has joined #mlw 12:02:14 joerg has joined #mlw 12:04:15 Jirka has joined #mlw 12:05:44 chaals has joined #mlw 12:05:52 scribe: chaals 12:05:58 Topic: Users session 12:07:19 Topic: Facebook and multilingualism 12:07:21 Ghassan: We havea a monthly hackathon on facebook. An engineer did a video to show some things taht take place in real time on facebook. 12:07:34 [plays video] 12:08:09 (looks to me like Opera globe - showing users doing thing in different places to classial music) 12:08:34 Ghassan: I like the one that shows nteractions between different places, which shows how global thing are 12:09:38 labra has joined #mlw 12:10:23 crisvaldes has joined #mlw 12:10:31 s/and multilingualism/Translations and the Social Web/ 12:10:52 Ghassan: This is not about crowd-sourcing, it's about localising the social web. 12:11:14 ... took me clsoe to ayear before I started thinking consciously about what is unique about what we are doing 12:11:23 ... why is social media so different. 12:11:42 fsasaki has joined #mlw 12:11:55 ... Three years ago when I went to interview at facebook I talked to engineers for a while. Then in comes a guy my daughter's age... the CTO. 12:12:14 ... He said "I don't want l10n or i18n to slow down product development. is this possible?" 12:12:32 ... I got the job... :) So I had to start thinking carefully about my 20+ years in the field. 12:13:35 ... I didn't want to go in and start implementing things the way I had always done. I wanted to think about a new start and different ways to do things. 12:13:38 fsasaki_ has joined #mlw 12:13:48 ... people talked about crowd sourcing, so I thought "why not". 12:13:59 ... I won't talk about the movie or the patent submission for crowd-sourcing. 12:14:30 ... The company was founded in 2004, went from being Harvard only to academic institutions in general and in 2007 opened to the public and became a platform for other developers to create applications. 12:14:41 ... we launched crowd-sourcing and opened translation into spanish. 12:14:59 ... At that time we notified only 20000 people, that it could be translated. 12:15:34 ... 2008 launched FBConnect. July we opened translation to any application, 2009 to any site that uses fbconnect. (A million+ sites) 12:15:43 [mission statement] 12:16:03 Ghassan: How well is world represented? 12:16:18 ... July we announced 500 million users. 12:16:35 [shows graph. shortly after translation there is a clear change in the growth rate] 12:17:13 Ghassan: each time we translate into a language, there is a huge growth in users. more than 90% of users do things in their local language. 12:17:45 ... gone from 75% US to 25% US. 12:18:00 ... About 50% use a translated UI (plus 5% who use en-GB) 12:18:09 ... 500k people have contributed to translation 12:18:25 [Hmm. about 1 in 1000] 12:18:41 Ghassan: We still have a lot of stuff not translated. 12:19:02 ... we have prioritie for what need to be translated. 12:19:27 ... Challenge: extending translation and capability for l10n in facebook. 12:20:10 ... Who are users? They are supposed to be 14+. We know there are a lot of 7-year-olds who pretend to be 14 and become users. 12:20:31 ... So you can't make assumptions about hte demographic. You have to allow people to select their language in a simple way. 12:20:44 ... Works well for us. 12:21:32 ... On main pages we provide a handful of common languages rather than the full list, to make it easie to select likely choices. 12:22:00 ... Within a week of allowing people to choose their language easily, we tripled adoption of localisation. 12:22:46 ... If you use e.g. English, in e.g. France, when we release french/basque/etc translations we announce it. 12:23:35 ... We moved to more aggressive switching - we almost force you to switch if you're in a region and not using a local langage. After that, switching increased by about 500% 12:23:55 ... There are about 8000 strings in the site for notifications. 12:25:29 ... Problems like the fact that some languages (arabic, russian, polish, etc) have more than just singular and plural (arabic has dual, russian has sing, 2-4, pl. ...) 12:27:03 ... Imagine doing complicated strings and localising them. it looks ugly... 12:27:42 [Examples of construcing phrases using automated rules] 12:28:02 ... We created 'dynamic string explosion'. 12:28:33 ... We have code that fetches information and can select different translations. 12:29:27 I'm reading carefully... 12:30:05 s/I'm reading carefully...// 12:30:35 ... Microsoft for a long time had closed ecosystems for translation, but it was important. slowly they opened up. We wanted to keep it open. 12:31:07 ... In our community we have a rating system - you get rated up and down by the community, until you get to the level where your translations are approved automatically. 12:31:25 ... Even then, you might have good translations, but that don't fit in context. 12:31:45 ... OF 75 languages, about 20 are done professionally and about 50 are community only. 12:32:12 ... We have done a lot fo surveys of quality. Result, no difference between professional and community translation. 12:32:41 ... Quality is based on what users want - not what the marketing director (EMEA, or wherever) thinks, until we change marketing director. 12:32:51 .. Is this the perfect solution? No, but it's pretty good. 12:33:19 Ghassan: a quick view of translation app. 12:33:48 ... An engineer turned a switch, and by the time I woke up, it was 75% translated. Byt the end of the day it was done. A lot of rubbish, but it was all in french. 12:34:05 [screenshots of the translation interface] 12:34:35 (Hmm. Looks a lot like Opera's interface. I guess there aren't so many innovative UI solutions in this space) 12:34:54 Ghassan: You can translate inline, or in bulk. 12:35:48 [reads slides] 12:36:07 Ghassan: Got a question - have we seen highly rated translations that are rubbish? Yes, lots. 12:36:22 ... We don't pay anyone to translate, but we have a leader board for community ratings. 12:36:41 ... Once a translation has been approved, it is no longer available for voting or rating. 12:37:06 ... This is really simple. And yet I have seen applications use it and end up with total rubbish. 12:38:21 Topic: Google community translation 12:39:29 Denis: will try to represent community perspective with regard to user generated content. 12:39:43 ... I manage translation to african languages, and am based in Nairobi 12:40:22 ... In particular, Sub-Saharan Africa - North Africa works with the Middle East team. 12:41:34 ... On current count Africa is 5% of Internet usage, but 14% of population (note the error margin for dividing Africa). 12:41:48 ... We used to use the number 2% until recently. So usage is growing fast. 12:42:06 ... Barriers include prices, and in particular cost of badnwidth getting to the continent in the first place. 12:42:31 ... Google is interested in content, and there is an issue of having relecvant content. 12:43:01 ... We won't solve all problems alone, we are working with other stakeholders in the area. 12:45:03 ... It is key that the colonial languages are used for education. South Africa took a big stand on local languages, but implementation is stil patchy. 12:45:34 ... Most exciting is the opportunity to bootstrap written form of languages, because now people can afford to do it e.g. by updating their status. 12:45:57 ... The problem of content: (Showing WIkipedia info, because we don't share google information) 12:46:49 ... A snapshot of articles - Amharic and Swahili (top written languages in Africa) compared to arabic, russian, chinese, english 12:47:07 claudio has joined #mlw 12:49:50 ... Tipping point in growth of wikipedia vs growth of users, there is a tipping point in growth of wkipedia about end of 2004. (As a proxy figure to suggest growth of user-generated content. 12:51:12 ... We use a feedback loop of human translation to improve machine translation. 12:53:29 ... We started by translating in silos. Community translation is an attempt to produce high-quality translation for the top 100 languages, working with University students. We provide a party, they do the work for us. 12:54:14 ... Dismbiguating in similar languages is very difficult. 12:54:50 ... We prioritise against internet penetration, whether something is a trendy language, whether there is content available in the language. 12:55:15 ... Never underestimate the value of certificates and t-shirts in getting people to do work. 12:55:42 [shows growth in localised search in Africa. Growing faster than total search growth...] 12:56:46 Denis: Wikipedia is source of locally relevant information. So it can seed translation, and that can be used to seed more content. 12:58:02 [time is up. Stop talking!] 12:58:33 Denis: There is a lack of data to start from, which is needed to bootstrap machine translation. 12:58:43 ... How do we standardise stuff for highly related languages? 12:59:08 [Picture: people we worked with at Universities in South Africa and Senegal] 12:59:52 Topic: Loic Martinez - Localization and Web Accessibility 13:00:02 http://sidar.org -> Fundación Sidar 13:00:42 Loic: From tosay's talk, looks like everything is almost solved ;) But there are a couple more things to think about. 13:00:50 s/tosay/today/ 13:02:16 ... People with disabilities / funtional diversity have the same rights everyone else does. This is important to the web because the web is more and more important to us. 13:02:46 ... This has all been approved by various countries an signed as treaties. But it seems countries have forgotten that they are agreeing to *do* things. Maybe we should remind them. 13:02:50 glazou has joined #mlw 13:03:19 ... who knows Web Content Accessibility Guidelines? 13:03:25 [about 1/3 ?] 13:04:21 Hexenhammer has joined #mlw 13:04:26 Loic: Accessibility is not one-one mapped to disability. It's important for people in various situations. 13:05:02 ... People with old technology, or new advanced technology that doesn't work well with things common on the web, as well. 13:06:03 ... Sidar is Seminario IberoAmericano for Disability and Accesibility en la Red (on the Net) 13:06:19 ... So we work in spanish in latin america as well as spain. And also in portuguese. 13:07:08 ... Various WCAG guidelines specifically touch the field of localisation. 13:09:01 ... Not so much talk today about non-text content. But we need alternatives, whether text, signed, adaptive to multiple devices, etc. 13:09:19 ... for non-text content. And these things have to be localised (including sign languages) 13:09:56 ... I Spain we have only two different sign languages, in South America there are a handful. 13:10:20 ... Important point, people who learn sign languages often ahve diffculty reading 13:10:44 [sign languages are grammatically and structually different from most written languages - about as much in common as chinese and russian] 13:10:55 r12a has joined #mlw 13:11:14 ... There are also issues with sound and how well people can hear it do distinguish the words etc. 13:13:57 [Mapping quick tips on accessibility to quick tips for internationalisation] 13:14:45 Loic: There are some gaps, like localising text alternatives, that aren't in the daily vocabulary of localisers. 13:15:08 http://www.w3.org/International/quicktips/ 13:15:42 ... We shuldn't need a business case for accessibility. In the words of William Loughborough "Accessibility is a right not a privelige". On the other hand in reality we do need to show business cases to get industry to do accessibility, just like for localisation. 13:16:07 Topic: Challenges of Multilingual Web in India 13:16:19 people in audiovisual translation have worked on accessibility and we have started talking about the web 13:16:50 Swaran: From governmentof India. We have a big issue - there are 22 official languages. A small group in a department in government is trying to do all of this. 13:17:06 ... Internet is increasing, but penetration about 8%, almost all in cities and in english. 13:17:43 ... Wireless devices are increasing too. So is usage of e-government, between government groups, Gov-business, gov-citizen 13:18:06 [screenshots of state government sites in local languages and in english] 13:18:28 Swaran: There is a mixed status and mixed levels of service available 13:18:46 ... some things are done by federal government, others by state governments. 13:19:21 ... Unique ID project to number each Indian person has brought Unisys into collaboration with government. 13:20:02 ... Will require multilingual systems, that can be used by everyone (so accessibility and usability in multiple languages are also important) 13:20:54 ... So far we are focusing only on the 22 constitutionally recognised lanugages (of the 122 languages and ~2400 dialects). 13:21:16 ... Hindi, the "national language" isn't spoken all over India. Nor is anything else. 13:21:39 ... States can choose their languages, and some states have multiple languages. 13:22:37 ... Whenever we do anything, we have to deal with a large complexity in languages. 13:23:44 ... We started funding projects in 1991, and we have learned that we need to deal with consortia, including people from different regions to ensure we cover different languages. We're going to start dealing with machine translations among indian languages, and will make it available over the next year or so. 13:24:06 ... We need these to enable people in different languages to access content. 13:24:28 ... Hindi-Punjabi example: High quality, because they are close relatives. 13:24:51 ... We need to be able to do multilingual searching, and presenting results through translation. 13:25:21 ... At least to allow a rough idea if a site is worth translating seriously. 13:25:40 ... We are working on Optical Character Recognition of Indian languages. 13:25:54 ... We are converting content to digital unicode-encoded formats. 13:26:54 ... Speech systems are under development and will start to become available soon in a few languages based on [open-source] festival text-to-speech system. 13:27:14 ... Next step is to work on phonetic engine. 13:27:55 ... In May we opened W3C Office, and are going to move this work forward. We think the language work has to happen in order to build a platform where business can see the business case. 13:28:16 ... On W3C side we are looking at issues like encoding, input, display, accessibility, mobile web, etc. 13:28:44 ... Proud to say we have all the constitutional languages completely encoded in Unicode and it is an official standard in the country. 13:29:29 ... There are various issues of how things actually work in browsers. 13:31:13 ... Unicode grapheme clustering might not be covered for all indian languages. We need to get the language experts to look at this and check that they are correct. We are in the process of preparing styling manuals to show what needs to be done. Then the browser developers need to come forward and help us - we are working with mozilla and Opera now... 13:32:00 ... There are many rendering issues, some from the OS and some from the browser. We need reference implementations to help show what browsers need to do. 13:32:54 ... Working on some standards for speech output, national standards to incorporate accessibility standards. 13:34:26 [time up] 13:34:45 ... no standardisation in mobile input methods, problems for other things in mobile 13:35:11 [question marks in multiple indian languages/scripts] 13:35:22 Topic: Questions 13:36:04 Hi, thank you for your interesting speeches. I am Sven Noben and sign language user. 13:37:19 Q: We've done surveys on why volunteers get involved. Our experience is they are motivated to do good. None of the t-shirts, ratings, etc matter, they want feedback from their peers and collaboration. 13:38:47 ... I think we are looking at the trees - big corporates - and not the people who are involved in non-commercial web acitivites, which is a huge amount of the web. 13:39:52 Denis: There are people around who know the brand, and want to contribute. But the large concern is whether things are sustainable. Based on our work on Swahili wikipedia challenge, the answer is no. 13:40:10 ... We did lots of good in six weeks, but it took 5 years to get to that stage. 13:40:17 ... It is difficult and costly to work online., 13:40:28 ... the do-good approach wears off after a while. 13:41:55 Swaran: I think it is difficult to get community participation. There is an NGO behind most community work, and behind them is generally funding, e.g. a company. 13:42:13 ... In the end, somebody generally has to pay. Without money, not so much gets done. 13:43:14 ... A lotof languages are not yet connectedto the web, and don't yet know the value, so a lot of awareness needs to be generated 13:43:46 Q: Concerned with the fact that you give google worldwide license to content. 13:44:30 ... so you are taking free license material from wikipedia and producing it through a process that puts it under google licensing terms. 13:45:10 ... A bit like taking America... you're swapping a few t-shirts for the culture of the world, with a bunch of kids who don't really understand the legal implications. 13:45:31 s/so you are taking/so Google is taking/ 13:46:28 Denis: The spirit of Wikipedia is for people to contribute content to make it accessible. Use of Google toolkit to enable this is within that spirit. It makes it easier for people to contribute. 13:46:33 ... in local languages. 13:47:18 ... Translator toolkit is human post-editing. Google machine translation includes making it available to the entire world. 13:47:31 [discussion about the terms and conditions, and who owns the content in the end] 13:48:34 Axel: Thanks for congratulations about our Indian performance. Most of the real creit goes to the RedHat team in India. 13:49:16 ... How much would it be interesting to cmpare facebook usage to African internet usage - which is prbably more mobile. 13:49:41 [see http://www.opera.com/smw for some information on mobile web usage, with highlights on Africa every so often] 13:50:00 Ghassan: Good question, don't know the answer in numbers. 13:50:54 Denis: Internet stats people have started publishing facebook users as a common metric. Don't think it gives insight in regards to mobile questions. 13:52:43 Sven: Hi, thank you for your interesting speeches. I am Sven Noben and sign language user. 13:52:43 ... I am founder of the signfuse company ( http://signfuse.com ) where we do try to make digital information available in sign languages and provide complete user interfaces in sign languages as well. I think it is important to consider sign languages too when talking about the multilingual web. They are as natural as all the other languages brought to attention in this workshop, and are used as a first language by both deaf and hearing people, 13:52:43 amounting over 100 million people worldwide. My questions thus focus on the language itself, rather than from an accessibility point of view. 13:52:44 ... Q1: May I know your opinion on this, and ask you why sign languages are often neglected when thinking about the multilingual web? 13:52:47 ... Q2: Do Facebook and Google see opportunities for sign language content / interface? 13:52:49 ... What are eventually barriers to take this opportunity? 13:53:35 Ghassan: Translation is done by user groups. If you are interested as a community we can open it up - we have opened up for Cherokee, which only has 1000 speakers. We are opening it to another native american group with 50 speakers. 13:54:10 Peter: In terms of localisation of content. It's in strings. We're talking about sign languages, which often have no written form at all. 13:54:23 ... where there are, they are not encoded in unicode. 13:54:47 ... So question to Sven: what is the expectation of supporting signed languages in teh multilingual web 13:55:20 ... If a signed language has a transcription system, those are potential canidates for unicode, but right now they are not there. 13:55:34 Support it like any other language. With support of video 13:56:40 Swaran: We have to explore, we are facing this in India too 13:57:11 Chaals: Agree with Sven, you need video (which is a barrier, but a vanishing one. You can get decent communication with fairly low bandwidth). 13:57:31 Q: What do you do about things like small European languages, such as Gallego. 13:58:14 Ghassan: For small languages we aren't really thinking about marketing. There isn't much business requirement to support a small language, but it is philosophically sensible 13:58:32 [CMN: I guess it also makes sense because it is technically not much of a stretch] 13:58:49 Denis: We support spanish official languages... 13:59:14 Ghassan: We do to. The basque community were motivated and provided us with a lot of material. 13:59:26 s/amounting over 100/... amounting over 100/ 13:59:56 Q: Can you comment about standardising identifiers for African languages? 14:00:43 Denis: In my first six months at google I tried to prioritise languages to focus on. There are different classification schemes - some people say ther are more languages, others say they are all the same. 14:01:23 [CMN thinks about the case of Serbian/Croatian/Serbo-croat/etc...] 14:02:21 Denis: We need to deal with the realities on the ground. There are varieties of Swahili, and at a practical level we assign locale information to clarify. The other issue is data on usage - that is absolutely lacking, which makes it very hard to make the right decisions. 14:02:55 ... Governments' first priority isn't to tell us how many people are literate in a given language, the tech community has to figure that out for us. 14:03:36 Axel: We get contributions in languages, and we don't know what the decision we are making really is... 14:04:34 [/me remembers Claudio Segovia's work on trying to collect the clearly different south american native languages that are all lumped together in official tagging schemes, although some of them have substantial populations of users] 14:05:13 Denis: there are real issues here with slang and languages that are actually used in practice. If they are common enough, should welearn about them. Should htey be languages, or variants? It isn't clear. 14:06:02 Richard: Next session will be a video-cast from US, so we have to be seated on time - there is no flexitime. 14:06:11 Chiara: thanks to speakers. 14:09:08 tadej has joined #mlw 14:35:18 chaals has joined #mlw 14:36:20 fsasaki has joined #mlw 14:37:31 Topic: Software for the world - Mark Davis 14:37:45 Richard: Mark has been president of Unicode consortium since it began. 14:38:04 Mark: I cannot hear, so commnicate by text. 14:38:14 ... I am going to talk about latest devlopments in Unicode 14:39:36 [shows slide of unicode taking over, repeat from someone yesterday] 14:39:59 Mark: Huge growth in Unicode, ascii and latin-1 have plummeted. 14:40:23 Sven2 has joined #mlw 14:40:27 ... Unicode is now about 50%. But there are regional differences - in japan it is 40% but rising, in China 50% 14:40:32 rrsagent, draft minutes 14:40:32 I have made the request to generate http://www.w3.org/2010/10/27-mlw-minutes.html chaals 14:41:02 ... sample selection includes a lot of pages that people don't look at much. 14:41:34 ... Unicode 6 just came out. There are about 70 properties associated with characters in the database. 14:42:03 ... One of the biggest problems we have is when people hardcode for particular characters. Having properties lets prograamers make language-independent code 14:42:55 ... There are characters like the ruppe symbol, but also a lot of "emoji" - useful or funny symbols that are commonly used on mobiles in japan for messaging. 14:43:57 ... We can now (since May) have domain names that are *entirely* internationalised. 14:45:20 ... There are problems with deployment, because there are differences between the first version if international domain names (IDNA2003) and the new version. Browsers need to match user expectations too - and in the new version upper case doesn't match lower case, although people expect it to do so since it did in the past. 14:45:37 ... There are still a lot of old browsers out there, which don't deal with modern standards. 14:46:02 ... There are also issues with characters that are used, but not permitted by the standard. 14:46:32 ... UTS46 - between Unicode and browser makers, created a standard to figure out how to do stuff in practice. 14:47:20 ... CLDR == Common Locale Data Repository, a dataset maintaine by the Unicode consortium to help programmers be locale-independent 14:48:29 ... CLDR is very widely used, so getting improvements in the data there means improvements in lots of software people use 14:48:53 ... Products generally translate from the XML format we produce to something optimised for the product. 14:49:53 ... Anatomy of a language tag, which can be very big so people have to allow for that. 14:52:56 ... Script tag is important for some langauges (chinese, uzbek, ...) and there are other things that might be useful information. 14:54:53 ... French has, for example common names for days of the week, or months, so all french locales would use them. But for e.g. fr_CA everything that applies to french applies, but it adds different currency information... 14:55:37 ... This is what happens in the locale source information we publish. An implementation might choose to rewrite that, e.g. Posix explicitly fills out all the information for each locale, instead of relying on inhertance. 14:56:34 ... There are different sets of exemplar characters. Main set is likely to be used for things like automatic language detection, but there are other letters used in practice. 14:57:11 ... The index head letters are for things like defining alphabetical order, to use in a contacts address book or for sorting. These will be in the new version of CLDR for the first time. 15:00:27 Mark: Flexible formats allow for different use cases, like presenting things on mobile phones in many different combinations 15:01:11 ... Time Zone formatting is quite complicated, and people use different things. So CLDR provides various different ones 15:01:45 ... Unit formatting is interesting because different languages have different approaches to number - czech needs 3 different forms (1, 2-4 or 5+) 15:02:24 ... for a language like Arabic there are 6 different forms we need. 15:02:54 ... Currencies have different things they use. 15:03:49 Mark: List formats differ. Identifying letters and word spaces is tricky, especially in languages that don't use spaces. 15:04:15 [when I first learned spanish, "ll" was considered a single letter, and german used ß. That has changed] 15:04:56 Mark: Transliteration is important to translate e.g. place names inot different scripts so people can use it. 15:05:39 ... In CLDR 9 you can do things like say 'please sort cyrillic characters before latin (or after)'... 15:06:53 ... place names are also noted, because they are different in different languages 15:07:26 Q: How does Unicode deal with sign languages? 15:07:58 Mark: If there are standards for the way sign languages are represented in symbols, we would welcome that. We are engaged in a process of doing a lot more work for symbols, too. 15:08:29 ... Sorry, I have to cut off now. Thanks for your attention 15:08:54 Richard: Time for survivors' party (which we didn't announce so that people who left didn't get invited :) ) 15:09:32 ... we ahve 20 minutes, and I am going to wrap up to use the last ten, so there are ten minutes for questions. 15:10:53 Q: Thanks, it has been great. Where do you see W3C action in this area going? Has anything changed about that in the last 2 days. 15:11:25 Richard: We have some more workshops coming up. I haven't had time to gather my thoughts, so I don't have an answer yet. Sorry. 15:11:47 Q: Will all presentations be available? 15:13:04 A: yes. 15:13:47 Chaals: Anyone want to comment on the internationalisation issues for dead langauges that are historically important? 15:14:23 Peter: There are a number of forms already in Unicode for historical reasons - Sumerian Cunieform, characters used in manuscripts that have since fallen out of use. 15:14:40 Swaran: Grantha? (south indian language) is an example 15:15:13 Peter: Other issues - input methods are needed. Do we need locale information to identify different hieroglyphic sets? 15:15:43 Comment: Looking at historical texts, there are transformation rules that change over time. 15:16:01 ... maybe it is necesary to add something to standards to allow for this? 15:16:32 Peter: Unicode is a standard for representation of characters. text representation is the primary goal. Details of presentation are left to fonts... 15:17:36 ... although there are scenarios where it is important to indicate some stuff. Unicode can provide some generic mechanism, e.g. zero-width joiner, to indicate that a ligature is *requested*. Is that something that is needed, or is it better to have some markup at a document level (like font...) 15:19:22 ... There are also things like variant selectors to identify which version of a character should be used in a particular place name. 15:20:11 Richard: I hope you enjoyed the workshop. We were excited and nervous about the diversity and mixing people who don't normally meet each other. 15:20:30 Sven2 has joined #mlw 15:20:31 ... Please fill in the feedback form. 15:21:15 ... and leave them on the table. 15:21:39 ... A huge vote of thanks to the Universidad Politecnica de Madrid, and to the sponsors of the workshop. 15:23:01 ... Especially thanks to Luis and Encarna for making this real. Thanks to the speakers, for giving great talks and finishing on time. 15:23:33 ... thanks to the scribes. Next workshop we will start by introducing IRC so you can follow what happens. The logs will be on the web. 15:23:59 ... Slides will also be up. Links from http://www.multilingualweb.eu 15:24:16 Sven2 has joined #mlw 15:24:20 ... (speakers please send slides if you haven't already). We also have video of most speakers. 15:24:30 rrsagent, draft minutes 15:24:30 I have made the request to generate http://www.w3.org/2010/10/27-mlw-minutes.html chaals 15:25:11 ... We've done the first workshop. We need to think more about how to share the information we are gathering. We havea twitter stream - @multilingweb and a facebook page, and other things can be helpful. 15:25:25 ... We have a mailing list people can join, and use it to discuss. 15:26:07 [repeated ads for http://www.multilingualweb.eu/] 15:26:24 ... Thanks for coming. Next workshop will be in Pisa, probably mid march. 15:26:30 ... Have a good trip home. 15:26:58 rrsagent, draft minutes 15:26:58 I have made the request to generate http://www.w3.org/2010/10/27-mlw-minutes.html chaals 15:33:13 glazou has joined #mlw