See also: IRC log
This is the raw scribe log for the sessions on day two of the MultilingualWeb workshop in Madrid. The log has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC was used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following IRC can also add contributions to the flow of text themselves.
See also the log for the first day.
<fsasaki> scribe: various
<joerg> start scriping the machines sessions - intro by Dan
<joerg> first speaker is felix - talking about LT and in particular interoperability of technologies
<joerg> introduces applications concerning summarization, MT and text mining and shows what is needed in terms of resources
<joerg> identifies different types of language resources
<joerg> distinguishes between linguistic approaches and statistical approaches
<joerg> machines need 3 types of data: input, resources and workflow
<joerg> show the types of gaps that exist in this scenario: metadata, process, purpose
<joerg> these gaps are exemplified with an MT application
<joerg> purpose specifically concerns the identification of metadata, process flows and the employed resources
<joerg> the identification must be facilitated across applications with a common understanding
<joerg> therefore different communities have to join in and share the information that has to be employed in the descriptive part of the identification
<joerg> a solution that can provide a machine-readable information foundation is provided by the semantic web
<joerg> a microformat or RDFa example gives some insights in how the semantic web contributes to closing the introduced gaps
<joerg> the point to remember is that RDF is a means to provide a common information space for machines
<joerg> the talk closes with some ideas on joint projects
<joerg> ... and specifically on how META-NET is already working is this direction
<joerg> discussion points: language description frameworks and the complexity of RDF for browser developers
<joerg> she extends the notion of language resources to also include tools
<joerg> a new paradigm is needed in the LR world to accumulate the continuous evolution of the multilingual web
<joerg> including the satisfaction of new demands and requirements which shall account for the dynamic needs
<joerg> right now a web of LRs is built up and is driven by standardization efforts
<joerg> for the further evolution distributed services are also needed including effective collaborations between the different communities
<joerg> a very important and sensitive issue concern politics and policies to support these changes
<joerg> several EU funded projects have taken up this new R&D direction, and national initiatives are joining in to build stronger communities
<joerg> critical is to ensure a web-based access together with a global vision and cooperation
<joerg> examples are projects such as CLARIN, FLaReNet and META-NET
<joerg> interoperability between resources and tools is key for the overall success as well as more openness through sharing efforts
<joerg> question: many infrastructures and what about the interoperabilty right now? META-NET should/must solve this issue
<joerg> third speaker is Tierry talking about "lemon" an ontology-lexicon model for LRs
<joerg> lemon is part of the EU funded project Monnet and collaborates with standardization bodies
<joerg> lemon contributes to the multilingual web by providing linguistically enriched data and information
<joerg> the industrial use case in Monnet is the financial world, in particular the field of reporting
<joerg> standards used from the industrial side are XBRL, IFRS and GAAP
<joerg> the encodings in these standards are related to semantic web standards to build a bridge between financial data and linguistic data
<joerg> the approach is exemplified by an online term translation application
<joerg> talk closes with an architectural overview of the Monnet components
<joerg> standards used on the language side are among others LMF, LIR and SKOS
<joerg> strong link to META-NET will be established soon...
<joerg> next talk by Jose Gonzales of DAEDALUS with similar subjects but with a strong market view
<joerg> LRs of DAEDALUS date back to the 1990 when no resources for Spanish were available
<joerg> initial focus was on spell checking
<joerg> which was needed mainly by the media market
<joerg> these developments had an important influence on all future developments such as search and retrieval, and even ontology development
<joerg> multilingual developments followed and were based on the continuous experience in the field
<joerg> and is exemplified by a multilingual information extraction application
<joerg> followed by an example that integrates speech (DALI)
<joerg> current applications include sentiment analysis and opinion mining (in Japanese)
<joerg> an EU funded project (favius) takes into account user-generated content and machine translation
<joerg> some online examples close the talk with an outlook on linking ontologies with lexical tools
<joerg> question: islands of LRs could they be made available in the public? There are ISO initiatives underway to support this direction also in terms of structuring of the resources.
<joerg> An open issue is still the representation format
<fsasaki> jörg: what is business intelligence?
<fsasaki> .. traditional BI is very complex
<fsasaki> .. four main steps: requirements, app design, development, delivery - outcome of data analysis
<fsasaki> scribe: fsasaki
Jörg: many people
involved, too slow, too expensive
... ermerging BI: new dynamic data resources, new online
algorithms, new paradigms like agility
... relation to mlw: mainly browser based interfaces in
emerging BI
<scribe> .. new model allows for more iterations, is more cost effective
Jörg: there is a need
for multilingual BI
... interoperability between BI applications and language
tools, resources, etc.
... that is, two different ecosystems need to be linked
... normally they use their own standards
... e.g. in BI, there is XMLA, BPMN, UML, Six Sigma,
Unicode
... in mlw, you have ITS, XLIFF, MLF, TMX, TBX, SRX, GMX,
Unicode
... inbetween there are protocols for communication
... shared serizalization (XML)
... coupling should be done in a round-trip version
... how do communities interact right now?
... currently trust only in your own standard, fear of more
complexity
... lack of reference implementations
... in summary - missing: a common mindest for change, exchange
between communities, joint reference implemnetations (e.g.
supported by funding), self-adapting and self-learning
technologies
... join the "interoperability" discussion at http://interoperability-now.org
piek: presenting kyoto
project
... text mining across different languages
... text mining platform we use is a very generic relations in
text
... which you can tune to your needs
... provocative statement: why translate text if you can get
knowledge out of if in a language neutral form?
... evolution of the Web: from 1.0, 2.0 (social Web), 3.0
... in 3.0, if a machine can "understand", they can build
applications relying on the Web
... question is: how can we connect the different versions of
the Web?
... we need: interoperable representation of the structure of
language
... representation of formal conceptual knowledge
... kyoto project: for each language a linguistic
processor
... a series of programs with basic processing (finding tokens,
words, important structures, main verb / topic of a
sentence)
... output is a uniform annotation format (kyoto format)
... that is uniform across all languages
<scribe> .. new languages can easily be pluged in, since the basic processing can easily be developed
piek: based on the
uniform format, we can do word sense disambiguation, named
entity recognition
... disambiguation relying on wordnet(s)
... all resources are linked to each other and on a general
ontology
... that is a kind of attempt to create the global wordnet
grid
... vocabularies are represented as wordnet LMF
representation
... they map to central ontologies like SUMO, DOLCE and a DOLCE
extension developed in the project
... the point is: WSD, NER etc. are used with the same program
for all languages
... since they are working on the kyoto annotation format
... fact mining also works on that format, and creates
RDF
... after that you need a "language renderer" which creates
output for humans
... about kyoto annotation format: based on layered annotation
format
... several layers of annotations. Each layer points to a
different layer
... with the data in RDF, we can create e.g. a semantic search
application
... across languages
... project and linked open data cloud: lexical resources in
many languages
... summary: do not focus only on machine translation, but also
on conceptual anchoring of the meaning in a shared
representation
... also important: what do we do with the output of
this?
... currently we have 4000 documents and generate 4 gigabytes
of triples
... not trivial how to represent this to machines and
users
... for humans we need to have "language renderers" for
poential users
scribe missed question
piek: you can accumulate information from many languages
pedro: so translators are still necessary?
piek: he can make use of that information
christian: talk is about
work by many people in this room, like Richard Ishida, Jirka
Koszek, Felix Sasaki
... ITS in your source content can help you with many things
which e.g. Piek described
... example: imagine that you receive this document to
translate it, spell check it etc.
(slide contains document in many languages)
scribe: questions are: what
language is the document in, what are defined terms, footnotes
etc.?
... ITS let's you specify information like that in terms of
"data categories" - about Translatability, Localization Note,
Terminology, Directionality, Ruby, Language information,
Elements within Text
... with ITS, we can express annotations in a CSS like
manner
... with a local approach (like CSS "class" or "style"
attributes)
... and a global approach, using XPath expressions
... usage scenarios for ITS: already there are libraries to
extract translatable text or other things
... additional scenario about conversion, see a roundtripping
for ITS > XLIFF > XML source http://fabday.fh-potsdam.de/~sasaki/its/
... with ITS annotated documents, you can have a processing
chain to get XLIFF
... but you can also imagine to have in a user agent an "I18N /
L10N preprocessor"
... assume a user says "I want this in my local language"
... a processor could create an XLIFF version
... that is sent to machine translation / translation memory
etc.
... the receiving application knows what to do, if the workflow
is clear
david: ITS - using data categories of using them in RDF, is there a suggested mapping ?
felix: no, but that could be a part of a project to develop that
swaran: to Piek - would you application work for Indian languages?
piek: every language can work with the system, it is language independent. The preprocessing is pretty easy developed for many languages
xyz: why do you need provide the
information that a certain element name is in a certain
language
... if I throw it in an English browser, it can't do anything
with it
richard: the names are tokens -
your system would need to understand what they "mean"
... you don't have the situation that you put it into a web
browser and he has to do something with it
peter: you need the information
at the top of the file saying "here is the schema"
... a developer could decide to create variable names in
cyrillic
... but the keywords in e.g. C programming language need to be
just the keywords
... I can write e.g. HTML in any language, but it is still
HTML
jirka: if you need to know information about mapping, it can be made available by the "renaming" language standard
piek: we have a way to match
language specific concepts to a general level
... conceptualization of things using language, that we can
handle
xyz: you could combine both projects (ITS and kyoto) so that translation of such words is not done
richard: piek - you have a way of dealing with misalignments/gaps of terminology for a given concept in ontologies, is it a standard way?
piek: we are using a wordnet we
are developing in the project, an extension of LMF
... we have formal mappings of concepts into relations
... we have to find out to combine now the mapping
relations
... lexInfo is another standardiaztion proposal which needs to
be aligned
christian: a question to people
who talk about natural language processing
... all the things we heared were about analysis
... is there research there you ask people to start from the
concepts
... in a language neutral manner
... so, do not start writing text, but start writing
concepts?
piek: once the ontologies become
very big, they are not big enough to express all concepts you
find in a language
... it makes more sense to approach lexicon and ontology
development at the same sense
thierry: extracting ontologies
from texts is one approach
... but sometimes you have experts developing domain
ontologies
... example of radiology - domain experts did not make any
commitments about matching concepts and terms
... that would be the other way around, from ontologies to
text
... this morning we talked not so much about translation
... but there is a need for a (human) translator, for the
generation phase
pedro: NLP which works more about generation is machine translation in rule based approach
thierry: the monnet project
relies on the language independent format
... for some languages it is hard to get the right expression,
see piek's example
... there are three projects working on this
nicoletta: both for lexica and
ontologies, we are loosing information
... language is more complex than lexica and ontologies
... we need to add another element - the real text
... that gives the real complexity
discussion about terminology translation
thierry: TBX (terminology standard) is encoding the information
abc: what do you link to the linked data cloud? the different wordnets, or the domain (general) description?
piek: both
end of session
<chaals> scribe: chaals
Ghassan: We havea a monthly hackathon on facebook. An engineer did a video to show some things taht take place in real time on facebook.
[plays video]
(looks to me like Opera globe - showing users doing thing in different places to classial music)
Ghassan: I like the one that
shows nteractions between different places, which shows how
global thing are
... This is not about crowd-sourcing, it's about localising the
social web.
... took me clsoe to ayear before I started thinking
consciously about what is unique about what we are doing
... why is social media so different.
... Three years ago when I went to interview at facebook I
talked to engineers for a while. Then in comes a guy my
daughter's age... the CTO.
... He said "I don't want l10n or i18n to slow down product
development. is this possible?"
... I got the job... :) So I had to start thinking carefully
about my 20+ years in the field.
... I didn't want to go in and start implementing things the
way I had always done. I wanted to think about a new start and
different ways to do things.
... people talked about crowd sourcing, so I thought "why
not".
... I won't talk about the movie or the patent submission for
crowd-sourcing.
... The company was founded in 2004, went from being Harvard
only to academic institutions in general and in 2007 opened to
the public and became a platform for other developers to create
applications.
... we launched crowd-sourcing and opened translation into
spanish.
... At that time we notified only 20000 people, that it could
be translated.
... 2008 launched FBConnect. July we opened translation to any
application, 2009 to any site that uses fbconnect. (A million+
sites)
[mission statement]
Ghassan: How well is world
represented?
... July we announced 500 million users.
[shows graph. shortly after translation there is a clear change in the growth rate]
Ghassan: each time we translate
into a language, there is a huge growth in users. more than 90%
of users do things in their local language.
... gone from 75% US to 25% US.
... About 50% use a translated UI (plus 5% who use en-GB)
... 500k people have contributed to translation
[Hmm. about 1 in 1000]
Ghassan: We still have a lot of
stuff not translated.
... we have prioritie for what need to be translated.
... Challenge: extending translation and capability for l10n in
facebook.
... Who are users? They are supposed to be 14+. We know there
are a lot of 7-year-olds who pretend to be 14 and become
users.
... So you can't make assumptions about hte demographic. You
have to allow people to select their language in a simple
way.
... Works well for us.
... On main pages we provide a handful of common languages
rather than the full list, to make it easie to select likely
choices.
... Within a week of allowing people to choose their language
easily, we tripled adoption of localisation.
... If you use e.g. English, in e.g. France, when we release
french/basque/etc translations we announce it.
... We moved to more aggressive switching - we almost force you
to switch if you're in a region and not using a local langage.
After that, switching increased by about 500%
... There are about 8000 strings in the site for
notifications.
... Problems like the fact that some languages (arabic,
russian, polish, etc) have more than just singular and plural
(arabic has dual, russian has sing, 2-4, pl. ...)
... Imagine doing complicated strings and localising them. it
looks ugly...
[Examples of construcing phrases using automated rules]
scribe: We created 'dynamic
string explosion'.
... We have code that fetches information and can select
different translations.
... Microsoft for a long time had closed ecosystems for
translation, but it was important. slowly they opened up. We
wanted to keep it open.
... In our community we have a rating system - you get rated up
and down by the community, until you get to the level where
your translations are approved automatically.
... Even then, you might have good translations, but that don't
fit in context.
... OF 75 languages, about 20 are done professionally and about
50 are community only.
... We have done a lot fo surveys of quality. Result, no
difference between professional and community
translation.
... Quality is based on what users want - not what the
marketing director (EMEA, or wherever) thinks, until we change
marketing director.
... Is this the perfect solution? No, but it's pretty good.
Ghassan: a quick view of
translation app.
... An engineer turned a switch, and by the time I woke up, it
was 75% translated. Byt the end of the day it was done. A lot
of rubbish, but it was all in french.
[screenshots of the translation interface]
(Hmm. Looks a lot like Opera's interface. I guess there aren't so many innovative UI solutions in this space)
Ghassan: You can translate inline, or in bulk.
[reads slides]
Ghassan: Got a question - have we
seen highly rated translations that are rubbish? Yes,
lots.
... We don't pay anyone to translate, but we have a leader
board for community ratings.
... Once a translation has been approved, it is no longer
available for voting or rating.
... This is really simple. And yet I have seen applications use
it and end up with total rubbish.
Denis: will try to represent
community perspective with regard to user generated
content.
... I manage translation to african languages, and am based in
Nairobi
... In particular, Sub-Saharan Africa - North Africa works with
the Middle East team.
... On current count Africa is 5% of Internet usage, but 14% of
population (note the error margin for dividing Africa).
... We used to use the number 2% until recently. So usage is
growing fast.
... Barriers include prices, and in particular cost of
badnwidth getting to the continent in the first place.
... Google is interested in content, and there is an issue of
having relecvant content.
... We won't solve all problems alone, we are working with
other stakeholders in the area.
... It is key that the colonial languages are used for
education. South Africa took a big stand on local languages,
but implementation is stil patchy.
... Most exciting is the opportunity to bootstrap written form
of languages, because now people can afford to do it e.g. by
updating their status.
... The problem of content: (Showing WIkipedia info, because we
don't share google information)
... A snapshot of articles - Amharic and Swahili (top written
languages in Africa) compared to arabic, russian, chinese,
english
... Tipping point in growth of wikipedia vs growth of users,
there is a tipping point in growth of wkipedia about end of
2004. (As a proxy figure to suggest growth of user-generated
content.
... We use a feedback loop of human translation to improve
machine translation.
... We started by translating in silos. Community translation
is an attempt to produce high-quality translation for the top
100 languages, working with University students. We provide a
party, they do the work for us.
... Dismbiguating in similar languages is very difficult.
... We prioritise against internet penetration, whether
something is a trendy language, whether there is content
available in the language.
... Never underestimate the value of certificates and t-shirts
in getting people to do work.
[shows growth in localised search in Africa. Growing faster than total search growth...]
Denis: Wikipedia is source of locally relevant information. So it can seed translation, and that can be used to seed more content.
[time is up. Stop talking!]
Denis: There is a lack of data to
start from, which is needed to bootstrap machine
translation.
... How do we standardise stuff for highly related
languages?
[Picture: people we worked with at Universities in South Africa and Senegal]
http://sidar.org -> Fundación Sidar
Loic: From today's talk, looks
like everything is almost solved ;) But there are a couple more
things to think about.
... People with disabilities / funtional diversity have the
same rights everyone else does. This is important to the web
because the web is more and more important to us.
... This has all been approved by various countries an signed
as treaties. But it seems countries have forgotten that they
are agreeing to *do* things. Maybe we should remind them.
... who knows Web Content Accessibility Guidelines?
[about 1/3 ?]
Loic: Accessibility is not
one-one mapped to disability. It's important for people in
various situations.
... People with old technology, or new advanced technology that
doesn't work well with things common on the web, as well.
... Sidar is Seminario IberoAmericano for Disability and
Accesibility en la Red (on the Net)
... So we work in spanish in latin america as well as spain.
And also in portuguese.
... Various WCAG guidelines specifically touch the field of
localisation.
... Not so much talk today about non-text content. But we need
alternatives, whether text, signed, adaptive to multiple
devices, etc.
... for non-text content. And these things have to be localised
(including sign languages)
... I Spain we have only two different sign languages, in South
America there are a handful.
... Important point, people who learn sign languages often ahve
diffculty reading
[sign languages are grammatically and structually different from most written languages - about as much in common as chinese and russian]
scribe: There are also issues with sound and how well people can hear it do distinguish the words etc.
[Mapping quick tips on accessibility to quick tips for internationalisation]
Loic: There are some gaps, like localising text alternatives, that aren't in the daily vocabulary of localisers.
<r12a> http://www.w3.org/International/quicktips/
Loic: We shuldn't need a business case for accessibility. In the words of William Loughborough "Accessibility is a right not a privelige". On the other hand in reality we do need to show business cases to get industry to do accessibility, just like for localisation.
<crisvaldes> people in audiovisual translation have worked on accessibility and we have started talking about the web
Swaran: From governmentof India.
We have a big issue - there are 22 official languages. A small
group in a department in government is trying to do all of
this.
... Internet is increasing, but penetration about 8%, almost
all in cities and in english.
... Wireless devices are increasing too. So is usage of
e-government, between government groups, Gov-business,
gov-citizen
[screenshots of state government sites in local languages and in english]
Swaran: There is a mixed status
and mixed levels of service available
... some things are done by federal government, others by state
governments.
... Unique ID project to number each Indian person has brought
Unisys into collaboration with government.
... Will require multilingual systems, that can be used by
everyone (so accessibility and usability in multiple languages
are also important)
... So far we are focusing only on the 22 constitutionally
recognised lanugages (of the 122 languages and ~2400
dialects).
... Hindi, the "national language" isn't spoken all over India.
Nor is anything else.
... States can choose their languages, and some states have
multiple languages.
... Whenever we do anything, we have to deal with a large
complexity in languages.
... We started funding projects in 1991, and we have learned
that we need to deal with consortia, including people from
different regions to ensure we cover different languages. We're
going to start dealing with machine translations among indian
languages, and will make it available over the next year or
so.
... We need these to enable people in different languages to
access content.
... Hindi-Punjabi example: High quality, because they are close
relatives.
... We need to be able to do multilingual searching, and
presenting results through translation.
... At least to allow a rough idea if a site is worth
translating seriously.
... We are working on Optical Character Recognition of Indian
languages.
... We are converting content to digital unicode-encoded
formats.
... Speech systems are under development and will start to
become available soon in a few languages based on [open-source]
festival text-to-speech system.
... Next step is to work on phonetic engine.
... In May we opened W3C Office, and are going to move this
work forward. We think the language work has to happen in order
to build a platform where business can see the business
case.
... On W3C side we are looking at issues like encoding, input,
display, accessibility, mobile web, etc.
... Proud to say we have all the constitutional languages
completely encoded in Unicode and it is an official standard in
the country.
... There are various issues of how things actually work in
browsers.
... Unicode grapheme clustering might not be covered for all
indian languages. We need to get the language experts to look
at this and check that they are correct. We are in the process
of preparing styling manuals to show what needs to be done.
Then the browser developers need to come forward and help us -
we are working with mozilla and Opera now...
... There are many rendering issues, some from the OS and some
from the browser. We need reference implementations to help
show what browsers need to do.
... Working on some standards for speech output, national
standards to incorporate accessibility standards.
[time up]
scribe: no standardisation in mobile input methods, problems for other things in mobile
[question marks in multiple indian languages/scripts]
<sven_noben> Hi, thank you for your interesting speeches. I am Sven Noben and sign language user.
Q: We've done surveys on why volunteers get involved. Our experience is they are motivated to do good. None of the t-shirts, ratings, etc matter, they want feedback from their peers and collaboration.
scribe: I think we are looking at the trees - big corporates - and not the people who are involved in non-commercial web acitivites, which is a huge amount of the web.
Denis: There are people around
who know the brand, and want to contribute. But the large
concern is whether things are sustainable. Based on our work on
Swahili wikipedia challenge, the answer is no.
... We did lots of good in six weeks, but it took 5 years to
get to that stage.
... It is difficult and costly to work online.,
... the do-good approach wears off after a while.
Swaran: I think it is difficult
to get community participation. There is an NGO behind most
community work, and behind them is generally funding, e.g. a
company.
... In the end, somebody generally has to pay. Without money,
not so much gets done.
... A lotof languages are not yet connectedto the web, and
don't yet know the value, so a lot of awareness needs to be
generated
Q: Concerned with the fact that you give google worldwide license to content.
scribe: so Google is taking free
license material from wikipedia and producing it through a
process that puts it under google licensing terms.
... A bit like taking America... you're swapping a few t-shirts
for the culture of the world, with a bunch of kids who don't
really understand the legal implications.
Denis: The spirit of Wikipedia is
for people to contribute content to make it accessible. Use of
Google toolkit to enable this is within that spirit. It makes
it easier for people to contribute.
... in local languages.
... Translator toolkit is human post-editing. Google machine
translation includes making it available to the entire
world.
[discussion about the terms and conditions, and who owns the content in the end]
Axel: Thanks for congratulations
about our Indian performance. Most of the real creit goes to
the RedHat team in India.
... How much would it be interesting to cmpare facebook usage
to African internet usage - which is prbably more mobile.
[see http://www.opera.com/smw for some information on mobile web usage, with highlights on Africa every so often]
Ghassan: Good question, don't know the answer in numbers.
Denis: Internet stats people have started publishing facebook users as a common metric. Don't think it gives insight in regards to mobile questions.
Sven: Hi, thank you for your
interesting speeches. I am Sven Noben and sign language
user.
... I am founder of the signfuse company ( http://signfuse.com ) where we do try
to make digital information available in sign languages and
provide complete user interfaces in sign languages as well. I
think it is important to consider sign languages too when
talking about the multilingual web. They are as natural as all
the other languages brought to attention in this workshop, and
are used as a first language by both deaf and hearing
people,
amounting over 100 million people worldwide. My questions thus focus on the language itself, rather than from an accessibility point of view.
scribe: Q1: May I know your
opinion on this, and ask you why sign languages are often
neglected when thinking about the multilingual web?
... Q2: Do Facebook and Google see opportunities for sign
language content / interface?
... What are eventually barriers to take this opportunity?
Ghassan: Translation is done by user groups. If you are interested as a community we can open it up - we have opened up for Cherokee, which only has 1000 speakers. We are opening it to another native american group with 50 speakers.
Peter: In terms of localisation
of content. It's in strings. We're talking about sign
languages, which often have no written form at all.
... where there are, they are not encoded in unicode.
... So question to Sven: what is the expectation of supporting
signed languages in teh multilingual web
... If a signed language has a transcription system, those are
potential canidates for unicode, but right now they are not
there.
<sven_noben> Support it like any other language. With support of video
Swaran: We have to explore, we are facing this in India too
Chaals: Agree with Sven, you need video (which is a barrier, but a vanishing one. You can get decent communication with fairly low bandwidth).
Q: What do you do about things like small European languages, such as Gallego.
Ghassan: For small languages we aren't really thinking about marketing. There isn't much business requirement to support a small language, but it is philosophically sensible
[CMN: I guess it also makes sense because it is technically not much of a stretch]
Denis: We support spanish official languages...
Ghassan: We do to. The basque community were motivated and provided us with a lot of material.
s/amounting over 100/... amounting over 100/
Q: Can you comment about standardising identifiers for African languages?
Denis: In my first six months at google I tried to prioritise languages to focus on. There are different classification schemes - some people say ther are more languages, others say they are all the same.
[CMN thinks about the case of Serbian/Croatian/Serbo-croat/etc...]
Denis: We need to deal with the
realities on the ground. There are varieties of Swahili, and at
a practical level we assign locale information to clarify. The
other issue is data on usage - that is absolutely lacking,
which makes it very hard to make the right decisions.
... Governments' first priority isn't to tell us how many
people are literate in a given language, the tech community has
to figure that out for us.
Axel: We get contributions in languages, and we don't know what the decision we are making really is...
[/me remembers Claudio Segovia's work on trying to collect the clearly different south american native languages that are all lumped together in official tagging schemes, although some of them have substantial populations of users]
Denis: there are real issues here with slang and languages that are actually used in practice. If they are common enough, should welearn about them. Should htey be languages, or variants? It isn't clear.
Richard: Next session will be a video-cast from US, so we have to be seated on time - there is no flexitime.
Chiara: thanks to speakers.
Richard: Mark has been president of Unicode consortium since it began.
Mark: I cannot hear, so
commnicate by text.
... I am going to talk about latest devlopments in Unicode
[shows slide of unicode taking over, repeat from someone yesterday]
Mark: Huge growth in Unicode,
ascii and latin-1 have plummeted.
... Unicode is now about 50%. But there are regional
differences - in japan it is 40% but rising, in China 50%
... sample selection includes a lot of pages that people don't
look at much.
... Unicode 6 just came out. There are about 70 properties
associated with characters in the database.
... One of the biggest problems we have is when people hardcode
for particular characters. Having properties lets prograamers
make language-independent code
... There are characters like the ruppe symbol, but also a lot
of "emoji" - useful or funny symbols that are commonly used on
mobiles in japan for messaging.
... We can now (since May) have domain names that are
*entirely* internationalised.
... There are problems with deployment, because there are
differences between the first version if international domain
names (IDNA2003) and the new version. Browsers need to match
user expectations too - and in the new version upper case
doesn't match lower case, although people expect it to do so
since it did in the past.
... There are still a lot of old browsers out there, which
don't deal with modern standards.
... There are also issues with characters that are used, but
not permitted by the standard.
... UTS46 - between Unicode and browser makers, created a
standard to figure out how to do stuff in practice.
... CLDR == Common Locale Data Repository, a dataset maintaine
by the Unicode consortium to help programmers be
locale-independent
... CLDR is very widely used, so getting improvements in the
data there means improvements in lots of software people
use
... Products generally translate from the XML format we produce
to something optimised for the product.
... Anatomy of a language tag, which can be very big so people
have to allow for that.
... Script tag is important for some langauges (chinese, uzbek,
...) and there are other things that might be useful
information.
... French has, for example common names for days of the week,
or months, so all french locales would use them. But for e.g.
fr_CA everything that applies to french applies, but it adds
different currency information...
... This is what happens in the locale source information we
publish. An implementation might choose to rewrite that, e.g.
Posix explicitly fills out all the information for each locale,
instead of relying on inhertance.
... There are different sets of exemplar characters. Main set
is likely to be used for things like automatic language
detection, but there are other letters used in practice.
... The index head letters are for things like defining
alphabetical order, to use in a contacts address book or for
sorting. These will be in the new version of CLDR for the first
time.
... Flexible formats allow for different use cases, like
presenting things on mobile phones in many different
combinations
... Time Zone formatting is quite complicated, and people use
different things. So CLDR provides various different ones
... Unit formatting is interesting because different languages
have different approaches to number - czech needs 3 different
forms (1, 2-4 or 5+)
... for a language like Arabic there are 6 different forms we
need.
... Currencies have different things they use.
... List formats differ. Identifying letters and word spaces is
tricky, especially in languages that don't use spaces.
[when I first learned spanish, "ll" was considered a single letter, and german used ß. That has changed]
Mark: Transliteration is
important to translate e.g. place names inot different scripts
so people can use it.
... In CLDR 9 you can do things like say 'please sort cyrillic
characters before latin (or after)'...
... place names are also noted, because they are different in
different languages
Q: How does Unicode deal with sign languages?
Mark: If there are standards for
the way sign languages are represented in symbols, we would
welcome that. We are engaged in a process of doing a lot more
work for symbols, too.
... Sorry, I have to cut off now. Thanks for your attention
Q: Thanks, it has been great. Where do you see W3C action in this area going? Has anything changed about that in the last 2 days.
Richard: We have some more workshops coming up. I haven't had time to gather my thoughts, so I don't have an answer yet. Sorry.
Q: Will all presentations be available?
A: yes.
Chaals: Anyone want to comment on the internationalisation issues for dead langauges that are historically important?
Peter: There are a number of forms already in Unicode for historical reasons - Sumerian Cunieform, characters used in manuscripts that have since fallen out of use.
Swaran: Grantha? (south indian language) is an example
Peter: Other issues - input methods are needed. Do we need locale information to identify different hieroglyphic sets?
Comment: Looking at historical
texts, there are transformation rules that change over
time.
... maybe it is necesary to add something to standards to allow
for this?
Peter: Unicode is a standard for
representation of characters. text representation is the
primary goal. Details of presentation are left to
fonts...
... although there are scenarios where it is important to
indicate some stuff. Unicode can provide some generic
mechanism, e.g. zero-width joiner, to indicate that a ligature
is *requested*. Is that something that is needed, or is it
better to have some markup at a document level (like
font...)
... There are also things like variant selectors to identify
which version of a character should be used in a particular
place name.
Richard: I hope you enjoyed the
workshop. We were excited and nervous about the diversity and
mixing people who don't normally meet each other.
... Please fill in the feedback form.
... and leave them on the table.
... A huge vote of thanks to the Universidad Politecnica de
Madrid, and to the sponsors of the workshop.
... Especially thanks to Luis and Encarna for making this real.
Thanks to the speakers, for giving great talks and finishing on
time.
... thanks to the scribes. Next workshop we will start by
introducing IRC so you can follow what happens. The logs will
be on the web.
... Slides will also be up. Links from http://www.multilingualweb.eu
... (speakers please send slides if you haven't already). We
also have video of most speakers.
... We've done the first workshop. We need to think more about
how to share the information we are gathering. We have a twitter
stream - @multilingweb and a facebook page, and other things
can be helpful.
... We have a mailing list people can join, and use it to
discuss.
[repeated ads for http://www.multilingualweb.eu/]
scribe: Thanks for coming. Next
workshop will be in Pisa, and we expect the dates to be 15/16 March (to be confirmed shortly). Hope to see you there.
... Have a good trip home.