W3C   W3C Internationalization (I18n) Activity: Making the World Wide Web truly world wide!

Latest del.icio.us tags

Blog searches

Contributors

If you own a blog with a focus on internationalization, and want to be added or removed from this aggregator, please get in touch with Richard Ishida at ishida@w3.org.

All times are UTC.

Powered by: Planet

Planet Web I18n

The Planet Web I18n aggregates posts from various blogs that talk about Web internationalization (i18n). While it is hosted by the W3C Internationalization Activity, the content of the individual entries represent only the opinion of their respective authors and does not reflect the position of the Internationalization Activity.

January 29, 2015

ishida>>blog » i18n

Bopomofo on the Web

Three bopomofo letters with tone mark.

Light tone mark in annotation.

A key issue for handling of bopomofo (zhùyīn fúhào) is the placement of tone marks. When bopomofo text runs vertically (either on its own, or as a phonetic annotation), some smarts are needed to display tone marks in the right place. This may also be required (though with different rules) for bopomofo when used horizontally for phonetic annotations (ie. above a base character), but not in all such cases. However, when bopomofo is written horizontally in any other situation (ie. when not written above a base character), the tone mark typically follows the last bopomofo letter in the syllable, with no special handling.

From time to time questions are raised on W3C mailing lists about how to implement phonetic annotations in bopomofo. Participants in these discussions need a good understanding of the various complexities of bopomofo rendering.

To help with that, I just uploaded a new Web page Bopomofo on the Web. The aim is to provide background information, and carry useful ideas from one discussion to the next. I also add some personal thoughts on implementation alternatives, given current data.

I intend to update the page from time to time, as new information becomes available.

by r12a at 29 January 2015 12:07 PM

January 20, 2015

Wikimedia Foundation

File:Content Translation Screencast (English).webm

File:Content Translation Screencast (English).webm

Video: How to translate a Wikipedia article in 3 minutes with Content Translation. This video can also be viewed on YouTube (4:10). Screencast by Pau Giner, licensed under CC BY-SA 4.0

Wikimedia Foundation’s Language Engineering team is happy to announce the first version of Content Translation on Wikipedia for 8 languages: Catalan, Danish, Esperanto, Indonesian, Malay, Norwegian (Bokmål), Portuguese and Spanish. Content Translation, available as a beta feature, provides a quick way to create new articles by translating from an existing article into another language. It is also well suited for new editors looking to familiarize themselves with the editing workflow. Our aim is to build a tool that leverages the power of our multicultural global community to further Wikimedia’s mission of creating a world where every single human being can share in the sum of all knowledge.

Design

During early 2014, when the design ideas for Content Translation were being conceptualized, we came across an interesting study by Scott A.Hale of University of Oxford, on the influences and editing patterns of multilingual editors on Wikipedia. Combined with feedback from editors we interacted with, the data presented in the study guided our initial choices, both in terms of features and languages. We were fortunate to have met the researcher in person at Wikimania 2014, so we could learn more about his findings and references.

The tool was designed for multilingual editors as our main target users. Several important patterns emerged from a month-long user study, including:

  • Multilingual editors are relatively more active in Wikipedias of smaller size. Often the editors from smaller sized Wikipedias would also edit on a relatively large sized Wikipedia like English or German;
  • Multilingual editors often edited the same articles in their primary and non-primary languages.

These and other factors listed in the study impact the transfer of content between different language versions of Wikipedia; they increase content parity between versions — and decrease ‘self-focus’ bias in individual editions.

Languages

When selecting languages for the tool’s introduction, we were guided by several factors, including signs of relatively high multilingualism amongst the primary editors. The availability of high quality machine-translated content was an additional consideration, to fully explore the usability of the core editing workflow designed for the tool. Based on these considerations, Catalan Wikipedia, a very actively edited project of medium size was a logical choice. Subsequent language selections were made by studying possible overlap trends between language users — and the probability of editors benefiting from those overlaps when creating new articles. Availability of machine translation to speed up the process and community requests were important considerations.

How it works

The article Abel Martín in the Spanish Wikipedia doesn’t have a version in Portuguese, so a red link to Portuguese is shown.
Content Translation red interlanguage link screenshot by Amire80 , licensed under CC BY-SA 4.0

Content Translation combines a rich text translation interface with tools targeted for editing — and machine translation support for most language pairs. It integrates different tools to automate repetitive steps during translation: it provides an initial automatic translation while keeping the original text format, links, references, and categories. To do so, the tool relies on the inter-language connections from Wikidata, html-to-wikitext conversion from Parsoid, and machine translation support from Apertium. This saves time for editors and allows them to focus on creating quality content.

Although basic text formatting is supported, the purpose of the tool is to create an initial version of the content that each community can keep improving with their usual editing tools. Content Translation is not intended to keep the information in sync across multiple language versions, but to provide a quick way to reuse the effort already made by the community when creating an article from scratch in a different language.

The tool can be accessed in different ways. There is a persistent access point at your contributions page, but access to the tool is also provided in situations where you may want to translate the content you are just reading. For instance, a red link in the interlanguage link area (see image).

Next steps

Next steps for the tool’s future development include adding support for more – eventually all – languages, managing lists of articles to translate, and adding features for more streamlined translation.

In coming weeks, we will closely monitor feedback from users and interact with them to guide our future development. Please read the release announcement for more details about the features and instructions on using the tool. Thank you!

Amir Aharoni, Pau Giner, Runa Bhattacharjee, Language Engineering, Wikimedia Foundation

by fflorin2015 at 20 January 2015 06:56 PM

January 18, 2015

ishida>>blog » i18n

Bengali picker & character & script notes updated

Screen Shot 2015-01-18 at 07.42.56

Version 16 of the Bengali character picker is now available.

Other than a small rearrangement of the selection table, and the significant standard features that version 16 brings, this version adds the following:

  • three new buttons for automatic transcription between latin and bengali. You can use these buttons to transcribe to and from latin transcriptions using ISO 15919 or Radice approaches.
  • hinting to help identify similar characters.
  • the ability to select the base character for the display of combining characters in the selection table.

For more information about the picker, see the notes at the bottom of the picker page.

In addition, I made a number of additions and changes to Bengali script notes (an overview of the Bengali script), and Bengali character notes (an annotated list of characters in the Bengali script).

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 18 January 2015 08:10 AM

January 15, 2015

Global By Design

Global Gateway Fail: Yandex

Yandex is Russia’s leading search engine and, following in Google’s footsteps, is eager to take over much of Russia’s Internet, which naturally includes the web browser. Yandex is also in the process of expanding its reach beyond Russia. But when I visited the web browser download web page I couldn’t help but notice a few problems with the global […]

The post Global Gateway Fail: Yandex appeared first on Global by Design.

by John Yunker at 15 January 2015 01:25 AM

January 13, 2015

ishida>>blog » i18n

Initial letter styling in CSS

initial-letter-tibetan-01

The CSS WG needs advice on initial letter styling in non-Latin scripts, ie. enlarged letters or syllables at the start of a paragraph like those shown in the picture. Most of the current content of the recently published Working Draft, CSS Inline Layout Module Level 3 is about styling of initial letters, but the editors need to ensure that they have covered the needs of users of non-Latin scripts.

The spec currently describes drop, sunken and raised initial characters, and allows you to manipulate them using the initial-letter and the initial-letter-align properties. You can apply those properties to text selected by ::first-letter, or to the first child of a block (such as a span).

The editors are looking for

any examples of drop initials in non-western scripts, especially Arabic and Indic scripts.

I have scanned some examples from newspapers (so, not high quality print).

In the section about initial-letter-align the spec says:

Input from those knowledgeable about non-Western typographic traditions would be very helpful in describing the appropriate alignments. More values may be required for this property.

Do you have detailed information about initial letter styling in a non-Latin script that you can contribute? If so, please write to www-style@w3.org (how to subscribe).

by r12a at 13 January 2015 12:13 PM

January 10, 2015

Wikimedia Foundation

"Cx-new-languages" by Runabhattacharjee, under CC-Zero


The new Content Translation tool’s language selector will make it easier to translate Wikipedia articles.
Content Translation tool screenshot by Runabhattacharjee, licensed under CC-0

In early December 2014, the Wikimedia Foundation’s Language Engineering team announced the release of the third version of our Content Translation tool, which aims to make it easier to translate Wikipedia articles. Since then, our focus has been to take the tool to the next step and make it more widely available. Encouraged by the feedback we have received in the last 6 months, we are now happy to announce that the tool will soon be available in 8 Wikipedias as a beta feature. Users of Catalan, Danish, Esperanto, Indonesian, Malay, Norwegian (Bokmål), Portuguese, and Spanish Wikipedias will be able to use Content Translation from mid-January 2015. The tool will also be enabled on the Norwegian (Nynorsk) and Swedish Wikipedias, but only to facilitate their use as sources for Norwegian (Bokmål) and Danish respectively.

Users of Catalan, Spanish and Portuguese wikis have already previewed the tool on the Wikimedia beta servers and it was a natural choice to add these three languages in our first set for deployment. The remaining five languages were chosen based on user survey results and community requests. These languages are also available on the Wikimedia beta servers where Content Translation has been hosted since July 2014.

Currently, the Language Engineering team is completing the final phases for enabling Content Translation as a beta feature. After deployment, users will be able to translate Wikipedia articles into the language of their choice (restricted to the above mentioned eight languages) from appropriate source languages available for that language. For most of these languages, machine translation between the source and target language pairs will be made available through Apertium. English will be enabled as a source language for all languages, but without machine translation support except for English to Esperanto, where machine translations from English have been found to be satisfactory.

We will make further announcements as we close in on the deployment date. It’s possible that the beta feature may become available on the wikis for testing, before the announcements are out. Prior to that, the Language Engineering team will also host an IRC ‘office hours’ discussion on January 14th at 1600 UTC on #wikimedia-office.

Meanwhile, we welcome users to try out Content Translation and to bring to our attention any issues or suggestions. You can also help us prepare Content Translation to support more languages by filling in the language evaluation survey.

Runa Bhattacharjee, Language Engineering, Wikimedia Foundation

by fflorin2015 at 10 January 2015 10:39 PM

January 06, 2015

ishida>>blog » i18n

The Combining Character Conundrum

I’m struggling to show combining characters on a page in a consistent way across browsers.

For example, while laying out my pickers, I want users to be able to click on a representation of a character to add it to the output field. In the past I resorted to pictures of the characters, but now that webfonts are available, I want to replace those with font glyphs. (That makes for much smaller and more flexible pages.)

Take the Bengali picker that I’m currently working on. I’d like to end up with something like this:

comchacon0

I put a no-break space before each combining character, to give it some width, and because that’s what the Unicode Standard recommends (p60, Exhibiting Nonspacing Marks in Isolation). The result is close to what I was looking for in Chrome and Safari except that you can see a gap for the nbsp to the left.

comchacon1

But in IE and Firefox I get this:

comchacon2

This is especially problematic since it messes up the overall layout, but in some cases it also causes text to overlap.

I tried using a dotted circle Unicode character, instead of the no-break space. On Firefox this looked ok, but on Chrome it resulted in two dotted circles per combining character.

I considered using a consonant as the base character. It would work ok, but it would possibly widen the overall space needed (not ideal) and would make it harder to spot a combining character by shape. I tried putting a span around the base character to grey it out, but the various browsers reacted differently to the span. Vowel signs that appear on both sides of the base character no longer worked – the vowel sign appeared after. In other cases, the grey of the base character was inherited by the whole grapheme, regardless of the fact that the combining character was outside the span. (Here are some examples ে and ো.)

In the end, I settled for no preceding base character at all. The combining character was the first thing in the table cell or span that surrounded it. This gave the desired result for the font I had been using, albeit that I needed to tweak the occasional character with padding to move it slightly to the right.

On the other hand, this was not to be a complete solution either. Whereas most of the fonts I planned to use produce the dotted circle in these conditions, one of my favourites (SolaimanLipi) doesn’t produce it. This leads to significant problems, since many combining characters appear far to the left, and in some cases it is not possible to click on them, in others you have to locate a blank space somewhere to the right and click on that. Not at all satisfactory.

comchacon3

I couldn’t find a better way to solve the problem, however, and since there were several Bengali fonts to choose from that did produce dotted circles, I settled for that as the best of a bad lot.

However, then i turned my attention to other pickers and tried the same solution. I found that only one of the many Thai fonts I tried for the Thai picker produced the dotted circles. So the approach here would have to be different. For Khmer, the main Windows font (Daunpenh) produced dotted circles only for some of the combining characters in Internet Explorer. And on Chrome, a sequence of two combining characters, one after the other, produced two dotted circles…

I suspect that I’ll need to choose an approach for each picker based on what fonts are available, and perhaps provide an option to insert or remove base characters before combining characters when someone wants to use a different font.

It would be nice to standardise behaviour here, and to do so in a way that involves the no-break space, as described in the Unicode Standard, or some other base character such as – why not? – the dotted circle itself. I assume that the fix for this would have to be handled by the browser, since there are already many font cats out of the bag.

Does anyone have an alternate solution? I thought I heard someone at the last Unicode conference mention some way of controlling the behaviour of dotted circles via some script or font setting…?

Update: See Marc Durdin’s blog for more on this topic, and his experiences while trying to design on-screen keyboards for Lao and other scripts.

by r12a at 06 January 2015 05:28 PM

January 05, 2015

Global By Design

Domain name registrations surpass 280 million

According to Verisign’s domain name industry brief, TLD registrations hit 280 million in the second quarter of 2014. Here’s a list of the leading domains overall: Note that the .TK ccTLD is technically a country code but is marketed as a generic TLD, and quite successfully it seems. So Germany is effectively the leading ccTLD. […]

The post Domain name registrations surpass 280 million appeared first on Global by Design.

by John Yunker at 05 January 2015 07:15 PM

ishida>>blog » i18n

Khmer character picker v16

khmer-picker16

I have uploaded a new version of the Khmer character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible. If you prefer, you can still access the previous version.

Other than a small rearrangement of the default selection table to accomodate fonts rather than images, and the significant standard features that version 16 brings, there are no additional changes in this version.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 05 January 2015 10:12 AM

Devanagari, Gurmukhi & Uighur pickers available

uighur-picker16

devanagari-picker16

gurmukhi-picker16

I have updated the Devanagari picker, the Gurmukhi picker and the Uighur picker to version 16.

You may have spotted a previous, unannounced, version of the Devanagari and Uighur pickers on the site, but essentially these versions should be treated as new. The Gurmukhi picker has been updated from a very old version.

In addition to the standard features that version 16 of the character pickers brings, things to note include the addition of hints for all pickers, and automated transcription from Devanagari to ISO 15919, and vice versa for the Devanagari picker.

For more information about the pickers, see the notes at the bottom of the relevant picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 05 January 2015 09:45 AM

January 04, 2015

ishida>>blog » i18n

More picker changes: Version 16

A couple of posts ago I mentioned that I had updated the Thai picker to version 16. I have now updated a few more. For ease of reference, I will list here the main changes between version 16 pickers and previous versions back to version 12.

  • Fonts rather than graphics. The main selection table in version 12 used images to represent characters. These have now gone, in favour of fonts. Most pickers include a web font download to ensure that you will see the characters. This reduces the size and download time significantly when you open a picker. Other source code changes have reduced the size of the files even further, so that the main file is typically only a small fraction of the size it was in version 14.

    It is also now possible, in version 16, to change the font of the main selection table and the font size.

  • UI. The whole look and feel of the user interface has changed from version 14 onwards, and includes useful links and explanations off the top of the normal work space.

    In particular, the vertical menu, introduced in version 14, has been adjusted so that input features can be turned on and off independently, and new panels appear alongside the others, rather than toggling the view from one mode to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.

  • Transcription panels. Some pickers had one or more transcription views in versions below 16. These enable you to construct some non-Latin text when working from a Latin transcription. In version 16 these alternate views are converted to panels that can be displayed at the same time as other information. They can be shown or hidden from the vertical menu. When there is ambiguity as to which characters to use, a pop up displays alternatives. Click on one to insert it into the output. There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. This panel is now hidden by default, but can be easily shown from the vertical menu.

  • Automated transcription. Version 16 pickers carry forward, and in some cases add, automated transcription converters. In some cases these are intended to generate only an approximation to the needed transcription, in order to speed up the transcription process. In other cases, they are complete. (See the notes for the picker to tell which is which.) Where there is ambiguity about how to transcribe a sequence of characters, the interface offers you a choice from alternatives. Just click on the character you want and it will replace all the options proposed. In some cases, particularly South-East Asian scripts, the text you want to transcribe has to be split into syllables first, using spaces and or hyphens. Where this is necessary, a condense button it provided, to quickly strip out the separators after the transcription is done.

  • Layout The default layout of the main selection table has usually been improved, to make it easier to locate characters. Rarely used, deprecated, etc, characters appear below the main table, rather than to the right.

  • Hints Very early versions of the pickers used to automatically highlight similar and easily confusable characters when you hovered over a character in the main selection table. This feature is being reintroduced as standard for version 16 pickers. It can be turned on or off from the vertical menu. This is very helpful for people who don’t know the script well.

  • Shape-based selection. In previous versions the shape-based view replaced the default view. In version 16 the shape selectors appear below the main selection table and highlight the characters in that table. This arrangement has several advantages.

  • Applying actions to ranges of text. When clicking on the Codepoints and Escapes buttons, it is possible to apply the action to a highighted range of characters, rather than all the characters in the output area. It is also possible to transcribe only highlighted text, when using one of the automated transcription features.

  • Phoneme bank. When composing text from a Latin transcription in previous versions you had to make choices about phonetics. Those choices were stored on the UI to speed up generation of phonetic transcriptions in addition to the native text, but this feature somewhat complicated the development and use of the transcription feature. It has been dropped in version 16. Hopefully, the transcription panels and automated transcription features will be useful enough in future.

  • Font grid. The font grid view was removed in version 16. It is of little value when the characters are already displayed using fonts.

  • About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 04 January 2015 12:53 PM

January 01, 2015

Global By Design

Most popular posts of 2014

Looking over traffic logs, here are some of the most-visited blog posts of 2014: Wikipedia and the Internet language chasm The top 25 global websites from the 2014 Web Globalization Report Card The worst global websites of the 2014 Web Globalization Report Card Gmail to be first major platform to support non-Latin email addresses Google […]

The post Most popular posts of 2014 appeared first on Global by Design.

by John Yunker at 01 January 2015 04:12 PM

December 26, 2014

ishida>>blog » i18n

Language Subtag Lookup tool updated

This update to the Language Subtag Lookup tool brings back the Check function that had been out of action since last January. The code had to be pretty much completely rewritten to migrate it from the original PHP. In the process, I added support for extension and private use tags, and added several more checks. I also made various changes to the way the results are displayed.

Give it a try with this rather complicated, but valid language tag: zh-cmn-latn-CN-pinyin-fonipa-u-co-phonebk-x-mytag-yourtag

Or try this rather badly conceived language tag, to see some error messages: mena-fr-latn-fonipa-biske-x-mylongtag-x-shorter

The IANA database information is up-to-date. The tool currently supports the IANA Subtag registry of 2014-12-17. It reports subtags for 8,081 languages, 228 extlangs, 174 scripts, 301 regions, 68 variants, and 26 grandfathered subtags.

by r12a at 26 December 2014 08:11 AM

December 21, 2014

ishida>>blog » i18n

Thai character picker v16

I have uploaded another new version of the Thai character picker.

Sorry this follows so quickly on the heels of version 15, but as soon as I uploaded v15 several ideas on how to improve it popped into my head. This is the result. I will hopefully bring all the pickers, one by one, up to the new version 16 format. If you prefer, you can still access version 12.

The main changes include:

  • UI. Adjustment of the vertical menu, so that input features can be turned on and off independently, and new panels appear with the others, rather than toggling from one to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.
  • Transcription panels. Panels have been added to enable you to construct some Thai text when working from a Latin transcription. This brings the transcription inputs of version 12 into version 16, but in a more compact and simpler way, and way that gives you continued access to the standard table for special characters.

    There are currently options to transcribe from ISO 11940-2 (although there are some gaps in that), or from the transcription used by Benjawan Poomsan Becker in her book, Thai for Beginners. These are both transcriptions based on phonetic renderings of the Thai, so there is often ambiguity about how to transcribe a particular Latin letter into Thai. When such an ambiguity occurs, the interface offers you a choice via a small pop-up. Just click on the character you want and it will be inserted into the main output area.

    The transcription panels are useful because you can add a whole vowel at a time, rather than picking the individual vowel signs that compose it. An issue arises, however, when the vowel signs that make up a given vowel contain one that appears to the left of the syllable initial consonant(s). This is easily solved by highlighting the syllable in question and clicking on the reorder button. The vowel sign in question will then appear as the first item in the highlighted text.

    There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. (This was available in v15 too, but has been made into a panel like the others, which can be hidden when not needed.)

  • Tones for automatic IPA transcriptions. The automatic transcription to IPA now adds tone marks. These are usually correct, but, as with other aspects of the transcription, it doesn’t take into account the odd idiosyncrasy in Thai spelling, so you should always check that the output is correct. (Note that there is still an issue for some of the ambiguous transcription cases, mostly involving RA.)

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 21 December 2014 09:18 AM

December 18, 2014

Global By Design

Have you registered your Cuba domain?

Cuba has long had its own country code: .CU. But most companies didn’t view this domain as a priority. Until now. But be aware that this domain isn’t cheap. I’ve seen prices ranging from $800 to $1,100, so only larger companies will see this is an impulse buy. But if you can get it, I think […]

The post Have you registered your Cuba domain? appeared first on Global by Design.

by John Yunker at 18 December 2014 05:27 PM

ishida>>blog » i18n

Thai character picker v15

I have uploaded a new version of the Thai character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible, and dispenses with the transcription view. If you prefer, you can still access the previous version.

Other changes include:

  • Significant rearrangement of the default selection table. The new arrangement makes it easy to choose the right characters if you have a Latin transcription to hand, which allows the removal of the previous transcription view, at the same time as speeding up that type of picking.
  • Addition of latin prompts to help locate letters (standard with v15).
  • Automatic transcription from Thai into ISO 11940-1, ISO 11940-2 and IPA. Note that for the last two there are some corner cases where the results are not quite correct, due to the ambiguity of the script, and note also that you need to show syllable boundaries with spaces before transcribing. (There’s a way to remove those spaces quickly afterwards.) See below for more information.
  • Hints! When switched on and you mouse over a character, other similar characters or characters incorporating the shape you moused over, are highlighted. Particularly useful for people who don’t know the script well, and may miss small differences, but also useful sometimes for finding a character if you first see something similar.
  • It also comes with the new v15 features that are standard, such as shape-based picking without losing context, range-selectable codepoint information, a rehabilitated escapes button, the ability to change the font of the table and the line-height of the output, and the ability to turn off autofocus on mobile devices to stop the keyboard jumping up all the time, etc.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

More about the transcriptions: There are three buttons that allow you to convert from Thai text to Latin transcriptions. If you highlight part of the text, only that part will be transcribed.

The toISO-1 button produces an ISO 11940-1 transliteration, that latinises the Thai characters without changing their order. The result doesn’t normally tell you how to pronounce the Thai text, but it can be converted back to Thai as each Thai character is represented by a unique sequence in Latin. This transcription should produce fully conformant output. There is no need to identify syllables boundaries first.

The toISO-2 and toIPA buttons produce an output that is intended to approximately reflect actual pronunciation. It will work fine most of the time, but there are occasional ambiguities and idiosynchrasies in Thai which will cause the converter to render certain, less common syllables incorrectly. It also doesn’t automatically add accent marks to the phonetic version (though that may be added later). So the output of these buttons should be treated as something that gets you 90% of the way. NOTE: Before using these two buttons you need to add spaces or hyphens between each syllable of the Thai text. Syllable boundaries are important for correct interpretation of the text, and they are not detected automatically.

The condense button removes the spaces from the highlighted range (or the whole output area, if nothing is highlighted).

Note: For the toISO-2 transcription I use a macron over long vowels. This is non-standard.

by r12a at 18 December 2014 02:35 PM

W3C I18n Activity highlights

W3C MultilingualWeb Workshop Announced: 29 April 2015, Riga, Latvia

W3C announced today the 8th MultilingualWeb workshop in a series of events exploring the mechanisms and processes needed to ensure that the World Wide Web lives up to its potential around the world and across barriers of language and culture.

This workshop will be held 29 April 2015 in Riga, Latvia, and is made possible by the generous support of the LIDER project. The workshop is part of the Riga Summit 2015 on the Multilingual Digital Single Market (27-29 April)

Anyone may attend all sessions at no charge and the W3C welcomes participation by both speakers and non-speaking attendees. Early registration is encouraged due to limited space.

Building on the success of seven highly regarded previous workshops, this workshop will emphasize new technology developments that lead to new opportunities for the Multilingual Web. The workshop brings together participants interested in the best practices and standards needed to help content creators, localizers, language tools developers, and others meet the challenges of the multilingual Web. It provides further opportunities for networking across communities. We are particularly interested in speakers who can demonstrate novel solutions for reaching out to a global, multilingual audience.

See the Call for Participation and register online.

by Richard Ishida at 18 December 2014 10:34 AM

December 16, 2014

W3C I18n Activity highlights

First Public Working Draft of Indic Layout Requirements published

The W3C Internationalization Working Group has published a First Public Working Draft of Indic Layout Requirements on behalf of the Indic Layout Task Force, part of the W3C Internationalization Interest Group.

This document describes the basic requirements for Indic script layout and text support on the Web and in eBooks. These requirements provide information for Web technologies such as CSS, HTML and SVG about how to support users of Indic scripts. The current document focuses on Devanagari, but there are plans to widen the scope to encompass additional Indian scripts as time goes on.

Publication as a First Public Working Draft, signals the beginning of the process, rather than an end point. We are now looking for comments on the document. Please send any comments you have to public-i18n-indic@w3.org. The archive is public, but you need to subscribe to post to it.

by Richard Ishida at 16 December 2014 05:10 PM

December 10, 2014

Wikimedia Foundation

guillaumepaumier

Exciting new features are now available in the third version of the Content Translation tool. Development of the new version was recently completed and the newly added features can be used in Wikimedia’s beta environment. To use it, you first need to enable the Content Translation beta-feature in the wiki, then go to the Special Page to select the article to translate. This change in behavior was done in preparation for the activation of Content Translation as a beta-feature on a few selected Wikipedias in early 2015.

The Content Translation user dashboard

Highlights

Two important features have been included in this phase of development work: a user dashboard, and saving & continuing of unfinished translations.

Users can currently use these two features to monitor only their own work. The dashboard (see image) will display all the published and unpublished articles created by the user. Unpublished articles are translations that the user has not published to the user namespace of the wiki. These articles can be opened from the dashboard and users can continue to translate them. The dashboard is presently in a very early stage of development, and enhancements will be made to enrich the features.

Additionally, the selector for source and target languages and articles has been redesigned. Published articles with excessive amount of unedited machine-translated content are now included in a category so that they can be easily identified.

Languages currently available with Apertium‘s machine translation support are Catalan, Portuguese and Spanish. Users of other languages can also explore the tool after they have enabled the beta-feature. Please remember that this wiki is hosted on Wikimedia’s beta servers and you will need to create a separate account.

Upcoming plans and participation

Development work is currently going on for the fourth version of this tool. During this phase, we will focus our attention on making the translation interface stable and prepare the tool for deployment as a beta-feature in several Wikipedias.

Since the first release in July 2014, we have been guided by the helpful feedback we have continuously received from early users. We look forward to wider participation and more feedback as the tool progresses with new features and is enabled for new languages. Please let us know your views about Content Translation on the Project talk page, or by signing up for user testing sessions. You can also participate in the language quality evaluation survey to help us identify new languages that can be served through the tool.

Runa Bhattacharjee, Wikimedia Foundation, Language Engineering team

by Guillaume Paumier at 10 December 2014 11:24 PM

December 06, 2014

ishida>>blog » i18n

Tibetan character picker v15

I have uploaded a new version of the Tibetan character picker.

The new version dispenses with the images for the selection table. If you don’t have a suitable font to display the new version of the picker, you can still access the previous version, which uses images.

Other changes include:

  • Significant rearrangement of the default table, with many less common symbols moved into a location that you need to click on to reveal. This declutters the selection table.
  • Addition of latin prompts to help locate letters (standard with v15).
  • Hints (When switched on and you mouse over a character, other similar characters or characters incorporating the shape you moused over, are highlighted. Particularly useful for people who don’t know the script well, and may miss small differences, but also useful sometimes for finding a character if you first see something similar.)
  • A new Wylie button that converts Tibetan text into an extended Wylie Latin transcription. There are still some uncommon characters that don’t work, but it should cover most normal needs. I used diacritics over lowercase letters rather than uppercase letters, except for the fixed form characters. I also didn’t provide conversions for many of the symbols – they will appear without change in the transcription. See the notes on the page for more information.
  • The Codepoints button, which produces a list of characters in the output box, now has a new feature. If you have highlighted some text in the output box, you will only see a list of the highlighted characters. If there are no highlights, the contents of the whole output box are listed.
  • Don’t forget, if you are using the picker on an iPad or mobile device, to set Autofocus to Off before tapping on characters. This stops the device keypad popping up every time you select a character. (This is also standard for v15.)

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 06 December 2014 10:39 PM

December 02, 2014

W3C I18n Activity highlights

Final report for Chinese Layout Requirements Workshop available

The final report for the Workshop on Chinese Language Text Layout Requirements, which was held on September 11, 2014, at Beihang University, is now available. See also the Chinese version of this report.

The report contains links to slides.

The workshop gave a strong message of support for W3C Beihang and CESI to cooperate and lead the work on the Chinese Layout Requirement Document. In addition to Simplified and Traditional Chinese, there was also strong interest from representatives of the Mongolian, Tibetan and Uighur script communities to participate in the work. The closing session of the workshop proposed a number of steps to continue the efforts.

The W3C staff is driving the process of setting up this task force and reaching out to a wide range of interested stakeholders. This consultation will seek to clarify the mission for the task force, the target topics and industry priorities, and opportunities for liaisons with other related standards development organizations.

by Richard Ishida at 02 December 2014 05:01 PM

November 26, 2014

Global By Design

WordPress now at 70 languages, and counting

This blog has been hosted on WordPress since 2002. Since then, WordPress has grown into one of the dominant publishing platforms on the Internet. And one of the most multilingual as well, with strong support for 53 locales and limited support for an additional 20 or so locales. Languages supported include Russian, Arabic, Hebrew, Icelandic, […]

The post WordPress now at 70 languages, and counting appeared first on Global by Design.

by John Yunker at 26 November 2014 03:32 PM

November 17, 2014

ishida>>blog » i18n

Picker changes

If you use my Unicode character pickers, you may have noticed some changes recently. I’ve moved several pickers on to version 14. Most of the noticeable changes are in the location and styling of elements on the UI – the features remain pretty much unchanged.

Pages have acquired a header at the top (which is typically hidden), that provides links to related pages, and integrates the style into that of the rest of the site. What you don’t see is a large effort to tidy the code base and style sheets.

So far, I have changed the following: Arabic block, Armenian, Balinese, Bengali, Khmer, IPA, Lao, Mongolian, Myanmar, and Tibetan.

I will convert more as and when I get time.

However, in parallel, I have already made a start on version 15, which is a significant rewrite. Gone are the graphics, to be replaced by characters and webfonts. This makes a huge improvement to the loading time of the page. I’m also hoping to introduce more automated transcription methods, and simpler shape matching approaches.

Some of the pickers I already upgraded to version 14 have mechanisms for transcription and shape-based identification that took a huge effort to create, and will take a substantial effort to upgrade to version 15. So they may stay as they are for a while. However, easier to handle and new pickers will move to the new format.

Actually, I already made a start with Gurmukhi v15, which yanks that picker out of the stone-age and into the future. There’s also a new picker for the Uighur language that uses v15 technology. I’ll write separate blogs about those.

 

[By the way, if you are viewing the pickers on a mobile device such as an iPad, don't forget to turn Autofocus off (click on 'more controls' to find the switch). This will stop the onscreen keyboard popping up, annoyingly, each time you try to tap on a character.]

by r12a at 17 November 2014 10:51 PM

November 14, 2014

Wikimedia Foundation

guillaumepaumier

Many readers of this blog know about the Content Translation initiative. This project, developed by the Language Engineering team of the Wikimedia Foundation, brings together machine translation and rich text editing to provide a quick method to create Wikipedia articles by translating them from another language.

Content Translation uses Apertium as its machine translation back-end. Apertium is a freely licensed open source project and was our first choice for this stage of development. The first version of Content Translation focused on the Spanish-Catalan language pair, and one of the reasons for this choice was the maturity of Apertium’s machine translation for those languages.

However, with growing needs to support more language pairs in the newer versions of Content Translation, it became essential that the machine translation continue to be reliable, and that the back-end be stable and up-to-date. To ensure this stability, we needed to use the latest updates released by the Apertium upstream project maintainers, and we needed to use Apertium as a separate service. Prior to this set-up, the Apertium service was being provided from within the Content Translation server (cxserver).

The Content Translation tool is currently hosted on Wikimedia’s beta servers. To set up the independent Apertium service, it was important to use the latest released stable packages from Apertium, but they were not available for the current versions of Ubuntu and Debian. This became a significant blocker, because use of third party package repositories is not recommended for Wikimedia’s server environments.

After discussion with Wikimedia’s Operations team and Apertium project maintainers, it was decided that the Apertium packages would be built for the Wikimedia repository. In addition to the Apertium base packages, individual packages for supporting the language pairs and other service packages were built, tested and included in the Wikimedia repository. Alexandros Kosiaris (from the Wikimedia Operations team), reviewed and merged these packages and the patches for their inclusion in the repository. The Apertium service was then puppetized for easy configuration and management on the Wikimedia beta cluster.

Meanwhile, to make Apertium more accessible for Ubuntu and Debian users, Kartik Mistry (from the Wikimedia Language Engineering team) also started working closely with the Apertium project maintainers, to make sure that the Debian packages were up-to-date in the main repository. Going forward, once the updated packages are included in Ubuntu’s next Long Term Support (LTS) version, we plan to remove these packages from the internal Wikimedia repository.

The Content Translation tool has since been updated and now supports Catalan, Portuguese and Spanish machine translation, using the updated Apertium service through cxserver. We hope our users will benefit from the faster and more reliable translation experience.

We would like to thank Tino Didriksen, Francis Tyers and Kevin Brubeck Unhammer from the Apertium project, and Alexandros Kosiaris and Antoine Musso from the Wikimedia Operations and Release Engineering teams respectively, for their continued support and guidance.

Runa Bhattacharjee, and Kartik Mistry, Wikimedia Language Engineering team

by Guillaume Paumier at 14 November 2014 06:41 PM

November 13, 2014

Global By Design

The Four Seasons improves its global gateway

I was pleased to see the Four Seasons embrace the globe icon for its global gateway. It is well positioned in the upper right-hand corner. The Four Seasons website ranked 145th out of the 150 websites scored in the 2014 Web Globalization Report Card. I predict its ranking will improve in the 2015 edition!  

The post The Four Seasons improves its global gateway appeared first on Global by Design.

by John Yunker at 13 November 2014 10:07 PM

November 10, 2014

Global By Design

Amazon pluralizes Singles Day

Leave it to Amazon to turn Single Day plural. And why not. If we can extend Black Friday to Cyber Monday, why not extend Singles day an extra day? Here’s a screen grab of the Amazon China home page (note that the sale begins on 11/10): Nike is sticking with one day, for now. Here’s a Singles […]

The post Amazon pluralizes Singles Day appeared first on Global by Design.

by John Yunker at 10 November 2014 04:42 PM

November 06, 2014

Global By Design

The biggest ecommerce day in November? It’s not Black Friday.

In China, November 11th is known as Singles Day and it has quickly become the world’s biggest day for ecommerce. Tmall, the massive ecommerce website owned by Alibaba is already promoting this day: Tmall hosts a great number of Western brands that are also eager to capitalize on this day, like Clinique: Xiaomi, China’s leading […]

The post The biggest ecommerce day in November? It’s not Black Friday. appeared first on Global by Design.

by John Yunker at 06 November 2014 03:19 AM

November 04, 2014

Wikimedia Foundation

guillaumepaumier

CLDR, the Common Locale Data Repository project from the Unicode Consortium, provides translated locale-specific information like language names, country names, currency, date/time etc. that can be used in various applications. This library, used across several platforms, is particularly useful in maintaining parity of locale information in internationalized applications. In MediaWiki, the CLDR extension provides localized data and functions that can be used by developers.

The CLDR project constantly updates and maintains this database and publishes it twice a year. The information is periodically reviewed through a submission and vetting process. Individual participants and organisations can contribute during this process to improve and add to the CLDR data. The most recent version of CLDR was released in September 2014.

An important part of the CLDR data are the rules that impact how plurals are handled within the grammar of a language. In CLDR versions 25 and 26, plural rules for several languages were altered. These changes have already been incorporated in MediaWiki, which was still using rules from CLDR version 24.

The affected languages are: Russian (ru), Abkhaz (ab), Avaric (av), Bashkir (ba), Buryat (bxr), Chechen (ce), Crimean Tatar (crh-cyrl), Chuvash (cv), Inguish (inh), Komi-Permyak (koi), Karachay-Balkar (krc), Komi (kv), Lak (lbe), Lezghian (lez), Eastern Mari (mhr), Western Mari (mrj), Yakut (sah), Tatar (tt), Tatar-Cyrillic (tt-cyrl), Tuvinian (tyv), Udmurt (udm), Kalmyk (xal), Prussian (prg), Tagalog (tl), Manx (gv), Mirandese (mwl), Portuguese (pt), Brazilian Portuguese (pt-br), Uyghur (ug), Lower Sorbian (dsb), Upper Sorbian (hsb), Asturian (ast) and Western Frisian (fy).

This change will have very little impact on our users. Translators, however, will have to review the user interface messages that have already been changed to include the updated plural forms. An announcement with the details of the change has also been made. The announcement also includes instructions for updating the translations for the languages mentioned above.

The CLDR MediaWiki extension, which provides convenient abstraction for getting country names, language names etc., has also been upgraded to use CLDR 26. Universal Language Selector and CLDRPluralRuleParser libraries have been upgraded to use latest data as well.

The Wikimedia Foundation is a participating organisation in the CLDR project. Learn more about how you can be part of this effort.

Further reading about CLDR and its use in Wikimedia internationalization projects:

  1. http://laxstrom.name/blag/2014/01/05/mediawiki-i18n-explained-plural/
  2. http://thottingal.in/blog/2014/05/24/parsing-cldr-plural-rules-in-javascript/

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

by Guillaume Paumier at 04 November 2014 05:17 PM

The Content Translation tool car be used to translate articles more easily (here from Spanish to Portuguese). It provides features such as link cards, category adaptation (in development), and a warning to the editor when the text is coming exclusively from machine translation.

A few months back, the Language Engineering team of the Wikimedia Foundation announced the availability of the first version of the Content Translation tool, with machine translation support from Spanish to Catalan. The response from the Catalan Wikipedia editors was overwhelming and nearly 200 articles have already been created using the tool.

We have now enabled support for translating across Spanish, Portuguese and Catalan using Apertium as the machine translation back-end system. This extends our Spanish-to-Catalan initial launch.

The Content Translation tool is particularly useful for multilingual editors who can create new articles from corresponding articles in another language. The tool features a minimal rich-text editor with translation tools like dictionaries and machine translation support.

The Content Translation tool car be used to translate articles more easily (here from Spanish to Portuguese). It provides features such as link cards, category adaptation (in development), and a warning to the editor when the text is coming exclusively from machine translation.

The Content Translation tool can be used to translate articles more easily (here from Spanish to Portuguese). It provides features such as link cards, category adaptation (in development), and a warning to the editor when the text is coming exclusively from machine translation.

Development for the second version was completed on September 30, 2014. Due to technical difficulties in the deployment environment, availability of the updated version of the tool was delayed. As a result, the current deployment also includes some of the planned features from the next release, which is scheduled to be complete on November 18, 2014.

Highlights from this version

Some of the features included in this version originated from feedback received from the community, either during usability testing sessions, or as comments and suggestions from our initial users. Editors from the Catalan Wikipedia provided constant feedback after the first release of the tool and also during the recent roundtable.

Highlights:

  1. Automatic adaptation of categories.
  2. Text formatting with a simple toolbar in the Chrome browser. In Firefox, this support is limited only to keyboard shortcuts (Ctrl-B for bold, Ctrl-I for italics).
  3. Bi-directional machine translation support for Spanish and Portuguese
  4. Machine translation support from Catalan to Spanish
  5. Paragraph alignment improvements to better match original and translated sentences.
  6. More accurate detection for the use of Machine Translation suggestions without further corrections, with warnings shown to the user
  7. Redesigned top bar and progress bar.
  8. Numerous bug fixes.

How to Use

To use the tool, users can visit http://en.wikipedia.beta.wmflabs.org/wiki/Special:ContentTranslation and make the following selections:

  • source language – the language of the article to translate from. Can be Catalan, Spanish or Portuguese.
  • target language – the language of the article you will be translating into. Can be Catalan, Spanish or Portuguese.
  • article name – the title of the article to translate.

Users can also continue using the tool from the earlier available instance at http://es.wikipedia.beta.wmflabs.org/wiki/Especial:ContentTranslation

After translation, users can publish the translation in their own namespace on the same wiki and can choose to copy the page contents to the real Wikipedia for the target language. Please visit this link for more instructions on how to create and publish a new article.

Feedback and Participation

In the next few weeks, we will be reaching out to the editors from the Catalan, Spanish and Portuguese Wikipedia communities to gather feedback and also work closely to resolve any issues.

Please let us know about your feedback through the project talk page. You can also volunteer for our testing sessions.

Runa Bhattacharjee, Wikimedia Foundation, Language Engineering team

by Guillaume Paumier at 04 November 2014 04:59 AM

October 29, 2014

Global By Design

Is your global gateway stuck in the basement?

When you welcome visitors into your home, you probably don’t usher them directly to the basement. Yet when it comes to websites, this is exactly how many companies treat visitors from around the world. That is, they expect visitors to scroll down to the footer (basement) of their websites in order to find the global gateway. Now I want […]

The post Is your global gateway stuck in the basement? appeared first on Global by Design.

by John Yunker at 29 October 2014 11:43 PM


Contact: Richard Ishida (ishida@w3.org).