W3C   W3C Internationalization (I18n) Activity: Making the World Wide Web truly world wide!

Latest del.icio.us tags

Blog searches

Contributors

If you own a blog with a focus on internationalization, and want to be added or removed from this aggregator, please get in touch with Richard Ishida at ishida@w3.org.

All times are UTC.

Powered by: Planet

Planet Web I18n

The Planet Web I18n aggregates posts from various blogs that talk about Web internationalization (i18n). While it is hosted by the W3C Internationalization Activity, the content of the individual entries represent only the opinion of their respective authors and does not reflect the position of the Internationalization Activity.

April 08, 2015

Wikimedia Foundation

The Content Translation tool makes it easier to create new Wikipedia articles from other languages. You can now start translations from your Contributions link, where you can find articles missing in your language. Screenshot by Runa Bhattacharjee, freely licensed under CC0 1.0

The Content Translation tool makes it easier to create new Wikipedia articles from other languages. You can now start translations from your Contributions link, where you can find articles missing in your language. Screenshot by Runa Bhattacharjee, freely licensed under CC0 1.0
The Content Translation tool makes it easier to create new Wikipedia articles from other languages. You can now start translations from your Contributions link, where you can find articles missing in your language. Screenshot by Runa Bhattacharjee, licensed under CC0 1.0

Since it was first introduced three months ago, the Content Translation tool has been used to write more than 850 new articles on 22 Wikipedias. This tool was developed by Wikimedia Foundation’s Language Engineering team to help multilingual users quickly create new Wikipedia articles by translating them from other languages. It includes an editing interface and translation tools that make it easy to adapt wiki-specific syntax, links, references, and categories. For a few languages, machine translation support via Apertium is also available.

Content Translation (aka CX) was first announced on January 20, 2015, as a beta feature on 8 Wikipedias: Catalan, Danish, Esperanto, Indonesian, Malay, Norwegian (Bokmal), Portuguese, and Spanish. Since then, Content Translation has been added gradually to more Wikipedias – mostly at the request of their communities. As a result, the tool is now available as a beta feature on 22 Wikipedias. Logged-in users can enable the tool as a preference on those sites, where they can translate articles from any of the available source languages (including English) into these 22 languages.

Here is what we have learned by observing how Content Translation was used by over 260 editors in the last three months.

Translators

Number of users who enabled this beta feature over time on Catalan Wikipedia. Graph by Runa Bhattacharjee, CC0 1.0

To date, nearly 1,000 users have manually enabled the Content Translation tool — and more than 260 have used it to translate a new article. Most translators are from the Catalan and Spanish Wikipedias, where the tool was first released as a beta feature.

Articles

Articles published using Content Translation. Graph by Runa Bhattacharjee, CC0 1.0

Articles created with the Content Translation tool cover a wide range of topics, such as fashion designers, Field Medal scholars, lunar seas and Asturian beaches. Translations can be in two states: published or in-progress. Published articles appear on Wikipedia like any other new article and are improved collaboratively; these articles also include a tag that indicates that they were created using Content Translation. In-progress translations are unpublished and appear on the individual dashboard of the translator who is working on it. Translations are saved automatically and users can continue working on them anytime. In cases where multiple users attempt to translate or publish the same article in the same language, they receive a warning. To avoid any accidental overwrites, the other translators can publish their translations under their user page — and make separate improvements on the main article. More than 875 new articles have been created since Content Translation has been made available — 500 of which were created on the Catalan Wikipedia alone.

Challenges

When we first planned to release Content Translation, we decided to monitor how well the tool was being adopted — and whether it was indeed useful to complement the workflow used by editors to create a new article. The development team also agreed to respond quickly all queries or bugs. Complex bugs and other feature fixes were planned into the development cycles. But finding the right solution for the publishing target proved to be major challenge, from user experience to analytics. Originally, we did not support publishing into the main namespace of any Wikipedia: users had to publish their translations under their user pages first and then move them to the main namespace. However, this caused delays, confusion and sometimes conflicts when the articles were eventually moved for publication. In some cases, we also noticed that articles had not been counted correctly after publication. To avoid these issues, that original configuration was changed for all supported sites. A new translation is now published like any other new article and in case an article already exists or gets created while the translation was being done, the user is displayed warnings.

New features

Considering the largely favorable response from our first users, we have now started to release the tool to more Wikipedias. New requests are promptly handled and scheduled, after language-specific checks to make sure that proposed changes will work for all sites. However, usage patterns have varied across the 22 Wikipedias. While some of the causes are outside of our control (like the total number of active editors), we plan to make several enhancements to make Content Translation easily discoverable by more users, at different points of the editing and reading workflows. For instance, when users are about to create a new article from scratch, a message gives them the option to start with a translation instead. Users can also see suggestions in the interlanguage link section for languages that they can translate an article into. And last but not least, the Contributions section now provides a link to start a new translation and find articles missing in your language (see image at the top of this post).

In coming months, we will continue to introduce new features and make Content Translation more reliable for our users. See the complete list of Wikipedias where Content Translation is currently available as a beta feature. We hope you will try it out as well, to create more content

Runa Bhattacharjee, Language Engineering, Wikimedia Foundation

by Andrew Sherman at 08 April 2015 05:10 PM

April 07, 2015

Global By Design

Starbucks: The best global retail website

For the 2015 Web Globalization Report Card, we studied 10 retail websites: Best Buy Costco GameStop Gap H&M IKEA McDonald’s Staples Starbucks Toys R Us UNIQLO Walmart Zara Out of those 10 websites, Starbucks emerged as number one. Here is a screen shot from the German site: McDonald’s leads the category in languages supported, with 39 (in […]

by John Yunker at 07 April 2015 05:30 PM

April 06, 2015

Wikimedia Foundation

Group photo

Group photo
The Content Translation tool has made it a lot easier for Catalan Wikimedians to convert articles to and from different languages. Photo by Flamenc, freely licensed under CC BY-SA 3.0

Catalan Wikimedians are a very enthusiastic wiki community. In relation to the whole movement, we are mid-sized but one of the most active in terms of editors per millions of speakers.

Surprisingly Catalan, our mother language, was banished for more than 40 years. Thankfully, editors like to use wikis for digital language activism. With Wikipedia (Viquipèdia, in Catalan) we founded a digital space where we can freely spread our language without real life restrictions (governments, markets).

Almost 99% of Catalan speakers are bilingual and also speak Spanish. This means that content translation from Spanish Wikipedia happens frequently on our project. Some translate by hand, others use commercial platforms like Google Translate or freely licensed translation engines like Apertium. Some users even create their own translation bots, like the AmicalBot or EVA, which our community loves and uses often.

A few months ago, we heard news of the upcoming Wikimedia’s ContentTranslation tool, and we’re really happy to find that the very first language tests were planned between Spanish and Catalan. Our community responded to this news with great enthusiasm and we have been testing the tool for months now. The development team has kindly listened to our comments and demands, while implementing many of our shared recommendations.

At a personal level, I found the tool really helpful. It is easy to use and understand, and it greatly facilitates our work. I can now translate a 20- line article in less than 5 minutes, saving lots of time. Before, the worst part of translating articles was spending extra time translating reference templates and some of the wikicode. We understand the tool is not perfect yet, but nothing is perfect in a wiki environment: it is continuously being improved.

One of our community’s biggest challenges is updating different language wikis. We have good content about Catalan culture in the Catalan language, but we are not that good at exporting this content to other wikis. I personally hope that this tool can help us with both tasks.

I recommend that you try the ContentTranslation tool with an open mind and spend some time with it. Translate a few articles and if you find any bugs, please report them. When we say Wikipedia is a global project, we mean that it is multilingual, and this tool really helps us reach our shared vision to help every single human being can freely share in the sum of all knowledge.

Alex Hinojo, Amical Wikimedia community member

by Andrew Sherman at 06 April 2015 10:08 PM

Global By Design

Adobe points to external localized tutorials

Adobe provides French, German and Japanese tutorials for Photoshop Elements. But what about other languages? Until the funding comes along for additional translation, Adobe directs users to tutorials created in Spanish, Polish, Dutch and Russian. Simple and smart. I don’t know what more software companies don’t do this. PS: Adobe ranked #9 overall in this year’s Web […]

by John Yunker at 06 April 2015 05:20 PM

March 25, 2015

W3C I18n Activity highlights

Program published for W3C MultilingualWeb Workshop in Riga, 29 April

See the program. The keynote speaker will be Page Williams, Director of Global Readiness, Trustworthy Computing, Microsoft. She is followed by a strong line up in sessions entitled Developers and Creators, Localizers, Machines, and Users, including speakers from Microsoft, the European Parliament, the UN FAO, Intel, Verisign, and many more. The workshop is made possible with the generous support of the LIDER project.

Participation in the event is free. Please register via the Riga Summit for the Multilingual Digital Single Market site.

The MultilingualWeb workshops, funded by the European Commission and coordinated by the W3C, look at best practices and standards related to all aspects of creating, localizing and deploying the multilingual Web. The workshops are successful because they attract a wide range of participants, from fields such as localization, language technology, browser development, content authoring and tool development, etc., to create a holistic view of the interoperability needs of the multilingual Web.

We look forward to seeing you in Riga!

by Richard Ishida at 25 March 2015 08:28 AM

March 24, 2015

Global By Design

BMW & Chevrolet: The Best Global Automotive Websites

For the 2015 Web Globalization Report Card, we studied 14 automotive manufacturers and one supplier (Michelin). Audi BMW Chevrolet Ford Goodyear Honda Hyundai Land Rover Lexus Mercedes Michelin Mini Nissan Toyota Volkswagen Out of those 15 websites, BMW and Chevrolet emerged in a numerical tie for number one. BMW and Chevrolet both support an impressive 41 languages, in addition to […]

by John Yunker at 24 March 2015 04:25 PM

March 16, 2015

Global By Design

Why you should be using geolocation for global navigation

In the 2015 Web Globalization Report Card, slightly more than half of the websites studied use geolocation specifically to improve global navigation. This is up significantly from just a few years ago. Geolocation is the process of identifying the IP address of a user’s computer or smartphone and responding with localized content or websites. Companies that […]

by John Yunker at 16 March 2015 11:12 AM

March 14, 2015

Global By Design

Armenia gets an IDN: հայ

This is not exactly breaking news, but Armenia now has an IDN: հայ Here it is in my fast-evolving IDN map: This means that 34 countries now have delegated IDNs.  

by John Yunker at 14 March 2015 08:59 PM

March 11, 2015

W3C I18n Activity highlights

Unicode 8.0 Beta Review

The Unicode® Consortium announced the start of the beta review for Unicode 8.0.0, which is scheduled for release in June, 2015. All beta feedback must be submitted by April 27, 2015.

Unicode 8.0.0 comprises several changes which require careful migration in implementations, including the conversion of Cherokee to a bicameral script, a different encoding model for New Tai Lue, and additional character repertoire. Implementers need to change code and check assumptions regarding case mappings, New Tai Lue syllables, Han character ranges, and confusables. Character additions in Unicode 8.0.0 include emoji symbol modifiers for implementing skin tone diversity, other emoji symbols, a large collection of CJK unified ideographs, a new currency sign for the Georgian lari, and six new scripts. For more information on emoji in Unicode 8.0.0, see the associated draft Unicode Emoji report.

Please review the documentation, adjust code, test the data files, and report errors and other issues to the Unicode Consortium by April 27, 2015. Feedback instructions are on the beta page.

See more information about testing the 8.0.0 beta. See the current draft summary of Unicode 8.0.0.

by Richard Ishida at 11 March 2015 12:29 PM

March 04, 2015

Global By Design

Google to the Internet: Go mobile or watch your sales rank fall

Four years ago, for the Web Globalization Report Card, I began noting (and rewarding) those websites that supported mobile devices. Even then one could easily see the virtual grounds shifting in favor of mobile devices. But at the time, only about 20% of the websites studied supported mobile devices. In this year’s Report Card, the majority of websites are […]

by John Yunker at 04 March 2015 03:14 PM

February 26, 2015

W3C I18n Activity highlights

Speaker deadline for Riga MultilingualWeb Workshop is Sunday, 8 March

We would like to remind you that the deadline for speaker proposals for the 8th MultilingualWeb Workshop (April 29, 2015, Riga, Latvia) is on Sunday, March 8, at 23:59 UTC.

Featuring a keynote by Paige Williams (Director of Global Readiness, Trustworthy Computing at Microsoft) and sessions for various audiences (Web developers, content creators, localisers, users, and multilingual language processing), this workshop will focus on the advances and challenges faced in making the Web truly multilingual. It provides an outstanding and influential forum for thought leaders to share their ideas and gain critical feedback.

While the organizers have already received many excellent submissions, there is still time to make a proposal, and we encourage interested parties to do so by the deadline. With roughly 150 attendees anticipated for the Workshop from a wide variety of profiles, we are certain to have a large and diverse audience that can provide constructive and useful feedback, with stimulating discussion about all of the presentations.

The workshop is made possible by the generous support of the LIDER project and will be part of the Riga Summit 2015 on the Multilingual Digital Single Market. We are organizing the workshop as part of the Riga Summit to strengthen the European related community at large. Depending on the number of submissions to the MultilingualWeb workshop we may suggest to move some presentations to other days of the summit. For these reasons we highly recommend you to attend the whole Riga Summit! See the line-up of speakers already confirmed for the various events during the summit.

For more information and to register a presentation proposal, please visit the Riga Workshop Call for Participation. For registration as a regular participant of the MultilingualWeb workshop or other events at the Riga Summit, please register at the Riga Summit 2015 site.

by Richard Ishida at 26 February 2015 11:30 AM

February 20, 2015

Global By Design

Web localization in the Year of the Sheep

I enjoying watching how Western companies localize their websites and products to capitalize on Chinese New Year — the Year of the Sheep (or Goat). Like this gift card from Starbucks China: And this  hero image on the Microsoft China home page: And Nike has put together a color-appropriate assortment of products: Happy New Year!

by John Yunker at 20 February 2015 03:26 AM

February 18, 2015

Global By Design

LinkedIn adds Arabic

Nice to see that LinkedIn has added support for Arabic: This raises LinkedIn’s language total to 24 languages, including English. As a point of comparison, Facebook supports more than 70 languages.    

by John Yunker at 18 February 2015 05:20 PM

February 10, 2015

Global By Design

The top 25 global websites from the 2015 Web Globalization Report Card

I’m pleased to announce the publication of The 2015 Web Globalization Report Card. Here are the top-scoring websites from the report: For regular readers of this blog, you’ll notice that Google is once again ranked number one. The fact is, no other company on this list invests in web and software globalization like Google. While […]

by John Yunker at 10 February 2015 08:08 PM

February 05, 2015

W3C I18n Activity highlights

Paige Williams (Microsoft) to keynote at 8th Multilingual Web Workshop (April 29, 2015, Riga)

We are please to announce that Paige Williams, Director of Global Readiness, Trustworthy Computing at Microsoft, will deliver the keynote at the 8th Multilingual Web Workshop, “Data, content and services for the Multilingual Web,” in Riga, Latvia (29 April 2015).

Paige spent 10 years managing internationalization of Microsoft.com, before joining the Trustworthy Computing organization in 2005. In TwC, Paige oversees compliance of company policy for geographic, country-region and cultural requirements, establishing a new center of excellence for market and world readiness, globalization/localizability, and language programs, tools, resources and external community forums to reach markets across the world with the right local experience.

The Multilingual Web Workshop series brings together participants interested in the best practices, new technologies, and standards needed to help content creators, localizers, language tools developers, and others address the new opportunities and challenges of the multilingual Web. It will provide for networking across communities and building connections.

Registration for the Workshop is free, and early registration is recommended since space at the Workshop is limited.

The workshop will be part of the Riga Summit 2015 on the Multilingual Digital Single Market. We are organizing the workshop as part of the Riga Summit to strengthen the European related community at large. Depending on the number of submissions to the MultilingualWeb workshop we also may suggest to move presentations to other days of the summit. For these reasons we highly recommend you to attend the whole Riga Summit!

There is still opportunity for individuals to submit proposals to speak at the workshop. Ideal proposals will highlight emerging challenges or novel solutions for reaching out to a global, multilingual audience. The deadline for speaker proposals is March 8, but early submission is strongly encouraged. See the Call for Participation for more details.

This workshop is made possible by the generous support of the LIDER project.

by Richard Ishida at 05 February 2015 11:30 AM

February 03, 2015

W3C I18n Activity highlights

Counter Styles: two documents published

The Cascading Style Sheets (CSS) Working Group has published a Candidate Recommendation of CSS Counter Styles Level 3. It adds new built-in counter styles to those defined in CSS 2.1, but, more importantly, it also allows authors to define custom styles for list markers, numbered headings and other types of generated content.

At the same time, the Internationalization Working Group has updated their Working Draft of Predefined Counter Styles, which provides custom rules for over a hundred counter styles in use around the world. It serves both as a ready-to-use set of styles to copy into your own style sheets, and also as a set of worked examples.

by Richard Ishida at 03 February 2015 06:18 PM

January 29, 2015

ishida>>blog » i18n

Bopomofo on the Web

Three bopomofo letters with tone mark.

Light tone mark in annotation.

A key issue for handling of bopomofo (zhùyīn fúhào) is the placement of tone marks. When bopomofo text runs vertically (either on its own, or as a phonetic annotation), some smarts are needed to display tone marks in the right place. This may also be required (though with different rules) for bopomofo when used horizontally for phonetic annotations (ie. above a base character), but not in all such cases. However, when bopomofo is written horizontally in any other situation (ie. when not written above a base character), the tone mark typically follows the last bopomofo letter in the syllable, with no special handling.

From time to time questions are raised on W3C mailing lists about how to implement phonetic annotations in bopomofo. Participants in these discussions need a good understanding of the various complexities of bopomofo rendering.

To help with that, I just uploaded a new Web page Bopomofo on the Web. The aim is to provide background information, and carry useful ideas from one discussion to the next. I also add some personal thoughts on implementation alternatives, given current data.

I intend to update the page from time to time, as new information becomes available.

by r12a at 29 January 2015 12:07 PM

January 20, 2015

Wikimedia Foundation

File:Content Translation Screencast (English).webm

File:Content Translation Screencast (English).webm

Video: How to translate a Wikipedia article in 3 minutes with Content Translation. This video can also be viewed on YouTube (4:10). Screencast by Pau Giner, licensed under CC BY-SA 4.0

Wikimedia Foundation’s Language Engineering team is happy to announce the first version of Content Translation on Wikipedia for 8 languages: Catalan, Danish, Esperanto, Indonesian, Malay, Norwegian (Bokmål), Portuguese and Spanish. Content Translation, available as a beta feature, provides a quick way to create new articles by translating from an existing article into another language. It is also well suited for new editors looking to familiarize themselves with the editing workflow. Our aim is to build a tool that leverages the power of our multicultural global community to further Wikimedia’s mission of creating a world where every single human being can share in the sum of all knowledge.

Design

During early 2014, when the design ideas for Content Translation were being conceptualized, we came across an interesting study by Scott A.Hale of University of Oxford, on the influences and editing patterns of multilingual editors on Wikipedia. Combined with feedback from editors we interacted with, the data presented in the study guided our initial choices, both in terms of features and languages. We were fortunate to have met the researcher in person at Wikimania 2014, so we could learn more about his findings and references.

The tool was designed for multilingual editors as our main target users. Several important patterns emerged from a month-long user study, including:

  • Multilingual editors are relatively more active in Wikipedias of smaller size. Often the editors from smaller sized Wikipedias would also edit on a relatively large sized Wikipedia like English or German;
  • Multilingual editors often edited the same articles in their primary and non-primary languages.

These and other factors listed in the study impact the transfer of content between different language versions of Wikipedia; they increase content parity between versions — and decrease ‘self-focus’ bias in individual editions.

Languages

When selecting languages for the tool’s introduction, we were guided by several factors, including signs of relatively high multilingualism amongst the primary editors. The availability of high quality machine-translated content was an additional consideration, to fully explore the usability of the core editing workflow designed for the tool. Based on these considerations, Catalan Wikipedia, a very actively edited project of medium size was a logical choice. Subsequent language selections were made by studying possible overlap trends between language users — and the probability of editors benefiting from those overlaps when creating new articles. Availability of machine translation to speed up the process and community requests were important considerations.

How it works

The article Abel Martín in the Spanish Wikipedia doesn’t have a version in Portuguese, so a red link to Portuguese is shown.
Content Translation red interlanguage link screenshot by Amire80 , licensed under CC BY-SA 4.0

Content Translation combines a rich text translation interface with tools targeted for editing — and machine translation support for most language pairs. It integrates different tools to automate repetitive steps during translation: it provides an initial automatic translation while keeping the original text format, links, references, and categories. To do so, the tool relies on the inter-language connections from Wikidata, html-to-wikitext conversion from Parsoid, and machine translation support from Apertium. This saves time for editors and allows them to focus on creating quality content.

Although basic text formatting is supported, the purpose of the tool is to create an initial version of the content that each community can keep improving with their usual editing tools. Content Translation is not intended to keep the information in sync across multiple language versions, but to provide a quick way to reuse the effort already made by the community when creating an article from scratch in a different language.

The tool can be accessed in different ways. There is a persistent access point at your contributions page, but access to the tool is also provided in situations where you may want to translate the content you are just reading. For instance, a red link in the interlanguage link area (see image).

Next steps

Next steps for the tool’s future development include adding support for more – eventually all – languages, managing lists of articles to translate, and adding features for more streamlined translation.

In coming weeks, we will closely monitor feedback from users and interact with them to guide our future development. Please read the release announcement for more details about the features and instructions on using the tool. Thank you!

Amir Aharoni, Pau Giner, Runa Bhattacharjee, Language Engineering, Wikimedia Foundation

by fflorin2015 at 20 January 2015 06:56 PM

January 18, 2015

ishida>>blog » i18n

Bengali picker & character & script notes updated

Screen Shot 2015-01-18 at 07.42.56

Version 16 of the Bengali character picker is now available.

Other than a small rearrangement of the selection table, and the significant standard features that version 16 brings, this version adds the following:

  • three new buttons for automatic transcription between latin and bengali. You can use these buttons to transcribe to and from latin transcriptions using ISO 15919 or Radice approaches.
  • hinting to help identify similar characters.
  • the ability to select the base character for the display of combining characters in the selection table.

For more information about the picker, see the notes at the bottom of the picker page.

In addition, I made a number of additions and changes to Bengali script notes (an overview of the Bengali script), and Bengali character notes (an annotated list of characters in the Bengali script).

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 18 January 2015 08:10 AM

January 15, 2015

Global By Design

Global Gateway Fail: Yandex

Yandex is Russia’s leading search engine and, following in Google’s footsteps, is eager to take over much of Russia’s Internet, which naturally includes the web browser. Yandex is also in the process of expanding its reach beyond Russia. But when I visited the web browser download web page I couldn’t help but notice a few problems with the global […]

by John Yunker at 15 January 2015 01:25 AM

January 13, 2015

ishida>>blog » i18n

Initial letter styling in CSS

initial-letter-tibetan-01

The CSS WG needs advice on initial letter styling in non-Latin scripts, ie. enlarged letters or syllables at the start of a paragraph like those shown in the picture. Most of the current content of the recently published Working Draft, CSS Inline Layout Module Level 3 is about styling of initial letters, but the editors need to ensure that they have covered the needs of users of non-Latin scripts.

The spec currently describes drop, sunken and raised initial characters, and allows you to manipulate them using the initial-letter and the initial-letter-align properties. You can apply those properties to text selected by ::first-letter, or to the first child of a block (such as a span).

The editors are looking for

any examples of drop initials in non-western scripts, especially Arabic and Indic scripts.

I have scanned some examples from newspapers (so, not high quality print).

In the section about initial-letter-align the spec says:

Input from those knowledgeable about non-Western typographic traditions would be very helpful in describing the appropriate alignments. More values may be required for this property.

Do you have detailed information about initial letter styling in a non-Latin script that you can contribute? If so, please write to www-style@w3.org (how to subscribe).

by r12a at 13 January 2015 12:13 PM

January 10, 2015

Wikimedia Foundation

"Cx-new-languages" by Runabhattacharjee, under CC-Zero


The new Content Translation tool’s language selector will make it easier to translate Wikipedia articles.
Content Translation tool screenshot by Runabhattacharjee, licensed under CC-0

In early December 2014, the Wikimedia Foundation’s Language Engineering team announced the release of the third version of our Content Translation tool, which aims to make it easier to translate Wikipedia articles. Since then, our focus has been to take the tool to the next step and make it more widely available. Encouraged by the feedback we have received in the last 6 months, we are now happy to announce that the tool will soon be available in 8 Wikipedias as a beta feature. Users of Catalan, Danish, Esperanto, Indonesian, Malay, Norwegian (Bokmål), Portuguese, and Spanish Wikipedias will be able to use Content Translation from mid-January 2015. The tool will also be enabled on the Norwegian (Nynorsk) and Swedish Wikipedias, but only to facilitate their use as sources for Norwegian (Bokmål) and Danish respectively.

Users of Catalan, Spanish and Portuguese wikis have already previewed the tool on the Wikimedia beta servers and it was a natural choice to add these three languages in our first set for deployment. The remaining five languages were chosen based on user survey results and community requests. These languages are also available on the Wikimedia beta servers where Content Translation has been hosted since July 2014.

Currently, the Language Engineering team is completing the final phases for enabling Content Translation as a beta feature. After deployment, users will be able to translate Wikipedia articles into the language of their choice (restricted to the above mentioned eight languages) from appropriate source languages available for that language. For most of these languages, machine translation between the source and target language pairs will be made available through Apertium. English will be enabled as a source language for all languages, but without machine translation support except for English to Esperanto, where machine translations from English have been found to be satisfactory.

We will make further announcements as we close in on the deployment date. It’s possible that the beta feature may become available on the wikis for testing, before the announcements are out. Prior to that, the Language Engineering team will also host an IRC ‘office hours’ discussion on January 14th at 1600 UTC on #wikimedia-office.

Meanwhile, we welcome users to try out Content Translation and to bring to our attention any issues or suggestions. You can also help us prepare Content Translation to support more languages by filling in the language evaluation survey.

Runa Bhattacharjee, Language Engineering, Wikimedia Foundation

by fflorin2015 at 10 January 2015 10:39 PM

January 06, 2015

ishida>>blog » i18n

The Combining Character Conundrum

I’m struggling to show combining characters on a page in a consistent way across browsers.

For example, while laying out my pickers, I want users to be able to click on a representation of a character to add it to the output field. In the past I resorted to pictures of the characters, but now that webfonts are available, I want to replace those with font glyphs. (That makes for much smaller and more flexible pages.)

Take the Bengali picker that I’m currently working on. I’d like to end up with something like this:

comchacon0

I put a no-break space before each combining character, to give it some width, and because that’s what the Unicode Standard recommends (p60, Exhibiting Nonspacing Marks in Isolation). The result is close to what I was looking for in Chrome and Safari except that you can see a gap for the nbsp to the left.

comchacon1

But in IE and Firefox I get this:

comchacon2

This is especially problematic since it messes up the overall layout, but in some cases it also causes text to overlap.

I tried using a dotted circle Unicode character, instead of the no-break space. On Firefox this looked ok, but on Chrome it resulted in two dotted circles per combining character.

I considered using a consonant as the base character. It would work ok, but it would possibly widen the overall space needed (not ideal) and would make it harder to spot a combining character by shape. I tried putting a span around the base character to grey it out, but the various browsers reacted differently to the span. Vowel signs that appear on both sides of the base character no longer worked – the vowel sign appeared after. In other cases, the grey of the base character was inherited by the whole grapheme, regardless of the fact that the combining character was outside the span. (Here are some examples ে and ো.)

In the end, I settled for no preceding base character at all. The combining character was the first thing in the table cell or span that surrounded it. This gave the desired result for the font I had been using, albeit that I needed to tweak the occasional character with padding to move it slightly to the right.

On the other hand, this was not to be a complete solution either. Whereas most of the fonts I planned to use produce the dotted circle in these conditions, one of my favourites (SolaimanLipi) doesn’t produce it. This leads to significant problems, since many combining characters appear far to the left, and in some cases it is not possible to click on them, in others you have to locate a blank space somewhere to the right and click on that. Not at all satisfactory.

comchacon3

I couldn’t find a better way to solve the problem, however, and since there were several Bengali fonts to choose from that did produce dotted circles, I settled for that as the best of a bad lot.

However, then i turned my attention to other pickers and tried the same solution. I found that only one of the many Thai fonts I tried for the Thai picker produced the dotted circles. So the approach here would have to be different. For Khmer, the main Windows font (Daunpenh) produced dotted circles only for some of the combining characters in Internet Explorer. And on Chrome, a sequence of two combining characters, one after the other, produced two dotted circles…

I suspect that I’ll need to choose an approach for each picker based on what fonts are available, and perhaps provide an option to insert or remove base characters before combining characters when someone wants to use a different font.

It would be nice to standardise behaviour here, and to do so in a way that involves the no-break space, as described in the Unicode Standard, or some other base character such as – why not? – the dotted circle itself. I assume that the fix for this would have to be handled by the browser, since there are already many font cats out of the bag.

Does anyone have an alternate solution? I thought I heard someone at the last Unicode conference mention some way of controlling the behaviour of dotted circles via some script or font setting…?

Update: See Marc Durdin’s blog for more on this topic, and his experiences while trying to design on-screen keyboards for Lao and other scripts.

by r12a at 06 January 2015 05:28 PM

January 05, 2015

Global By Design

Domain name registrations surpass 280 million

According to Verisign’s domain name industry brief, TLD registrations hit 280 million in the second quarter of 2014. Here’s a list of the leading domains overall: Note that the .TK ccTLD is technically a country code but is marketed as a generic TLD, and quite successfully it seems. So Germany is effectively the leading ccTLD. […]

by John Yunker at 05 January 2015 07:15 PM

ishida>>blog » i18n

Khmer character picker v16

khmer-picker16

I have uploaded a new version of the Khmer character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible. If you prefer, you can still access the previous version.

Other than a small rearrangement of the default selection table to accomodate fonts rather than images, and the significant standard features that version 16 brings, there are no additional changes in this version.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 05 January 2015 10:12 AM

Devanagari, Gurmukhi & Uighur pickers available

uighur-picker16

devanagari-picker16

gurmukhi-picker16

I have updated the Devanagari picker, the Gurmukhi picker and the Uighur picker to version 16.

You may have spotted a previous, unannounced, version of the Devanagari and Uighur pickers on the site, but essentially these versions should be treated as new. The Gurmukhi picker has been updated from a very old version.

In addition to the standard features that version 16 of the character pickers brings, things to note include the addition of hints for all pickers, and automated transcription from Devanagari to ISO 15919, and vice versa for the Devanagari picker.

For more information about the pickers, see the notes at the bottom of the relevant picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 05 January 2015 09:45 AM

January 04, 2015

ishida>>blog » i18n

More picker changes: Version 16

A couple of posts ago I mentioned that I had updated the Thai picker to version 16. I have now updated a few more. For ease of reference, I will list here the main changes between version 16 pickers and previous versions back to version 12.

  • Fonts rather than graphics. The main selection table in version 12 used images to represent characters. These have now gone, in favour of fonts. Most pickers include a web font download to ensure that you will see the characters. This reduces the size and download time significantly when you open a picker. Other source code changes have reduced the size of the files even further, so that the main file is typically only a small fraction of the size it was in version 14.

    It is also now possible, in version 16, to change the font of the main selection table and the font size.

  • UI. The whole look and feel of the user interface has changed from version 14 onwards, and includes useful links and explanations off the top of the normal work space.

    In particular, the vertical menu, introduced in version 14, has been adjusted so that input features can be turned on and off independently, and new panels appear alongside the others, rather than toggling the view from one mode to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.

  • Transcription panels. Some pickers had one or more transcription views in versions below 16. These enable you to construct some non-Latin text when working from a Latin transcription. In version 16 these alternate views are converted to panels that can be displayed at the same time as other information. They can be shown or hidden from the vertical menu. When there is ambiguity as to which characters to use, a pop up displays alternatives. Click on one to insert it into the output. There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. This panel is now hidden by default, but can be easily shown from the vertical menu.

  • Automated transcription. Version 16 pickers carry forward, and in some cases add, automated transcription converters. In some cases these are intended to generate only an approximation to the needed transcription, in order to speed up the transcription process. In other cases, they are complete. (See the notes for the picker to tell which is which.) Where there is ambiguity about how to transcribe a sequence of characters, the interface offers you a choice from alternatives. Just click on the character you want and it will replace all the options proposed. In some cases, particularly South-East Asian scripts, the text you want to transcribe has to be split into syllables first, using spaces and or hyphens. Where this is necessary, a condense button it provided, to quickly strip out the separators after the transcription is done.

  • Layout The default layout of the main selection table has usually been improved, to make it easier to locate characters. Rarely used, deprecated, etc, characters appear below the main table, rather than to the right.

  • Hints Very early versions of the pickers used to automatically highlight similar and easily confusable characters when you hovered over a character in the main selection table. This feature is being reintroduced as standard for version 16 pickers. It can be turned on or off from the vertical menu. This is very helpful for people who don’t know the script well.

  • Shape-based selection. In previous versions the shape-based view replaced the default view. In version 16 the shape selectors appear below the main selection table and highlight the characters in that table. This arrangement has several advantages.

  • Applying actions to ranges of text. When clicking on the Codepoints and Escapes buttons, it is possible to apply the action to a highighted range of characters, rather than all the characters in the output area. It is also possible to transcribe only highlighted text, when using one of the automated transcription features.

  • Phoneme bank. When composing text from a Latin transcription in previous versions you had to make choices about phonetics. Those choices were stored on the UI to speed up generation of phonetic transcriptions in addition to the native text, but this feature somewhat complicated the development and use of the transcription feature. It has been dropped in version 16. Hopefully, the transcription panels and automated transcription features will be useful enough in future.

  • Font grid. The font grid view was removed in version 16. It is of little value when the characters are already displayed using fonts.

  • About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 04 January 2015 12:53 PM

January 01, 2015

Global By Design

Most popular posts of 2014

Looking over traffic logs, here are some of the most-visited blog posts of 2014: Wikipedia and the Internet language chasm The top 25 global websites from the 2014 Web Globalization Report Card The worst global websites of the 2014 Web Globalization Report Card Gmail to be first major platform to support non-Latin email addresses Google […]

by John Yunker at 01 January 2015 04:12 PM

December 26, 2014

ishida>>blog » i18n

Language Subtag Lookup tool updated

This update to the Language Subtag Lookup tool brings back the Check function that had been out of action since last January. The code had to be pretty much completely rewritten to migrate it from the original PHP. In the process, I added support for extension and private use tags, and added several more checks. I also made various changes to the way the results are displayed.

Give it a try with this rather complicated, but valid language tag: zh-cmn-latn-CN-pinyin-fonipa-u-co-phonebk-x-mytag-yourtag

Or try this rather badly conceived language tag, to see some error messages: mena-fr-latn-fonipa-biske-x-mylongtag-x-shorter

The IANA database information is up-to-date. The tool currently supports the IANA Subtag registry of 2014-12-17. It reports subtags for 8,081 languages, 228 extlangs, 174 scripts, 301 regions, 68 variants, and 26 grandfathered subtags.

by r12a at 26 December 2014 08:11 AM

December 21, 2014

ishida>>blog » i18n

Thai character picker v16

I have uploaded another new version of the Thai character picker.

Sorry this follows so quickly on the heels of version 15, but as soon as I uploaded v15 several ideas on how to improve it popped into my head. This is the result. I will hopefully bring all the pickers, one by one, up to the new version 16 format. If you prefer, you can still access version 12.

The main changes include:

  • UI. Adjustment of the vertical menu, so that input features can be turned on and off independently, and new panels appear with the others, rather than toggling from one to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.
  • Transcription panels. Panels have been added to enable you to construct some Thai text when working from a Latin transcription. This brings the transcription inputs of version 12 into version 16, but in a more compact and simpler way, and way that gives you continued access to the standard table for special characters.

    There are currently options to transcribe from ISO 11940-2 (although there are some gaps in that), or from the transcription used by Benjawan Poomsan Becker in her book, Thai for Beginners. These are both transcriptions based on phonetic renderings of the Thai, so there is often ambiguity about how to transcribe a particular Latin letter into Thai. When such an ambiguity occurs, the interface offers you a choice via a small pop-up. Just click on the character you want and it will be inserted into the main output area.

    The transcription panels are useful because you can add a whole vowel at a time, rather than picking the individual vowel signs that compose it. An issue arises, however, when the vowel signs that make up a given vowel contain one that appears to the left of the syllable initial consonant(s). This is easily solved by highlighting the syllable in question and clicking on the reorder button. The vowel sign in question will then appear as the first item in the highlighted text.

    There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. (This was available in v15 too, but has been made into a panel like the others, which can be hidden when not needed.)

  • Tones for automatic IPA transcriptions. The automatic transcription to IPA now adds tone marks. These are usually correct, but, as with other aspects of the transcription, it doesn’t take into account the odd idiosyncrasy in Thai spelling, so you should always check that the output is correct. (Note that there is still an issue for some of the ambiguous transcription cases, mostly involving RA.)

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

by r12a at 21 December 2014 09:18 AM


Contact: Richard Ishida (ishida@w3.org).