So today is the big day for the people of Scotland as well as the UK. One question that occurs to country code geeks such as myself is what country code domain would Scotland use if/when it became separate from .UK? It turns out that one domain is already available right now: .scot. However, this isn’t technically a […]
The Planet Web I18n aggregates posts from various blogs that talk about Web internationalization (i18n). While it is hosted by the W3C Internationalization Activity, the content of the individual entries represent only the opinion of their respective authors and does not reflect the position of the Internationalization Activity.
September 18, 2014
September 17, 2014
The Encoding specification has been published as a Candidate Recommendation. This is a snapshot of the WHATWG document, as of 4 September 2014, published after discussion with the WHATWG editors. No changes have been made in the body of this document other than to align with W3C house styles. The primary reason that W3C is publishing this document is so that HTML5 and other specifications may normatively refer to a stable W3C Recommendation.
Going forward, the Internationalization Working Group expects to
receive more comments in the form of implementation feedback and
test cases. The Working Group
believes it will have satisfied its implementation criteria no earlier than 16 March 2015. If you would like to contribute test cases or information about implementations, please send mail to email@example.com.
The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding.
The other (legacy) encodings have been defined to some extent in the past. However, user agents have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification addresses those gaps so that new user agents do not have to reverse engineer encoding implementations and existing user agents can converge.
September 11, 2014
September 10, 2014
I’ve been meaning to write about this for awhile. A few months ago, Apple CEO Tim Cook reportedly said this at an investor meeting: “When we work on making our devices accessible by the blind,” he said, “I don’t consider the bloody ROI.” I love this quote. And I love any CEO who knows when the […]
September 03, 2014
A few things. First, using flags to indicate language is almost always a mistake. Second, why are the language names all in English? Only the “English language” text needs to be in English. The purpose of the gateway is to communicate with speakers of other languages, not just English speakers. Finally, do we need “Language” at all? […]
August 27, 2014
I don’t know of any large company that translates all of its content into all of its target languages. I won’t go into the many reasons for why this is — money being the major reason — but I will say that if this is an issue you struggle with you’re not alone. The key to […]
August 26, 2014
I still look at the Buick brand as something for the post-60 demographic (though I must confess that demographic doesn’t feel quite so old anymore). It’s an image Buick has been working to change for years. But the beauty of globalization is that Buick doesn’t carry this sort of generational baggage in other countries. Like China. The Chinese apparently love […]
Amir Aharoni of the Wikimedia Language Engineering team introduces the Content Translation tool to the student delegation from Kazakhstan at Wikimania 2014, in London.
On July 17, 2014, the Wikimedia Language Engineering team announced the deployment of the ContentTranslation extension in Wikimedia Labs. This first deployment was targeted primarily for translation from Spanish to Catalan. Since then, users have expressed generally positive feedback about the tool. Most of the initial discussion took place in the Village pump (Taverna) of the Catalan Wikipedia. Later, we had the opportunity to showcase the tool to a wider audience at Wikimania in London.
In the first 2 weeks, 29 articles were created using the Content Translation tool and published in the Catalan Wikipedia. Article topics were diverse, ranging from places in Malta, to companies in Italy, a river, a monastery, a political manifesto, and a prisoner of war. As the Content Translation tool is also being used for testing by the developers and other volunteers, the full list of articles that make it to a Wikipedia is regularly updated. The Language Engineering team also started addressing some of the bugs that were encountered, such as issues with paragraph alignment and stability of the machine translation controller.
The number of articles published using Content Translation has now crossed over 100 and its usage has not been only limited to Catalan Wikipedia. Users have been creating articles in other languages like Gujarati and Malayalam, although machine translation has not been extended beyond Spanish−Catalan yet. All the pages that were published as articles had further edits for wikification, grammar correction, and in some cases meaningful enhancement. A deeper look at the edits revealed that the additional changes were first made by the same user who made the initial translation, and later by other editors or bots.
Wikimania in London
The Language Engineering team members worked closely with Wikimedians to better understand requirements for languages like Arabic, Persian, Portuguese, Tajik, Swedish, German and others, that can be instrumental in extending support for these languages.
The development of ContentTranslation continues. Prior to Wikimania, the Language Engineering team met to evaluate the response and effectiveness of the first release of the tool, and prepared the goals for the next release. The second release is slated for the last week of September 2014. Among the features planned are support for more languages (machine translation, dictionaries), a smarter entry point to the translation UI, and basic editor formatting. It is expected that translation support from Catalan to Spanish will be activated by the end of August 2014. Read the detailed release plan and goals to know more.
Over the next couple of months, the Language Engineering team intends to work closely with our communities to better understand how the Content Translation tool has helped the editors so far and how it can serve the the global community better with the translation aids and resources currently integrated with tool. We welcome feedback at the project talk page. Get in touch with the Language Engineering team for more information and feedback.
Amir Aharoni and Runa Bhattacharjee, Language Engineering, Wikimedia Foundation
August 23, 2014
August 22, 2014
It’s disappointing to see that non-standard uses of UTF-8 are being used by the BBC on their BBC Burmese Facebook page.
Take, for example, the following text.
On the actual BBC site it looks like this (click on the burmese text to see a list of the characters used):
As far as I can tell, this is conformant use of Unicode codepoints.
Look at the same title on the BBC’s Facebook page, however, and you see:
Depending upon where you are reading this (as long as you have some Burmese font and rendering support), one of the two lines of Burmese text above will contain lots of garbage. For me, it’s the second (non-standard).
This non-standard approach uses visual encoding for combining characters that appear before or on both sides of the base, uses Shan or Rumai Palaung codepoints for subjoining consonants, uses the wrong codepoints for medial consonants, and uses the virama instead of the asat at the end of a word.
I assume that this is because of prevalent use of the non-standard approach on mobile devices (and that the BBC is just following that trend), caused by hacks that arose when people were impatient to get on the Web but script support was lagging in applications.
However, continuing this divergence does nobody any long-term good.
[Find fonts and other resources for the Myanmar script]
August 21, 2014
The W3C i18n Working Group has published a new Working Draft of Predefined Counter Styles. This document describes numbering systems used by various cultures around the world and can be used as a reference for those wishing to create user-defined counter styles for CSS. The latest draft synchronizes the document with changes to the related document CSS Counter Styles Level 3, for which a second Last Call is about to be announced. If you have comments on the draft, please send to firstname.lastname@example.org.
August 13, 2014
Industry speakers lined up to discuss use cases and requirements for linked data and content analytics
The agenda of the 4th LIDER roadmapping workshop and LD4LT event has been published. A great variety of industry stakeholders will talk about linked data and content analytics. Industry areas represented include content analytics technology, multilingual conversational applications, localisation and more.
The workshop will take place on September 2nd in Leipzig, Germany and it will be collocated with the SEMANTiCS conference. The workshop will be organised as part of MLODE 2014 and will be preceded by a hackathon on the 1st of September.
August 08, 2014
XLIFF is the open standard bi-text format: Bi-text keeps source language and target language data in sync during localization.
The publication of XLIFF 2.0 is of high importance for W3C since several of the main ITS 2.0 data categories can be used within XLIFF 2.0 to provide content related information during the localization process. Full ITS 2.0 support is planned for the upcoming XLIFF 2.1 version.
August 06, 2014
report summarizing the MultilingualWeb workshop in Madrid is
now available from the MultilingualWeb site. It contains a summary
of each session with links to presentation slides and minutes taken
during the workshop in Madrid. The workshop was a huge success,
with approximately 110 participants, and with the associated
LIDER roadmapping workshop. The Workshop was hosted by Universidad Politécnica de Madrid, sponsored
by the EU-funded LIDER
project, by Verisign and by
A new workshop in the MultilingualWeb series is planned for 2015.
July 31, 2014
This version updates the app per the changes during beta phase of the specification, so that it now reflects the finalised Unicode 7.0.0.
The initial in-app help information displayed for new users was significantly updated, and the help tab now links directly to the help page.
A more significant improvement was the addition of links to character descriptions (on the right) where such details exist. This finally reintegrates the information that was previously pulled in from a database. Links are only provided where additional data actually exists. To see an example, go here and click on See character notes at the bottom right.
Rather than pull the data into the page, the link opens a new window containing the appropriate information. This has advantages for comparing data, but it was also the best solution I could find without using PHP (which is no longer available on the server I use). It also makes it easier to edit the character notes, so the amount of such detail should grow faster. In fact, some additional pages of notes were added along with this upgrade.
A pop-up window containing resource information used to appear when you used the query to show a block. This no longer happens.
Changes in version 7beta
I forgot to announce this version on my blog, so for good measure, here are the (pretty big) changes it introduced.
Some features that were available in version 6.1.0a are still not available, but they are minor.
Significant changes to the UI include the removal of the ‘popout’ box, and the merging of the search input box with that of the other features listed under Find.
In addition, the buttons that used to appear when you select a Unicode block have changed. Now the block name appears near the top right of the page with a icon. Clicking on the icon takes you to a page listing resources for that block, rather than listing the resources in the lower right part of UniView’s interface.
UniView no longer uses a database to display additional notes about characters. Instead, the information is being added to HTML files.
This is the result of a massive investment of resources and expertise — and I’m excited they’ve made it open source. From Adobe: Source Han Sans, available in seven weights, is a typeface family which provides full support for Japanese, Korean, Traditional Chinese, and Simplified Chinese, all in one font. It also includes Latin, Greek, and Cyrillic […]
July 22, 2014
Keeping in mind that this is a survey funded by Australia’s registry, the data points pretty clearly toward a preference for .au over .com. From the announcement: The report found .au remains Australia’s home on the Internet with more than double the level of trust over any other namespace. George Pongas, General Manager of Naming Services at […]
July 18, 2014
July 17, 2014
The projects in the Wikimedia universe can be accessed and used in a large number of languages from around the world. The Wikimedia websites, their MediaWiki software (bot core and extensions) and their growing content benefit from standards-driven internationalization and localization engineering that makes the sites easy to use in every language across diverse platforms, both desktop and and mobile.
However, a wide disparity exists in the numbers of articles across language wikis. The article count across Wikipedias in different languages is an often cited example. As the Wikimedia Foundation focuses on the larger mission of enabling editor engagement around the globe, the Wikimedia Language Engineering team has been working on a content translation tool that can greatly facilitate the process of article creation by new editors.
About the Tool
Particularly aimed at users fluent in two or more languages, the Content Translation tool has been in development since the beginning of 2014. It will provide a combination of editing and translation tools that can be used by multilingual users to bootstrap articles in a new language by translating an existing article from another language. The Content Translation tool has been designed to address basic templates, references and links found in Wikipedia articles.
Development of this tool has involved significant research and evaluation by the engineering team to handle elements like sentence segmentation, machine translation, rich-text editing, user interface design and scalable backend architecture. The first milestone for the tool’s rollout this month includes a comprehensive editor, limited capabilities in areas of machine translation, link and reference adaptation and dictionary support.
Why Spanish and Catalan as the first language pair?
Presently deployed at http://es.wikipedia.beta.wmflabs.org/wiki/Especial:ContentTranslation, the tool is open for wider testing and user feedback. Users will have to create an account on this wiki and log in to use the tool. For the current release, machine translation can only be used to translate articles between Spanish and Catalan. This language pair was chosen for their linguistic similarity as well as availability of well-supported language aids like dictionaries and machine translation. Driven by a passionate community of contributors, the Catalan Wikipedia is an ideal medium sized project for testing and feedback. We also hope to enhance the aided translation capabilities of the tool by generating parallel corpora of text from within the tool.
To view Content Translation in action, please follow the link to this instance and make the following selections:
- article name – the article you would like to translate
- source language – the language in which the article you wish to translate exists (restricted to Spanish at this moment)
- target language – the language in which you would like to translate the article (restricted to Catalan at this moment)
This will lead you to the editing interface where you can provide a title for the page, translate the different sections of the article and then publish the page in your user namespace in the same wiki. This newly created page will have to be copied over to the Wikipedia in the target language that you had earlier selected.
Users in languages other than Spanish and Catalan can also view the functionality of the tool by making a few tweaks.
We care about your feedback
Please provide us your feedback on this page on the Catalan Wikipedia or at this topic on the project’s talk page. We will attempt to respond as soon as possible based on criticality of issues surfaced.
Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation
July 16, 2014
This document builds upon on the Character Model for the World Wide Web 1.0: Fundamentals to provide authors of specifications, software developers, and content developers a common reference on string matching on the World Wide Web and thereby increase interoperability. String matching is the process by which a specification or implementation defines whether two string values are the same or different from one another.
The main target audience of this specification is W3C specification developers. This specification and parts of it can be referenced from other W3C specifications and it defines conformance criteria for W3C specifications, as well as other specifications.
This version of this document represents a significant change from its previous edition. Much of the content is changed and the recommendations are significantly altered. This fact is reflected in a change to the name of the document from “Character Model: Normalization” to “Character Model for the World Wide Web: String Matching and Searching”.
July 14, 2014
The goal of the workshop is to gather input from experts and stakeholders in the area of content analytics, to identify areas and tasks in content analytics where linked data & semantic technologies can contribute. The workshop will organised as part of MLODE 2014 and will be preceded by a hackathon on the 1st of September.
June 30, 2014
Most Swedes have a basic understanding of English, but many of them are far from being fluent. Hence, it is important that different computer programs are localized so that they can also work in Swedish and other languages. This helps people avoid mistakes and makes the users work faster and more efficienttly. But how is this done?
First and foremost, the different messages in the software need to be translated separately. To get the translation just right and to make sure that the language is consistent requires a lot of thought. In open source software, this work is often done by volunteers who double check each other’s work. This allows for the program to be translated into hundreds of different languages, including minority languages that commercial operators usually do not focus on. As an example, the MediaWiki software that is used in all Wikimedia projects (such as Wikipedia), is translated in this way. As MediaWiki is developed at a rapid pace, with a large amount of new messages each month, it is important for us that we have a large and active community of translators. This way we make sure that everything works in all languages as fast as possible. But what could the Wikimedia movement do to help build this translator community?
We are happy to announce that Wikimedia Sverige is about to start a new project with support from Internetfonden (.Se) (the Internet Fund). The Internet Fund supports projects that improve the Internet’s infrastructure. The idea of translating open software to help build the translator community is in line with their goals. We gave the project a zingy name: “Expanding the translatewiki.net – ‘Improved Swedish localization of open source, for easier online participation’.” This is the first time that Wikimedia Sverige has had a project that focuses on this important element of the user experience. Here we will learn many new things that we will try to share with the wider community while aiming to improve the basic infrastructure on translatewiki.net. The translation platform translatewiki.net currently has 27 programs ready to be translated into 213 languages by more than 6,400 volunteers from around the world.
We will carry out the project in cooperation with Umeå University and Meta Solutions Ltd, with support from the developers of translatewiki.net (who are employed by the Wikimedia Foundation). We will be working on several exciting things and together we will:
- Build a larger and more active community of Swedish-speaking translator on translatewiki.net;
- Design a system for Open Badges and explore how it can be integrated with MediaWiki software. (Do let us know if you are working on something similar so that we can help each other!);
- Complete translations into Swedish for at least five of the remaining programs that are on translatewiki.net;
- Improve usability by inventorying and clarifying the documentation, something that will be done in cooperation with and will benefit the entire community on translatewiki.net;
- Umeå University will conduct research on parts of the project so that we get a deeper understanding of the processes (what exactly they will focus their research on is yet to be determined); and
- Add Meta Solutions’ program EntryScape for translation on translatewiki.net, and document the steps and how it went. This case study will hopefully identify bottle necks and make it easier for others to add their programs. MetaSolutions will also develop the necessary code to make it possible for similar programs to be added to translatewiki.net.
We will also organize several translation sprints where we can jointly translate as many messages as possible (you can also participate remotely). Last year we organized a translation sprint and discovered real value in sitting together. It made the work more enjoyable and it made it easier to arrive at the appropriate translations for the trickier messages. If you would like to be involved in the Swedish translations, please get in contact with us!
June 26, 2014
At the ICANN 50 conference Jordyn Buchanan of Google confirmed that Gmail would support EAI (email address internationalization) by the end of this month. This is significant news. But what does it mean exactly? I don’t have the details yet, but at a minimum I assume it means a Gmail user could create an email address using a […]
June 24, 2014
June 20, 2014
It’s been over a decade since Unicode standard was made available for Odia script. Odia is a language spoken by roughly 33 million people in Eastern India, and is one of the many official languages of India. Since its release, it has been challenging to get more content on Unicode, the reason being many who are used to other non-Unicode standards are not willing to make the move to Unicode. This created the need for a simple converter that could convert text once typed in various non-Unicode fonts to Unicode. This could enrich Wikipedia and other Wikimedia projects by converting previously typed content and making it more widely available on the internet. The Odia language recently got such a converter, making it possible to convert two of the most popular fonts among media professionals (AkrutiOriSarala99 and AkrutiOriSarala) into Unicode.
All of the non-Latin scripts came under one umbrella after the rollout of Unicode. Since then, many Unicode compliant fonts have been designed and the open source community has put forth effort to produce good quality fonts. Though contribution to Unicode compliant portals like Wikipedia increased, the publication and printing industries in India were still stuck with the pre-existing ASCII and ISCII standards (Indian font encoding standard based on ASCII). Modified ASCII fonts that were used as typesets for newspapers, books, magazines and other printed documents still exist in these industries. This created a massive amount of content that is not searchable or reproducible because it is not Unicode compliant. The difference in Unicode font is the existence of separate glyphs for the Indic script characters along with the Latin glyphs that are actually replaced by the Indic characters. So, when someone does not have a particular ASCII standard font installed, the typed text looks absurd (see Mojibake), however text typed using one Unicode font could be read using another Unicode font in a different operating system. Most of the ASCII fonts that are used for typing Indic languages are proprietary and many individuals/organizations even use pirated software and fonts. Having massive amounts of content available in multiple standards and little content in Unicode created a large gap for many languages including Odia. Until all of this content gets converted to Unicode to make it searchable, sharable and reusable, then the knowledge base created will remain inaccessible. Some of the Indic languages fortunately have more and more contributors creating Unicode content. There is a need to work on technological development to convert non-Unicode content to Unicode and open it up for people to use.
There are a few different kinds of fonts used by media and publication houses, the most popular one is Akruti. Two other popular standards are LeapOffice and Shreelipi. Akruti software comes bundled with a variety of typefaces and an encoding engine that works well in Adobe Acrobat Creator, the most popular DTP software package. Industry professionals are comfortable using it for its reputation and seamless printing. The problem of migrating content from other standards to Unicode arose when the Odia Wikimedia community started reaching out to these industry professionals. Apparently authors, government employees and other professional were more comfortable using one of the standards mentioned above. All of these people type using either a generic popular standard, Modular, or a universal standard, Inscript. Fortunately, the former is now incorporated into Mediawiki‘s Universal Language Selector (ULS) and the latter is in the process of getting added to ULS. Once this is done, many folks could start contributing to Wikipedia easily.
Content that has been typed in various modified ASCII fonts include encyclopedias that could help grow content on Wikisource and Wikiquote. All of these need to be converted to Unicode. The non-profit group Srujanika first initiated a project to build a converter for conversion of two different Akruti fonts: AkrutiOriSarala99 and OR-TT Sarala. The former being outdated and the other being less popular. The Rebati 1 converter which was built by the Srujanika team was not being maintained and was more of an orphan project. Fellow Wikimedian Manoj Sahukar and myself used parts of the “Rebati 1 converter” code and worked on building another converter. The new “Akruti Sarala – Unicode Odia converter” can convert the more popular AkrutiOriSarala font and its predecessor AkrutiOriSarala99, which is still used by some. Odia Wikimedian Mrutyunjaya Kar and journalist Subhransu Panda have helped by reporting broken conjuncts which helps in fixing all problems before publishing. Odia authors and journalists have already started using the font and many of them have regular posts in Odia. We are waiting for more authors to contribute to Wikipedia by converting their work and wikifying it.
Even after getting the classical status, Odia language is not being used actively on the internet like some other Indian languages. The main reason behind this is our writing system has not been web-friendly. Most of those in Odisha having typing skills, use modular keyboard and Akruti fonts. Akruti is not web-compatible as we know. There are thousands of articles, literary works, news stories typed in Akruti fonts lying unused (on the internet). Thanks to Subhashish Panigrahi and his associates, they have developed this new font converter that can convert your Akruti text into Unicode. I have checked it. It’s error-free. Now it’s easy for us to write articles online (for Wikipedia and other sites).
Yes, we are late entrants as far as use of vernacular languages on the internet is concerned. But this converter will help us to go godspeed. Lets make Odia our language of communication and expression.
Subhashish Panigrahi, Odia Wikipedian and Programme Officer, Centre for Internet and Society
- Quick links: