Keeping in mind that this is a survey funded by Australia’s registry, the data points pretty clearly toward a preference for .au over .com. From the announcement: The report found .au remains Australia’s home on the Internet with more than double the level of trust over any other namespace. George Pongas, General Manager of Naming Services at […]
The Planet Web I18n aggregates posts from various blogs that talk about Web internationalization (i18n). While it is hosted by the W3C Internationalization Activity, the content of the individual entries represent only the opinion of their respective authors and does not reflect the position of the Internationalization Activity.
July 21, 2014
July 18, 2014
July 17, 2014
The projects in the Wikimedia universe can be accessed and used in a large number of languages from around the world. The Wikimedia websites, their MediaWiki software (bot core and extensions) and their growing content benefit from standards-driven internationalization and localization engineering that makes the sites easy to use in every language across diverse platforms, both desktop and and mobile.
However, a wide disparity exists in the numbers of articles across language wikis. The article count across Wikipedias in different languages is an often cited example. As the Wikimedia Foundation focuses on the larger mission of enabling editor engagement around the globe, the Wikimedia Language Engineering team has been working on a content translation tool that can greatly facilitate the process of article creation by new editors.
About the Tool
Particularly aimed at users fluent in two or more languages, the Content Translation tool has been in development since the beginning of 2014. It will provide a combination of editing and translation tools that can be used by multilingual users to bootstrap articles in a new language by translating an existing article from another language. The Content Translation tool has been designed to address basic templates, references and links found in Wikipedia articles.
Development of this tool has involved significant research and evaluation by the engineering team to handle elements like sentence segmentation, machine translation, rich-text editing, user interface design and scalable backend architecture. The first milestone for the tool’s rollout this month includes a comprehensive editor, limited capabilities in areas of machine translation, link and reference adaptation and dictionary support.
Why Spanish and Catalan as the first language pair?
Presently deployed at http://es.wikipedia.beta.wmflabs.org/wiki/Especial:ContentTranslation, the tool is open for wider testing and user feedback. Users will have to create an account on this wiki and log in to use the tool. For the current release, machine translation can only be used to translate articles between Spanish and Catalan. This language pair was chosen for their linguistic similarity as well as availability of well-supported language aids like dictionaries and machine translation. Driven by a passionate community of contributors, the Catalan Wikipedia is an ideal medium sized project for testing and feedback. We also hope to enhance the aided translation capabilities of the tool by generating parallel corpora of text from within the tool.
To view Content Translation in action, please follow the link to this instance and make the following selections:
- article name – the article you would like to translate
- source language – the language in which the article you wish to translate exists (restricted to Spanish at this moment)
- target language – the language in which you would like to translate the article (restricted to Catalan at this moment)
This will lead you to the editing interface where you can provide a title for the page, translate the different sections of the article and then publish the page in your user namespace in the same wiki. This newly created page will have to be copied over to the Wikipedia in the target language that you had earlier selected.
Users in languages other than Spanish and Catalan can also view the functionality of the tool by making a few tweaks.
We care about your feedback
Please provide us your feedback on this page on the Catalan Wikipedia or at this topic on the project’s talk page. We will attempt to respond as soon as possible based on criticality of issues surfaced.
Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation
July 16, 2014
This document builds upon on the Character Model for the World Wide Web 1.0: Fundamentals to provide authors of specifications, software developers, and content developers a common reference on string matching on the World Wide Web and thereby increase interoperability. String matching is the process by which a specification or implementation defines whether two string values are the same or different from one another.
The main target audience of this specification is W3C specification developers. This specification and parts of it can be referenced from other W3C specifications and it defines conformance criteria for W3C specifications, as well as other specifications.
This version of this document represents a significant change from its previous edition. Much of the content is changed and the recommendations are significantly altered. This fact is reflected in a change to the name of the document from “Character Model: Normalization” to “Character Model for the World Wide Web: String Matching and Searching”.
June 30, 2014
Most Swedes have a basic understanding of English, but many of them are far from being fluent. Hence, it is important that different computer programs are localized so that they can also work in Swedish and other languages. This helps people avoid mistakes and makes the users work faster and more efficienttly. But how is this done?
First and foremost, the different messages in the software need to be translated separately. To get the translation just right and to make sure that the language is consistent requires a lot of thought. In open source software, this work is often done by volunteers who double check each other’s work. This allows for the program to be translated into hundreds of different languages, including minority languages that commercial operators usually do not focus on. As an example, the MediaWiki software that is used in all Wikimedia projects (such as Wikipedia), is translated in this way. As MediaWiki is developed at a rapid pace, with a large amount of new messages each month, it is important for us that we have a large and active community of translators. This way we make sure that everything works in all languages as fast as possible. But what could the Wikimedia movement do to help build this translator community?
We are happy to announce that Wikimedia Sverige is about to start a new project with support from Internetfonden (.Se) (the Internet Fund). The Internet Fund supports projects that improve the Internet’s infrastructure. The idea of translating open software to help build the translator community is in line with their goals. We gave the project a zingy name: “Expanding the translatewiki.net – ‘Improved Swedish localization of open source, for easier online participation’.” This is the first time that Wikimedia Sverige has had a project that focuses on this important element of the user experience. Here we will learn many new things that we will try to share with the wider community while aiming to improve the basic infrastructure on translatewiki.net. The translation platform translatewiki.net currently has 27 programs ready to be translated into 213 languages by more than 6,400 volunteers from around the world.
We will carry out the project in cooperation with Umeå University and Meta Solutions Ltd, with support from the developers of translatewiki.net (who are employed by the Wikimedia Foundation). We will be working on several exciting things and together we will:
- Build a larger and more active community of Swedish-speaking translator on translatewiki.net;
- Design a system for Open Badges and explore how it can be integrated with MediaWiki software. (Do let us know if you are working on something similar so that we can help each other!);
- Complete translations into Swedish for at least five of the remaining programs that are on translatewiki.net;
- Improve usability by inventorying and clarifying the documentation, something that will be done in cooperation with and will benefit the entire community on translatewiki.net;
- Umeå University will conduct research on parts of the project so that we get a deeper understanding of the processes (what exactly they will focus their research on is yet to be determined); and
- Add Meta Solutions’ program EntryScape for translation on translatewiki.net, and document the steps and how it went. This case study will hopefully identify bottle necks and make it easier for others to add their programs. MetaSolutions will also develop the necessary code to make it possible for similar programs to be added to translatewiki.net.
We will also organize several translation sprints where we can jointly translate as many messages as possible (you can also participate remotely). Last year we organized a translation sprint and discovered real value in sitting together. It made the work more enjoyable and it made it easier to arrive at the appropriate translations for the trickier messages. If you would like to be involved in the Swedish translations, please get in contact with us!
Translatewiki.net i rampljuset
De flesta svenskar har en grundläggande engelskaförståelse, men långt ifrån alla kan arbeta obehindrat eller känner sig helt bekväma att använda språket. Därför är det viktigt att olika datorprogram anpassas för att även fungerar på svenska och andra språk. Det hjälper folk att undvika misstag och gör det enklare för användaren att arbeta snabbt och effektivt. Men hur går då detta till rent praktiskt?
För detta behöver de olika meddelandena i programvaran översättas var för sig. Detta kräver ofta en hel del eftertanke för att betydelsen ska blir rätt och för att språkbruket skall bli enhetligt rakt igenom. I öppen programvara sker detta arbete väldigt ofta av volontärer som kontrollerar varandras arbete. Detta gör det möjligt att för en väldigt låg kostnad få översättningar till hundratals olika språk, inklusive minoritetsspråk som de kommersiella aktörerna aldrig skulle fokusera på. Exempelvis översätts MediaWiki, programvaran som används i alla Wikimedias projekt, på detta sätt. Då MediaWiki utvecklas i en snabb takt med mängder av nya meddelanden varje månad är det viktigt för oss att vi har en stor och aktiv gemenskap av översättare så att allt fungerar på alla språk. Vi anser alltså att detta är värdefullt för fri kunskap. Men vad kan då Wikimedia-rörelsen göra för att utveckla denna översättargemenskap?
Glädjande nog kommer vi på Wikimedia Sverige att påbörja ett nytt projekt som med stöd från Internetfonden (.Se). Internetfonden stödjer olika projekt som förbättrar Internets infrastruktur, och vårt projekt är i linje med deras mål. Vi kallar projektet lite snärtigt för “En expandering av translatewiki.net – Förbättrad svensk lokalisering av öppen källkod, för enklare onlinedeltagande”. Detta är extra kul då Wikimedia Sverige inte tidigare haft något projekt som fokuserar på denna viktiga del av användarupplevelsen. Här kommer vi att lära oss många nya saker som vi kommer att försöka dela med andra som gillar öppen programvara och förbättra translatewiki.nets infrastruktur. Översättningsplattformen translatewiki.net har för närvarande 27 program upplagda, vilka översätts till 213 språk av över 6 400 volontärer från hela världen.
Vi kommer att genomföra projektet tillsammans med Umeå universitet samt MetaSolutions AB och med stöd från utvecklare av translatewiki.net, som är anställda av Wikimedia Foundation i USA. Inom ramen för projektet kommer vi att arbeta med flera spännande saker! Tillsammans kommer vi att:
- Arbeta för att bygga upp en större gemenskap av svensktalande översättare på translatewiki.net;
- Designa ett system för Open Badges och undersöka hur det kan integreras i MediaWiki-programvaran;
- Färdigställa översättningarna till svenska för minst fem av de återstående program som finns där;
- Förbättra användarvänligheten genom att inventera och förtydliga dokumentation, vilket kommer att gynna hela translatewiki.net;
- Umeå universitet kommer att forska på delar av vårt projekt så att vi får en djupare förståelse för processerna (den exakta inriktningen är ännu inte bestämd); samt
- Lägga till MetaSolutions program EntryScape för översättning på translatewiki.net och dokumentera hur det gick. Denna fallstudie kommer förhoppningsvis att identifiera flaskhalsar och göra det lättare för andra att lägga till sina projekt. För att kunna lägga till EntryScape kommer MetaSolutions även att skriva den nödvändiga koden för att göra det enklare för liknande program att läggas upp på platformen.
Vi kommer även att anordna flera översättningssprint för att gemensamt översätta så många meddelanden som möjligt. Självklart kan man alltid delta på distans! Vi prövade att genomföra översättningssprint förra året och såg då ett tydligt värde av att sitta tillsammans då det gjorde arbetet roligare och det blev enklare att komma fram till lämpliga översättningar för de klurigaste meddelandena. Alla är inbjudna till att hjälpa till och vi guidar er gärna rätt om ni är nya på translatewiki.net om ni är nya där. Skicka bara ett mail eller slå en pling!
June 26, 2014
At the ICANN 50 conference Jordyn Buchanan of Google confirmed that Gmail would support EAI (email address internationalization) by the end of this month. This is significant news. But what does it mean exactly? I don’t have the details yet, but at a minimum I assume it means a Gmail user could create an email address using a […]
June 24, 2014
June 20, 2014
It’s been over a decade since Unicode standard was made available for Odia script. Odia is a language spoken by roughly 33 million people in Eastern India, and is one of the many official languages of India. Since its release, it has been challenging to get more content on Unicode, the reason being many who are used to other non-Unicode standards are not willing to make the move to Unicode. This created the need for a simple converter that could convert text once typed in various non-Unicode fonts to Unicode. This could enrich Wikipedia and other Wikimedia projects by converting previously typed content and making it more widely available on the internet. The Odia language recently got such a converter, making it possible to convert two of the most popular fonts among media professionals (AkrutiOriSarala99 and AkrutiOriSarala) into Unicode.
All of the non-Latin scripts came under one umbrella after the rollout of Unicode. Since then, many Unicode compliant fonts have been designed and the open source community has put forth effort to produce good quality fonts. Though contribution to Unicode compliant portals like Wikipedia increased, the publication and printing industries in India were still stuck with the pre-existing ASCII and ISCII standards (Indian font encoding standard based on ASCII). Modified ASCII fonts that were used as typesets for newspapers, books, magazines and other printed documents still exist in these industries. This created a massive amount of content that is not searchable or reproducible because it is not Unicode compliant. The difference in Unicode font is the existence of separate glyphs for the Indic script characters along with the Latin glyphs that are actually replaced by the Indic characters. So, when someone does not have a particular ASCII standard font installed, the typed text looks absurd (see Mojibake), however text typed using one Unicode font could be read using another Unicode font in a different operating system. Most of the ASCII fonts that are used for typing Indic languages are proprietary and many individuals/organizations even use pirated software and fonts. Having massive amounts of content available in multiple standards and little content in Unicode created a large gap for many languages including Odia. Until all of this content gets converted to Unicode to make it searchable, sharable and reusable, then the knowledge base created will remain inaccessible. Some of the Indic languages fortunately have more and more contributors creating Unicode content. There is a need to work on technological development to convert non-Unicode content to Unicode and open it up for people to use.
There are a few different kinds of fonts used by media and publication houses, the most popular one is Akruti. Two other popular standards are LeapOffice and Shreelipi. Akruti software comes bundled with a variety of typefaces and an encoding engine that works well in Adobe Acrobat Creator, the most popular DTP software package. Industry professionals are comfortable using it for its reputation and seamless printing. The problem of migrating content from other standards to Unicode arose when the Odia Wikimedia community started reaching out to these industry professionals. Apparently authors, government employees and other professional were more comfortable using one of the standards mentioned above. All of these people type using either a generic popular standard, Modular, or a universal standard, Inscript. Fortunately, the former is now incorporated into Mediawiki‘s Universal Language Selector (ULS) and the latter is in the process of getting added to ULS. Once this is done, many folks could start contributing to Wikipedia easily.
Content that has been typed in various modified ASCII fonts include encyclopedias that could help grow content on Wikisource and Wikiquote. All of these need to be converted to Unicode. The non-profit group Srujanika first initiated a project to build a converter for conversion of two different Akruti fonts: AkrutiOriSarala99 and OR-TT Sarala. The former being outdated and the other being less popular. The Rebati 1 converter which was built by the Srujanika team was not being maintained and was more of an orphan project. Fellow Wikimedian Manoj Sahukar and myself used parts of the “Rebati 1 converter” code and worked on building another converter. The new “Akruti Sarala – Unicode Odia converter” can convert the more popular AkrutiOriSarala font and its predecessor AkrutiOriSarala99, which is still used by some. Odia Wikimedian Mrutyunjaya Kar and journalist Subhransu Panda have helped by reporting broken conjuncts which helps in fixing all problems before publishing. Odia authors and journalists have already started using the font and many of them have regular posts in Odia. We are waiting for more authors to contribute to Wikipedia by converting their work and wikifying it.
Even after getting the classical status, Odia language is not being used actively on the internet like some other Indian languages. The main reason behind this is our writing system has not been web-friendly. Most of those in Odisha having typing skills, use modular keyboard and Akruti fonts. Akruti is not web-compatible as we know. There are thousands of articles, literary works, news stories typed in Akruti fonts lying unused (on the internet). Thanks to Subhashish Panigrahi and his associates, they have developed this new font converter that can convert your Akruti text into Unicode. I have checked it. It’s error-free. Now it’s easy for us to write articles online (for Wikipedia and other sites).
Yes, we are late entrants as far as use of vernacular languages on the internet is concerned. But this converter will help us to go godspeed. Lets make Odia our language of communication and expression.
Subhashish Panigrahi, Odia Wikipedian and Programme Officer, Centre for Internet and Society
- Quick links:
June 17, 2014
One of the more common questions I hear when talking about country codes is why the inconsistency across countries. For instance, why do you have to register .co.uk in the UK while you can get a simple .fr in France or .de in Germany? My answer is that country codes are managed by their respective countries […]
Version 7.0 of the Unicode Standard is now available, adding 2,834 new characters. This latest version adds the new currency symbols for the Russian ruble and Azerbaijani manat, approximately 250 emoji (pictographic symbols), many other symbols, and 23 new lesser-used and historic scripts, as well as character additions to many existing scripts. These additions extend support for written languages of North America, China, India, other Asian countries, and Africa. See the link above for full details.
Most of the new emoji characters derive from characters in long-standing and widespread use in Wingdings and Webdings fonts.
Major enhancements were made to the Indic script properties. New property values were added to enable a more algorithmic approach to rendering Indic scripts. These include properties for joining behavior, new classes for numbers, and a further division of the syllabic categories of viramas and rephas. With these enhancements, the default rendering for newly added Indic scripts can be significantly improved.
Unicode character properties were extended to the new characters. The old characters have enhancements to Script and Alphabetic properties, and casing and line-breaking behavior. There were also nearly 3,000 new Cantonese pronunciation entries, as well as new or clarified stability policies for promoting interoperable implementations.
Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and have updates for Version 7.0. These will be released at the same time:
June 11, 2014
Industry stakeholders from many areas (localization, publishing, language technology applications etc.) and key researchers from linked data and language technology discussed promises and challenges around linguistic linked data. The report summarizes all presentations and includes an initial list of use cases and requirements for linguistic linked data. This and the overall outcome of the event will feed into work of the LD4LT group (see especially the LD4LT latest draft version of use cases), and the field of multilingual linked data in general.
June 09, 2014
I was intrigued to read recently that Mozilla is working on updating the Firefox Android mobile browser, codename Fennec, to allow the browser to offer more languages than the underlying Android system currently supports. Typically, apps leverage language support from the underlying operating system, which can sometimes be limiting. So it’s nice to see Mozilla moving beyond this limitation. […]
June 06, 2014
A Last Call Working Draft of Encoding has been published.
While encodings have been defined to some extent, implementations have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification attempts to fill those gaps so that new implementations do not have to reverse engineer encoding implementations of the market leaders and existing implementations can converge.
The body of this spec is an exact copy of the WHATWG version as of the date of its publication, intended to provide a stable reference for other specifications. We are hoping for people to review the specification and send comments about any technical areas that need attention (see the Status section for details).
Please send comments by 1 July 2014.
Authoring HTML: Handling Right-to-left Scripts and Authoring HTML: Language declarations have both been updated to a new format that lists do’s and don’ts, but points to existing or new articles for detailed information. This will significantly help in keeping the material up to date in the future as technology changes. The documents have also been thoroughly overhauled to reflect the latest changes and information on
The first document provides advice to content authors using HTML markup and CSS style sheets about how to create pages for languages that use right-to-left scripts, such as Arabic, Hebrew, Persian, Thaana, Urdu, etc. It explains how to create content in right-to-left scripts that builds on but goes beyond the Unicode bidirectional algorithm, as well as how to prepare content for localization into right-to-left scripts.
The second helps content authors specify the language of content, which is useful for a wide number of applications, from linguistically-sensitive searching to applying language-specific display properties. In some cases the potential applications for language information are still waiting for implementations to catch up, whereas in others it is a necessity today. Adding markup for language information to content is something that can and should be done now and as content is first developed. If not, it will be much more difficult to take advantage of any future developments.
June 04, 2014
June 02, 2014
May 30, 2014
May 28, 2014
One more week to FEISGILTT – learn about and contribute to the future of Web and localization standards!
On 4 June and as part of the Localization World conference in Dublin, the FEISGILTT event will again provide an opportunity to discuss latest developments around localization and multilingual Web technologies. The event is sponsored by the LIDER project.
Highlights include updates about ITS 2.0 and XLIFF 2.0, and a session about usage scenarios for linguistic linked data in localization. Speakers include Kevin O’Donnell (Microsoft), Bryan Schnabel (Tektronix), Yves Savourel (Enlaso) and many more.
Register now to meet the key players around standards that will influence today’s and future business.
Factoids listed at the start of the EURid/UNESCO World Report on IDN Deployment 2013
5.1 million IDN domain names
Only 2% of the world’s domain names are in non-Latin script
The 5 most popular browsers have strong support for IDNs in their latest versions
Poor support for IDNs in mobile devices
92% of the world’s most popular websites do not recognise IDNs as URLs in links
0% of the world’s most popular websites allow IDN email addresses as user accounts
99% correlation between IDN scripts and language of websites (Han, Hangkuk, Hiragana, Katakana)
About two weeks ago I attended the part of a 3-day Asia Pacific Top Level Domain Association (APTLD) meeting in Oman related to ‘Universal Acceptance’ of Internationalized Domain Names (IDNs), ie. domain names using non-ASCII characters. This refers to the fact that, although IDNs work reasonably well in the browser context, they are problematic when people try to use them in the wider world for things such as email and social media ids, etc. The meeting was facilitated by Don Hollander, GM of APTLD.
Here’s a summary of information from the presentations and discussions.
(By the way, Don Hollander and Dennis Tan Tanaka, Verisign, each gave talks about this during the MultilingualWeb workshop in Madrid the week before. You can find links to their slides from the event program.)
International Domain Names (IDNs) provide much improved accessibility to the web for local communities using non-Latin scripts, and are expected to particularly smooth entry for the 3 billion people not yet web-enabled. For example, in advertising (such as on the side of a bus) they are easier and much faster to recognise and remember, they are also easier to note down and type into a browser.
The biggest collection of IDNs is under .com and .net, but there are new Brand TLDs emerging as well as IDN country codes. On the Web there is a near-perfect correlation between use of IDNs and the language of a web site.
The problems tend to arise where IDNs are used across cultural/script boundaries. These cross-cultural boundaries are encountered not just by users but by implementers/companies that create tools, such as email clients, that are deployed across multilingual regions.
It seems to be accepted that there is a case for IDNs, and that they already work pretty well in the context of the browser, but problems in widespread usage of internationalized domain names beyond the browser are delaying demand, and this apparently slow demand doesn’t convince implementers to make changes – it’s a chicken and egg situation.
The main question asked at the meeting was how to break the vicious cycle. The general opinion seemed to lean to getting major players like Google, Microsoft and Apple to provide end-to-end support for IDNs throughout their produce range, to encourage adoption by others.
Domain names are used beyond the browser context. Problem areas include:
- email clients generally don’t support use of non-ascii email addresses
- standards don’t address the username part of email addresses as well as domain
- there’s an issue to do with smptutf8 not being visible in all the right places
- you can’t be sure that your email will get through, it may be dropped on the floor even if only one cc is IDN
- applications that accept email IDs or IDNs
- even Russian PayPal IDs fail for the .рф domain
- things to be considered include:
- plain text detection: you currently need http or www at start in google docs to detect that something is a domain name
- input validation: no central validation repository of TLDs
- rendering: what if the user doesn’t have a font?
- storage & normalization: ids that exist as either IDN or punycode are not unique ids
- security and spam controls: Google won’t launch a solution without resolving phishing issues; some spam filters or anti-virus scanners think IDNs are dangerous abnormalities
- other integrations: add contact, create mail and send mail all show different views of IDN email address
- search: how do you search for IDNs in contacts list?
- search in general already works pretty well on Google
- I wasn’t clear about how equivalent IDN and Latin domain names will be treated
- mobile devices: surprisingly for the APTLD folks, it’s harder to find the needed fonts and input mechanisms to allow typing IDNs in mobile devices
- consistent rendering:
- some browsers display as punycode in some circumstances – not very user friendly
- there are typically differences between full and hybrid (ie. partial) int. domain names
- IDNs typed in twitter are sent as punycode (mouse over the link in the tweet on a twitter page)
Google are working on enabling IDN’s throughout their application space, including Gmail but also many other applications – they pulled back from fixing many small, unconnected bugs to develop a company wide strategy and roll out fixes across all engineering teams. The Microsoft speaker echoed the same concerns and approaches.
In my talk, I expressed the hope that Google and MS and others would collaborate to develop synergies and standards wherever feasible. Microsoft, also called for a standard approach rather than in-house, proprietary solutions, to ensure interoperability.
However, progress is slow because changes need to be made in so many places, not just the email client.
Google expects to have some support for international email addresses this summer. You won’t be able to sign up for Arabic/Chinese/etc email addresses yet, but you will be able to use Gmail to communicate with users on other providers who have internationalized addresses. Full implementation will take a little longer because there’s no real way to test things without raising inappropriate user expectations if the system is live.
SaudiNIC has been running Arabic emails for some time, but it’s a home-grown and closed system – they created their own protocols, because there were no IETF protocols at the time – the addresses are actually converted to punycode for transmission, but displayed as Arabic to the user (http://nic.sa).
Google uses system information about language preferences of the user to determine whether or not to display the IDN rather than punycode in Chrome’s address bar, but this could cause problems for people using a shared computer, for example in an internet café, a conference laptop etc. They are still worrying about users’ reactions if they can’t read/display an email address in non-ASCII script. For email, currently they’re leaning towards just always showing the Unicode version, with the caveat that they will take a hard line on mixed script (other than something mixed with ASCII) where they may just reject the mail.
All the Arabic email addresses I saw were shown fully right to left, ie. <tld><domain>@<username>. I wonder whether this may dislodge some of the hesitation in the IETF about the direction in which web addresses should be displayed – perhaps they should therefore also flow right-to-left?? (especially if people write domain names without http://, which these guys seem to think they will).
Many of the people in the room wanted to dispense with the http:// for display of web addresses, to eliminate the ASCII altogether, also get rid of www. – problem is, how to identify the string as a domain name – is the dot sufficient?? We saw some examples of this, but they had something like “see this link” alongside.
By the way, Google is exploring the idea of showing the user, by default, only the domain name of a URL in future versions of the Chrome browser address bar. A Google employee at the workshop said “I think URLs are going away as far as something to be displayed to users – the only thing that matters is the domain name … users don’t understand the rest of the URL”. I personally don’t agree with this.
One participant proposed that government mandates could be very helpful in encouraging adaptation of technologies to support international domain names.
I gave a talk and was on a panel. Basically my message was:
Most of the technical developments for IDN and IRIs were developed at the IETF and the Unicode Consortium, but with significant support by people involved in the W3C Internationalization Working Group. Although the W3C hasn’t been leading this work, it is interested in understanding the issues and providing support where appropriate. We are, however, also interested in wider issues surrounding the full path name of the URL (not just the domain name), 3rd level domain labels, frag ids, IRI vs punycode for domain name escaping, etc. We also view domain names as general resource identifiers (eg. for use in linked data), not just for a web presence and marketing.
I passed on a message that groups such as the Wikimedia folks I met with in Madrid the week before are developing a very wide range of fonts and input mechanisms that may help users input non-Latin IDs on terminals, mobile devices and such like, especially when travelling abroad. It’s something to look into. (For more information about Wikimedia’s jQuery extensions, see here and here.)
I mentioned the idea of bidi issues related to both the overall direction of Arabic/Hebrew/etc URLs/domain names, and the more difficult question about to handle mixed direction text that can make the logical http://www.oman/muscat render to the user as http://www.muscat/oman when ‘muscat’ and ‘oman’ are in Arabic, due to the default properties of the Unicode bidi algorithm. Community guidance would be a help in resolving this issue.
I said that the W3C is all about getting people together to find interoperable solutions via consensus, and that we could help with networking to bring the right people together. I’m not proposing that we should take on ownership of the general problem of Universal Acceptance, but I did suggest that if they can develop specific objectives for a given aspect of the problem, and identify a natural community of stakeholders for that issue, then they could use our Community Groups to give some structure to and facilitate discussions.
I also suggested that we all engage in grass-roots lobbying, requesting that service/tool providers allow us to use IDNs.
At the end of the first day, Don Hollander summed up what he had gathered from the presentations and discussions as follows:
People want IDNs to work, they are out there, and they are not going away. Things don’t appear quite so dire as he had previously thought, given that browser support is generally good, closed email communities are developing, and search and indexing works reasonably well. Also Google and Microsoft are working on it, albeit perhaps slower than people would like (but that’s because of the complexity involved). There are, however, still issues.
The question is how to go forward from here. He asked whether APTLD should coordinate all communities at a high level with a global alliance. After comments from panelists and participants, he concluded that APTLD should hold regular meetings to assess and monitor the situation, but should focus on advocacy. The objective would be to raise visibility of the issues and solutions. “The greatest contribution from Google and Microsoft may be to raise the awareness of their thousands of geeks.” ICANN offered to play a facilitation role and to generate more publicity.
One participant warned that we need a platform for forward motion, rather than just endless talking. I also said that in my panel contributions. I was a little disappointed (though not particularly surprised) that APTLD didn’t try to grasp the nettle and set up subcommittees to bring players together to take practical steps to address interoperable solutions, but hopefully the advocacy will help move things forward and developments by companies such as Google and Microsoft will help start a ball rolling that will eventually break the deadlock.
May 23, 2014
I’ve been trying to understand how web pages need to support justification of Arabic text, so that there are straight lines down both left and right margins.
The following is an extract from a talk I gave at the MultilingualWeb workshop in Madrid at the beginning of May. (See the whole talk.) It’s very high level, and basically just draws out some of the uncertainties that seem to surround the topic.
Let’s suppose that we want to justify the following Arabic text, so that there are straight lines at both left and right margins.Unjustified Arabic text
Generally speaking, received wisdom says that Arabic does this by stretching the baseline inside words, rather than stretching the inter-word spacing (as would be the case in English text).
To keep it simple, lets just focus on the top two lines.
One way you may hear that this can be done is by using a special baseline extension character in Unicode, U+0640 ARABIC TATWEEL.Justification using tatweels
The picture above shows Arabic text from a newspaper where we have justified the first two lines using tatweels in exactly the same way it was done in the newspaper.
Apart from the fact that this looks ugly, one of the big problems with this approach is that there are complex rules for the placement of baseline extensions. These include:
- extensions can only appear between certain characters, and are forbidden around other characters
- the number of allowable extensions per word and per line is usually kept to a minimum
- words vary in appropriateness for extension, depending on word length
- there are rules about where in the line extensions can appear – usually not at the beginning
- different font styles have different rules
An ordinary web author who is trying to add tatweels to manually justify the text may not know how to apply these rules.
A fundamental problem on the Web is that when text size or font is changed, or a window is stretched, etc, the tatweels will end up in the wrong place and cause problems. The tatweel approach is of no use for paragraphs of text that will be resized as the user stretches the window of a web page.
In the next picture we have simply switched to a font in the Naskh style. You can see that the tatweels applied to the word that was previously at the end of the first line now make the word to long to fit there. The word has wrapped to the beginning of the next line, and we have a large gap at the end of the first line.Tatweels in the wrong place due to just a font change
To further compound the difficulties mentioned above regarding the rules of placement for extensions, each different style of Arabic font has different rules. For example, the rules for where and how words are elongated are different in the Nastaliq version of the same text which you can see below. (All the characters are exactly the same, only the font has changed.) (See a description of how to justify Urdu text in the Nastaliq style.)Same text in the Nastaliq font style
And fonts in the Ruqah style never use elongation at all. (We’ll come back to how you justify text using Ruqah-style fonts in a moment.)Same text in the Ruqah font style
In the next picture we have removed all the tatweel characters, and we are showing the text using a Naskh-style font. Note that this text has more ligatures on the first line, so it is able to fit in more of the text on that line than the first font we saw. We’ll again focus on the first two lines, and consider how to justify them.Same text in the Naskh font style
High end systems have the ability to allow relevant characters to be elongated by working with the font glyphs themselves, rather than requiring additional baseline extension characters.Justification using letter elongation (kashida)
In principle, if you are going to elongate words, this is a better solution for a dynamic environment. It means, however, that:
- the rules for applying the right-sized elongations to the right characters has to be applied at runtime by the application and font working together, and as the user or author stretches the window, changes font size, adds text, etc, the location and size of elongations needs to be reconfigured
- there needs to be some agreement about what those rules are, or at least a workable set of rules for an off-the-shelf, one-size-fits-all solution.
The latter is the fundamental issue we face. There is very little, high-quality information available about how to do this, and a lack of consensus about, not only what the rules are, but how justification should be done.
Some experts will tell you that text elongation is the primary method for justifying Arabic text (for example), while others will tell you that inter-word and intra-word spacing (where there are gaps in the letter-joins within a single word) should be the primary approach, and kashida elongation may or may not be used in addition where the space method is strained.Justification using inter-word spacing
The space-based approach, of course, makes a lot of sense if you are dealing with fonts of the Ruqah style, which do not accept elongation. However, the fact that the rules for justification need to change according to the font that is used presents a new challenge for a browser that wants to implement justification for Arabic. How does the browser know the characteristics of the font being used and apply different rules as the font is changed? Fonts don’t currently indicate this information.
Looking at magazines and books on a recent trip to Oman I found lots of justification. Sometimes the justification was done using spaces, other times using elongations, and sometimes there was a mixture of both. In a later post I’ll show some examples.
By the way, for all the complexity so far described this is all quite a simplistic overview of what’s involved in Arabic justification. For example, high end systems that justify Arabic text also allow the typesetter to adjust the length of a line of text by manual adjustments that tweak such things as alternate letter shapes, various joining styles, different lengths of elongation, and discretionary ligation forms.
The key messages:
- We need an Arabic Layout Requirements document to capture the script needs.
- Then we need to figure out how to adapt Open Web Platform technologies to implement the requirements.
- To start all this, we need experts to provide information and develop consensus.
Any volunteers to create an Arabic Layout Requirements document? The W3C would like to hear from you!
When it comes to wrapping text at the end of a line in a web
page, there are some special rules that should be applied if you
know the language of the text is either Chinese or Japanese (ie. if
the markup contains a
lang attribute to
There’s an open question in the editor’s draft about whether
Korean has any special behaviours that need to be
documented in the spec, when the markup uses
lang to identify the content as Korean.
If you want to provide information, take a look at what’s in the CSS3 Text module and write to email@example.com and copy firstname.lastname@example.org.
If you put a
span tag around one or two
letters in an Arabic word, say to change the colour, it breaks the
cursiveness in WebKit and Blink browsers. You can change things
like colour in Mozilla and IE, but changing the font breaks the
Breaking on colour change makes it hard to represent educational texts and things such as the Omantel logo, which I saw all over Muscat recently. (Omantel is the largest internet provider in Oman.) Note how, despite the colour change, the Arabic letters in the logo below (on the left) still join.Multi-coloured Omantel logo on a building in Muscat.
Here’s an example of an educational page that colours parts of words. You currently have to use Firefox or IE to get the desired effect.
This lead to questions about what to do if you convert block
elements, such as
li, into inline elements
that sit side by side. You probably don’t want the character at the
end of one
li tag to join with the next
one. What if there is padding or margins between them, should this
cause bidi isolation as well as preventing joining behaviour?
See a related thread on the W3C Internationalization and CSS lists.
May 21, 2014
I too am surprised it took Google this long to launch a dedicated Thai website. YouTube has local versions in 61 countries and 61 different languages. Today’s launch in Thailand makes that the 62nd. Currently, there are over one billion unique visits to YouTube each month. Six billion hours of videos are watched monthly. 80 […]
May 19, 2014
The Unicode Consortium is pleased to announce the release of version 2014-05-16 of the Unicode Ideographic Variation Database (IVD). This release registers the new Moji_Joho collection, along with the first 10,710 sequences in that collection, 9,685 of which are shared by the registered Hanyo-Denshi collection. Details can be found at http://www.unicode.org/ivd/.
May 15, 2014
The slides from the MultilingualWeb workshop (including several posters) and the LIDER roadmapping workshop are now available for download. Additional material (videos of the presentations, a workshop report and more) will follow in the next weeks – stay tuned.
May 12, 2014
May 07, 2014
My latest post for client Pitney Bowes on making sense of the hype concerning the new generic TLDs.
The first wave is rolling in
More than 1,200 gTLDs have been applied for so far. You can check the status of each domain online (https://gtldresult.icann.org/application-result/applicationstatus) and learn more about what each applicant plans to do with it. In many cases, applicants are planning to register their brand names for internal uses, such as KPMG and Hermes. But many applicants are registering domains with the hopes of creating a popular and lucrative new source of revenue.
So far, more than 125 domain names have been delegated, including such names as:
- онлайн (online)
- 公司 (company)
My take: Many of these new TLDs are going to amount to nothing. But many will be quite successful and will usher in a new wave of innovations. So anyone who dismisses gTLDs altogether is mistaken.
May 06, 2014
The MultilingualWeb workshop on 7-8 May will be streamed live! Follow the event online if you cannot make it to Madrid. For details about speakers and presentations see the workshop program. The workshop is supported by the LIDER project and sponsored by Verisign and Lionbridge.