The Unicode Consortium has announced the release of Version 6.1 of the Unicode Standard, continuing Unicode’s long-term commitment to support the full diversity of languages around the world. This latest version adds characters to support additional languages of China, other Asian countries, and Africa. It also addresses educational needs in the Arabic-speaking world. A total of 732 new characters have been added.
This version of the Standard also brings technical improvements to support implementers. Improved changes to property values and their aliases mean that properties now have easy-to-specify labels. The new labels combined with a new script extensions property means that regular expressions can be more straightforward and are easier to validate.
Over 200 new Standardized Variants have been added for emoji characters, allowing implementations to distinguish preferred display styles between text and emoji styles. For example:
26FA FE0F TENT emoji style
26FD FE0E FUEL PUMP text style
26FD FE0F FUEL PUMP emoji style
Among the notable property changes and additions in Unicode 6.1 are two new line break property values, which improve the line-breaking behavior of Hebrew and Japanese text. Segmentation behavior was also improved for Thai, Lao, and similar languages.
Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and have updates for Version 6.1. These will be finalized in February:
UTS #10, Unicode Collation Algorithm
UTS #46, Unicode IDNA Compatibility Processing
The Unicode® Consortium will close its call for participation in the 35th Internationalization & Unicode® Conference (IUC 35) on Friday, March 25. If you want to talk at the conference, you should submit your proposal soon.
The Program Committee will notify authors by Wednesday, April 20. Final presentation materials will be required from selected presenters by Wednesday, August 3.
The conference will take place in Santa Clara, Calif., USA; October 17-19, 2011, sponsored by Adobe. The conference is produced by OMG®.
This is the premier conference on technologies and practices for the creation and management of global and multilingual software solutions. This annual event is praised for its excellent technical content, industry-tested recommendations and updates on the latest standards.
The Unicode 6.0 core specification includes information on scripts newly encoded in Unicode 6.0, as well as many updates and clarifications to other sections of the text. The release of the core specification completes the definitive documentation of the Unicode Standard, Version 6.0.
In Version 6.0, the standard grew by 2,088 characters. Over 1,000 of these characters are symbols used for text exchange on mobile phones. The Unicode Standard now also includes the recently created official symbol for the Indian rupee. After computers and mobile phones update to Version 6.0, the rupee sign will be available for use like the $ or € now.
In addition, this version adds many CJK Unified Ideographs in common use in China, Taiwan, and Japan,as well as characters for African language support, including extensions to the Tifinagh, Ethiopic,and Bamum scripts. Three scripts are supported for the first time: Mandaic, Batak, and Brahmi.
In October of 2010, the other portions of Unicode 6.0 were released: the Unicode Standard Annexes, code charts, and the Unicode Character Database. This allowed vendors to update their implementations of Unicode 6.0 as quickly as possible.
For more information on all of The Unicode Standard, Version 6.0, see http://www.unicode.org/versions/Unicode6.0.0/
RFC 6067 specifies an extension to BCP 47. BCP 47 provides subtags that specify language and/or locale-based behavior.
Many locale identifiers require additional “tailorings” or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.
The maintaining authority for the extension defined by this document is the Unicode Consortium. The Unicode Consortium defines a standardized, structured set of locale data and identifiers for locale data in the “Common Locale Data Repository” or “CLDR”.
The newly finalized Unicode Version 6.0 adds 2,088 characters, with over 1,000 new symbols.
The October 2010 release includes the Unicode Character Database (UCD), Unicode Standard Annexes (UAXes), and code charts. With the release of these components, implementers are able update their software to Unicode 6.0 without delay. The final text of the core specification will be available in early 2011.
A long-awaited feature of Unicode 6.0 is the encoding of hundreds of symbols for mobile phones. These emoji characters are in widespread use, especially in Japan, and have become an essential part of text messages there and elsewhere. Unicode 6.0 now provides for data interchange between different mobile vendors and across the internet. The symbols include symbols for many domains: maps and transport, phases of the moon, UI symbols (such as fast-forward) and many others – including the symbol for mobile phone itself.
A late-breaking addition is the newly created official symbol for the Indian rupee. “With the help of the Indian government and our colleagues in ISO, we were able to accelerate the encoding process.” said Mark Davis, president of the Consortium. “Once computers and mobile phones update to the new version of Unicode, people will be able to use the rupee sign like they use $ or € now.”
The Unicode CLDR committee is making Unicode locale-sensitive collation a major focus for the next release, CLDR 1.9. There are specific changes for a large number of languages, plus a change in the default ordering of punctuation vs symbols for all languages.
See the background document for more information:
If you have any feedback on any of the actions, please contact the Unicode Consortium as described in the background document.
Review period for this issue closes on October 1, 2010.
In addition to providing the basis for language identification on the Web, BCP 47 language tags also are used to control language and culturally specific APIs on many systems. Based on work done by the Unicode Consortium, the proposed Language Tag Extension ‘U’ provides additional subtags that can be used to refine locale-based details such as calendar, sort order, and other locale details.
The Unicode Consortium announced today the release of the new version of the Unicode Common Locale Data Repository (Unicode CLDR 1.8), providing key building blocks for software to support the world’s languages.
CLDR 1.8 contains data for 186 languages and 159 territories: 501 locales in all. Version 1.8 of the repository contains over 22% more locale data than the previous release, with over 42,000 new or modified data items from over 300 different contributors.
For this release, the Unicode Consortium partnered with ANLoc, the African Network for Localization, a project sponsored by Canada’s International Development Research Centre (IDRC), to help extend modern computing on the African continent. ANLoc’s vision is to empower Africans to participate in the digital age by enabling their languages in computers. A sub-project of ANLoc, called Afrigen, focuses on creating African locales.
For more information about Unicode CLDR 1.8, see the CLDR 1.8 Release Note.
The Unicode Consortium has just announced this new release of CLDR (Common Locale Data Repository), the largest and most extensive standard repository of locale data. This data is used for software internationalization and localization: adapting software to the conventions of different languages for such common software tasks as formatting of dates, times, time zones, numbers, and currency values; sorting text; choosing languages or countries by name; transliterating different alphabets; and many others. See more information about the Unicode CLDR project (including charts).
Version 5.2 of the Unicode Collation Algorithm has been released. This version resynchronizes the Unicode Collation Algorithm with all
of the updates for the Unicode Standard, Version 5.2.
The rest of this post is taken from the Unicode Consortium’s release notification and details changes and issues for implementations.
- The text of UTS #10 has been updated. Among other changes, the revised text for UTS #10 makes it clear that the BASE for implicit generation of weights for Han characters does not include unassigned code points.
- There are small changes in Gujarati, Telugu, Malayalam (including weighting for chillus), Tamil, and Sinhala. While these changes move in the direction of expected behavior, good results will only come from tailoring for particular languages, such as with CLDR.
- There have been significant changes to the ordering of many combining marks. Many combining marks that are not in customary use in modern languages now have the same secondary weight, and will only be distinguished on a fourth level, by code point ordering. This can be seen by looking at the Unicode Collation Charts (http://unicode.org/charts/collation/). In 5.2, many characters now have a white background, indicating that they sort exactly the same as the previous character, unless a 4th (codepoint) level is used.
- Implementations of UCA should take note that the increased number of characters may cause overflows if the implementing code makes certain assumptions or optimizations. This can result either from the new character additions (which increase the number of distinct weights in the table) or because of changes in the way the weights, particularly for secondary weight values, are assigned in the table. The latter change may result in unexpected numbers of characters having the same weight.