The article was edited to make it easier for non-experts to follow. An example of an encoding declaration was added, and a form to check for HTTP headers, but most of the text was also reworked.
See the updated article.
All times are UTC.
The Planet Web I18n aggregates posts from various blogs that talk about Web internationalization (i18n). While it is hosted by the W3C Internationalization Activity, the content of the individual entries represent only the opinion of their respective authors and does not reflect the position of the Internationalization Activity.
A draft of a new article, Ruby Markup is out for wide review. We are looking for comments by 5 May.
The article describes how to mark up HTML for ruby support. (It will later be followed by a similar article describing how to style ruby.)
Please send any comments as github issues by clicking on the link “Leave a comment” at the bottom of the article. (This will add some useful information to your comment.)
For twenty-five years the Internationalization & Unicode® Conference (IUC) has been the preeminent event highlighting the latest innovations and best practices of global and multilingual software providers. The 40th conference will be held this year on November 1-3, 2016 in Santa Clara, California.
The deadline for speaker submissions is Monday, 4 April, so don’t forget to send in an abstract if you want to speak at the conference.
The Program Committee will notify authors by Friday, May 13, 2016. Final presentation materials will be required from selected presenters by Friday, July 22, 2016.
Tutorial Presenters receive complimentary conference registration, and two nights lodging, while Session Presenters receive a fifty percent conference discount and two nights lodging.
UniView now supports the characters introduced for the beta version of Unicode 9. Any changes made during the beta period will be added when Unicode 9 is officially released. (Images are not available for the Tangut additions, but the character information is available.)
It also brings in notes for individual characters where those notes exist, if Show notes is selected. These notes are not authoritative, but are provided in case they prove useful.
A new icon was added below the text area to add commas between each character in the text area.
Links to the help page that used to appear on mousing over a control have been removed. Instead there is a noticeable, blue link to the help page, and the help page has been reorganised and uses image maps so that it is easier to find information. The reorganisation puts more emphasis on learning by exploration, rather than learning by reading.
Various tweaks were made to the user interface.
This new article addresses the question: If my site contains alternative language versions of the same page, what can I do to help the user see the page in their preferred language?
This article is relevant for pages for which there are complete translations of the content. If your alternative pages have different content, or are regional variants rather than translations, you may need to do things differently.
The article is accompanied by a Swedish translation, thanks to Olle Olsson.
I have just published a picker for Egyptian Hieroglyphs.
This Unicode character picker allows you to produce or analyse runs of Egyptian Hieroglyph text using the Latin script.
Characters are grouped into standard categories. Click on one of the orange characters, chosen as a nominal representative of the class, to show below all the characters in that category. Click on one of those to add it to the output box. As you mouse over the orange characters, you’ll see the name of the category appear just below the output box.
Just above the orange characters you can find buttons to insert RLO and PDF controls. RLO will make the characters that follow it to progress from right to left. Alternatively, you can select more controls > Output direction to set the direction of the output box to RTL/LTR override. The latter approach will align the text to the right of the box. I haven’t yet found a Unicode font that also flips the glyphs horizontally as a result. I’m not entirely sure about the best way to apply directionality to Egyptian hieroglyphs, so I’m happy to hear suggestions.
Alongside the direction controls are some characters used for markup in the Manuel de Codage, which allows you to prepare text for an engine that knows how to lay it out two-dimensionally. (The picker doesn’t do that.)
The Latin Characters panel, opened from the grey bar to the left, provides characters needed for transcription.
In case you’re interested, here is the text you can see in the
picture. (You’ll need a font to see this, of course. Try the free
Noto Sans font, if
you don’t have one – or copy-paste these lines into the picker,
where you have a webfont.)
The last two lines spell the name of Amenhotep using Manuel de Codage markup, according to the Unicode Standard (p 432).
A new article, What is Ruby? is out for wide review. We are looking for comments by 10 February.
This new article will replace an older page, simply called Ruby, with more complete and up to date information. Other articles in preparation will address how to use markup and styling in HTML and CSS.
Please send any comments as github issues by clicking on the link “Leave a comment” at the bottom of the article. (This will add some useful information to your comment.) You may find that some links in the article won’t work, because this is a copy of the article which will eventually be published on the W3C site. There is no need to report those.
FREME is a project that is developing a Framework for multilingual and semantic enrichment of digital content. A key aspect of the framework is that it puts standards and best practices in the area of linguistic linked data and multilingual content processing in action. We will introduce the framework in a dedicated webinar on 22 February, 4 p.m. CET. If you are interested in participating please contact Nieves Sande and Felix Sasaki for further logistics.
I just received a query from someone who wanted to know how to figure out what characters are in and what characters are not in a particular legacy character encoding. So rather than just send the information to her I thought I’d write it as a blog post so that others can get the same information. I’m going to write this quickly, so let me know if there are parts that are hard to follow, or that you consider incorrect, and I’ll fix it.
A few preliminary notes to set us up: When I refer to ‘legacy encodings’, I mean any character encoding that isn’t UTF-8. Though, actually, I will only consider those that are specified in the Encoding spec, and I will use the data provided by that spec to determine what characters each encoding contains (since that’s what it aims to do for Web-based content). You may come across other implementations of a given character encoding, with different characters in it, but bear in mind that those are unlikely to work on the Web.
Also, the tools I will use refer to a given character encoding using the preferred name. You can use the table in the Encoding spec to map alternative names to the preferred name I use.
Let’s suppose you want to know what characters are in the character encoding you know as cseucpkdfmtjapanese. A quick check in the Encoding spec shows that the preferred name for this encoding is euc-jp.
Go to http://r12a.github.io/apps/encodings/ and look for the selection control near the bottom of the page labelled show all the characters in this encoding.
Select euc-jp. It opens a new window that shows you all the characters.
This is impressive, but so large a list that it’s not as useful as it could be.
So highlight and copy all the characters in the text area and go to https://r12a.github.io/apps/listcharacters/.
Paste the characters into the big empty box, and hit the button Analyse characters above.
This will now list for you those same characters, but organised by Unicode block. At the bottom of the page it gives a total character count, and adds up the number of Unicode blocks involved.
If instead you actually want to know what characters are not in the encoding for a given Unicode block you can follow these steps.
Go to UniView (http://r12a.github.io/uniview/) and select the block you are interested where is says Show block, or alternatively type the range into the control labelled Show range (eg. 0370:03FF).
Let’s imagine you are interested in Greek characters and you have therefore selected the Greek and Coptic block (or typed 0370:03FF in the Show range control).
On the edit buffer area (top right) you’ll see a small icon with an arrow point upwards. Click on this to bring all the characters in the block into the edit buffer area. Then hit the icon just to its left to highlight all the characters and then copy them to the clipboard.
Next open http://r12a.github.io/apps/encodings/ and paste the characters into the input area labelled with Unicode characters to encode, and hit the Convert button.
The Encoding converter app will list all the characters in a number of encodings. If the character is part of the encoding, it will be represented as two-digit hex codes. If not, and this is what you’re looking for, it will be represented as decimal HTML escapes (eg. Ͱ). This way you can get the decimal code point values for all the characters not in the encoding. (If all the characters exist in the encoding, the block will turn green.)
(If you want to see the list of characters, copy the results for the encoding you are interested in, go back to UniView and paste the characters into the input field labelled Find. Then click on Dec. Ignore all ASCII characters in the list that is produced.)
Note, by the way, that you can tailor the encodings that are shown by the Encoding converter by clicking on change encodings shown and then selecting the encodings you are interested in. There are 36 to choose from.
This tutorial workshop, sponsored by the Unicode Consortium and organized by the German University of Technology in Muscat, Oman, is a three-day event designed to familiarize the audience with the Unicode Standard and the concepts of internationalization. It is the first Unicode event to be held in the Middle East.
The workshop program includes an introduction to Writing Systems & Unicode, plus presentations on Arabic Typography, web best practices, mobile internationalization, and more.
The workshop website provides full information about the event. Early bird registration lasts until January 31, 2016, but register early to ensure a place.