Locale-based forms

From Internationalization

Locale-based forms in HTML pages

Recently, a Blink intent to implement was published. It proposes that in HTML the form control UI should respect lang attribute values, instead of the browser UI locale. This feature will be applied to submit, reset, number, date, datetime-local, month, time, week input types, and so on.

This wiki page will attempt to capture and consolidate thinking on any issues associated with this, which on the face of it seems like a good idea.

There are two aspects to this. The first is that controls such as submit buttons would have translated strings for text that appears automatically, such as the word 'Submit'. The second is that of managing input by the user for things such as number, date and currency formats, etc.

Apart from needing to clarify what happens if the translation for a given language is not available, we do not see many issues with the first aspect mentioned above. We assume, however, that it would make sense to use the nearest language declaration to establish the language, and not require the lang attribute on the control. (This may be different for the second aspect.)

Most of the remainder of this wiki will focus on issues and requirements related to data formats that change from one locale to another.

Examples

The number format for one thousand two hundred and thirty four point five in England is:

1,234.5

In Germany it is:

1.234,5

It can be very important for the user to know what format is being used when the information is being displayed to them, or input by them. For example, if you are paying 1.003 Omani Rials for an item on a website, are you paying approximately one rial or approximately one thousand rials?

Date formats are notoriously difficult when using only numbers. The date 1st February 2003 can be written as follows:

01/02/03
02/01/03 (generally North America only)
03/02/01

In addition to this, there are differences related to ordering of days on calendar views (MTWTFSS, SMTWTFS, SSMTWTF), and of course the separators are different (for example, 2013年2月1日 in Japan).

Topics for discussion (issues, requirements, notes, questions)

Where to get the locale formats information from?

Currently, it appears that the browser uses the locale of the system to determine the format of input such as numbers and dates.

Another possibility is to use the language settings of the browser.

The disadvantage of both of the above approaches is that translated labels and formats are not necessarily consistent with the language of the page, for example where a bilingual person uses pages in more that one language, where a user is accessing the Web from an internet café or another person's computer, etc.

A third possibility would be to derive the appropriate format from Content-Language information, if there is any, in the header. This is, however, often not available, and so doesn't provide a useful source of information.

A fourth possibility is to use the value of the lang attribute in the HTML markup. The value of the lang attribute usually indicates a language, rather than a locale (which typically requires regional information in addition, eg. French in Canada is fr-CA, French in Switzerland is fr-CH, etc.).

A fifth possibility is to use a new attribute specifically to indicate locales, but the i18n WG thinks that this would be unnecessary, and that the lang attribute should suffice and should therefore reduce complexity for the content author.

The i18n WG thinks that the lang attribute would serve well for locale identification. It is well-recognised by browsers, and can be set as needed by content authors to match the needs of the page. The values of the lang attribute are defined by BCP47, and can encompass regional, script and variant information, in addition to extensions that have been specifically designed with locale support in mind. (For more information, see Language tags in HTML and XML.

How would the content author indicate that the system settings should be used, if they want to?

Setting the lang attribute to the null string (lang="") indicates that the language is undetermined. This could provide a way for the content author to specifically say that the page itself should not determine the locale.

Where to declare the language tag that will be used for locale setting?

If the language of a page is declared in the html tag, say, is it necessary to declare it again in the form control itself to determine the locale?

There may be advantages to restricting locale information to lang attributes on the input etc control itself. One is that you may want to be more specific in declaring the locale than in declaring the language. For example, de may be sufficient on the html tag, but you may need de-DE-u-co-phonebk for the form control. On the other hand, you can always do both.

See also the example in the following section about insufficiently detailed lang values.

It is possible to provide a lang value on the html element (or any other) that contains the full details needed for locale information, since the browser should be able to parse it for the relevant information at the point of need. If it was declared there, however, the content author would have to be diligent in ensuring that the same detailed information is available any time the lang attribute is used on a lower-level element that may contain a form control. This, again, points to a recommendation that locale-specific lang values are possibly best used close to or on the form controls themselves, rather than defined at the top of the document.

It could be claimed that a benefit of using a new locale attribute, rather than the lang attribute, would be that such things would be much clearer to the content author. Their mind would be much more concentrated on supplying an appropriate level of detail for the form control.

What to do if the locale is under-specified or invalid?

This point is related to the previous one. If the only locale information available is a language subtag 'fr', it is not sufficient to know whether to use the French, Swiss or Canadian formats for numbers, currency and date.

The browser needs to have a plan for dealing with this situation.

Browsers also need to define an approach for handling invalid values in lang attributes and pages without lang attributes.

(This problem may be avoided more frequently if content authors re-use a lang tag on the form control itself (or close by on a parent element) to provide the locale-specific information, since they stand a better chance of realising that they need to be more specific. On the other hand, many may not know how to do the right thing without some education.)

How would users know whether a comma is a thousands or decimal separator?

If twelve point three four five was presented to the user as 12,345, how would they know whether the form control had been set to the French locale (where a comma is a decimal separator) or an English locale (where the comma is a thousands separator)?

Since the locale may or may not be determined by the content author, without some indicator the only way for the user to know would be to look in the source. Not all French pages they see would present the information in the same format, since not all content authors would indicate the locale.

This could be turned into a serious security hazard by people who intentionally try to create misdirection. How would the user protect themselves from such people?

How would users know which date format they have to use?

Imagine going to the English web site of a Malaysian airline and trying to book a flight for November 12. Should you type in "11/12" or "12/11"? It's quite likely that the web site is decorated with a UK flag (Malaysia used to be a British colony, and the US flag looks quite similar to the Malaysian one), and that the page's lang attribute is set to "en" because, well, it's in English...

How would back-end applications and database know what the user meant to type?

Any number typed in by the user should be converted to a standard internal format by the browser and that standard format is what should be transmitted over the wire. The locale-appropriate format is generated by the browser when a number is displayed. This is simply a presentational device.

This only works, however, if the user typed what they intended, ie. the browser has to make it clear enough to the user what the input means (for example, whether a comma is a decimal or thousands separator).

Can users override the browser's format for data with their preferred format?

If a user is filling in a form in another language and want to check that they are entering the right date or number value, or if they simply have a preferred format, would they be able to tell the browser to use that format instead of the one related to the language of the content?

This could vary from simply choosing an alternative locale to specifying specific attributes of the formats used (like you can in MS Office products, choosing your preferred date separators, DDMMYYYY format, etc.)

This kind of feature would be particularly useful if significant differences exist between the format of data per the locale of the page versus the locale of the user. For example, if the form generated a different calendar from the one the user is used to, or for users who don't want to type ideographic separators in a date on a Japanese site, etc. But it can also be useful for them to be confident that they haven't misunderstood the meaning of a comma or period in a currency amount.

Is digit-shaping part of the localization of forms?

Many scripts and languages use non-European number digits. Will they be able to use these digits for form input?

Should local-specific formats be used in code?

If a content author wants to provide a number as a default value for a numeric form field, does he have to provide it in the local format or the canonical format? For example, would <input lang="fr-FR" type=number value="1,234".../> be appropriate as a number to 3 decimal places? Is the expectation that the browser would recognise that as the French format? If so, what would happen if the value was actually expressed as 1.234 (which could easily happen in the future as well as in legacy pages)? Also, if so, the content authors would need to have access to a list of display formats supported by the browser per each language tag value, so that they can enter information in the correct format.

If it's not the case, which seems likely to be easier to manage, and the canonical form needs to be used then the content author needs to resist the temptation to add something that looks like the actual displayed value. Should there be an error message if they deviate from the canonical form? (Perhaps there already is.) (Presumably this includes resisting the temptation to use non-European digits.)

The same questions, of course, apply to other data formats.

None of this removes the need for content authors to build forms wisely

Any time a content author creates a form for date entry that simply asks for numeric input, they are looking for trouble. Date pickers, month names, explanatory text, four-digit years (if appropriate for the calendar), separate fields for format components, field validation, repeating back to the user what they input – all these and other methods should be considered to clarify for the user what they need to type where.

Useful spec links

[1] explains that formats of dates, times and numbers as presented to the user need to be converted to a canonical form to be stored in the DOM, processed, transmitted or recieved over the wire

[2] definition of the canonical form for numbers, and dates/times

[3] about the input element in various states


Tests

Here are some tests to explore behaviour (which only works for Firefox at the moment).

http://www.w3.org/International/tests/test-incubator/locales-in-forms/numbers.html

Other useful links

How Firefox localizes numbers: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Input#Localization

Some information about decimal separators http://en.wikipedia.org/wiki/Decimal_mark