GeoEncoding

From W3C Wiki
Jump to: navigation, search


This page is now defunct

Do not edit this page.

See instead the new wiki

What is character encoding, and why should I care?

First, why should I care?

If you use anything other than the most basic letters and numbers of the English alphabet, people may not be able to read your text unless you tell them what character encoding you used. And you can also run into problems with even the basic English alphabet.

You may also have come across strange looking garbage in other people's text that you need to deal with. That's usually because they didn't properly identify the character encoding they are using.

For example, the author may intend the text to look like this:

mojibake1.gif

but it may actually display like this:

mojibake2.gif

Not only that, but data may not be searchable.

What's a character encoding?

Words and sentences in text are created from characters. Examples of characters include the Latin letter á or the Chinese ideograph 請 or the Devanagari ह.

Basically, all characters are stored in computers in using a numeric code. Each number in memory corresponds to a particular character.

A character encoding is a key to unlock the code. It is a set of mappings between numbers and characters. For example, in the encoding called ISO 8859-1 the number 233 represents the letter é. In the encoding called ISO 8859-5, the same number represents the Cyrillic character щ.

Unfortunately, there are many different character encodings, ie. many different ways of mapping between the same numbers and different characters. This is especially true for Web pages, which may be written on many different platforms, that each have their own encoding idiosyncracies.

If you are to correctly decode the sequence of numbers in memory, you need to know which encoding was applied originally. Otherwise you will see garbage.

How does this affect me?

You will need to declare the encoding you used for others to read your stuff.

You will need to know what encoding you editor is saving stuff in.

You will need to choose the best encoding for your purposes.

You will need to ensure that the various parts of your system that communicate with each other understand which character encodings are being used and support all the necessary encodings.

You will need to hope that others did the same thing so that you can read their stuff.

Talk about default settings of browsers.

How do i do that?

A LITTLE MORE DETAIL

Character encodings vs. character sets

Examples of character encodings

Unicode

Characters and fonts

PREVIOUS STUFF FROM DAVID

Author: David Clarke

DRC New start

Why should I Care about Character Encoding and What is It?

You can’t read what I’m writing

The first thing that gives you a clue about a character encoding problem is when somebody contacts you to say they can’t read your email or web site because it is full of “garbage text”. Often this is associated with damaged currency symbols obscuring price information, or accented characters. In its extreme, the entire text cannot be read.

For example :

You intend your web site to show
Users say they see

Of course, you may be on the other side of the problem and trying to read something that should be important.

What causes this?

Computers store characters as numbers. There are a very large number of ways in which the numbers are associated with the characters. These are the character sets.

If a program does not know which character set has been used, most will use a default or guess. If the program that is displaying the text makes an incorrect guess then the result can’t be read.

When I write content, what can I do to prevent this problem?

A simple ways to prevent this incorrect display of text is to provide explicit information in documents or web pages about their character encoding, for the readers’ computer programs to use and eliminate the guesswork. Care must be taken to ensure explicit information is consistent with the actual encoding of information.

When I want to read content, what can I do?

Web browsers and email programs can be forced to interpret a page as a specific encoding or the guesswork process can be given hints. For example, Firefox can be set to “auto-detect” with an expectation that the page uses some form of the many Japanese character encodings.

Why should I Care about Encoding?

Different character encodings used to be one of the biggest obstacles of interoperability between computers. They still make some emails and websites almost impossible to read because of apparently meaningless characters being displayed instead of the ones that were intended.

Be sure other people can see your message

For example : If a web site expects to display the following using the UTF-8 encoding hiraganautf8.gif but the web browser incorrectly assumes that the page uses ISO8859-1 then the following will appear hiraganautf8asiso8859-1.gif

Interoperability - or how not to make content unreadable?

One of the simplest steps to prevent the incorrect display of text is to include explicit information in documents or web pages about their encoding. When you save a document it is essential to check that it really does use the encoding that is expected. Information can be included directly in XML documents or HTML pages. DRC see documents on correctly marking up charset info

Be sure you can read information from other sources

[[RI we should add:

Question: What is 'character encoding', and why should I care?

]]

Simple Explanation of Character Encoding

Internally, computers store characters as numeric codes. The relationship between these numbers and the characters they represent is the character encoding.

Why does this cause problems?

When different computer systems use different character encodings it is often produces an unreadable result because each one interprets the same bytes to mean different things.

Why should I care?

If other people need to read your content, then the character encoding settings must be compatible, or the recipients can't read it.

[[RI Although the question has it, i think, the right way round to grab attention, I think the explanation may be better if it started off with the problem statement - ie. why should i care? I think this should be put in very simple, graphic terms: Characters are represented in a computer using bytes. If the bytes representing character in your text are misinterpreted you get a mess, like this... There a several schemes for associating bytes with characters, and determining what characters are included in a set. Examples are...

I fear the current text is too abstract and not simple enough. ]]

Background

Historically, different types of computer system from different countries or manufacturers have used different character encodings.

Most people have received an email or other document from a foreign source, that displays as apparently random characters. This is often because the receiving program is not configured to support the same encoding RI''' same what? You haven't defined or described 'encoding' yet. (Nor do you later ;-) of the original document.

Many web sites, emails programs use a range of character encodings which they cannot guarantee will be readable by other people, or on differently configured computers.

Historical Background

RI''' Yawn. ;-)

Early developments in computing were generally in English speaking countries, so the choice of characters to represent was dictated by the characters used in English. One of the predominant encodings was ASCII {should be referenced}

Extensions to ASCII were provided for representing some other languages, by assigning locale specific values whilst remaining within 8 bit character sets. This allowed for language specific character sets as extensions of ASCII, but only allows 128 more characters. This is adequate for adding French, German or Finnish accents and even adding small alphabets such as Greek.

RI''' This is all dudley boring. Can we add some pictures? See for example http://people.w3.org/rishida/scripts/tutorial/slides/Slide0190.html http://people.w3.org/rishida/scripts/tutorial/slides/Slide0190.html and following slides.

Extended ASCII cannot support large character sets such as the Japanese Kanji, which has 1,945 government approved characters and numerous additional ones used for special purposes such as names.

To represent these large sets of characters, a number of mutually incompatible, and often proprietary, standards evolved for different software environments. While information from one system was not being shared with another, each of these standards was adequate.

Many software environments are still being built using these historic or legacy encodings.

Character encoding remains a major component of a locale definitions.

How do I deal with Encoding?

[Setting Encoding In Applications]

[[HTML (this is emphasis)]]