Intended audience: script developers (PHP, JSP, etc.), webmasters, Web project managers, and anyone who wants to understand how to set or send HTTP charset information.
When a server sends a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. This article provides a starting point for those needing to set the encoding information in the HTTP header.
You should look elsewhere for information about how to declare character encoding in HTML pages, or how to find out how to check the character encoding information being sent in an HTTP header.
Documents transmitted with HTTP that are of type text, such as text/html, text/plain, etc., can send a charset parameter in the HTTP header to specify the character encoding of the document.
It is very important to always label Web documents explicitly. HTTP 1.1 says that the default charset is ISO-8859-1. But there are too many unlabeled documents in other encodings, so browsers use the reader's preferred encoding when there is no explicit charset parameter.
The line in the HTTP header typically looks like this:
Content-Type: text/html; charset=utf-8
In theory, any character encoding that has been registered with IANA can be used, but there is no browser that understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it. A Unicode encoding such as UTF-8 is a good choice for a number of reasons.
How to make the server send out appropriate charset information depends on the server. You will need the appropriate administrative rights to be able to change server settings.
Apache. This can be done via the AddCharset (Apache 1.3.10 and later) or AddType directives, for directories or individual resources (files). With AddDefaultCharset (Apache 1.3.12 and later), it is possible to set the default charset for a whole server. For more information, see the article on Setting 'charset' information in .htaccess.
Jigsaw. Use an indexer in JigAdmin to associate extensions with charsets, or set the charset directly on a resource.
IIS 5 and 6. In Internet Services Manager, right-click "Default Web Site" (or the site you want to
configure) and go to "Properties" => "HTTP Headers" => "File Types..." => "New Type...". Put in the extension you want to map, separately
for each extension; IIS users will probably want to map .htm, .html,... Then, for Content type, add "
text/html;charset=utf-8" (without the
quotes; substitute your desired charset for utf-8; do not leave any spaces anywhere because IIS ignores all text after spaces). For IIS 4, you
may have to use "HTTP Headers" => "Creating a Custom HTTP Header" if the above does not work.
The appropriate header can also be set in server side scripting languages. For example:
Perl. Output the correct header before any part of the actual page. After the last header, use a double
print "Content-Type: text/html; charset=utf-8\n\n";
Python. Use the same solution as for Perl (except that you don't need a semicolon at the end).
PHP. Use the header() function before generating any content,
header('Content-type: text/html; charset=utf-8');
Java Servlets. Use the setContentType method on the ServletResponse before obtaining any
object (Stream or Writer) used for output, e.g.:
If you use a Writer, the Servlet automatically takes care of the conversion from Java Strings to the encoding selected.
JSP. Use the
page directive e.g.:
<%@ page contentType="text/html; charset=UTF-8" %>
out.println() or the expression elements (
<%= object%>) is automatically
converted to the encoding selected. Also, the page itself is interpreted as being in this encoding.
ASP and ASP.Net. ContentType and charset are set independently, and are methods on the response object.
To set the charset, use e.g.:
In ASP.Net, setting Response.ContentEncoding will take care both of the charset parameter in the HTTP Content-Type as well as of the actual encoding of the document sent out (which of course have to be the same). The default can be set in the
globalization element in
Machine.config, which is originally set to UTF-8).