Setting charset information in .htaccess

Question

How do I use .htaccess directives on an Apache server to serve files with a specific encoding?

Background

It is important to ensure that any information about character encoding sent by the server is correct, since information in the HTTP header overrides information in the document itself.

Many Apache servers are configured to send files using the ISO-8859-1 (Latin-1) encoding. In the examples in this FAQ, we'll assume that you want to serve your file or files using a different encoding than that specified in the default configuration. (For advice on choosing an encoding see Choosing & applying a character encoding.)

The following shows an example of an HTTP header that accompanies a file sent to a user agent. In this case the character encoding information is included in the Content-Type header on the second line from the bottom.

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=utf-8
Content-Language: en

In the example the Content-Type header expresses both the MIME type of the file and the character encoding. The MIME type describes the format of the file being served. HTML files are typically served as text/html. The character encoding (or 'charset') of this file is UTF-8.

To learn how to view the HTTP header for a file see the article Checking HTTP Headers.

Files on an Apache server may be served with a default character encoding declaration in the HTTP header that conflicts with the actual encoding of the file. The character encoding sent by the server may be the out-of-the-box default, a default set by the system administrator, or a result of implementing various Apache directives. In other cases no character encoding information is sent by the server when it is actually desired.

If the server is set up to allow users or administrators to change information in .htaccess files, these can provide a way to override default settings. This FAQ shows you how.

Answer

There are a couple of different scenarios to bear in mind. In the first instance, you may want to change the default for all the files in a directory with the same extension. Alternatively, you may want to change the default for a single file or small number of files. We will explore these in turn.

In our examples we will assume that the default server configuration serves files as ISO-8859-1, but that you want to serve your file or files using UTF-8 (a very sensible strategy!).

Is this answer relevant to you?

This article is written for content authors, rather than system administrators. Setting the server's default encoding is beyond the scope of this article.

This advice is only relevant if you are happy to declare the character encoding of your document via the HTTP header. In some cases you may not want that.

Note that this FAQ also assumes that your server is set up to use .htaccess files, and that the directives described below work in .htaccess files on your server. It is also assumed that it is not appropriate to simply change the default configuration of the server. If you are not sure, contact your server administrator.

You should also be aware of the conventions in use on your server for association of character encoding information with extensions. In some cases the server may be set up in the expectation that character encodings are indicated by encoding-specific extensions, eg. example.html.utf8 where it is the .utf8 that needs to be associated with a character encoding, rather than the .html (which may be associated with the file type).

If these approaches fail, you should consult the Apache manuals (see attached links) or your server administrator.

Specifying by extension

Use the AddCharset directive to associate the character encoding with all files having a particular extension in the current directory and its subdirectories. For example, to serve all files with the extension .html as UTF-8, open the .htaccess file in a plain text editor and type the following line:

AddCharset UTF-8 .html

The extension can be specified with or without a leading dot. You can add multiple extensions to the same line. This will still work if you have file names such as example.en.html or example.html.en.

The example will cause all files with the extension .html to be served as UTF-8. The HTTP Content-Type header will contain a line that ends with the 'charset' information as shown in the example that follows.

Content-Type: text/html; charset=UTF-8

Note: All files with this extension in all subdirectories of the current location will also be served as UTF-8. If, for some reason, you need to serve the odd file with a different encoding you will need to override this using additional directives.

Note: You can associate the character encoding with any extension attached to your file. For example, suppose you do language negotiation and you have pages in two languages that follow the model example.en.html and example.ja.html. Let's also suppose that you are happy to serve English pages using your server's ISO-8859-1 default, but want to serve Japanese files in UTF-8. To do this, you can associate the character encoding with the language extension, as follows:

AddCharset UTF-8 .ja

Take note, however, that, if you can, it might be a better solution to change the server default to UTF-8, or serve all files in new directories as UTF-8.

Note: It is also possible to achieve the same result using the AddType directive, although this declares both the character encoding and the MIME type at the same time. The decision as to which is most appropriate will depend in part on how you are using extensions for content negotiation. If you are using different extensions to express the document type and the character encoding, this is less likely to be appropriate.

AddType 'text/html; charset=UTF-8' html

Changing the occasional file

Let's now assume that you want to serve only one file as UTF-8 in a large directory where all the other older files are correctly served as ISO-8859-1. The file you want to serve as UTF-8 is called example.html. Open the .htaccess file in a plain text editor and type the following:

<Files "example.html">
AddCharset UTF-8 .html
</Files>

What we did here was wrap the directive discussed in the previous section in some markup that identifies the specific file we are concerned with. If you have the need, there is also a slightly different syntax that allows you to specify a number of file names using a regular expression.

Note: It is also possible to achieve the same result using the AddType directive shown above, or, in this case, the ForceType directive, although these declare both the character encoding and the MIME type at the same time.

<Files "example.html">
ForceType 'text/html; charset=UTF-8'
</Files>

Note: Any files with the same name in a subdirectory of the current location will also be served as UTF-8, unless you create a counter directive in the relevant directory.

More complex scenarios

When two extension rules apply to the same document the order of extensions is important. Thus, in the following example

AddCharset UTF-8 .utf8
AddCharset windows-1252 .html

the file 'example.utf8.html' will be served as "windows-1252" and 'example.html.utf8' as UTF-8.