Ampersands, PHP Sessions and Valid HTML

Why using PHP sessions causes invalid HTML and XHTML to be generated, and how to fix it.

Status of this document

This document is an article contributed to the QA Interest Group. Feedback, suggestions and corrections are welcome, and should be sent to the publicly archived mailing-list www-qa.

Credits

Author(s)
David Dorward

Table of Contents

Background

In HTML (and XHTML, along with other SGML and XML applications) certain characters have special meaning, a prime example being <, which indicates the beginning of a tag. Such characters cannot be simply typed into a document if you wish them to display - otherwise how could the user agent tell the difference between b<a (meaning b is less than a) and b<a (meaning b followed by the start of an anchor)?

In order to display reserved characters HTML and XHTML provide a mechanism called character references. The syntax of these is:

  1. an ampersand
  2. a "code" for the referenced character
  3. a semicolon

For example, the "less than" character is represented as &lt;.

Giving the ampersand special meaning makes it, like <, a reserved character, so it also needs to be represented by an entity for it to be used in a document - &amp;

Now for a small confession - there are exceptions to these rules, although they are not relevant when dealing with the issues caused by PHP sessions.

HTML and XHTML include blocks of what is called CDATA, where HTML special characters no longer have special meaning. Inside such blocks character references are no longer processed, so an ampersand must be typed as an ampersand, and not as its character reference. In HTML, the content of <script> and <style> elements is CDATA, while in XHTML they are marked explicitly. You can avoid the problem by placing scripts and style sheets in separate files and using <link> and <script src="…">.

The other exceptions are that sometimes the semi-colon is optional, and sometimes ampersands can be represented without being encoded as entities. In these situations it is never wrong to represent the character as a character reference terminated by a semicolon, so I won't go into more detail.

Problem

PHP has session handling code built in, this enables data to be stored on the server but be associated with a specific user (for, roughly, a single visit to the site).

To link the data with a user, the website has to hand the user agent a token which identifies it. This token is stored in a cookie, but not all user agents support cookies, and most of those which do allow them to be turned off.

PHP provides a fallback mechanism. If it discovers that cookies are not accepted by the client, it rewrites every link on the page to include that token in a query string. I believe this used to be enabled by default, but testing shows that, at least for the Fedora package of PHP 4.3.11 (Fedora release 2.4 of that package), it isn't. It can be turned by on by setting the session.trans_sid directive.

This is, in theory, a pretty elegant solution to the problem (discounting the issues of the token hanging around for third parties to hover off public computers, bookmarking, link sharing, etc, etc), but the implementation is flawed.

For links with no query string, there isn't a problem. PHP appends ?PHPSESSID= followed by a random hexadecimal number. For links that do have a query string PHP appends &PHPSESSID=.

Ampersand characters used as argument separators pose no problem in plain old URLs, however in URLs encoded in HTML they still mean start of character reference (subject to the aforementioned exceptions, which the above example does not qualify for).

Most users won't notice a problem, the majority of user agents are rather good at working around mistakes by authors. However, that does not mean authors should ignore the problem.

Solutions

Outputting a character reference

The character that PHP uses to separate arguments is configurable with the arg_separator.output directive. This can be set in a number of ways and is the solution suggested in the PHP manual.

Editing php.ini

The php.ini file contains the central configuration data for an install of PHP on a computer. You can specify a character reference to use there.

arg_separator.output = "&amp;"

Apache directives

The Apache web server can set PHP scripts in all the usual places. This allows different directives to be set on a per site or per directory basis (in, for example, a <location> block or .htaccess file).

php_value arg_separator.output &amp;

Per script basis

PHP configuration directives can be set on a per script basis with the ini_set function. Put the code to set the directives at the top of your script.

<?php ini_set('arg_separator.output','&amp;'); ?>

Using a different argument separator

Since the ampersand character has special meaning in HTML, the specification suggests that query string parsers allow the use of a semicolon as an argument separator. PHP comes preconfigured to accept this, so you can alter the output code to use a semicolon instead of an ampersand using the same techniques.

Editing php.ini

arg_separator.output = ";"

Apache directives

php_value arg_separator.output ;

Per script basis

<?php ini_set('arg_separator.output',';'); ?>

Disable sessions for non-cookie users

This option has a number of advantages from a security point of view as it reduces the chance of the session token leaking to third parties. As a side effect it will render your session code useless for visitors who disable, block or otherwise do not support cookies (this has accessibility implications).

Editing php.ini

session.use_trans_sid = 0

Apache directives

php_value session.use_trans_sid 0

Per script basis

This directive may or may not be able to be set on a per script basis depending on which version of PHP you are using. If it is possible to set it then the syntax is as follows:

<?php ini_set('session.use_trans_sid','0'); ?>

Valid XHTML 1.0!
Created Date: 2005-04-15
Last modified $Date: 2011/12/16 02:59:19 $ by $Author: gerald $

Copyright © 2000-2003 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.