Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0

1 Introduction

A plethora of applications and data are exposed as XML over HTTP. User agents such as Voice and Web browsers fetch and execute applications but restrict the XML content accessible to those applications merely to the URLs located in the same domain as the application. To take advantage of the rich XML content available on the Web, application developers must resort to proxying the content through the domain hosting their application thereby increasing overhead and limiting scalability.

This note describes a mechanism being used in the industry that allows a content provider to use a processing instruction embedded within the XML prolog to specify the access policy of that content. In this model a user agent can safely extend the sandbox in which it has restricted the application to include access to the XML content if and only if the specified policy grants permission.

The processing instruction is designed explicitly to enable extending the sandbox and is not designed as a restriction mechanism. The expectation is that the user agent's default policy is more strict. Therefore, it is always safe to fall-back to default policy in the event of an error.

ISSUE: The Task Force would like to enable this mechanism as an HTTP header (e.g. Content-Access-Control). We expect to apply this change to a later draft.

2 <?access-control?> Processing Instruction Algorithm

Before allowing an application executing in the context of a user agent to manipulate external XML content, a user agent validates that the host requesting the content is allowed to access the content. This validation is performed by comparing the hostname and IP Address of the document server from which the requesting application was fetched to the list of hostnames, hostname suffixes, and IP addresses listed in the <?access-control?> processing instruction included in the XML prolog to be fetched. When comparing hostnames, the user agent must perform a case insensitive match as specified in [RFC2616].

If the XML prolog specifies one or more <?access-control?> processing instructions, access to the content is allowed based on the following algorithm:

If the IP address of the requesting application matches a value in the deny attribute, access is denied, and the search algorithm is stopped.
If the IP address of the requesting application matches a value in the allow attribute, access is allowed, and the search algorithm is stopped.
If the fully qualified domain name of the requesting application exactly matches a value in the deny attribute, access is denied, and the search algorithm is stopped.
If the fully qualified domain name of the requesting application exactly matches a value in the allow attribute, access is allowed, and the search algorithm is stopped.
The user agent then searches for the best match using wildcards on the domain name. Best match is defined as the closest match using the wildcards (e.g. "bert.evil.example.com" matches "*.evil.example.com" more closely than "*.example.com").
If a best match occurs in the deny attribute, access is denied, and the search algorithm is stopped.
If a best match occurs in the allow attribute, access is allowed, and the search algorithm is stopped.
If there is no match on any of the allow or deny attributes, the search algorithm is stopped and the user agent should its own security policy.

ISSUE: There are two choices. One is to assume that if a PI exists, that the PI defines the universe of access allowed. For instance, if a PI exists but does not explicitly allow or deny example.net, then either the user agent may default to its own policy as currently specified. It may be better to say if a PI exists but does not make a statement about example.net, then example.net must be denied access.

Processing instructions outside the XML prolog must be ignored.

If the user agent encounters multiple <?access-control?> processing instructions in the XML prolog of the retrieved XML content, the user agent must collect the attributes from each instruction and create a merged ruleset in document order before executing the access algorithm. For example, the following two blocks are equivalent:

<?access-control allow="*.example.com"
        deny="*.visitors.example.com"?>

        <?access-control allow="www.example.org"
        deny="www.example.net"?>

        

        <?access-control allow="*.example.com www.example.org"
        deny="*.visitors.example.com www.example.net"?>

ISSUE: This combination algorithm is under review as there are many options with different intended results. Feedback and use cases are greatly appreciated.

The require-secure attribute is optional. If not specified, the default is false. If the attribute require-secure is specified to be true, the user agent is responsible for ensuring that the requesting application's host has been validated using a secure protocol (e.g. HTTPS).

If the XML prolog does not contain an <?access-control?> processing instruction, access to the XML content is dependent on the user agent's security environment. Similarly, if the requesting application is from a non-remote protocol (e.g. file://) access to the XML content is dependent on the user agent's security environment.

If the processing instruction is not well-formed for any reason, the user agent should ignore all processing instructions and default to its security policy.

The following grammar describes the syntax for the <?access-control?> processing instruction to be embedded in the XML content retrieved by the user agent. The grammar is specified using Extended Backus-Naur Form (EBNF) notation. For more information on this syntax, see section 6, Notation, in [XML]. For definitions of the HostName and IPv4address, and IPv6address productions, see [RFC3986].

ISSUE: The working group would like to move to using URIs in combination with or instead of hosts, domains, and IP addresses. This might also eliminate the need for the require-secure attribute and any special handling for non-remote URI schemes.

On 5/6/2006 Charles McCathieNevile wrote:

Current implementations are based on IP numbers, or domain names, allowing for wild cards.
There are alternatives, such as the rules for the scope attribute of P3P's HINT element [1].
Although those are similar, they allow a greater graularity, providing for scheme (e.g.
file:, http: etc) and port number constraints, as follows:

[quote cite=http://www.w3.org/TR/P3P/#hints]
... the host part of the authority MAY begin with a wildcard, as defined in Section 2.3.2.1.2.
The scope attribute MUST NOT contain a wildcard in any other position, MUST be encoded according
to the conventions in Section 2.3.2.1.2, and MUST NOT contain a path, query or fragment URI
component. Additionally, if the authority is a server, it SHOULD NOT contain a userinfo part.

For example, legal values for scope include:
http://www.example.com
http://www.example.com:81
http://*.example.com
ftp://ftp.example.org

The following are illegal values for the scope attribute:
http://www.*.com ; the wildcard can only be at the start
http://www.example.com/ ; the trailing slash is not allowed
www.example.com ; the scheme must be stated
*://www.example.com ; the scheme cannot contain a wildcard
http://www.example.com:*; the port cannot contain a wildcard

The path attribute is used to locate the policy reference file on the hinted site. It is a
relative URI whose base is the URI scheme and authority matched in the scope attribute. The
path attribute MUST NOT be an absolute URI, so that the policy reference file is always retrieved
from the same site that it is applied to.
[/quote]

Since this syntax is almost compatible with that already used, but provides greater granularity,
I think we could readily use it. In order to provide for backwards compatibility I think we should
allow domains to be specified without schemes (although in the general case it could raise more
security issues, which should be noted in the relevant section of the spec). Should we adapt it
to allow for wildcarding IP ranges?

There is another approach taken by the RDF-CL XG, which allows for specifying a base pattern, and
exceptions, covering particular files or groups of files. Although this is clearly useful, it goes
some way beyond the current approach and requires a significant and incompatible change of syntax.
I suggest that is beyond the scope of the current work, although it could be used by future work,
and might be worth pointing to as an informative reference. It is already used to provide access
control, so it does seem a natural complementary approach that we should at least mention.

Access Control Processing Instruction

[1]	`AccessControlPI`	::=	`'<?access-control' (S 'allow="'AccessList'"' \| S "allow='"AccessList"'")? (S 'deny="'AccessList'"' \| S "deny='"AccessList"'")? (S 'require-secure="'true'"' \| "require-secure="'false'")? S? '?>'`
[2]	`AccessList`	::=	`AccessItem (S AccessItem)* \| '*'`
[3]	`AccessItem`	::=	`HostName \| PartialHostName \| IPv4address \| genericuri`
[4]	`PartialHostName`	::=	`'*.' HostName`

In the following example, the hosts named "voice.example.com" and "voice.example.org" are allowed access to the XML content. An XML request from an application located on all other hosts (e.g. "www.example.com") will fail.

<?access-control allow="voice.example.com
        voice.example.org"?>

Numerous hosts within a domain may require XML content access, and listing them all is impractical. For this reason, the user agent should support wildcard matching through the use of an asterisk ('*') at the beginning of a domain name. Use of the '*' will provide access to any and all applications hosted in that domain. For example, "*.example.com" will allow applications from voice.example.com, www.example.com, mail.example.com, example.com, and any other host ending in example.com to access that data.

In the following example, all applications hosted within the "example.org" and "example.com" domains are allowed access to the XML content containing the processing instruction:

<?access-control allow="*.example.org
        *.example.com"?>

To allow any application hosted in any domain to access the XML content, set the value of allow to a single asterisk ('*') as shown in the following example:

<?access-control allow="*"?>

To allow any application hosted in the "example.com" domain with the exception of applications hosted within the "visitors.example.com" domain to access the XML content, set the value of allow to "*.example.com" and the value of deny to "*.visitors.example.com" as shown in the following example:

<?access-control allow="*.example.com"
        deny="*.visitors.example.com"?>

3 Security Considerations for User Agent Implementors and Application Authors

The processing instruction is designed explicitly to enable extending the sandbox for access to XML content for "read". It is not designed to used to enforce sandboxing itself restriction or provided generalized trust validation. The expectation is that the user agent's default sandboxing policy is more strict. Therefore, it is always safe to fall-back to default policy in the event of an error.

A user agent running inside a trusted corporate network and executing untrusted content should enforce a sandboxing policy by denying access. In contrast, it may be appropriate to relax this policy when the user agent is executing only trusted applications that requires access to arbitrary XML feeds on the local network. User agent vendors that allow this sandboxing policy to be configured are encouraged to provide guidance on the appropriate settings. It is critical that network administrators understand the security issues pertinent to their environment and configure their systems appropriately. In tandem, developers and web server administrators must be aware of the dangers of trusting a user agent that can be configured to disable sandboxing.

User agents which implement this capability should take care not to expose other trusted data (cookies, HTTP header data) inappropriately. The access-control processing instruction is only designed to enable access to the XML content.

User agents which implement this capability should also take care to properly normalize Unicode and to properly interpret IDNs to prevent URL spoofing attacks.

Application authors should be aware that XML content retrieved from another site is not itself trustable. Authors should take care to protect against exposing themselves to cross-site scripting attacks by failing to validate the content returned or executing the retrieved content directly.

A References

AC-NOTE: Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0, ed. Matt Oshry, Brad Porter, RJ Auburn. W3C Working Group Note, 13 June 2005. See http://www.w3.org/TR/access-control/.
DOM3LS: Document Object Model (DOM) Level 3 Load and Save Specification, ed. Johnny Stenback and Andy Heninger. W3C Recommendation, April 2004. See http://www.w3.org/TR/DOM-Level-3-LS/.
RFC2616: Hypertext Transfer Protocol -- HTTP/1.1, ed. R. Fielding et al. IETF RFC 2616, June 1999. See http://www.ietf.org/rfc/rfc2616.txt.
RFC3986: Uniform Resource Identifier (URI): Generic Syntax , ed. T. Berners-Lee et al. IETF RFC 3986, January 2005. See http://www.ietf.org/rfc/rfc3986.txt.
VXML21: VoiceXML 2.1, ed. Matt Oshry et al. W3C Candidate Recommendation, June 2005. See http://www.w3.org/TR/2005/CR-voicexml21-20050613/.
XML: Extensible Markup Language (XML) 1.0, ed. Tim Bray et al. W3C Recommendation, February 2004. See http://www.w3.org/TR/2004/REC-xml-20040204/.

Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0

W3C Working Draft 17 May 2006

Abstract

Status of this Document

Table of Contents

Appendix

1 Introduction

2 <?access-control?> Processing Instruction Algorithm

Access Control Processing Instruction

3 Security Considerations for User Agent Implementors and Application Authors

A References