W3C

PR-PICSRules-971104

PICSRules 1.1

W3C Proposed Recommendation 4 Nov 1997

This is:
http://www.w3.org/TR/PR-PICSRules-971104
Editor:
Martin Presler-Marshall, IBM <mpresler@us.ibm.com>
Authors:
Christopher Evans, Microsoft <cevans@microsoft.com>
Clive D.W. Feather, Demon Internet Ltd. <clive@demon.net>
Alex Hopmann, Microsoft <alexhop@microsoft.com>
Martin Presler-Marshall, IBM <mpresler@us.ibm.com>
Paul Resnick, University of Michigan <presnick@umich.edu>

Status of this document

This document is in the course of review by the members of the World-Wide Web Consortium. This is a stable document derived from internal Working Drafts of the W3C PICSRules Working Group and the public working draft dated 26 October, 1997. Details of this review have been distributed to member's representatives. Comments by non-members should be sent to w3c-PICSRules-ask@w3.org.

The review period will end on 14 December 1997 24:00 GMT. Within 14 days from that time, the document's disposition will be announced: it may become a W3C Recommendation (possibly with minor changes), or it may revert to Working Draft status, or it may be dropped as a W3C work item. This document does not at this time imply any endorsement by the Consortium's staff or member organizations.

This document is part of the W3C (http://www.w3.org/) Metadata activity.

A list of current W3C Recommendations, Proposed Recommendations and Working Drafts can be found at: http://www.w3.org/TR


Abstract

This document defines a language for writing profiles, which are filtering rules that allow or block access to URLs based on PICS labels that describe those URLs.

Definitions

This specification uses the same words as RFC 1123 [RFC1123] for defining the significance of each particular requirement. These words are:

MUST
This word or the adjective "required" means that the item is an absolute requirement of the specification.
SHOULD
This word or the adjective "recommended" means that there may exist valid reasons in particular circumstances to ignore this item, but the full implications should be understood and the case carefully weighed before choosing a different course.
MAY
This word or the adjective "optional" means that this item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because it enhances the product, for example; another vendor may omit the same item.

An implementation is not compliant if it fails to satisfy one or more of the MUST requirements for the protocols it implements. An implementation that satisfies all the MUST and all the SHOULD requirements for its protocols is said to be "unconditionally compliant"; one that satisfies all the MUST requirements but not all the SHOULD requirements for its protocols is said to be "conditionally compliant." User-agents which process PICSRules are free to choose any interpretation they wish for constructs which fail to meet one of the MUST requirements.

This document assumes that the reader has a working knowledge of PICS-1.1. All labels referred to here are assumed to be PICS-1.1 compliant labels. See references [PicsServices] and [PicsLabels] for details.

The PICSRules language: examples

Example 1: Forbid access to certain URLs

 1 (PicsRule-1.1  
 2     (
 3     Policy (RejectByURL ("http://*@www.grody.com:*/*" 
                            "http://*@www.gross.net:*/*"))
 4     Policy (AcceptIf "otherwise")
 5     )
 6 )

The numbers on the left are line numbers for ease of reference; they aren't part of the actual rule.

This example forbids access to a specific set of URLs, without using any PICS labels. Any URL that specifies the host www.grody.com or www.gross.net will be blocked, regardless of the username, port number, or particular file path that is specified in the URL; any other URLs are considered acceptable.

Example 2: Forbid access based on PICS labels

 1 (PicsRule-1.1
 2     (
 3     serviceinfo (
 4         "http://www.coolness.org/ratings/V1.html"
 5         shortname "Cool"
 6         bureauURL "http://labelbureau.coolness.org/Ratings"
 7         UseEmbedded "N"
 8         )
 9     Policy (RejectIf "((Cool.Coolness <= 3) or (Cool.Graphics >= 3))")
10     Policy (AcceptIf "otherwise")
11     ) 
12 )

This rule checks the rating given to documents according to the "Cool" rating service ("http://www.coolness.org/ratings/V1.html"). Labels will be fetched from the label bureau "http://labelbureau.coolness.org/Ratings". Labels embedded in the document are ignored because the document authors can't be trusted to assess their own coolness.  Documents which are not sufficiently cool or have too many graphics will be blocked. Everything else, including unlabeled documents, will be allowed.

Example 3: Allow access based on PICS labels: block everything else

 1 (PicsRule-1.1
 2     (
 3     ServiceInfo (
 4         name "http://www.coolness.org/ratings/V1.html"
 5         shortname "Cool"
 6         bureauURL "http://labelbureau.coolness.org/Ratings"
 7         )
 8     Policy (RejectUnless "(Cool.Coolness)")
 9     Policy (AcceptIf "((Cool.Coolness > 3) and (Cool.Graphics < 3))")
10     Policy (RejectIf "otherwise")
11     ) 
12 )

This rule also checks the rating given to documents according to the "Cool" rating service. In this case, because UseEmbedded is not specified, it defaults to using embedded labels in addition to labels it fetches from the label bureau. Line 8 says that documents will be blocked unless we have a rating on the "Coolness" scale of the "Cool" rating system ("http://www.coolness.org"). Line 9 says that documents which are sufficiently cool, and don't have too many graphics, will be passed. Line 10 says to block all other documents.

Example 4: A more complex example

 1 (PicsRule-1.1
 2     (
 3         name    (rulename "Example 4"
 4                  description "Example 4 from PICSRules spec; simply shows 
                                 how PICSRules rules are formed. This rule is 
                                 not actually intended for use by real users.") 
 5         source (sourceURL 
                      "http://www1.raleigh.ibm.com/pics/PICSRulz/Example1.html")
 6         ServiceInfo (name "http://www.coolness.org/ratings/V1.html"
 7                      shortname "Cool"
 8                      bureauURL "http://labelbureau.coolness.org/Ratings")
 9         ServiceInfo ("http://www.kid-protectors.org/ratingsv01.html"
10                      shortname "KP")
11         Policy (RejectByURL ("http://*@www.badnews.com:*/*" 
                                "http://*@www.worsenews.com:*/*"
                                "*://*@18.0.0.0!8:*/*"))
12         Policy (AcceptByURL "http://*rated-g.org/movies*")
13         Policy (AcceptIf "(KP.educational = 1)" 
                    Explanation "Always allow educational content.")
14         Policy (RejectIf "(KP.violence >= 3)" 
                    Explanation "Blood's a %22scary%22 thing.")
15         Policy (RejectUnless "(Cool.Graphics < 4)" )
16         Policy (AcceptIf "otherwise")
17     )
18 )

Explanation of example

Line
Explanation
1
Defines this construct as a PICSRules rule, and gives the version number.
3
Provides a short, human-readable name for this rule. There is no requirement for uniqueness on this name; it's meant as a mnemonic for users when manipulating rules in some sort of a user interface.
4
Provides a longer, human-readable description of this rule. This is meant to be use for an explanation of the semantics of this rule, and is also intended for users when manipulating rules in some sort of a user interface.
5
Specifies "where the rule came from". This URL is intended to point to a human-readable Web page which will give more information about this rule, who created it, why it was created, possible updates, etc.
6-8
Defines the rating service "http://www.coolness.org/ratings/V1.html", with short name Cool and identifies a label bureau from which to fetch its labels.
9-10
Defines the rating service "http://www.kid-protectors.org/ratingsv01.html", with short name KP. No label bureau is defined for this service; only embedded labels will be used.
11
Reject any HTTP URLs from the www.badnews.com and www.worsenews.com hosts, and all URLs that specify a host whose ip address has 18 as its first eight bits (these are the addresses corresponding to mit.edu).
12
Accept URLs whose domain names end in rated-g.org and whose pathnames begin "movies", but only if no username or port number is specified. For example "http://www.mystuff.rated-g.org/movies/hello" would be accepted, but neither "http://joe@www.mystuff.rated-g.org/movies/hello" nor "http://www.mystuff.rated-g.org:8009/movies/hello" would be accepted at this point in the rule processing (although they might be accepted by one of the subsequent policy statements).
13
Specifies that documents which have an educational rating of 1 in the KP rating system (defined above) will be allowed. Documents which have no rating under this rating system, or which have a rating other than 1 will be examined according to the rules which follow.
14
Specifies that documents which have a violence rating of 3 or more in the KP rating system (defined above) will be blocked; explanatory text is provided for user-agents to display to users: after decoding, the text is: Blood's a "scary" thing. Documents which have no rating under this rating system, or which have a lower rating will be examined according to the rules which follow.
15
Specifies that documents which have a Graphics rating of 3 or more under the Cool rating will be blocked. Documents which have no rating under the Cool system, or whose rating does not include the Graphics category will be blocked. Documents which have a Graphics rating less than 3 will be examined according to the rules which follow.
16
Specifies that documents which have not been either passed or blocked by the filter rules above will be passed.

The summary of this rule is the following:

  1. Reject things from two sites; otherwise accept certain other things from the rated-g.org domain.
  2. Educational pages are OK, regardless of whether they have violence or any other objectionable content.
  3. Pages showing a lot of violence will be blocked unless they are educational.
  4. Except for educational pages, pages with too many graphics will be blocked.
  5. Anything else is fair game.

Full syntax

It is intended that this syntax will be registered as a MIME type, application/pics-rules.

Let us first consider the basic underpinnings of a PICSRules rule, then the general format of the rule, and finally the format of the expressions found in filter clauses.
 

Basic structure

PICSRules rules are based on a limited form of an S-expression, namely a parenthesized attribute-value pair. A value is either a quoted string or a parenthesized list of additional attribute-value pairs, thus allowing nesting. When a value for an attribute is a list of further pairs, there is a concept known as a "primary attribute". The name of the primary attribute may be omitted, for the sake of readability, so that only the value of the primary attribute is specified. A parser can syntactically distinguish values from attributes (values begin with either a quote or left parenthesis); any values that are not paired with attribute names automatically belong to the primary attribute. When a value for an attribute is a list of pairs, the list MUST include the primary attribute-value pair (with or without the primary attribute name specified); it MAY contain additional attribute-value pairs. The general grammar for these limited S-expressions is:
attrvalpair:: attribute whitespace value | value

attribute:: alphanumstr 

value:: quotedstring |'(' attrvalpair+ ')'

quotedstring:: '"'notdoublequotechar*'"' | "'"notsinglequotechar*"'"

alphanumstr:: (alphanum | '.')+

whitespace:: ' ' | '\t' | '\r' | '\n' 

alphanum:: '0' - '9' | 'A' - 'Z' | 'a' - 'z'

notdoublequotechar :: any Unicode character except "
notsinglequotechar :: any Unicode character except '

The grammar uses " to quote strings, but ' may be used instead, provided that the same character starts and ends the string:
        "string"
        'string'
    but not:
        "string'
        'string"

As a shorthand in the rest of the BNF, we will use "double quotes" for all quoted strings, with the understanding that single quotes are equally valid as a delimiter. Also as a shorthand, we use notquotechar to mean any Unicode character other than the quoting delimiter (either " or ')  used for the current string.

The other quoting character may appear within a string. In order to accommodate the use of both single and double quotes inside strings, the following escaping conventions apply:

  1. " may be encoded as %22
  2. ' may be encoded as %27
  3. % may be encoded as %25
  4. % followed by anything other than 22, 27, or 25 is syntactically invalid

Note that, although ", ', and % are encoded using the % hex hex encoding rule used for special characters in URLs, other % hex hex combinations are not valid and are not considered encodings of other characters.
 

Character string as represented in a PICS Rule Parsed and decoded character string
"string" string
'string' string
'This is "quoted" text.'  This is "quoted" text.
"It's nice to quote."  It's nice to quote.
"It%27s nice to %22quote.%22" It's nice to "quote."
"50%25 of test scores are above the median" 50% of test scores are above the median
"50% are below the median" <syntactically invalid string>

 

Internationalization

RFC 2070 [RFC2070] on internationalization of HTML describes the more general SGML distinction between the internal character encoding and external character encoding. In those terms, Unicode is the internal character set for PICSRules rules. Unicode is a character set that includes characters from most languages; it is a 16-bit character set. We designate UTF-8 as the official external encoding for PICSRules. UTF-8 [UTF-8] has the useful properties that all USASCII characters are represented by themselves, and that they do not appear as part of the encoding of anything else. This means that most processing need not know about UTF-8 provided that it does not strip the top bit of 8-bit bytes.

Note that in order to properly interpret a PICSRules rule, the UTF-8 transformation is applied first, to convert the rule into a sequence of Unicode characters. Each quoted string must then be passed through a converter that unescapes quotes,
converting %22 to ", %27 to ', and %25 to %.

Note that all attribute names are case insensitive, while the case of values MUST be preserved. However, individual clauses and/or attributes MAY define their values to be case-insensitive.

Comments

The PICSRules syntax, which will be presented below, has a facility for descriptive text which can be shown to a user, in addition to various statements which influence the behavior of user-agents. However, it is frequently useful to have "source-level" comments - comments which are intended to individuals writing and/or editing rules, but which are not intended for display to end users. This is analogous to placing comments in source code; in an effort to encourage rule authors to write clear rules, we provide a facility for placing comments into PICSRules rules.

The syntax of a comment is:

comment:: '{' comment-text* '}'
comment-text:: any characters except '}'

Note that a result of the above syntax is that comments may not be nested.

Comments may appear anywhere in PICSRules rules. A user-agent MAY remove the comments during lexical analysis of the rule; text within comments MUST NOT influence the interpretation of the rule in any manner. Note also that user-agents which generate or export PICSRules rules MAY choose to strip out comments before generating, exporting, or transmitting them.

PICSRules Rules

The general format of a PICSRules rule, in modified BNF, is as follows. Some elements, such as "policy-expression" and "URLpattern" are used here but defined later in the document.

rule:: '(' 'PicsRule-'verMajor'.'verMinor rule-body ')'

verMajor :: integer

verMinor :: integer

rule-body :: '(' rule-clauses ')'

rule-clauses :: rule-clause+

rule-clause :: policy-clause | 
               name-clause |
               source-clause |
               service-clause |
               opt-extension-clause |
               req-extension-clause |
               extension-aval

policy-clause :: 'Policy' '(' policy-attribute+ ')'

policy-attribute :: ['Explanation'] quotedstring |
                    'RejectByURL' URL-strings |
                    'AcceptByURL' URL-strings | 
                    'RejectIf' policy-string |
                    'RejectUnless' policy-string |
                    'AcceptIf' policy-string |
                    'AcceptUnless' policy-string |
                    extension-aval
URL-strings :: URL-string | '(' ['patterns'] URL-string+ ')'
URL-string :: '"'URLpattern'"'
policy-string :: '"'policy-expression'"'

name-clause :: 'name' '(' name-attribute+ ')'

name-attribute :: ['Rulename'] quotedstring |
                  'Description' quotedstring |
                  extension-aval

source-clause :: 'source' '(' source-attribute+ ')'

source-attribute :: ['SourceURL'] quotedURL |
                    'CreationTool' quotedstring |
                    'author' quoted-address |
                    'LastModified' quoted-date |
                    extension-aval

service-clause :: 'serviceinfo' '(' service-attribute+ ')'

service-attribute :: ['Name'] quotedURL |
                     'shortname' quotedstring |
                     'BureauURL' quotedURL |
                     'UseEmbedded' yes-no |
                     'Ratfile' quotedstring |
                     'BureauUnavailable' pass-fail |
                     extension-aval
yes-no :: '"'Y-N'"'
Y-N :: 'Y' | 'N'
pass-fail :: '"'P-F'"'
P-F :: 'PASS' | 'FAIL'
opt-extension-clause :: 'optextension' '(' extension-name+ ')'

extension-name :: ['extension-name'] quotedURL | 'shortname' quotedstring |
                  extension-aval

req-extension-clause :: 'reqextension' '(' extension-name+ ')'

extension-aval :: attrvalpair
quotedURL :: '"'URL'"'
URL :: as defined in RFC-1738 for URLs.

quoted-address :: '"'e-mail-address'"'

e-mail-address :: as defined in RFC-822 for addresses.
quoted-ISO-date :: '"'YYYY'-'MM'-'DD'T'hh':'mmStz'"'
     based on the ISO 8601:1988 date and time standard, restricted
     to the specific form described here:
     YYYY :: four-digit year
     MM :: two-digit month (01=January, etc.)
     DD :: two-digit day of month (01 through 31)
     hh :: two digits of hour (00 through 23) (am/pm NOT allowed)
     mm :: two digits of minute (00 through 60)
     S  :: sign of time zone offset from UTC ('+' or '-')
     tz :: four digit amount of offset from UTC
           (e.g., 1512 means 15 hours and 12 minutes)
     For example, "1994-11-05T08:15-0500" is a valid quoted-ISO-date
     denoting November 5, 1994, 8:15 am, US Eastern Standard Time
     Note: The ISO standard allows considerably greater
     flexibility than that described here.  PICS requires precisely
     the syntax described here -- neither the time nor the time zone may
     be omitted, none of the alternate formats are permitted, and
     the punctuation must be as specified here.

General Semantics

An application program will invoke a rule evaluator, providing a rule and a URL, and perhaps labels that were embedded in the document associated with the URL or passed in the HTTP headers along with the document associated with the URL. A yes (accept) or no (reject) answer is returned. The rule evaluator SHOULD also return the explanation string associated with the policy clause that determines the final answer, if such an explanation string is provided.

The serviceinfo clause or clauses specify how to find labels associated with a given URL (from one or more label bureaus or embedded in the document). The Policy clause or clauses determine whether an accept or reject answer is returned. Extension clauses (either required or optional) may cause additional labels to be collected or discarded, or otherwise change the meaning of a rule. The semantics of a rule are defined based on a user agent making a  best-effort attempt to retrieve labels from all the specified sources and using all the retrieved labels in evaluating policy clauses. A user agent may, however, perform optimizations, such as consulting a local source (a cache or a CD-ROM) that provides the same labels as those provided at a specified URL, or not collecting labels at all when those labels could not possibly change the rule's result.

Later in this document, we suggest that implementors adopt a particular evaluation order. Implementors should be very careful about any deviations from this suggested evalution order. Note that it is possible to write rules that are non-monotonic in the receipt of labels: as more labels are received, the result could flip from accept to reject and back again. In some situations,  however, it may be possible to infer that additional labels can not alter the result of a rule: for example, the first policy clause may specify that a particular URL is to be accepted, based solely on its URL, regardless of any labels that are available. As an optimization, a user agent may use the policy clause(s) to determine an answer even before labels are available from all of the sources specifies in the serviceinfo clause(s), but implementors should be careful to do this only in those situations where the additional labels, even if they were available, could not change the results of the evaluation.

Semantics & details of individual clauses

Policy
The Policy clause has seven defined attributes: RejectByURL, AcceptByURL, RejectIf, AcceptIf, RejectUnless, AcceptUnless,  and explanation. See the section on URL filtering for an explanation of the first two, which accept or reject items based solely on their URLs. See the section on Label Filtering  for an explanation of the next four, which accept or reject items based on the available labels that describe them. The primary attribute is explanation, and it has no default value. Any given Policy clause MUST have exactly one attribute from the set of {RejectIf, AcceptIf, RejectUnless, AcceptUnless, RejectByURL, AcceptByURL}. It is not acceptable for a Policy clause to have more than one of these attributes. The Policy clause may be repeated multiple times in a rule to impose a set of restrictions.
If multiple Policy clauses are given in a rule, the clauses are evaluated in the order given in the rule. Evaluation stops at the first clause which is satisfied, and the associated action is taken. The following table defines the attributes, how they are satisfied, and their meaning:
Attribute in clause Satisfied by  Action
RejectByURL URL matches any of the patterns specified Block document
AcceptByURL URL matches any of the patterns specified Pass document
RejectIf expression = true  Block document
AcceptIf expression = true  Pass document
RejectUnless expression = false  Block document
AcceptUnless expression = false  Pass document
 

If none of the policy clauses is satisfied, then the document is passed. This is equivalent to making the final policy clause be AcceptIf "otherwise".

name
This clause provides a short, human-readable name for the rule being presented. It is intended that these names could be shown on a user-agent's user interface, to show a human operator which rules are loaded, active/inactive, etc.
There are 2 attributes, rulename and description, defined for the name clause. Rulename is the primary attribute for a name clause, and its value is the human-readable name of this rule. The value for description is a more-detailed analogue of name; it provides a human-readable description of the rule being presented. The description is intended for display in a user-agent's user interface, to allow a human operator to get some understanding of who created the rule, its semantics, etc. The exact contents of the value associated with description are left up to the rule author.
Note that this mechanism does not provide a transparent method for supporting multiple national languages. This is intentionally not being addressed in this version of PICSRules. If you wish to produce PicsRules-1.1 rules in multiple languages, you will have to produce multiple copies of the rule - one for each target language.
source
This clause gives information about where the rule came from. There are 4 attributes defined for source: sourceURL, creationTool, author, and lastModified. The primary attribute is sourceURL.
The sourceURL attribute gives the "rule's URL". It provides a location where a human user of this rule can go to get more information about the rule and/or its creator. The value of this attribute should be a URL here a user can find a human-readable description of this rule.
The creationTool attribute gives the ability to identify the tool, if any, that was used to create this rule. This is analogous to the User-Agent string in HTTP. The value of the creationTool is a quoted string. The string should be in the format toolname/version, as in "Cool-PICS-Rule-Editor/1.04".
The author attribute gives the e-mail address of the individual or organization who produced this rule. The value associated with this attribute MUST be a quoted e-mail address.
The lastModified attribute gives the date and time that this rule was last modified. The value MUST be a quoted-ISO-date, as defined in the PICS-1.1 Label Syntax and Communication Protocols.
serviceinfo
This clause specifies information about a rating service. There are 6 attributes defined for serviceinfo: name, shortname, bureauURL, UseEmbedded, ratfile, and bureauUnavailable. The primary attribute is name.
The name attribute is the servicename URL of a rating service. Its value specifies the name of the service which is being described.
The shortname attribute gives a shorter name to this rating service. The shortname will be used in writing filter clauses; its value is a string. For example, for the rating service "http://coolness.raleigh.ibm.com/ratings/V1.html", the shortname might be "Cool".
The bureauURL attribute specifies the URL of a label bureau that has ratings from this rating service. The value for this attribute is the URL of a label bureau.  This attribute MAY occur multiple times. The user agent MUST attempt to retrieve labels from all the URLs specified and use all of those labels in evaluating policies.
The UseEmbedded attribute determines whether to use labels transmitted in the HTTP header stream along with a document or embedded in an HTML document using the META element. If this attribute is omitted, the default is to use such labels. If the attribute is supplied with the value "N", then labels for this service that are embedded in the document are ignored, as are labels trasmitted in the HTTP header stream. This may be useful if the writer of the rule does not trust authors of documents to be truthful in the labels they supply, and more reliable labels are available from a label bureau.
The bureauUnavailable attribute specifies what to do when none of the label bureau(s) listed in bureauURL attributes can be contacted. The defined values for this attribute are "PASS" and "FAIL", which cause the rule to return the corresponding value, regardless of what other labels are found.
The ratfile attribute presents the machine-readable rating system description (also know as "RAT file") that is used by this rating service. This may be specified in one of two ways: the value may be a quoted string which contains the entire machine-readable service description, or it may be of the syntax "[RATFile-URL]", where RATFile-URL is the URL of the rating system description; a user-agent SHOULD assume that dereferencing this URL will produce a document of type application/pics-service. There is no default value for the ratfile attribute. If the quoted string contains the machine-readable service description, then it MUST be escaped as mentioned above.
opt-extension-clause
opt-extension-clause and req-extension-clause are the extension mechanisms in PICSRules; they are modeled after the extension mechanism in the PICS label format. More information on the extension mechanism is given below.
The opt-extension-clause has two defined attributes: extension-name and shortname. The value of the extension-name attribute specifies the name of an extension that will be used by this rule. The name of the extension is the quotedURL; this URL should point to a human-readable description of this extension. URLs are used for extension names to insure uniqueness without requiring a central naming body. The value of the shortname attribute is a quoted string, but MUST use only valid attribute name characters (a-z, A-Z, 0-9). The shortname is used as a prefix of attribute names, to identify attributes defined by this extension.
If a user-agent receives a rule which contains an optextension which it does not recognize, the user-agent should process the rule, ignoring any clauses it does not recognize. This means that any optional extensions MUST use the attribute-value syntax given above, so as to not break existing parsers. Note that declaring the use of an optional extension may appear to be redundant, as unrecognized attribute-value pairs are discarded by user-agents. The purpose of the optextension construct is for use as a documentation mechanism. User-agents MAY also display the names of optional extensions used by a rule, asking the user for confirmation, before making use of a rule.
req-extension-clause
This clause has two defined attributes: extension-name and shortname. The value of the extension-name attribute specifies the name of an extension that will be used by this rule. The name of the extension is the quotedURL; this URL should point to a human-readable description of this extension. URLs are used for extension names to insure uniqueness without requiring a central naming body. The value of the shortname attribute is a quoted string, but MUST use only valid attribute name characters (a-z, A-Z, 0-9). The shortname is used as a prefix of attribute names, to identify attributes defined by this extension.
If a user-agent is asked to process a request about the acceptability of a URL, using a rule which contains a req-extension-clause which the user agent does not recognize, the user agent should signal an error.
verMajor
The major version number of PICSRules which this rule conforms to. This version of PICSRules uses '1' as its major version number.
verMinor
The minor version number of PICSRules which this rule conforms to. This version of PICSRules uses '1' as its minor version number.

Restrictions

The following semantic restrictions are imposed on rules:

  1. The name, and source clauses MUST NOT appear more than once each in a PICSRules rule.
  2. The optextension, reqextension, and serviceinfo clauses MAY appear more than once in a PICSRules rule.
  3. Each Policy clause MUST have exactly one attribute from the set of {AcceptIf, RejectIf, AcceptUnless, RejectUnless, AcceptByURL, RejectByURL}.
  4. A rule MAY contain any number of Policy clauses.
  5. A Policy clause MUST NOT contain more than one explanation attribute.
  6. The shortname attribute of an extension clause or a service clause takes a quoted string as a value, but that string MUST include only characters that are acceptable for use in attribute names.
  7. A PICSRules parser MUST maintain the order of the values (or value-lists) given for the attributes in a rule.

URL-Based Filtering

In policy clauses, the AcceptByURL and RejectByURL attributes employ the URLpattern element, whose BNF is given below.

URLpattern :: internet-pattern | other-pattern
internet-pattern :: internet-scheme '://' 
                   [user '@'] hostoraddr [':' port] ['/' pathmatch]
internet-scheme :: '*' | 'ftp' | 'http' | 'gopher' | 'nntp' |
                   'irc' | 'prospero' | 'telnet'
user :: ['*' | '%*'] notquotechar* ['*' | '%*']

hostoraddr :: ['*' | '%*'] host | ipwild ['!' bitlength]

ipwild :: ipcomponent '.' ipcomponent '.' ipcomponent '.' ipcomponent
ipcomponent :: integer between '0' and '255' inclusive
bitlength :: integer between '0' and '32' inclusive

host :: substring of a fully qualified domain name as described 
        in Section 3.5 of [RFC1034]
port :: '*' | integerorwild [ '-' integerorwild ]
pathmatch :: ['*' | '%*'] notquotechar* ['*' | '%*']
integerorwild :: digit+ | '*'
digit :: '0' - '9'

other-pattern :: scheme : ['*' | '%*'] notquotechar* ['*' | '%*']
scheme :: '*' | schemechar+
schemechar :: 'a' - 'z' | 'A' - 'Z' | digit | '+' | '.' | '-' 
              (as specified in [RFC1738])

A RejectbyURL policy clause causes the overall rule to "reject" if the URL match evaluates to TRUE. Similarly, an AcceptbyURL policy clause causes the overall rule to "accept" if the URL match evaluates to TRUE. In either case, the explanation associated with policy clause is returned. If a list of URL patterns is provided, the URL match evaluates to TRUE if any one of the patterns matches. If the URL match evaluates to FALSE, the policy clause is ignored and evaluation continues with the next policy clause.

When comparing a URLpattern to a URL, the rule interpreter MUST NOT unencode the URL (e.g., do not convert %2F  to  /). If the pattern can be interpreted as an internet-pattern, then the pattern is divided into its component parts and the URL matches the pattern if a match occurs on every component that is included in the pattern.
  scheme '*' for the pattern matches every scheme. Otherwise, an exact string match is required, but the  comparison is case-insensitive. The scheme component MUST NOT be omitted from the pattern. user '*' at the beginning or end of the pattern matches any number of characters in the URL string. '%*' at the beginning or end of the pattern matches the single character '*' in the URL string. Characters in the middle of the pattern must match exactly the characters in the URL string; this comparison is case-sensitive. A user component of "*" in the pattern also matches URLs that omit the user component. If the user component is omitted from the pattern, there is a match only if the user component is also omitted from the URL. password PICSRules patterns do not specify passwords. A pattern matches URLs that specify any password, as well as URLs that omit the password component. ipwild If the hostoraddress in the pattern is an ipaddress, then the corresponding host component of the URL is first resolved into a set of IP addresses. The pattern matches if it matches any of the IP addresses. If ! and a bitlength is specified, both the pattern and the URL are converted from decimal into binary notation and the first bitlength bits of the strings are compared. Thus, the '!' has the same semantics that '/' normally has when specifying subnets or CIDR blocks: we use ! because / could be misinterpreted as the delimiter after which the scheme appears. 18.23.7.22!16 is equivalent to 18.23.0.0!16, because comparisons will be done only on the first 16 bits. host '*' at the beginning of the pattern matches any number of characters in the URL string. '%*' at the beginning of the pattern matches the single character '*' in the URL string. Subsequent characters in the pattern must exactly match the remaining characters in the URL string; this comparison is case-insensitive.  Note that if the pattern specifies a host name (or a host name with wildcards), it does not match a URL that specifies an IP address, even if the host name in the pattern would resolve to the IP address in the URL. This avoids the need to perform expensive reverse DNS lookups based on IP addresses in URLs. Either a host or  an ipwild component MUST be specified in the URL pattern. port If the pattern specifies two numbers, it matches against any port number in that range. For example, if the pattern's port component is "80-82", it would match a URL whose port is 80, 81, or 82. The wildcard * as part of a range indicates an extreme value. That is, if the pattern's port is "*-82", it matches all ports less than or equal to 82; if the pattern's port is "80-*", it matches all ports greater than or equal to 80. If the pattern's port is simply "*", it matches URLs with any port number, including URLs that omit the port number component. If the pattern's port is omitted, it matches only URLs that also omit the port number. path '*' at the beginning or end of the pattern matches any number of characters in the URL string. '%*' at the beginning or end of the pattern matches the single character '*' in the URL string. Characters in the middle of the pattern must match exactly the characters in the URL string; this comparison is case-sensitive. A path component of "*" in the pattern also matches URLs that omit the path component. If the path component is omitted from the pattern, there is a match only if the path component is also omitted from the URL.

WARNING: if a component is not specified in the pattern, the pattern matches only URLs that omit the pattern. It is necessary to specify '*' for pattern components if the intention is to ignore that component of URLs. For example, to block access to all URLs contain the string "buy" in the pathname, the correct pattern is "*://*@*:*/*buy*". While it might seems natural to write the pattern "*://*/*buy*" or even "*buy*", the first would match only URLs that omit the username and port number, and the second is simply not a valid pattern.

If the pattern can not be interpreted as an Internet scheme, it is divided into a scheme name and a scheme-specific part. '*' for the scheme name matches any URL's scheme; otherwise exact string matching is required; this comparison is case-insensitive. '*' at the beginning or end of the scheme-specific part of the pattern matches any number of characters in the URL string. '%*' at the beginning or end of the pattern matches the single character '*' in the URL string. Characters in the middle of the scheme-specific part of the pattern must match exactly the characters in the URL string; this comparison is case-sensitive.
 
NOTE: It is not possible to write a URLpattern that matches exactly the URL string characters '%*'. This is not a limitation of the pattern matching language, however, because, in a valid URL, the '%' character must be followed by two hex digits. Thus, there are no URL strings containing the character sequence '%*'.

Known Limitations

Since %-encoded characters in URLs are not unencoded before comparison, a server may choose to treat two URLs as synonyms that the PICS rule evaluator will not treat as synonyms. That is, the URLs <http://www.student1.mit.edu/sex>, <http://www.student1.mit.edu/%73%65%78> and
<http://www.student1.mit.edu/se%78> might all cause the server to send back the same page, if the server follows a rule of unencoding the URL path (%73 becomes 's', %65 becomes 'e' and %78 becomes 'x').

Unfortunately, the alternative matching rule, of always unencoding URLs before comparing to the pattern, can cause ambiguities. For example, in HTTP, ? is reserved as the query string delimiter; any naturally occurring ? is encoded as %3F. After unencoding it would no longer be possible to distinguish a query string delimiter from a naturally occurring ?. We felt it was better to make the pattern matching precise, at the expense of missing some synonyms.

Another, similar limitation is that IP addresses in URLs are not converted into host names for comparison to rule patterns. This means that host name-based patterns will miss matching against certain synonymous IP-address based URLs. The pattern "http://*.mit.edu" will match against fewer URLs than the pattern "http://18.0.0.0!8". The latter pattern will match against web site ending in mit.edu, because they all will resolve to ip addresses beginning with 18. The reason that URLs containing IP addresses will not match against patterns that specify domain names is that performing a reverse lookup of the IP address in the URL is too expensive an operation to perform routinely. Hence, whenever it is practical to do so, rules may want to specify IP address matching rather than host name maching; beware, however, that this may require updating of the rule whenever a host name switches to a different IP address.

Label-Based Filtering

The attributes AcceptIf, RejectIf, AcceptUnless, and RejectUnless to the Policy clause all take a policy-expression as an argument. It is an expression operating on various labels; this section defines the syntax and semantics for those expressions.

policy-expression :: simple-expression                     or-expression                     and-expression                     degenerate-expression

simple-expression :: '(' service ['.' category [op constant ] ] ')'

service :: any shortname defined in a serviceinfo clause within this rule
category :: transmit-name-char+ ['/' category]
    Note: as in the [PicsLabels] spec, if the rating service defines 
    hierarchically nested categories, the outermost category name goes 
    at the left, followed by a slash, then the next category name, etc.
transmit-name-char :: alphanumpm | '.' | '$' | ',' | ';' | ':' 
                | '&' | '=' | '?' | '!' | '*' | '~' | '@'
                | '#' | '_' | '%' hex hex

alphanumpm :: 'A' | ... | 'Z' | 'a' | ... | 'z' | '0' | ... | '9' | '+' | '-'
hex :: '0' | ... | '9' | 'A' | ... | 'F' | 'a' | .... | 'f'
op :: '>' | '<' | '=' | '>=' | '<='

constant :: [sign] alphanum* ['.' alphanum*]

or-expression :: '(' policy-expression [or policy-expression]+ ')'

or :: 'or' 

and-expression :: '(' policy-expression [and policy-expression]+ ')'

and :: 'and' 

sign :: '-'

degenerate-expression :: 'otherwise'

When evaluating a clause, the user-agent may use zero, one, or more labels from a given rating service (for more details, see the control flow section). A simple-expression evaluates to true if any available label from the specified service satisfies the condition of the expression. Intuitively, a rule evaluator will try to prove that an expression is satisfied, using any available labels as evidence.

We must deal with the situation where a simple-expression calls for a value from a label, and either no label is available, or the available labels do not have values for the specified category. In those situations, the simple-expression evaluates to false. This leads to an intuitive semantics: if a simple-expression has no associated label available, that expression cannot contribute evidence toward proving the claim made by the expression.

Simple-expressions, as defined above, can use any types of operators on any types of data. More specifically, the semantics of expression evaluation are as follows:

Early drafts of PICSRules-1.0 included a != operator, which is intuitively useful. It was removed, because, in the presence of either zero or multiple values, the intuitive semantics for != are inconsistent with the semantics for other operators. For example, suppose that a label includes (s (2 3)), indicating values on the s dimension of both 2 and 3. This label would satisfy the policy-expression (Service.s < 3), because there exists a value less than 3. The intuitive semantics for !=, however, is to require that all the values be unequal to three. We found that smart people could easily get confused when mixing the existential quantification (there exists a value less than 3) with universal quantification (all values are unequal to 3). Moreover, "x != 3" is normally a synonym for "((x < 3) or (x > 3))". But in the presence of multiple values, this would not hold. We believed that it was worse to have an operator with non-intuitive semantics that to not have the operator at all, so it was removed from PICSRules-1.1.

The careful reader will also note the lack of the Boolean not operator, as well as the lack of universally quantified operators such as max, min, and forall. These omissions are deliberate, and for similar reasons to the omission of !=. Given that the available labels may provide either no values or multiple values for particular categories, rules become very difficult for people to understand when such operators are allowed in an unrestricted way. We have restricted the use of negation and universal quantification to appear only at the top-level, using the attributes AcceptIf, AcceptUnless, RejectIf, and RejectUnless, as described below. Our restricted language still has full expressiveness, however, by taking advantage of the fact that "forall x, g(x) holds" is mathematically equivalent to "there does not exist x such that g(x) does not hold". For example, suppose one wants to accept any URL so long as all the labels agree on an s-value equal to three. The policy clause would be:
Policy (AcceptUnless "(Service.s < 3) or (Service.s > 3)" ).
 

Control Flow

The rule syntax and semantics given above define what can be placed in a rule, and the meaning of those constructs. In order to process these rules, a user-agent SHOULD adopt an internal data-flow as described here; this will ease the implementation of expected extensions to PICSRules, when they become formalized.
The standard user-agent which processes PICSRules rules SHOULD have four significant components: the rule parser, the label source, label validators, and a rule evaluator. Their roles are:
Rule parser
Parses PICSRules rules, possibly loaded from saved configuration information or over a network. In user-agents which may store multiple rules, such as proxy servers, this component is also responsible for deciding which rule to use for each specific request; subsequent modules assume that only one rule is being applied.
Label source
This component is responsible for finding labels. It takes as input information from the rule being evaluated; it MAY use this information to contact label bureaus for labels. It MAY also find labels embedded in HTML documents or transmitted in datastreams (HTTP, NNTP) which support label transmission. The output of this component is the set of labels which apply to the document in question. Note that as there are multiple potential label sources, the label source component may produce more than one label from a given service for a given document. However, the label source component is responsible for choosing the "most applicable" label when that is appropriate (i.e., picking specific labels over generic ones, and picking the most specific generic label if multiple generic labels are available). The label source will need to specify to the other components not only the label itself, but also how the label was obtained (embedded in content, from a label bureau, etc).
Label validators
Label validators are responsible for determining which labels are acceptable. No validators are defined in the PICSRules language, but we expect extensions to be defined. For example, a label validator may be defined which rejects labels that lack an authorized digital signature. Another possible validator would examine whether a label's author has been vouched for by a trusted third party.
Rule evaluator
The rule evaluator takes as input the labels that pass any validators, and the Policy clauses that the rule parser found in the rule. It evaluates the permission and prohibition expressions and produces a pass/fail decision.

Extension mechanism

Any well-designed network protocol provides a mechanism for extension. Here we present the extension mechanism provided with PICSRules.

Background

PICSRules is structured as a nested set of attribute-value pairs. Unrecognized attribute keywords are ignored by user-agents, and the associated values can be discarded by a PICSRules parser, as all values will be in a known syntax. The basic mechanism for extending PICSRules is to define new clauses and/or attribute-value pairs, their context, and their meaning. All new attribute-value pairs will be associated with a named extension. Names of extensions are URLs, and hence globally distinct. When used in a PICSRule, extension attribute names are preceded by a shortname for the extension that defines the attribute, so as to avoid potential attribute naming conflicts.

Details

To define a new extension:

  1. Determine if the extension is optional or required. Optional extensions may be ignored by user-agents which don't recognize the extension. In order for an extension to be "optional", the semantics of a rule which uses this extension must not be modified if the extension is ignored.
  2. Name the extension. Extensions must have a unique URL assigned to them. The URL should point to a human-readable document which explains the extension in detail. The URL must be in a domain controlled by the extension's creator, in order to insure uniqueness of extension names.
  3. If an extension needs new clause names, define, using the new-clause-name attribute, the extension-clause-name that will be used for each new clause defined by this extension. Each extension SHOULD define no more than one new clause.
  4. Determine the new attribute-value pairs that this extension will define, and which clauses those attribute-value pairs may appear in.
  5. Define the semantics of each new attribute-value pair defined by this extension. In particular, if this extension overrides existing parts of PICSRules, then this behavior must be spelled out exactly. If an extension overrides the existing semantics of PICSRules, it should be a required extension (using reqextension rather than optextension).

Here's a simple example of a PICSRules rule that uses an optional extension:

 1 (PicsRule-1.1
 2     (
 3         ServiceInfo (
 4             "http://www.coolness.org/ratings/V1.html"
 5             shortname "Cool" 
 6             bureauURL "http://labelbureau.coolness.org/Ratings"
 7             )
 8         Policy (AcceptIf "((Cool.Coolness < 3) or (Cool.Graphics < 3))" )
 9         Policy (RejectIf "otherwise")
10         optextension (
               "http://www.si.umich.edu/~presnick/pics/extensions/PRsample.htm" 
11             shortname "extension1")
12         extension1.SampleAttribute (
13             UseExpired "YES" 
14             GroupFile "/etc/ics.grp"
15             )
16     )
17 )

This example makes use of an optional extension named "http://www.si.umich.edu/~presnick/pics/extensions/PRsample.htm". That extension defines the keyword SampleAttribute . User-agents which don't understand this extension can simply ignore the extension1.SampleAttribute clause and its attribute-value pairs (lines 12-14).

Note that there is only one "level" to declaring extensions, but attribute-value pairs defined by extensions may appear anywhere within a PICSRules rule. That is, all extensions should declare themselves with an optextension or reqextension clause within a rule-clause, but the attributes defined by an extension may appear nested several layers down within a rule.

References

[PicsLabels]
Jim Miller, editor. "PICS Label Distribution Label Syntax and Communication Protocols". http://www.w3.org/PICS/labels.html.
[PicsServices]
Jim Miller, editor. "Rating Services and Rating Systems (and Their Machine Readable Descriptions)". http://www.w3.org/PICS/services.html.
[RFC1034]
Mockapetris, P., "Domain Names - Concepts and Facilities", STD 13, RFC 1034, USC/Information Sciences Institute, November 1987.  http://ds.internic.net/rfc/rfc1034.txt
[RFC1123]
R. Braden, editor. "Requirements for Internet Hosts -- Application and Support". http://ds.internic.net/rfc/rfc1123.txt.
[RFC1738]
Tim Berners-Lee et al, "Uniform Resource Locators". http://ds.internic.net/rfc/rfc1738.txt.
[RFC2070]
F. Yergeau, G. Nicol, G. Adams, and M. Duerst. "Internationalization of the Hypertext Markup Language". http://ds.internic.net/rfc/rfc2070.txt.
[RFC822]
David H. Crocker, editor. "Standard for the Format of ARPA Internet Text Messages". http://ds.internic.net/rfc/rfc822.txt.
[UNICODE]
The Unicode Consortium, "The Unicode Standard -- Worldwide Character Encoding -- Version 1.0", Addison- Wesley, Volume 1, 1991, Volume 2, 1992, and Technical Report #4, 1993.
[UTF-8]
ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation Format 8 (UTF-8).

Acknowledgements

We thank the following for their assistance in writing this document; without their help, none of this would have been possible. Special thanks go to David Shapiro, whose parsing code made it possible to test changes in the syntax and examples as we made them.

Scott Berkun, Microsoft
Jonathan Brezin, IBM
Yang-hua Chu, MIT
Lorrie Cranor, AT&T
Jon Doyle, MIT
Ghirardelli Chocolate Co.
Brian LaMacchia, AT&T
Breen Liblong, NetShepherd
Jim Miller, W3C
Mary Ellen Rosen, IBM
Rick Schenk, IBM
Bob Schloss, IBM
David Shapiro, MIT
Ray Soular, SafeSurf