RFC 822: Part 3: lexical Analysis of Tokens

3. LEXICAL ANALYSIS OF MESSAGES

3.1. GENERAL DESCRIPTION

A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF).

3.1.1. LONG HEADER FIELDS

Each header field can be viewed as a single, logical line of ASCII characters, comprising a field-name and a field-body. For convenience, the field-body portion of this conceptual entity can be split into a multiple-line representation; this is called "folding". The general rule is that wherever there may be linear-white-space (NOT simply LWSP-chars), a CRLF immediately followed by AT LEAST one LWSP-char may instead be inserted. Thus, the single line
       To:  "Joe & J. Harvey" <ddd @Org>, JJV @ BBN

can be represented as:
       To:  "Joe & J. Harvey" <ddd @ Org>,
               JJV@BBN

and
       To:  "Joe & J. Harvey"
                       <ddd@ Org>, JJV
        @BBN

and
       To:  "Joe &
        J. Harvey" <ddd @ Org>, JJV @ BBN

The process of moving from this folded multiple-line representation of a header field to its single line representation is called "unfolding". Unfolding is accomplished by regarding CRLF immediately followed by a LWSP-char as equivalent to the LWSP-char.

Note:

While the standard permits folding wherever linear-white-space is permitted, it is recommended that structured fields, such as those containing addresses, limit folding to higher-level syntactic breaks. For address fields, it is recommended that such folding occur between addresses, after the separating comma.

3.1.2. STRUCTURE OF HEADER FIELDS

Once a field has been unfolded, it may be viewed as being composed of a field-name followed by a colon (":"), followed by a field-body, and terminated by a carriage-return/line-feed. The field-name must be composed of printable ASCII characters (i.e., characters that have values between 33. and 126., decimal, except colon). The field-body may be composed of any ASCII characters, except CR or LF. (While CR and/or LF may be present in the actual text, they are removed by the action of unfolding the field.)

Certain field-bodies of headers may be interpreted according to an internal syntax that some systems may wish to parse. These fields are called "structured fields". Examples include fields containing dates and addresses. Other fields, such as "Subject" and "Comments", are regarded simply as strings of text.

Note:

Any field which has a field-body that is defined as other than simply <text> is to be treated as a structured field.

Field-names, unstructured field bodies and structured field bodies each are scanned by their own, independent "lexical" analyzers.

3.1.3. UNSTRUCTURED FIELD BODIES

For some fields, such as "Subject" and "Comments", no structuring is assumed, and they are treated simply as <text>s, as in the message body. Rules of folding apply to these fields, so that such field bodies which occupy several lines must therefore have the second and successive lines indented by at least one LWSP-char.

3.1.4. STRUCTURED FIELD BODIES

To aid in the creation and reading of structured fields, the free insertion of linear-white-space (which permits folding by inclusion of CRLFs) is allowed between lexical tokens. Rather than obscuring the syntax specifications for these structured fields with explicit syntax for this linear-white-space, the existence of another "lexical" analyzer is assumed. This analyzer does not apply for unstructured field bodies that are simply strings of text, as described above. The analyzer provides an interpretation of the unfolded text composing the body of the field as a sequence of lexical symbols.

These symbols are:

The first four of these symbols are self-delimiting. Atoms are not; they are delimited by the self-delimiting symbols and by linear-white-space. For the purposes of regenerating sequences of atoms and quoted-strings, exactly one SPACE is assumed to exist, and should be used, between them. (Also, in the "Clarifications" section on "White Space", below, note the rules about treatment of multiple contiguous LWSP-chars.)

So, for example, the folded body of an address field

       ":sysmail"@  Some-Group. Some-Org,
       Muhammed.(I am  the greatest) Ali @(the)Vegas.WBA

is analyzed into the following lexical symbols and types:
               :sysmail              quoted string
               @                     special
               Some-Group            atom
               .                     special
               Some-Org              atom
               ,                     special
               Muhammed              atom
               .                     special
               (I am  the greatest)  comment
               Ali                   atom
               @                     atom
               (the)                 comment
               Vegas                 atom
               .                     special
               WBA                   atom

The canonical representations for the data in these addresses are the following strings:
                   ":sysmail"@Some-Group.Some-Org

and
                       Muhammed.Ali@Vegas.WBA

Note:

For purposes of display, and when passing such structured information to other systems, such as mail protocol services, there must be NO linear-white-space between <word>s that are separated by period (".") or at-sign ("@") and exactly one SPACE between all other <word>s. Also, headers should be in a folded form.

3.2. HEADER FIELD DEFINITIONS

These rules show a field meta-syntax, without regard for the particular type or internal syntax. Their purpose is to permit detection of fields; also, they present to higher-level parsers an image of each field as fitting on one line.
field       =  field-name ":" [ field-body ] CRLF

field-name  =  1*<any CHAR, excluding CTLs, SPACE, and ":">

field-body  =  field-body-contents
               [CRLF LWSP-char field-body]

field-body-contents =
              <the ASCII characters making up the field-body, as
               defined in the following sections, and consisting
               of combinations of atom, quoted-string, and
               specials tokens, or else consisting of texts>


3.3. LEXICAL TOKENS

The following rules are used to define an underlying lexical analyzer, which feeds tokens to higher level parsers. See the ANSI references, in the Bibliography.
                                            ; (  Octal, Decimal.)
CHAR        =  <any ASCII character>        ; (  0-177,  0.-127.)
ALPHA       =  <any ASCII alphabetic character>
                                            ; (101-132, 65.- 90.)
                                            ; (141-172, 97.-122.)
DIGIT       =  <any ASCII decimal digit>    ; ( 60- 71, 48.- 57.)
CTL         =  <any ASCII control           ; (  0- 37,  0.- 31.)
                character and DEL>          ; (    177,     127.)
CR          =  <ASCII CR, carriage return>  ; (     15,      13.)
LF          =  <ASCII LF, linefeed>         ; (     12,      10.)
SPACE       =  <ASCII SP, space>            ; (     40,      32.)
HTAB        =  <ASCII HT, horizontal-tab>   ; (     11,       9.)
<">         =  <ASCII quote mark>           ; (     42,      34.)
CRLF        =  CR LF

LWSP-char   =  SPACE / HTAB                 ; semantics = SPACE

linear-white-space =  1*([CRLF] LWSP-char)  ; semantics = SPACE
                                            ; CRLF => folding

specials    =  "(" / ")" / "<" / ">" / "@"  ; Must be in quoted-
            /  "," / ";" / ":" / "\" / <">  ;  string, to use
            /  "." / "[" / "]"              ;  within a word.

delimiters  =  specials / linear-white-space / comment

text        =  <any CHAR, including bare    ; => atoms, specials,
                CR & bare LF, but NOT       ;  comments and
                including CRLF>             ;  quoted-strings are
                                            ;  NOT recognized.

atom        =  1*<any CHAR except specials, SPACE and CTLs>

quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or
                                            ;   quoted chars.

qtext       =  <any CHAR excepting <">,     ; => may be folded
                "\" & CR, and including
                linear-white-space>

domain-literal =  "[" *(dtext / quoted-pair) "]"





dtext       =  <any CHAR excluding "[",     ; => may be folded
                "]", "\" & CR, & including
                linear-white-space>

comment     =  "(" *(ctext / quoted-pair / comment) ")"

ctext       =  <any CHAR excluding "(",     ; => may be folded
                ")", "\" & CR, & including
                linear-white-space>

quoted-pair =  "\" CHAR                     ; may quote any char

phrase      =  1*word                       ; Sequence of words

word        =  atom / quoted-string


3.4. CLARIFICATIONS

3.4.1. QUOTING

Some characters are reserved for special interpretation, such as delimiting lexical tokens. To permit use of these characters as uninterpreted data, a quoting mechanism is provided. To quote a character, precede it with a backslash ("\").

This mechanism is not fully general. Characters may be quoted only within a subset of the lexical constructs. In particular, quoting is limited to use within:

Within these constructs, quoting is REQUIRED for CR and "\" and for the character(s) that delimit the token (e.g., "(" and ")" for a comment). However, quoting is PERMITTED for any character.

Note:

In particular, quoting is NOT permitted within atoms. For example when the local-part of an addr-spec must contain a special character, a quoted string must be used. Therefore, a specification such as:
                       Full\ Name@Domain

is not legal and must be specified as:
                       "Full Name"@Domain



3.4.2. WHITE SPACE

   Note:  In structured field bodies, multiple linear space ASCII
          characters  (namely  HTABs  and  SPACEs) are treated as
          single spaces and may freely surround any  symbol.   In
          all header fields, the only place in which at least one
          LWSP-char is REQUIRED is at the beginning of  continua-
          tion lines in a folded field.

   When passing text to processes  that  do  not  interpret  text
   according to this standard (e.g., mail protocol servers), then
   NO linear-white-space characters should occur between a period
   (".") or at-sign ("@") and a <word>.  Exactly ONE SPACE should
   be used in place of arbitrary linear-white-space  and  comment
   sequences.

   Note:  Within systems conforming to this standard, wherever  a
          member of the list of delimiters is allowed, LWSP-chars
          may also occur before and/or after it.

   Writers of  mail-sending  (i.e.,  header-generating)  programs
   should realize that there is no network-wide definition of the
   effect of ASCII HT (horizontal-tab) characters on the  appear-
   ance  of  text  at another network host; therefore, the use of
   tabs in message headers, though permitted, is discouraged.

3.4.3. COMMENTS

   A comment is a set of ASCII characters, which is  enclosed  in
   matching  parentheses  and which is not within a quoted-string
   The comment construct permits message originators to add  text
   which  will  be  useful  for  human readers, but which will be
   ignored by the formal semantics.  Comments should be  retained
   while  the  message  is subject to interpretation according to
   this standard.  However, comments  must  NOT  be  included  in
   other  cases,  such  as  during  protocol  exchanges with mail
   servers.

   Comments nest, so that if an unquoted left parenthesis  occurs
   in  a  comment  string,  there  must  also be a matching right
   parenthesis.  When a comment acts as the delimiter  between  a
   sequence of two lexical symbols, such as two atoms, it is lex-
   ically equivalent with a single SPACE,  for  the  purposes  of
   regenerating  the  sequence, such as when passing the sequence
   onto a mail protocol server.  Comments are  detected  as  such
   only within field-bodies of structured fields.

   If a comment is to be "folded" onto multiple lines,  then  the
   syntax  for  folding  must  be  adhered to.  (See the "Lexical



   Analysis of Messages" section on "Folding Long Header  Fields"
   above,  and  the  section on "Case Independence" below.)  Note
   that  the  official  semantics  therefore  do  not  "see"  any
   unquoted CRLFs that are in comments, although particular pars-
   ing programs may wish to note their presence.  For these  pro-
   grams,  it would be reasonable to interpret a "CRLF LWSP-char"
   as being a CRLF that is part of the comment; i.e., the CRLF is
   kept  and  the  LWSP-char is discarded.  Quoted CRLFs (i.e., a
   backslash followed by a CR followed by a  LF)  still  must  be
   followed by at least one LWSP-char.

3.4.4. DELIMITING AND QUOTING CHARACTERS

   The quote character (backslash) and  characters  that  delimit
   syntactic  units  are not, generally, to be taken as data that
   are part of the delimited or quoted unit(s).   In  particular,
   the   quotation-marks   that   define   a  quoted-string,  the
   parentheses that define  a  comment  and  the  backslash  that
   quotes  a  following  character  are  NOT  part of the quoted-
   string, comment or quoted character.  A quotation-mark that is
   to  be  part  of  a quoted-string, a parenthesis that is to be
   part of a comment and a backslash that is to be part of either
   must  each be preceded by the quote-character backslash ("\").
   Note that the syntax allows any character to be quoted  within
   a  quoted-string  or  comment; however only certain characters
   MUST be quoted to be included as data.  These  characters  are
   the  ones that are not part of the alternate text group (i.e.,
   ctext or qtext).

   The one exception to this rule  is  that  a  single  SPACE  is
   assumed  to  exist  between  contiguous words in a phrase, and
   this interpretation is independent of  the  actual  number  of
   LWSP-chars  that  the  creator  places  between the words.  To
   include more than one SPACE, the creator must make  the  LWSP-
   chars be part of a quoted-string.

   Quotation marks that delimit a quoted string  and  backslashes
   that  quote  the  following character should NOT accompany the
   quoted-string when the string is passed to processes  that  do
   not interpret data according to this specification (e.g., mail
   protocol servers).

3.4.5. QUOTED-STRINGS

   Where permitted (i.e., in words in structured fields)  quoted-
   strings  are  treated  as a single symbol.  That is, a quoted-
   string is equivalent to an atom, syntactically.  If a  quoted-
   string  is to be "folded" onto multiple lines, then the syntax
   for folding must be adhered to.  (See the "Lexical Analysis of



   Messages"  section  on "Folding Long Header Fields" above, and
   the section on "Case  Independence"  below.)   Therefore,  the
   official  semantics  do  not  "see" any bare CRLFs that are in
   quoted-strings; however particular parsing programs  may  wish
   to  note  their presence.  For such programs, it would be rea-
   sonable to interpret a "CRLF LWSP-char" as being a CRLF  which
   is  part  of the quoted-string; i.e., the CRLF is kept and the
   LWSP-char is discarded.  Quoted CRLFs (i.e., a backslash  fol-
   lowed  by  a CR followed by a LF) are also subject to rules of
   folding, but the presence of the quoting character (backslash)
   explicitly  indicates  that  the  CRLF  is  data to the quoted
   string.  Stripping off the first following LWSP-char  is  also
   appropriate when parsing quoted CRLFs.

3.4.6. BRACKETING CHARACTERS

   There is one type of bracket which must occur in matched pairs
   and may have pairs nested within each other:

       o   Parentheses ("(" and ")") are used  to  indicate  com-
           ments.

   There are three types of brackets which must occur in  matched
   pairs, and which may NOT be nested:

       o   Colon/semi-colon (":" and ";") are   used  in  address
           specifications  to  indicate that the included list of
           addresses are to be treated as a group.

       o   Angle brackets ("<" and ">")  are  generally  used  to
           indicate  the  presence of a one machine-usable refer-
           ence (e.g., delimiting mailboxes), possibly  including
           source-routing to the machine.

       o   Square brackets ("[" and "]") are used to indicate the
           presence  of  a  domain-literal, which the appropriate
           name-domain  is  to  use  directly,  bypassing  normal
           name-resolution mechanisms.

3.4.7. CASE INDEPENDENCE

Except as noted, alphabetic strings may be represented in any combination of upper and lower case. The only syntactic units which requires preservation of case information are:
               -  text
               -  qtext
               -  dtext
               -  ctext
               -  quoted-pair
               -  local-part, except "Postmaster"

   When matching any other syntactic unit, case is to be ignored.
   For  example, the field-names "From", "FROM", "from", and even
   "FroM" are semantically equal and should all be treated ident-
   ically.

   When generating these units, any mix of upper and  lower  case
   alphabetic  characters  may  be  used.  The case shown in this
   specification is suggested for message-creating processes.

   Note:  The reserved local-part address unit, "Postmaster",  is
          an  exception.   When  the  value "Postmaster" is being
          interpreted, it must be  accepted  in  any  mixture  of
          case, including "POSTMASTER", and "postmaster".

3.4.8. FOLDING LONG HEADER FIELDS

Each header field may be represented on exactly one line consisting of the name of the field and its body, and terminated by a CRLF; this is what the parser sees. For readability, the field-body portion of long header fields may be "folded" onto multiple lines of the actual field. "Long" is commonly interpreted to mean greater than 65 or 72 characters. The former length serves as a limit, when the message is to be viewed on most simple terminals which use simple display software; however, the limit is not imposed by this standard.

Note:

Some display software often can selectively fold lines, to suit the display terminal. In such cases, sender-provided folding can interfere with the display software.

3.4.9. BACKSPACE CHARACTERS

ASCII BS characters (Backspace, decimal 8) may be included in texts and quoted-strings to effect overstriking. However, any use of backspaces which effects an overstrike to the left of the beginning of the text or quoted-string is prohibited.

3.4.10. NETWORK-SPECIFIC TRANSFORMATIONS

   During transmission through heterogeneous networks, it may  be
   necessary  to  force data to conform to a network's local con-
   ventions.  For example, it may be required that a CR  be  fol-
   lowed  either by LF, making a CRLF, or by <null>, if the CR is
   to stand alone).  Such transformations are reversed, when  the
   message exits that network.

   When  crossing  network  boundaries,  the  message  should  be
   treated  as  passing  through  two modules.  It will enter the
   first module containing whatever network-specific  transforma-
   tions  that  were  necessary  to  permit migration through the
   "current" network.  It then passes through the modules:

       o   Transformation Reversal

           The "current" network's idiosyncracies are removed and
           the  message  is returned to the canonical form speci-
           fied in this standard.

       o   Transformation

           The "next" network's local idiosyncracies are  imposed
           on the message.

                           ------------------
               From   ==>  | Remove Net-A   |
               Net-A       | idiosyncracies |
                           ------------------
                                  ||
                                  \/
                             Conformance
                             with standard
                                  ||
                                  \/
                           ------------------
                           | Impose Net-B   |  ==>  To
                           | idiosyncracies |       Net-B
                           ------------------