Copyright © 1998 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress." A list of current W3C working drafts is also available.
This document describes an experimental design for a multiplexing transport, intended for, but not restricted to use with the Web. SMUX has been implemented as part of the HTTP/NG project. Use of this protocol is EXPERIMENTALat this time and the protocol may change. In particular, transition strategies to use of SMUX have not been definitively worked out. You have been warned!
This document is part of a suite of documents describing the HTTP-NG design and prototype implementation:
Note: Since working drafts are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves. This work is part of the W3C HTTP/NG Activity (for current status, see http://www.w3.org/Protocols/HTTP-NG/Activity).
Please send comments on this specification to <www-http-ng-comments@w3.org>.
This document defines the experimental multiplexing protocol referred to as "SMUX". SMUX is a session management protocol separating the underlying transport from the upper level application protocols. It provides a lightweight communication channel to the application layer by multiplexing data streams on top of a reliable stream oriented transport. By supporting coexistence of multiple application level protocols (e.g. HTTP and HTTP/NG), SMUX should ease transitions to future Web protocols, and communications of client applets using private protocols with servers over the same TCP connection as the HTTP conversation.
Tried to clarify teminology.
Moved comparison between SMUX and SCP(TMP) to end of the document, and extracted a goals section from it.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [7].
The Internet is suffering from the effects of the HTTP/1.0 protocol, which was designed without understanding of the underlying TCP [1] transport protocol. HTTP/1.0 opens a TCP connection for each URI [28] retrieved (at a cost of both packets and round trip times (RTTs)), and then closes the TCP connection. For small HTTP requests, these TCP connections have poor performance due to TCP slow start [9] [10] as well as the round trips required to open and close each TCP connection.
There are (at least) three reasons why multiple simultaneous TCP connections have come into widespread use on the Internet despite the apparent inefficiencies:
The "Keep-Alive" extension to HTTP/1.0 is a form of persistent TCP connections but does not work through HTTP/1.0 proxies and does not take pipelining of requests into account. Instead a revised version of persistent TCP connections was introduced in HTTP/1.1 as the default mode of operation.
HTTP/1.1 [6] persistent connections and pipelining [11] will reduce network traffic and the amount of TCP overhead caused by opening and closing TCP connections. However, the serialized behavior of HTTP/1.1 pipelining does not adequately support simultaneous rendering of inlined objects - part of most Web pages today; nor does it provide suitable fairness between protocol flows, or allow for graceful abortion of HTTP transactions without closing the TCP connection (quite common in HTTP operation).
Persistent connections and pipelining, however, do not fully address the rendering nor the fairness problems described above. A "hack" solution is possible using HTTP range requests; however, this approach does not, for example, allow a server to send just the metadata contained in embedded object before sending the object itself, nor does it solve the TCP connection abort problem.
Current TCP implementations do not share congestion information across multiple simultaneous TCP connections between two peers, which increases the overhead of opening new TCP connections. We expect that Transactional TCP [5] and sharing of congestion information in TCP control blocks [8] will improve TCP performance by using less RTTs and better congestion behavior, making it more suitable for HTTP transactions.
The solution to these problems requires two actions; either by itself will not entirely discourage opening multiple TCP connections to the same server from a client.
This document describes such an experimental multiplexing protocol. It is designed to multiplex a TCP connection underneath HTTP so that HTTP itself does not have to change, and allow coexistence of multiple protocols (e.g. HTTP and HTTP/NG), which will ease transitions to future Web protocols, and communications of client applets using private protocols with servers over the same TCP connection as the HTTP conversation.
Ideas from this design come from Simon Spero's SCP [15] [16] description and from experience from the X Window System's protocol design [13].
We believe SMUX meets the following goals::
Multiplexing multiple sessions over a single transport TCP connection introduces a potential deadlock that SMUX is designed to avoid.
Here is an example of potential deadlock:
This example points out that deadlock is possible: SMUX must be able to buffer data independently of the consumers of the data. It must also have some way to throttle sessions where the consumer of the data is not responsive in the multiplexing layer (in this example, prevent the transmission of more than 50 Kbytes of data). Note that this deadlock is independent of the size of any multiplexing fragment, but strictly dependent on availability of buffer space in SMUX for a particular session.
In SMUX, the receiver makes a promise (sends a credit) to the transmitter that a certain amount of buffer space is available (or at least that it will consume the bytes, if not buffer them, e.g. a real time audio protocol where the data is disposed of), and the transmitter promises not to send more data than the receiver has promised (no more than the credit). If these promises are met, then SMUX will not deadlock.
A SMUX implementation MUST maintain and adhere to the credit system or it can deadlock. Implementations on systems with large amounts of memory (e.g. VM systems) may be quite different than ones on thin clients with limited, non-virtual memory. It is reasonable on a VM system to hand out credits freely (analogous to the virtual socket buffering found in TCP implementations); but your implementation must be careful to test its credit mechanisms so that they will inter operate with limited memory systems. Credit control messages MAY be sent on sessions that are not active.
Sessions have an initial credit size (initial_default_credit) of 16 KB on each session; there is a SMUX control message to set this initial credit to something larger than the default.
A transmitter MUST NOT transmit more data in a fragment than the available credit on the session (or it could deadlock).
An SMUX implementation MUST fragment streams when transmitting them into fragments. The max_fragment_size, a variable which is maintained on (currently) a per transport TCP connection basis, determines the largest possible fragment a sender should ever send to a receiver. This determines the maximum latency introduced by a SMUX layer above and beyond the inherent TCP latencies (socket buffering on both sender and receiver and the delay-bandwidth product amount of data that could be in flight at any given instant). A client on a low bandwidth link, or with limited memory buffering might decide to set the max_fragment_size down to control latency and buffer space required. If max_fragment_size is set to zero, the transmitter is left to determine the fragment size and MAY take into account application protocol knowledge (e.g. a SMUX implementation for HTTP might send fragments of the metadata of embedded objects, or the next phase of a progressive image format, which it only knows). An implementation SHOULD honor the max_fragment_size as it transmits data, if it has been set by the receiver.
An SMUX implementation that does not have explicit knowledge or experience of good fragment sizes might use these guidelines as a starting point:
This would result in fragmentation roughly similar to TCP segmentation over multiple TCP connections.
An implementation should round robin between sessions with data to send in some fashion to avoid starving sessions, or allowing a single thread to monopolize the TCP connection. Exact details of such behavior is left to the implementation. To achieve highest bandwidth and lowest overhead SMUX behavior, credits should be handed out in reasonably large chunks. TCP implementations typically send an ack message on every other packet, and it is very hard to arrange to piggyback acks on data segments in implementations. Therefore, for SMUX to have reasonably low overhead credits should be handed out in some significant multiple (4 or more times larger) than the ~3000 bytes represented by two packets on an ethernet. The outstanding credit balance across active sessions will also have to be larger than the bandwidth/delay product of the TCP connection if SMUX is not to become a limit on TCP transport performance.
Both of these arguments indicate that outstanding credits in many implementations should be 10K bytes or more. Implementations SHOULD piggyback credit messages on data packets where possible, to avoid unneeded packets on the wire. A careful implementation in which both ends of the TCP connection are regularly sending some payload should be able to avoid sending extra packets on the network.
If necessary, we could add in a future version fragmentation control messages to do some bandwidth allocation, but for now, we are not bothering.
SMUX headers are always in big endian byte order. 
If people want, we could expand out the union below on a control message
type basis (e.g. the way the C bindings to X events were written out...).
For this draft, I'm not doing so.
 #define MUX_CONTROL       0x00800000
 #define MUX_SYN           0x00400000
 #define MUX_FIN           0x00200000
 #define MUX_RST           0x00100000
 #define MUX_PUSH          0x00080000
 #define MUX_SESSION       0xFF000000
 #define MUX_LONG_LENGTH   0xFF040000
 #define MUX_LENGTH        0x0003FFFF
 
 typedef unsigned int flagbit;
 struct w3mux_hdr {
     union {
        struct {
            unsigned int session_id : 8;
            flagbit control : 1;
            flagbit syn : 1;
            flagbit fin : 1;
            flagbit rst : 1;
            flagbit push : 1;
            flagbit long_length : 1;
            unsigned int fragment_size : 18;
            int long_fragment_size : 32; /* only present if long_length is set */
        } data_hdr;
         struct {
            unsigned int session_id : 8;
            flagbit control : 1;
            unsigned int control_code : 4;
            flagbit long_length : 1;
            unsigned int fragment_size : 18;
            int long_fragment_size : 32; /* only present if long_length is set */
        } control_message;
     } contents;
 };
The fragment_size is always the size in bytes of the fragment, excluding the SMUX header and any padding.
SMUX headers are always (at least) 32 bit aligned. To find the next SMUX header, take the fragment_size, and round up to the next 32 bit boundary.
Transmitters MAY insert NoOp control messages to force 64 bit alignment of the protocol stream.
A SMUX header with the long_length bit set must use the 32 bits following the SMUX header (the long_fragment_size field) for the value of the fragment_size field, for whatever purpose the fragment_size field is being used for.
Atoms are integers that are used as short-hand names for strings, which are defined using the InternAtom control message. Atoms are only used as protocol ID's in this version of SMUX, though they might be used for other purposes in future versions. Since the atom might be redefined at any time, it is not safe to use an atom unless you have defined it (i.e. you cannot use atoms defined by the other end of a mux connection). Atoms are therefore not unique values, and only make sense in the context of a particular direction of a particular mux connection. This restriction is to avoid having to define some protocol for deallocating atoms, with any round trip overhead that would likely imply.
Strings are defined to be UTF-8 encoded UNICODE strings. (Note that an ascii string is valid UTF-8). The definition of structure of these strings is outside of the scope of this document, though we expect they will often be URI's, naming a protocol or stack of protocols. Atoms always have values between 0x20000 and 0x200ff (a maximum of 256 atoms can be defined).
Strings used for protocol id's MUST be URIs [28].
The protocol used by a session is identified by a Protocol ID, which can either be an IANA port number, or an atom.
In the simplest case, a protocol ID is just a value in the range of 0-0x1FFFF, and specifies the TCP port number (0x0000-0xffff) or UDP port number (0x10000-0x1ffff) of the protocol per the IANA port number registry [17]. Firewall proxies can presume that the bytes should conform to that protocol. Protocol ID's above 0xfffff are atoms. The scheme name of the URI indicates the protocol family being used.
Each session is allocated a session identifier. Session Identifiers below 0 and 1 are reserved for future use. Session IDs allocated by initiator of the transport TCP connection are even; those allocated by the receiver of the transport connection odd. Proxies that do not understand messages of reserved Session ID's should forward them unchanged. A session identifier MUST only be deallocated and potentially reused by new sessions when a session is fully closed in both directions.
To establish a new session, the initiating end sends a SYN message, allocating a free session number out of its address space. A session is established by setting the SYN bit in the first message sent on that session. The session is specified by the session_id field. The fragment_size field is interpreted as the protocol ID of the session, as discussed above.
The receiver MUST either open the reverse path of that session (send a SYN message), or it MUST send a FIN message to indicate that the reverse path is not going to be used further, or send a RST message to indicate an error. This enables the initiator of a session to know when it is safe to reuse that session ID.
A session is ended by sending a fragment with the FIN bit set. Each end of a MUX connection may be closed independently.
MUX uses a half-close mechanism like TCP[1] to close data flowing in each direction in a session. After sending a FIN fragment, the sender MUST NOT send any more payload in that direction.
A session may be terminated by sending a message with the RST bit set. All pending data for that session should be discarded. "No such protocol" errors detected by the receiver of a new session are signaled to the originator on session creation by sending a message with the RST bit set. (Same as in TCP).
The payload of the fragment containing the RST bit contains the null terminated string containing the URI of an error message (note that content negotiation makes this message potentially multi-lingual), followed by a null terminated UTF-8 string containing the reason for the reset (in case the URI is not accessable).
A message boundary is marked by sending a message with the PUSH bit set. The boundary is set between the last octet in this message, including that octet, and the first byte of a subsequent message. This differs slightly from TCP, as PUSH can be reliably used as a record mark.
Flow control is determined by a simple credit scheme described above by using the AddCredits control message defined below. Fragments transmitted must never exceed the outstanding credit for that session. The initial outstanding credit for a session is 16Kbytes.
One of the major design goals of SMUX is to allow callbacks to objects in the process that initiated the transport TCP connection without requiring additional TCP connections (with the overhead in both machine resources and time that this would cause, or the problems with TCP connection establishment through firewalls).
The DefineEndpoint control message allows one to advertize that a particular (set of) URI's are reachable over the transport TCP connection.
The control bit of the SMUX header is always set in a control message. Control messages can be sent on any session, even sessions that are not (yet) open. The control_code reuses the SYN, FIN, RST, and PUSH bits of the SMUX header. The control_code of the control message determines the control message type. Any unused data in a control message must be ignored.
The revised version of SMUX means that a session creation costs 4 bytes (a control message with SYN set, and with the protocol ID in the message). Therefore the first fragment of payload has a total overhead of 8 bytes. (This is presuming using an IANA based protocol, rather than a named protocol). This is the same as the previous version, though it means two messages rather than one.
The individual control message types are listed below.
| code | Name | Dir | Description | 
|---|---|---|---|
| 0 | InternAtom | Both | The session_id is used as the Atom to be defined (offset by 0x2000), so a value of 0 is defining ID 0x2000). The fragment_size field is the length of the UTF-8 encoded string. The fragment itself contains the string to be interned. This allows the interning of 256 strings. (is this enough?). | 
| 1 | DefineEndpoint | Both | The session_id is ignored.  The fragment_size is
      interpreted as the protocol ID, naming an endpoint actually available on
      this transport TCP connection.  This enables a single transport
      TCP connection to be used for callbacks, or to advertise that a protocol
      endpoint can be reached to the process on the other end of the transport
      TCP connection. Whether this relative URI naming can be used depends upon
      the scheme of the URI [20], which defines its structure. For example, a firewall proxy might advertize just "http:" for the proxy, claiming it can be used to contact any HTTP protocol object anywhere, or "http://foo.com/bar/" to indicate that any object below that point in the URI space on the server foo.com may be reached by this TCP connection. A client might advertize that "http://myhost.com/" is available via this transport TCP connection. | 
| 2 | SetMSS | Both | This sets a limit on fragment sizes below the outstanding credit limit. The session_id must be zero. The fragment_size field is used as max_fragment_size (the largest fragment that be sent on any session on this transport TCP connection.). A max_fragment_size of zero means there is no limit on the fragment size allowed for this session. | 
| 3 | AddCredit | R->T | The session_id specifies the session. The fragment_size specifies the flow control credit granted (to be added to the current outstanding credit balance). A value of zero indicates no limit on how much data may be sent on this session. | 
| 4 | SetDefaultCredit | R->T | The session_id must be zero. The fragment_size field is used as to set the initial default credit limit for any incoming MUX connections over this transport TCP connection. (i.e. it is short hand for sending a series of AddCredit messages for each session ID). | 
| 5 | NoOp | Both | This control message is defined to perform no function. Any data in the payload should be ignored. | 
| 6-15 | Undefined. Reserved for future use. Must be ignored if not understood, and forwarded by any proxies. The fragment_size is always used for the length of the control message, and any data for the control message will be in the payload of the control message (to allow proxies to be able to forward future control messages). | 
Note that TIP (Transaction Internet Protocol) [21] defines a version of SCP called TMP .
Goals:
There are five issues that make SCP (TMP) inadequate for our use:
So far, SMUX is similar to SCP. There are some important differences:
Other comment on SCP:
SCP has 224 sessions, which seems highly excessive, and reserves 1024 of them for future use.
Some of the comments below allude to previous versions of the specification, and may not make sense in the context of the current version.
Henrik and I have convinced ourselves there are fundamental differences between a priority scheme and the credit scheme in this draft. They interact quite differently with TCP, and priority schemes have no way to limit the total amount of data being transmitted, though priority schemes are better matched to what the Web wants. We've decided, at least for now, to defer any priority schemes to higher level protocols.
ILU [22] style protocol stacks are a GOOD THING. There have been too many worries about the birthday problem for people to be comfortable with Bill Janssen's hashing schemes (see Henrik Frystyk Nielsen and Robert Thau's mail on this topic). We tried putting this directly in MUX in a previous version, and experience shows that it didn't really help an implementer (in particular, Bill Janssen while implementing ILU). This version has just the name of the protocol, and it is left to others to implement any stacking (e.g. ILU).
We believe the name of the protocol is necessary, if SMUX is ever to be used with firewalls. Application level firewall relays need the protocol information to sanity check the protocol being relayed. Application level relays are considered much more secure than just punching holes in the firewall for particular protocol families, which small organizations often find sufficient, as the relay can sanity check the protocol stream and enable better policy decisions (for example, to forbid certain datatypes in HTTP to transit a firewall). Large organizations and large targets typically only run application level proxies.
Wasting bytes in general, and in particular at TCP connection establishment, for a multiplexing transport must be avoided. There are several reasons for this:
So there is more than the usual tension between generality vs. performance. Performance analysis
Human perception is about 30 milliseconds; if much more than this, the user perceives delay. At 14.4 K baud, one byte uncompressed costs .55 milliseco nds (ignoring modem latencies). On an airplane via telephone today, you get a munificent 4800 baud, which is 3X slower. Cellular modems transmitting data (CDPD), as I understand it, will give us around 20Kbaud, when deployed.
So basic multiplexing @ 4 byte overhead costs ~ 2 milliseconds on common modems. This means basic overhead is small vs. human perception, for most low speed situations, a good position to be in.
On cMux onnection open, with above protocol we send 4 bytes in the setup message, and then must open a session, requiring at least 8 bytes more. 12 bytes == 7 milliseconds at 14.4K. Not 64 bit aligned, and 4 bytes costs of order 2 milliseconds. Ugh... Maybe a setup message isn't a good idea; other uses (e.g. security) can be dealt with by a control message.
We want to SMUX multiple protocols simultaneously over the same transport TCP connection, so we need to know what protocol is in use with each session, so the demultipexor can hand the data to the right person. (e.g. SUNRPC and DCERCP simultaneously).
There are two obvious ways I can see to do this:
Opinions? Mine is that b) is better than a. Answer: b) is the adopted strategy.
For a given stream, priority will affect which session is handled when multiplexing data; sending the priority on every block is unneeded, and would waste bytes. There is one case in which priority might be useful: at an intermediate proxy relaying sessions (and maybe remultiplexing them).
If so, it should be sent only when sessions are established or changed. Changes can be handled by a control message. Opinions?
A priority field can be hacked into the length field with the protocol field using b) above.
So the question is: is it important to send priority at all in this SMUX protocol? Or should priority control, if needed, be a control message? ; (control message).
Answer: Not in this protocol. Opens Pandora's box with remultiplexors, which could have denial of service attacks.
Is any setup message needed? I don't think it is,. and initial bytes are precious (see performance discussion above), and it complicates trivial use. If we move the byte order flag to the SMUX header, and use control messages if other information needs to be sent, we can dispense with it, and the layer is simpler. This is my current position, and unless someone objects with reasons, I'll nuke it in the next version of this document.
Answer: Not needed. Nuked.
While higher layer protocols using host dependent byte order can be a performan ce win (when sending larger objects such as arrays of data), the overhead at this layer isn't much, and may not be worth bothering with. Worst case (naive code) would be four memory reads and 3 shift overhead/payload. Smart code is one load and appropriate shifts etc.
Opinions? I'm still leaning toward swapping bytes here, but there are other examples of byte load and shift (particularly slow on Alpha, but not much of an issue on other systems).
Answer: Not sufficient performance gain at SMUX level to be worth doing. Defined as LE byte order for SMUX headers.
There are several error conditions, probably best reported via control messages from server:
Any others? Any twists to worry about?
Answer: Only error that can occur is no such protocol, given no priority in the base protocol. May still be some unresolved issues here around "Christma s Tree" message (all bits turned on).
Any reason to believe that the 32 bit length field for a single payload is inadequate? I don't think so, and I live on an Alpha.
Answer: 32 bit extended length field for a single fragment is sufficient.
Does there need to be a bit saying the payload is compressed to avoid explosion of protocol types?
Answer: Yes; introduction of control message to allow specification of transport stacks achieves this.
I think that we should be able to multiplex any TCP, UDP, or IP protocol. Internet protocol numbers are 8 bit fields.
So we need 16 bits for TCP, one bit to distinguish TCP and UDP, and one bit more we can use for IP protocol numbers and address space we can allocate privately. This argues for an 18 bit length field to allow for this reuse. * 18 bit length field * * 8 bit session field * * 4 control bits * * 1 long length bit *
The last bit is used to define control messages, which reuse the syn, fin, rst, and push bits as a control_code to define the control message. There are escapes, both by undefined control codes, and by the reservation of two sessions for further use if there needs to be further extensions. The spec above reflects this.
Back to alignment. If we demand 4 byte alignment, for all requests that do not end up naturally aligned, we waste bytes. Two bytes are wasted on average. At 14.4Kbaud the overhead for protocols that do not pad up would on mean be 6 bytes or ~3ms, rather than 4 bytes or ~ 2 ms (presuming even distributions of length). Note that this DOES NOT effect initial request latency (time to get first URL), and is therefore less critical than elsewhere.
I have one related worry; it can sometimes be painful to get padding bytes at the end of a buffer; I've heard of people losing by having data right up to the end of a page, so implementations are living slightly dangerous ly if they presume they can send the padding bytes by sending the 1, 2 or 3 bytes after the buffer (rather than an independent write to the OS for padding bytes).
Alternatively, the buffer alignment requirement can be satisfied by implementations remembering how many pad bytes have to be sent, and adjusting the beginning address of the subsequent write by that many bytes before the buffer where the SMUX header has been put. Am I being unnecessarily paranoid?
Opinion: I believe alignment of fragments in general is a GOOD THING, and will simplify both the SMUX transport and protocols at higher levels if they can make this presumption in their implementations. So I believe this overhead is worth the cost; if you want to do better and save these bytes, then start building an application specific compression scheme. If not, please make your case.
Are the four bits defined in Simon's flags field what we need? Are there any others?
Answer: no. More bits than we need. Current protocol doesn't use as many. I've ended back at the original bits specified, rather than the smaller set suggested by Bill Janssen. This enables full emulation of all the details of a socket interface, which would not otherwise be possible. See details around TCP and socket handling, discussed in books like "TCP/IP Illustrated," by W. Richard Stevens.
Am I all wet?
Opinion: I believe that we should do this.
Question: do we want/need a short control message? Right now, the out for extensibility are control messages sent in the reserved (and as yet unspecified ) control session. This requires a minimum of 8 bytes on the wire. We could steal the last available bit, and allow for a 4 byte short control message, that would have 18 bits of payload.
Opinion: Flow control needs it; protocol/transport stacks need it. Document above now defines some control messages.
The above specification allows for someone who just wants to SMUX a single protocol to entirely ignore protocol ID's.
To be supplied