W3C WD: Simple MUX Protocol Specification

WD-MUX-970825

Simple Multiplexing Protocol

W3C Working Draft 25-August-97

Authors:: Jim Gettys, W3C
Henrik Frystyk Nielsen, W3C
This version:: /TR/WD-MUX-970825
$Id: WD-MUX-961023.html,v 1.5 1996/12/09 03:35:09 jigsaw Exp $
Latest version:: /TR/WD-MUX
Previous Versions:: /TR/WD-MUX-961023; /TR/WD-MUX-970315

Status of this document

This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress." A list of current W3C working drafts is also available.

This document describes an experimental design for a multiplexing transport, intended for, but not restricted to use with the Web. Use of this protocol is EXPERIMENTAL and the protocol is guaranteed to change. In particular, transition strategies to use of MUX have not been worked out. You have been warned!

Note: Since working drafts are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves. This work is part of the W3C HTTP-NG Activity (for current status, see http://www.w3.org/Protocols/HTTP-NG/Activity)

Abstract

The Internet is suffering from the effects of the HTTP/1.0[7] protocol, which was designed without thorough understanding of the underlying TCP[1] transport protocol. HTTP/1.0 opens a TCP connection for each URI[4] retrieved (at a cost of both packets and round trip times (RTTs)), and then closes the connection. For small HTTP requests, these connections have poor performance due to TCP slow start [11][12] as well as the round trips required to open and close each TCP connection.

HTTP/1.1 persistent connections and pipelining will reduce network traffic and the amount of TCP overhead caused by opening and closing TCP connections [13]. However, the serialized behavior of HTTP/1.1 pipelining does not adequately support simultaneous rendering of inlined objects - part of most Web pages today; nor does it provide suitable fairness between protocol flows, or allow for graceful abortion of HTTP transactions without closing the TCP connection.

Current TCP implementations do not share congestion information across multiple simultaneous connections between two peers, which increases the overhead of opening new TCP connections. We expect that Transactional TCP[5] and sharing of congestion information in TCP control blocks [10] will improve TCP performance by using less RTTs, making it more suitable for HTTP transactions.

It is likely that the Web has caused the average packet train length on the Internet to decrease significantly over the last 2-3 years. Results from [13] and [21] indicate that sending fewer big packets is more cost effective than sending more small packets due to less overhead in routers and hosts. By multiplexing multiple lightweight HTTP transactions onto the same underlying transport connection and deploying smart output buffer management, small packets can to a large extend be avoided.

Note: We need some references here to point to the impact of packet train length on router performance. I have often heard from router people that big packets are better but can't find a good paper describing it.

This document defines an experimental multiplexing protocol referred to as "MUX". MUX is intended as an intermediate session management protocol separating the underlying transport like TCP from the upper level application level protocols like HTTP and HTTP-NG. It is designed to provide a lightweight communication channel to the application layer by multiplexing data streams on top of a reliable stream oriented transport.

Contents

1. Introduction

1.1 Purpose

1.2 Design Goals

2. MUX Header

2.1 Session Identifiers

2.2 Session Establishment

2.3 Graceful Release

2.4 Disgraceful Release

2.5 Message Boundaries

2.6 Control Messages

3. Extending MUX

4. Session Management

5. Publishing a Protocol

6. Normative References

7. Bibliography - Informative References

8. Acknowledgements

9. Appendix

9.1 Summary of MUX State Transitions

Introduction

Purpose

The current widespread use of multiple TCP connections in use simultaneously is compounding HTTP/1.0's misdesign:

A client gains an significant perceived performance advantage using multiple connections as early retrieval of meta-data (e.g. size) of embedded objects in a page. This allows a client to format a page sooner without suffering annoying reformatting of the page. Clients which open multiple connections in parallel to the same server, however can cause self congestion on heavily congested links, since TCP opens and closes are not themselves congestion controlled.
To keep low bandwidth/high latency links busy, more than one connection has been necessary since slow start may cause the line to be partially idle.
The additional TCP opens cause performance problems in the network, but a client that opens multiple connections simultaneously to the same server may also receive an "unfair" bandwidth advantage in the network relative to clients that use a single connection. This problem is not solvable at the application level; only the network itself can enforce such "fairness".

The "Keep-Alive" extension to HTTP/1.0 is a form of persistent connections. HTTP/1.1's design differs in minor details from Keep-Alive to overcome a problem discovered when Keep-Alive is used with more than one proxy between a client and a server.

HTTP/1.1 persistent connections will go a long way to reduce the network traffic and some of the congestion problems caused by the HTTP/1.0 protocol; however, but by itself will not succeed, as it does not address the rendering nor the fairness problems described above [13]. Neither does it allow for aborting an HTTP transaction without closing the TCP connection.

The solution to these problems requires two actions; either by itself will not entirely discourage opening multiple connections to the same server from a client.

Internet service providers should enable the Random Early Detection (RED)[14] algorithm in their routers to ensure bandwidth fairness to clients when the network is congested. RED also addresses queue length problems observed in routers today.
Development and deployment of a multiplexing protocol for use with HTTP (and eventually other protocols), so that multiple objects from a web server can be fetched approximately simultaneously over a single TCP connection, so that the meta-data to objects can be sent to clients without other metadata waiting for the rest of the first object requested.

MUX is a session management protocol separating the underlying transport from the upper level application protocols. It provides a lightweight communication channel to the application layer by multiplexing data streams on top of a reliable stream oriented transport. By supporting coexistence of multiple application level protocols (e.g. HTTP and HTTP-NG), MUX will ease transitions to future Web protocols, and communications of client applets using private protocols with servers over the same connection as the HTTP conversation.

Design Goals

Ideas from this design come from Simon Spero's Session Control Protocols (SCP)[17][18] description and from experience from the X Window System's protocol design [22]. The goals include:

Unconfirmed service without negotiation;
SCP allows data to be sent with the session establishment; the recipient does not confirm successful connection establishment, but may reject unsuccessful attempts. This simplifies the design of the protocol, and removes the latency required for a confirmed operation;
Simple design and extensibility;
Performance where critical

There are four issues that make Simon Spero's SCP inadequate for our use:

SCP has no built in provision for multiplexing multiple protocols over the same transport connection, essential for graceful transition without dependency on the currently incomplete NG design, and to allow other uses which could use the same multiplexed connection (e.g. applet communication with servelets);
SCP's 8 byte overhead is not reasonable most of the time. MUX uses four bytes in the default case. The design below permits an 8 byte header if you care to preserve 64 bit alignment at the cost of bytes. In practice, there seems few data formats or architectures that actually require more than 32 bit alignment;
Without some form of buffer management mechanism, infinite buffering in clients would be required;
Alignment is not preserved in the data stream. This allows compact, high speed (un)marshalling code in implementations of binary protocols, without extra data copies, which in such protocols can be significant overhead

Based on the design goals, MUX has the following characteristics that separates it from SCP:

Native support for multiple protocols to be multiplexed over same connection;
Lower overhead than SCP, while preserving data alignment
Support for 2⁸ sessions and reserves 2 instead of the 2²⁴ sessions available in SCP, which seems highly excessive along with the 1024 reserved sessions for future use.

The X Window system protocol did not (on day zero) make provisions for objects larger than 2¹⁸ bytes; this became a problem for poly graphics operations and for 3D extensions, particularly in shared memory situations. Large images, and either very fast transport (e.g. FDDI/ATM) or transport via shared memory to a local server encourage essentially unlimited lengths. This multiplexor can be used under all circumstances, even though for low bandwidth, typical sizes will be quite small. If a client wants to always preserve 64 bit alignment, the client can choose to always use the extended form described hereafter.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119[9].

MUX Header

A MUX fragment contains a MUX header and 0 or more bytes of data. MUX headers MUST be encoded in big endian byte order following the tradition from TCP/IP[1] and are defined as follows:

#define MUX_LONG_LENGTH         0x80000000
#define MUX_CTRL                0x40000000
#define MUX_SYN                 0x20000000
#define MUX_FIN                 0x10000000
#define MUX_RST                 0x08000000
#define MUX_PUSH                0x04000000
#define MUX_SESSION             0x03FC0000
#define MUX_LENGTH              0x0003FFFF

typedef unsigned int flagbit;
struct mux_hdr {
  union {
    struct {
      flagbit long_length : 1;
      flagbit ctrl : 1;
      flagbit syn : 1;
      flagbit fin : 1;
      flagbit rst : 1;
      flagbit push : 1;
      unsigned int session_id : 8;
      unsigned int data_size : 18;      /* Not if long_length set */
      int long_data_size : 32;          /* Only if long_length set */
    } data_hdr;
    struct {
      flagbit long_length : 1;
      flagbit ctrl : 1;
      unsigned int control_code : 4;
      unsigned int session_id : 8;
      unsigned int data_size : 18;      /* Not if long_length set */
      int long_data_size : 32;          /* Only if long_length set */
    } control_hdr;
};

The 18 bit data_size field contains the number of bytes of data (payload) in the fragment. A MUX header with the long_length bit set uses the 32 bits long_data_size for the number of data bytes in the fragment instead of the data_size. Agents can also use the long_length bit to force 64 bit alignment of outgoing protocol streams.

MUX headers are always (at least) 32 bit aligned. To find the next MUX header, take the data_size, and round up to the next 32 bit boundary. If the long_length bit is set, take the long_data_size and round up to the next 64 bit boundary. Unused bytes within a fragment are not guaranteed to be 0.

The four bit flags: SYN, FIN, RST, and PUSH are used to manage a MUX session (see section 2.2, 2.3, 2.4, and 2.5). As for TCP, it is possible to set more than one of the four bits in the same MUX header [2].

The CTRL bit determines which header format to use: if 0 use data_hdr and if 1 then use control_hdr (see section 2.6).

Session Identifiers

Each session is allocated an 8-bit session identifier, which uniquely identifies a session on a particular transport connection. Session identifiers 0 and 1 are reserved by MUX (see section 2.6). Session identifiers allocated by the initiator of the underlying transport connection are even (0, 2 ,4…); those allocated by receivers odd (1, 3, 5…). Session identifiers from terminated sessions can be reused on the same transport connection as soon as they are freed up (see section 2.3 and 2.4).

Session Establishment

A session is established by setting the SYN bit in the first message sent on that session. As MUX requires a reliable transport, there is no need to acknowledge a SYN fragment as in TCP; a single SYN fragment is enough to open a new session for data transport in both directions.

The protocol to be used in a MUX session is identified by a URI and is characterized by the following properties:

They link the semantics of the protocol to the URI identifying the protocol, potentially allowing a recipient to interpret the protocol stream correctly with no prior agreement.
Any agent can generate unique protocols independent of other agents

The SYN fragment carries the protocol identifier, which is the URI of the protocol or protocol stack to be used on the session (see section 5). The URI is located in the payload and the length of the URI is described as any other payload using the data_size or the long_data_size field. No other data can be sent in a SYN fragment.

The advantage of including the extension identifier is that, at the cost of some extra bytes to spell out the URI, the use of a central registry of extension names is avoided. MUX can also be used to extend applications to support centrally registered protocol, assuming a URI is published as part of the registration.

Question: I think we should allow relative URIs and require that they be expanded relative to the last transmitted absolute URI in that direction. This would shrink the size of the URI down to the empty string whenever the URI is the same across multiple sessions but requires state in the recipient. We could also define a default location for expanding relative URIs, for example "http://www.iana.org".

Question: Robert Thau mentioned that "stacking URIs" may be useful. I think that it adds too much overhead and as I expect the number of stack combinations is limited, the number of combinations seems limited and can be represented as a flat space.

Note: As many TCP stacks only allow a single TCP segment to be sent as the first data on a connection, it is important that applications do clever output buffering of MUX headers to avoid additional RTTs due to TCP slow start (see [13] for more information on application level pipelining).

Graceful Release

MUX uses a half-close mechanism like TCP[1] to close data flowing in each direction in a session. The generation of a FIN means that there will be no more data flowing in that direction. Only when a session is fully closed in both directions can the session identifier be released and potentially reused by new sessions.

Disgraceful Release

A session may be terminated at any time by sending a MUX fragment with the RST bit set. All pending data for that session SHOULD be discarded immediately.

The RST bit can also be used to indicate that a recipient does not understand or does not which to use a protocol by sending it in response to a SYN fragment. This is very similar to TCP's behavior when reacting to a connection request on an inactive port.

Message Boundaries

A message boundary is marked by sending a fragment with the PUSH bit set. The boundary is set between the last byte in this fragment, including that byte but excluding any alignment padding, and the first byte of a subsequent fragment.

Question: The PUSH flag is mainly for historic reasons in TCP - it is hardly ever used and can be handled internally if need be. In any case, there is no direct way of controlling it in any of the APIs that I know of. If we can find a better purpose of the bit then I am all for it.

Control Messages

The CTRL bit of the MUX header is always set in a control message. The control_code of the control message determines the control message type. Any unused data in the control message SHOULD be ignored. The long_length bit may be set in control headers if needed or if 64 bit alignment is desired. If the long_length bit is set, some control messages may reuse the data_size field for different purposes than the length. The individual control message types are listed below:

control_code	Name	Description
0x0	Switching	Allows MUX to extend itself dynamically (see section 3)
0x1	MDS	Specifies the Maximum Data Size for all MUX fragments (see section 4). The session_id is ignored. A data_size of zero means no limit on the fragment size.
0x8	Priority	Specifies the priority of the session (see section 4). There priority value is set using the remaining three bit in the control_code, which allows for 8 priority levels. Level 4 is the default.
0x2-0x7	-	undefined - reserved for future use.

Extending MUX

Different application level protocols have different requirements to the underlying transport. The MUX specification does not necessarily meet the requirements of all protocols that may be used on top of it. It is therefore important to have a well-defined mechanism for extending MUX dynamically while not breaking existing applications.

MUX reserves session identifiers 0 and 1 for certain types of control messages and for extending MUX itself. An application can change the semantics of MUX by sending a SYN fragment with a protocol extension identifier on session 0 if it has initiated the underlying transport, or session 1 if it has received the underlying transport connection.

The URI included in the SYN fragment defines the extended semantics of MUX. If the recipient accepts the extension, it SHOULD respond with a Switching control message. The acknowledgement is required in order for both parties to know when to use the new semantics. If the recipient does not understand or does not wish to use that extension, it SHOULD respond with a RST fragment.

This specification does not specify any other fragments with session identifiers 0 or 1.

Session Management

Note: This is more a discussion of pros and cons for credit schemes vs. session priorities than a spec. I am still discussing with Jim and we're not sure what is best and haven't tested it enough to say definitely.

When managing multiple streams within the same underlying transport connection, MUX must solve the following problems in order to avoid data being dropped or deadlocks to occur between multiple interdependent sessions on the receiving end of a protocol flow:

Guarantee fairness between multiple simultaneous sessions;
Avoid buffer overflow in sessions where the layers above MUX can not absorb data at the rate it arrives from the layers below MUX;

The first is a requirement on the sender and the latter is on the receiver. Note, that MUX is full duplex and an agent can be both a sender and a receiver at the same time. Another goal is that the MUX session management mechanisms do not interfere with flow control mechanisms used in the underlying transport layer.

Experience shows [13] that in order to achieve the highest possible throughput in TCP, it is important that the recipient reads data as fast as possible from the internal TCP kernel buffers, which often cannot be reached by network applications. Otherwise, the recipient's TCP stack will react by sending fewer ACKs to the sender resulting in slowing down the transmission.

In protocols like HTTP/1.0 using multiple simultaneous TCP connections, this requirement is often relaxed as each connection has its own kernel buffers hence increasing the overall memory allocation used by the application. However, when using only a single underlying transport connection for multiple interleaved sessions, this requirement becomes crucial for high performance and suggests an upper limit on the number of sessions that should be run on top of a single underlying transport connection.

The session management can be handled using either a credit scheme or a priority based mechanism. While the former controls the amount of data available to the receiver, the latter controls the timing of when data becomes available to the receiver. A problem with a credit scheme is that the buffer capacity in a receiver is largely independent of the time it takes to consume the data and can not be used to estimate how much data can be outstanding on the network at any given time. The advantage of a priority scheme is that it allows unlimited amounts of in-flight data over the network and does not interfere with TCP flow control at all. However, it does not completely rule out the potential risk of a deadlock between two simultaneous sessions.

I think, however, that the latter will serve us best and for now, I propose a "preemptive, fixed priority scheduling" mechanism that is a deterministic scheduler used in many thread packages and in the Java thread package[23]. Sessions with the same priority get executed in a round-robin fashion. The priority can be set using the Priority control_code. The possible priority levels are 0-7 with 4 being the default.

The Maximum Data Size (MDS) is the scheduling unit for how much data can be sent in a single MUX fragment and can be set dynamically using the MDS control_code. We don't want to specify a fixed default value as this likely will lead to the same problems as MSS in TCP[3] where technology evolution has rendered the default value of 576 obsolete. A good default may be the Path MTU (PMTU) minus 40 bytes to TCP/IP headers and 4-8 bytes to the MUX header depending on whether the long_length bit is set or not.

Note: As an optimization, it is not required that we insert a MUX header is there is nothing to multiplex. This would be the case if there only is one active session on a connection. Hence a sender can augment the data_size in a single header and omit the next header if desired.

Publishing a Protocol

While the protocol or protocol stack definition should be published at the address of the protocol identifier, this is not a requirement of this specification. The only absolute requirement is that distinct names be used for distinct semantics. For example, one way to achieve this is to use a mid, cid, or uuid URI. The association between the protocol identifier and the specification might be made by distributing a specification, which references the protocol identifier.

It is strongly recommended that the integrity and persistence of the protocol identifier is maintained and kept unquestioned throughout the lifetime of the protocol stack. Care should be taken not to distribute conflicting specifications that reference the same name. Even when a URI is used to publish protocol specifications, care must be taken that the specification made available at that address does not change significantly over time. One agent may associate the identifier with the old semantics, and another might associate it with the new semantics.

The protocol definition may be made available in different representations ranging from

a human-readable specification defining the protocol semantics,
downloadable code which implements the semantics defined by the protocol,
a formal interface description provided by the protocol, to
a machine-readable specification defining the protocol semantics.

For example, a software component that implements the specification may reside at the same address as a human-readable specification (distinguished by content negotiation). The human-readable representation serves to document the protocol and encourage deployment, while the software component allows clients and servers to be dynamically extended.

Normative References

J. Postel, "Transmission Control Protocol", RFC 793, Network Information Center, SRI International, September 1981
J. Postel, "TCP and IP bake off", RFC 1025, September 1987
J. Mogul, S. Deering, " Path MTU Discovery", RFC 1191, DECWRL, Stanford University, November 1990
T. Berners-Lee, "Universal Resource Identifiers in WWW. A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web", RFC 1630, CERN, June 1994.
R. Braden, "T/TCP -- TCP Extensions for Transactions: Functional Specification" RFC 1644, USC/ISI, July 1994
R. Fielding, "Relative Uniform Resource Locators", RFC 1808, UC Irvine, June 1995
T. Berners-Lee, R. Fielding, H. Frystyk, "Hypertext Transfer Protocol -- HTTP/1.0", RFC 1945, W3C/MIT, UC Irvine, W3C/MIT, May 1996.
R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, U.C. Irvine, DEC W3C/MIT, DEC, W3C/MIT, W3C/MIT, January 1997
S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, Harvard University, March 1997
J. Touch, "TCP Control Block Interdependence", RFC 2140, April 1997
W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", RFC 2001, January 1997
V. Jacobson, "Congestion Avoidance and Control", Proceedings of SIGCOMM '88
H. Frystyk Nielsen, J. Gettys, A. Baird-Smith, E. Prud'hommeaux, H. W. Lie, and C. Lilley, "Network Performance Effects of HTTP/1.1, CSS1, and PNG", to appear in Procedings of SIGCOMM '97
S. Floyd and V. Jacobson, "Random Early Detection Gateways for Congestion Avoidance", IEEE/ACM Trans. on Networking, vol. 1, no. 4, Aug. 1993.
R.W.Scheifler, J. Gettys, "The X Window System" ACM Transactions on Graphics # 63, Special Issue on User Interface Software
V. Paxson, "Growth Trends in Wide-Area TCP Connections" IEEE Network, Vol. 8 No. 4, pp. 8-17, July 1994
S. Spero, "Session Control Protocol, Version 1.0"
S. Spero, "Session Control Protocol, Version 1.1"
Keywords and Port numbers are maintained by IANA in the port-numbers registry.
Keywords and Protocol numbers are maintained by IANA in the protocol-numbers registry.
W. Richard Stevens, "TCP/IP Illustrated, Volume 1", Addison-Wesley, 1994
R.W.Scheifler, J. Gettys, "The X Window System" ACM Transactions on Graphics # 63, Special Issue on User Interface Software
D.J.Berg, "Java Threads", White Paper, Sun Microsystems, March 1996

Bibliography - Informative References

B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. Ramakrishnan, S. Shenker, J. Wroclawski, L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet". Internet draft draft-irtf-e2e-queue-mgt-00.txt, ps, March 25, 1997. This is work in progress
John Heidemann, "Performance Interactions Between P-HTTP and TCP Implementations", USC/Information Sciences Institute, November 1996
Simon Spero, "Analysis of HTTP Performance problems", July 1994
Venkata N. Padmanabhan, Jeffrey C. Mogul, "Improving HTTP Latency", University of California -- Berkeley, Digital Equipment Corporation Western Research Laboratory, October 1994
B. Janssen, M. Spreitzer, "Inter-Language Unification"; in particular see the manual section on Protocols and Transports.

Acknowledgements

Thanks to Robert Thau, Anselm Baird-Smith, Bill Janssen, Mike Spreitzer and others for comments in the design phase of this protocol. TBD…

Appendix

Summary of MUX State Transitions

MORE