WD-mux-971202
SMUX Protocol Specification
W3C Working Draft 02-December-1997
-
This version:
-
[11]/TR/WD-mux-971104
$Id: WD-mux-971202.html,v 1.2 1999/04/17 00:20:41 frystyk Exp $
-
Latest version:
-
[12]/TR/WD-mux
-
Authors:
-
Jim Gettys, Digital Equipment Corporation, Visiting Scientist, World
Wide Web Consortium
-
Henrik Frysyk Nielsen, World Wide
Web Consortium
Status of this document
This is a W3C Working Draft for review by W3C members and other interested
parties. It is a draft document and may be updated, replaced or made obsolete
by other documents at any time. It is inappropriate to use W3C Working
Drafts as reference material or to cite them as other than "work in progress."
A list of current W3C
working drafts is also available.
This document describes an experimental design for a multiplexing transport,
intended for, but not restricted to use with the Web. Use of this protocol
is EXPERIMENTAL and the protocol is guaranteed to change. In particular,
transition strategies to use of SMUX have not been worked out. You have
been warned!
Note: Note: Since working drafts are subject to frequent change,
you are advised to reference the above URL, rather than the URLs for working
drafts themselves. This work is part of the W3C HTTP/NG Activity (for current
status, see http://www.w3.org/Protocols/HTTP-NG/Activity)
Abstract
This document defines the experimental multiplexing protocol referred to
as "SMUX". SMUX is a session management protocol separating the underlying
transport from the upper level application protocols. It provides a lightweight
communication channel to the application layer by multiplexing data streams
on top of a reliable stream oriented transport. By supporting coexistence
of multiple application level protocols (e.g. HTTP and HTTP/NG), SMUX should
ease transitions to future Web protocols, and communications of client
applets using private protocols with servers over the same connection as
the HTTP conversation.
Contents
Introduction
Changes from Previous Version
Introduction and abstract cleaned up, per Henrik's message.
Converted to big-endian byte order, to keep the network police happy.
However, this should be revisited in any production version, as a higher
and higher fraction of conversations are little-endian servers talking
to little endian clients (not that it matters much for SMUX; it is significant
for marshaling protocols).
Credit control messages MAY be sent on sessions that are not active.
Added Henrik's graceful release paragraph.
Bill Janssen wanted indication of which way control messages are to
flow, to aid in understanding.
Reduced fragment size to 18 bits; put long length field back in.
Cleaned up references, thanks to Henrik.
DefineEndpoint semantics clarified, and defined to use URI's to interesting
effects. Now is the time to complain if you don't like this idea
for some reason.
Reset fragments can contain error indications for why the session was
terminated. (per John Wroclawski's suggestion not to repeat a TCP
mistake).
I still haven't dealt with Henrik's wants for some upgrade mechanism;
I wonder why one doesn't use whatever mechanism we use to know to use SMUX in
the first place to use some future version of SMUX. Comments?
Key Words
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document
are to be interpreted as described in RFC 2119 [7].
Purpose
The Internet is suffering from the effects of the HTTP/1.0
protocol, which was designed without understanding of the underlying
TCP [1] transport protocol. HTTP/1.0 opens a TCP
connection for each URI [28] retrieved (at a cost
of both packets and round trip times (RTTs)), and then closes the connection.
For small HTTP requests, these connections have poor performance due to
TCP slow start [9] [10]
as well as the round trips required to open and close each TCP connection.
There are (at least) three reasons why multiple simultaneous TCP connections
have come into widespread use on the Internet despite the apparent inefficiencies:
-
A client using multiple TCP connections gains a significant advantage in
perceived performance by the end-user, as it allows for early retrieval
of metadata (e.g. size) of embedded objects in a page. This allows a client
to format a page sooner without suffering annoying reformatting of the
page. Clients which open multiple connections in parallel to the same server,
however could cause self congestion on heavily congested links, since packets
generated by TCP opens and closes are not themselves congestion controlled.
-
The additional TCP opens cause performance problems in the network, but
a client that opens multiple connections simultaneously to the same server
may also receive an "unfair" bandwidth advantage in the network relative
to clients that use a single connection. This problem is not solvable at
the application level; only the network itself can enforce such "fairness".
-
To keep low bandwidth/high latency links busy (e.g. dialup lines), more
than one connection has been necessary since slow start may cause the line
to be partially idle.
The "Keep-Alive" extension to HTTP/1.0 is a form of persistent TCP connections
but does not work through HTTP/1.0 proxies and does not take pipelining
of requests into account. Instead a revised version of persistent connections
was introduced in HTTP/1.1 as the default mode of operation.
HTTP/1.1 [6] persistent connections and pipelining
[11] will reduce network traffic and the
amount of TCP overhead caused by opening and closing TCP connections. However,
the serialized behavior of HTTP/1.1 pipelining does not adequately support
simultaneous rendering of inlined objects - part of most Web pages today;
nor does it provide suitable fairness between protocol flows, or allow
for graceful abortion of HTTP transactions without closing the TCP connection
(quite common in HTTP operation).
Persistent connections and pipelining, however, do not fully address
the rendering nor the fairness problems described above. A "hack"
solution is possible using HTTP range requests; however, this approach
does not, for example, allow a server to send just the metadata contained
in embedded object before sending the object itself, nor does it solve
the connection abort problem.
Current TCP implementations do not share congestion information across
multiple simultaneous connections between two peers, which increases the
overhead of opening new TCP connections. We expect that Transactional TCP
[5] and sharing of congestion information in TCP
control blocks [8] will improve TCP performance
by using less RTTs and better congestion behavior, making it more suitable
for HTTP transactions.
The solution to these problems requires two actions; either by itself
will not entirely discourage opening multiple connections to the same server
from a client.
-
Internet service providers should enable the Random Early Detection (RED)
[12] or other active congestion control algorithms in
their routers to ensure bandwidth fairness to clients when the network
is congested. RED also addresses queue length problems observed in routers
today.
-
Development and deployment of a multiplexing protocol for use with HTTP
(and eventually other protocols), so that multiple objects from a web server
can be fetched approximately simultaneously over a single TCP connection,
so that the metadata to objects can be sent to clients without other metadata
waiting for the rest of the first object requested.
This document describes such an experimental multiplexing protocol. It
is designed to multiplex a connection underneath HTTP so that HTTP itself
does not have to change, and allow coexistence of multiple protocols (e.g.
HTTP and HTTP/NG), which will ease transitions to future Web protocols,
and communications of client applets using private protocols with servers
over the same connection as the HTTP conversation.
Ideas from this design come from Simon Spero's SCP [15] [16] description
and from experience from the X
Window System's protocol design [13].
Goals and Comparison with SCP (TMP)
Note that TIP (Transaction Internet Protocol) [21] defines
a version of SCP called TMP .
Goals:
-
Unconfirmed service without negotiation.
-
SCP allows data to be sent with the session establishment; the recipient
does not confirm successful connection establishment, but may reject unsuccessful
attempts. This simplifies the design of the protocol, and removes the latency
required for a confirmed operation.
-
simple design
-
performance where critical
There are five issues that make Simon Spero's SCP inadequate for our use:
-
SCP can deadlock, unless unlimited amounts of memory is available.
-
it has no provision for multiplexing multiple protocols over the same transport
connection, essential for graceful transition without dependency on the
currently incomplete NG design, and to allow other uses which could use
the same multiplexed connection (e.g. applet communication with serverlets).
-
SCP's 8 byte overhead is not reasonable most of the time. SMUX uses four
bytes in the default case. The design below permits an 8 byte header if
you care to preserve 64 bit alignment at the cost of bytes. In practice,
there seems few data formats or architectures that actually require more
than 32 bit alignment.
-
Without some form of flow control, infinite buffering in clients (receivers)
would be required.
-
Alignment is preserved in the data stream. This allows compact, high speed
(un)marshalling code in implementations of binary protocols, without extra
data copies, which in such protocols can be significant overhead.
-
SCP SYN in Version 2 requires a second message, which costs a round trip.
So far, SMUX is similar to SCP. There are some important differences:
-
deadlock-free (we believe), by a credit based flow control scheme.
-
allow multiple protocols to be multiplexed over same connection (not available
in SCP).
-
lower overhead than SCP, while preserving data alignment (very important
for binary protocol marshaling code)
-
ability to build a full function socket interface above this protocol.
-
SMUX avoids the SYN round trip of SCP V2 by session ID's being allocated
in independent address spaces. This also avoids many of the state
transitions of SCP, simplifying the protocol greatly.
Other comment on SCP:
SCP has 224 sessions, which seems highly excessive, and reserves
1024 of them for future use.
Operation and Deadlock Avoidance
Deadlock Scenario
Multiplexing multiple sessions over a single transport connection introduces
a potential deadlock that SMUX is designed to avoid.
Here is an example of potential deadlock:
-
Presume that each session is being handled by an independent thread and
that memory available to the SMUX is limited (for example, on a thin
client on a meter reader).
-
For the purposes of this example, presume the thin client has 50K bytes
of buffer available to its SMUX implementation, and cannot get more.
-
The sender of data decides to send, as part of a session request (SYN message),
100K bytes of initial data. There are no other senders, so all of
the data gets transmitted. But the thread to deal with the message
is blocked, and cannot make progress.
-
Unless SMUX can buffer all 100K (or 1 meg, or pick your favorite numbers),
any other session's data would be blocked behind this initial transmission
until and unless SMUX can read and buffer the data someplace (and since
it has no buffer available, the deadlock occurs). Many similar (but possibly
harder to explain) deadlocks are possible.
This example points out that deadlock is possible: SMUX must be able to
buffer data independently of the consumers of the data. It must also
have some way to throttle sessions where the consumer of the data is not
responsive in the multiplexing layer (in this example, prevent the transmission
of more than 50 Kbytes of data). Note that this deadlock is independent
of the size of any multiplexing fragment, but strictly dependent on availability
y of buffer space in SMUX for a particular session.
Deadlock Avoidance
In SMUX, the receiver makes a promise (sends a credit) to the transmitter
that a certain amount of buffer space is available (or at least that it
will consume the bytes, if not buffer them, e.g. a real time audio protocol
where the data is disposed of), and the transmitter promises not to send
more data than the receiver has promised (no more than the credit).
If these promises are met, then SMUX will not deadlock.
A SMUX implementation MUST maintain and adhere to the credit system
or it can deadlock. Implementations on systems with large amounts
of memory (e.g. VM systems) may be quite different than ones on thin clients
with limited, non-virtual memory. It is reasonable on a VM system
to hand out credits freely (analogous to the virtual socket buffering found
in TCP implementations); but your implementation must be careful to test
its credit mechanisms so that they will inter operate with limited memory
systems. Credit control messages MAY be sent on sessions that are
not active.
Sessions have an initial credit size (initial_default_credit)
of 16 KB on each session; there is a SMUX control message to set this initial
credit to something larger than the default.
Operation and Implementation Considerations
A transmitter MUST NOT transmit more data in a fragment than the available
credit on the session (or it could deadlock).
An SMUX implementation must fragment streams when transmitting them
into fragments. The max_fragment_size, a variable which
is maintained on (currently) a per transport connection basis, determines
the largest possible fragment a sender should ever send to a receiver.
This determines the maximum latency introduced by a SMUX layer above and
beyond the inherent TCP latencies (socket buffering on both sender and
receiver and the delay-bandwidth product amount of data that could be in
flight at any given instant). A client on a low bandwidth link, or
with limited memory buffering might decide to set the max_fragment_size
down to control latency and buffer space required. If max_fragment_size
is set to zero, the transmitter is left to determine the fragment size
and MAY take into account application protocol knowledge (e.g. a SMUX implementation
for HTTP might send fragments of the metadata of embedded objects, or the
next phase of a progressive image format, which it only knows). An
implementation SHOULD honor the max_fragment_size as it transmits
data, if it has been set by the receiver.
An SMUX implementation that does not have explicit knowledge or experience
of good fragment sizes might use these guidelines as a starting point:
-
The path_MTU of the TCP connection, if this information is available [3].
-
The MSS of the TCP connection, if the path_MTU is not available
-
In either case, you probably want to subtract 8 bytes to make sure a SMUX
header can be added without forcing another TCP segment.
This would result in fragmentation roughly similar to TCP segmentation
over multiple connections.
An implementation should round robin between sessions with data to send
in some fashion to avoid starving sessions, or allowing a single thread
to monopolize the connection. Exact details of such behavior is left
to the implementation. To achieve highest bandwidth and lowest overhead
SMUX behavior, credits should be handed out in reasonably large chunks.
TCP implementations typically send an ack message on every other packet,
and it is very hard to arrange to piggyback acks on data segments in implementations.
Therefore, for SMUX to have reasonably low overhead credits should be handed
out in some significant multiple (4 or more times larger) than the ~3000
bytes represented by two packets on an ether net. The outstanding
credit balance across active sessions will also have to be larger than
the bandwidth/delay product of the TCP connection if SMUX is not to become
a limit on TCP transport performance.
Both of these arguments indicate that outstanding credits in many implementations
should be 10K bytes or more. Implementations SHOULD piggyback credit
messages on data packets where possible, to avoid unneeded packets on the
wire. A careful implementation in which both ends of the TCP connection
are regularly sending some payload should be able to avoid sending extra
packets on the network.
If necessary, we could add in a future version fragmentation control
messages to do some bandwidth allocation, but for now, we are not bothering.
SMUX Header
SMUX headers are always in big endian byte order.
If people want, we could expand out the union below on a control
message type basis (e.g. the way the C bindings to X events were written
out...). For this draft, I'm not doing so.
#define MUX_CONTROL 0x00800000
#define MUX_SYN 0x00400000
#define MUX_FIN 0x00200000
#define MUX_RST 0x00100000
#define MUX_PUSH 0x00080000
#define MUX_SESSION 0xFF000000
#define MUX_LONG_LENGTH 0xFF040000
#define MUX_LENGTH 0x0003FFFF
typedef unsigned int flagbit;
struct w3mux_hdr {
union {
struct {
unsigned int session_id : 8;
flagbit control : 1;
flagbit syn : 1;
flagbit fin : 1;
flagbit rst : 1;
flagbit push : 1;
flagbit long_length : 1;
unsigned int fragment_size : 18;
int long_fragment_size : 32; /* only present if long_length is set */
} data_hdr;
struct {
unsigned int session_id : 8;
flagbit control : 1;
unsigned int control_code : 4;
flagbit long_length : 1;
unsigned int fragment_size : 18;
int long_fragment_size : 32; /* only present if long_length is set */
} control_message;
} contents;
};
The fragment_size is always the size in bytes of the fragment, excluding
the SMUX header and any padding.
Alignment
SMUX headers are always (at least) 32 bit aligned. To find the next SMUX
header, take the fragment_size, and round up to the next 32 bit
boundary.
Transmitters can insert a NoOp control message to force
64 bit alignment of the protocol stream.
Long Fragments
A SMUX header with the long_length bit set must use the 32 bits
following the SMUX header (the long_fragment_size field) for the
value of the fragment_size field, for whatever purpose the fragment_size
field is being used for.
Clients can also use this bit to force 64 bit alignment of the protocol
stream.
Session ID Allocation
Each session is allocated a session identifier. Session Identifiers below
0 and 1 are reserved for future use. Session IDs allocated by initiator
of the transport connection are even; those allocated by the receiver of
the transport connection odd. Proxies or re-multiplexors that do not understand
messages of reserved Session ID's should forward them unchanged.
A session identifier MUST only be deallocated and potentially reused by
new sessions when a session is fully closed in both directions.
Atoms
Atoms are integers that are used as short-hand names for strings, which
are defined using the InternAtom control message. Atoms are
only used as protocol ID's in this version of SMUX, though they might be
used for other purposes in future versions. Since the atom might
be redefined at any time, it is not safe to use an atom unless you have
defined it (i.e. you cannot use atoms defined by the other end of a connection).
Atoms are therefore not unique values, and only make sense in the context
of a particular direction of a particular connection. This restriction
is to avoid having to define some protocol for deallocating atoms, with
any round trip overhead that would likely imply.
Strings are defined to be UTF-8 encoded UNICODE strings. (Note
that an ascii string is valid UTF-8). The definition of structure
of these strings is outside of the scope of this document, though we expect
they will often be URI's, naming a protocol or stack of protocols.
Atoms always have values between 0x20000 and 0x200ff (a maximum of 256
atoms can be defined).
Strings used for protocol id's MUST be URIs [28].
Protocol
ID's
The protocol used by a session is identified by a Protocol ID, which can
either be an IANA port number, or an atom.
-
To allow higher layers to stack protocols (e.g. HTTP on top of deflate
compression, on top of TCP).
-
To identify the protocol or protocol stack in use so that application firewall
relays can perform sanity checking and policy enforcement on the multiplexed
protocols .
In the simplest case, a protocol ID is just a value in the range of 0-0x1FFFF,
and specifies the TCP port number (0x0000-0xffff) or UDP port number (0x10000-0x1ffff)
of the protocol per the IANA port number registry [17]. Firewall
proxies can presume that the bytes should conform to that protocol.
Protocol ID's above 0xfffff are atoms. The scheme name of the URI indicates
the protocol family being used.
Session
Establishment
A session is established by setting the SYN bit in the first message sent
on that session. The session is specified by the session_id field. The
fragment_size field is interpreted as the protocol
ID of the session, as discussed above.
Graceful
Release
A session is ended by sending a fragment with the FIN bit set. Each end
of a connection may be closed independently.
MUX uses a half-close mechanism like TCP[1] to close data flowing in
each direction in a session. After sending a FIN fragment, the sender MUST
NOT send any more payload in that direction.
Disgraceful
Release
A session may be terminated by sending a message with the RST bit set.
All pending data for that session should be discarded. "No such protocol"
errors detected by the receiver of a new session are signaled to the originator
on session creation by sending a message with the RST bit set. (Same as
in TCP).
The payload of the fragment containing the RST bit contains the null
terminated string containing the URI of an error message (note that content
negotiation makes this message potentially multi-lingual), followed by
a null terminated UTF-8 string containing the reason for the reset (in
case the URI is not accessable).
Message
Boundaries
A message boundary is marked by sending a message with the PUSH bit set.
The boundary is set between the last octet in this message, including that
octet, and the first byte of a subsequent message. This differs slightly
from TCP, as PUSH can be reliably used as a record mark.
Flow
Control
Flow control is determined by a simple credit scheme described above by
using the AddCredits control message defined below.
Fragments transmitted must never exceed the outstanding credit for that
session. The initial outstanding credit for a session is 16Kbytes.
End Points
One of the major design goals of SMUX is to allow callbacks to objects
in the process that initiated the transport connection without requiring
a second transport connections (with the overhead in both machine resources
and time that this would cause).
The DefineEndpoint control message allows one to advertize that
a particular (set of) URI's are reachable over the transport connection.
Control
Messages
The control bit of the SMUX header is always set in a control message.
Control messages can be sent on any session, even sessions that are not
(yet) open. The control_code reuses the SYN, FIN, RST, and PUSH
bits of the SMUX header. The control_code of the control message
determines the control message type. Any unused data in a control message
must be ignored.
The revised version of SMUX means that a session creation costs 4
bytes (a control message with SYN set, and with the protocol ID in the
message). Therefore the first fragment of payload has a total overhead
of 8 bytes. (This is presuming using an IANA based protocol, rather
than a named protocol). This is the same as the previous version,
though it means two messages rather than one.
The individual control message types are listed below.
code |
Name |
Dir |
Description |
0 |
InternAtom |
Both |
The session_id is used as the Atom to be defined (offset by
0x2000), so a value of 0 is defining ID 0x2000). The fragment_size
field is the length of the UTF-8 encoded string. The fragment itself contains
the string to be interned. This allows the interning of 256 strings.
(is this enough?). |
1 |
DefineEndpoint |
Both |
The session_id is ignored. The fragment_size is
interpreted as the protocol ID, naming an endpoint actually available on
this transport connection. This enables a single transport connection
to be used for callbacks, or to advertise that a protocol endpoint can
be reached to the process on the other end of the connection.
For example, a connection to a firewall proxy might advertize just
"http:" for the proxy, claiming it can be used to contact any HTTP protocol
object anywhere, or "http://foo.com/bar/" to indicate that any object below
that point in the URI space on the server foo.com may be reached by this
connection. Whether this relative URI naming can be used depends
upon the scheme of the URI [20], which defines its structure. |
2 |
SetMSS |
Both |
This sets a limit on fragment sizes below the outstanding credit limit.
The session_id must be zero. The fragment_size field is used
as max_fragment_size (the largest fragment that be sent on any session
on this transport connection.). A max_fragment_size of zero means
there is no limit on the fragment size allowed for this session. |
3 |
AddCredit |
R->T |
The session_id specifies the session. The fragment_size
specifies the flow control credit granted (to be added to the current outstanding
credit balance). A value of zero indicates no limit on how much data may
be sent on this session. |
4 |
SetDefaultCredit |
R->T |
The session_id must be zero. The fragment_size field
is used as to set the initial default credit limit for any incoming connections
over this transport connection. (i.e. it is short hand for sending a series
of AddCredit messages for each session ID). |
5 |
NoOp |
Both |
This control message is defined to perform no function. Any data
in the payload should be ignored. |
6-15 |
-
|
|
Undefined. Reserved for future use. Must be ignored if not understood,
and forwarded by any proxies. The fragment_size is always
used for the length of the control message, and any data for the control
message will be in the payload of the control message (to allow proxies
to be able to forward future control messages). |
Remaining
Issues for Discussion
-
When can MUX be used???
-
What are the appropriate strategies for determining if the simple multiplexing
protocol can be used? Name server hack? UPGRADE in HTTP? Remember that
previous UPGRADE to use MUX worked?
Closed
Issues from Discussion and Mail
Some of the comments below allude to previous versions of the specification,
and may not make sense in the context of the current version.
Flow control: priority vs. credit schemes
Henrik and I have convinced ourselves there are fundamental differences
between a priority scheme and the credit scheme in this draft. They
interact quite differently with TCP, and priority schemes have no way to
limit the total amount of data being transmitted, though priority schemes
are better matched to what the Web wants. We've decided, at least
for now, to defer any priority schemes to higher level protocols.
Stacking Protocols and Transports (Stacks)
ILU [22] style protocol stacks are a GOOD THING. There have been too many
worries about the birthday problem for people to be comfortable with Bill
Janssen's hashing schemes (see Henrik
Frystyk Nielsen and Robert
Thau's mail on this topic). We tried putting this directly
in MUX in a previous version, and experience shows that it didn't really
help an implementer (in particular, Bill Janssen while implementing ILU).
This version has just the name of the protocol, and it is left to others
to implement any stacking (e.g. ILU).
I believe the name of the protocol is necessary, if SMUX is ever to
be used with firewalls. Application level firewall relays need the
protocol information to sanity check the protocol being relayed. Application
level relays are considered much more secure than just punching holes in
the firewall for particular protocol families, which small organizations
often find sufficient, as the relay can sanity check the protocol stream
and enable better policy decisions (for example, to forbid certain datatypes
in HTTP to transit a firewall). Large organizations and large targets
typically only run application level proxies.
Byte Usage
Wasting bytes in general, and in particular at connection establishment,
for a multiplexing transport must be avoided. There are several reasons
for this:
-
if the initial segment is too long, a network round trip will be lost to
TCP slow start, so bytes near the beginning of a conversation MAY BE much
more precious than bytes later in the conversation, once slow start overhead
has been paid. If the first segment is too long, you fall off a cliff.
-
Directly affects user perceived response; no cleverness of later packing
and batching of request can get the time back; each goes directly to perceived
latency when a user talks to the server for the first time.
So there is more than the usual tension between generality vs. performance.
Performance analysis
Human perception is about 30 milliseconds; if much more than this, the
user perceives delay. At 14.4 K baud, one byte uncompressed costs .55 milliseco
nds (ignoring modem latencies). On an airplane via telephone today, you
get a munificent 4800 baud, which is 3X slower. Cellular modems transmitting
data (CDPD), as I understand it, will give us around 20Kbaud, when deployed.
So basic multiplexing @ 4 byte overhead costs ~ 2 milliseconds on common
modems. This means basic overhead is small vs. human perception, for most
low speed situations, a good position to be in.
On connection open, with above protocol we send 4 bytes in the setup
message, and then must open a session, requiring at least 8 bytes more.
12 bytes == 7 milliseconds at 14.4K. Not 64 bit aligned, and 4 bytes costs
of order 2 milliseconds. Ugh... Maybe a setup message isn't a good idea;
other uses (e.g. security) can be dealt with by a control message.
Multiple protocols over one SMUX
We want to SMUX multiple protocols simultaneously over the same transport
connection, so we need to know what protocol is in use with each session,
so the demultipexor can hand the data to the right person. (e.g. SUNRPC
and DCERCP simultaneously).
There are two obvious ways I can see to do this:
-
a) Send a control message when a session is first used,
indicating the protocol.
-
Disadvantage: costs probably 8 bytes to do so (4 SMUX overhead, and 4 byte
message), and destroys potential 64 bit alignment.
-
b) If syn is set indicating new session, then steal
mux_length field to indicate protocol in use on that session.
-
(overhead; 4 bytes for the SMUX header used just to establish the session.)
Opinions? Mine is that b) is better than a. Answer: b) is the adopted strategy.
Priority...
For a given stream, priority will affect which session is handled when
multiplexing data; sending the priority on every block is unneeded, and
would waste bytes. There is one case in which priority might be useful:
at an intermediate proxy relaying sessions (and maybe remultiplexing them).
If so, it should be sent only when sessions are established or changed.
Changes can be handled by a control message. Opinions?
A priority field can be hacked into the length field with the protocol
field using b) above.
So the question is: is it important to send priority at all in this
SMUX protocol? Or should priority control, if needed, be a control message?
; (control message).
Answer: Not in this protocol. Opens Pandora's box with remultiplexors,
which could have denial of service attacks.
Setup message
Is any setup message needed? I don't think it is,. and initial bytes are
precious (see performance discussion above), and it complicates trivial
use. If we move the byte order flag to the SMUX header, and use control
messages if other information needs to be sent, we can dispense with it,
and the layer is simpler. This is my current position, and unless someone
objects with reasons, I'll nuke it in the next version of this document.
Answer: Not needed. Nuked.
Byte order flags
While higher layer protocols using host dependent byte order can be a performan
ce win (when sending larger objects such as arrays of data), the overhead
at this layer isn't much, and may not be worth bothering with. Worst case
(naive code) would be four memory reads and 3 shift overhead/payload. Smart
code is one load and appropriate shifts etc.
Opinions? I'm still leaning toward swapping bytes here, but there are
other examples of byte load and shift (particularly slow on Alpha, but
not much of an issue on other systems).
Answer: Not sufficient performance gain at SMUX level to be worth doing.
Defined as LE byte order for SMUX headers.
Error handling
There are several error conditions, probably best reported via control
messages from server:
-
No such protocol. Some sort of serial number should be reported, I suppose;
this serial number can be implicit as in X
-
bad message.
-
Some combinations of flag bits are not legal.
-
Priority if it exists?
Any others? Any twists to worry about?
Answer: Only error that can occur is no such protocol, given no priority
in the base protocol. May still be some unresolved issues here around "Christma
s Tree" message (all bits turned on).
Length Field
Any reason to believe that the 32 bit length field for a single payload
is inadequate? I don't think so, and I live on an Alpha.
Answer: 32 bit extended length field for a single fragment is sufficient.
Compression
Does there need to be a bit saying the payload is compressed to avoid explosion
of protocol types?
Answer: Yes; introduction of control message to allow specification
of transport stacks achieves this.
Stacks
I think that we should be able to multiplex any TCP, UDP, or IP protocol.
Internet protocol numbers are 8 bit fields.
So we need 16 bits for TCP, one bit to distinguish TCP and UDP, and
one bit more we can use for IP protocol numbers and address space we can
allocate privately. This argues for an 18 bit length field to allow for
this reuse. * 18 bit length field * * 8 bit session field * * 4 control
bits * * 1 long length bit *
The last bit is used to define control messages, which reuse the syn,
fin, rst, and push bits as a control_code to define the control message.
There are escapes, both by undefined control codes, and by the reservation
of two sessions for further use if there needs to be further extensions.
The spec above reflects this.
Alignment
Back to alignment. If we demand 4 byte alignment, for all requests that
do not end up naturally aligned, we waste bytes. Two bytes are wasted on
average. At 14.4Kbaud the overhead for protocols that do not pad up would
on mean be 6 bytes or ~3ms, rather than 4 bytes or ~ 2 ms (presuming even
distributions of length). Note that this DOES NOT effect initial request
latency (time to get first URL), and is therefore less critical than elsewhere.
I have one related worry; it can sometimes be painful to get padding
bytes at the end of a buffer; I've heard of people losing by having data
right up to the end of a page, so implementations are living slightly dangerous
ly if they presume they can send the padding bytes by sending the 1, 2
or 3 bytes after the buffer (rather than an independent write to the OS
for padding bytes).
Alternatively, the buffer alignment requirement can be satisfied by
implementations remembering how many pad bytes have to be sent, and adjusting
the beginning address of the subsequent write by that many bytes before
the buffer where the SMUX header has been put. Am I being unnecessarily
paranoid?
Opinion: I believe alignment of fragments in general is a GOOD THING,
and will simplify both the SMUX transport and protocols at higher levels
if they can make this presumption in their implementations. So I believe
this overhead is worth the cost; if you want to do better and save these
bytes, then start building an application specific compression scheme.
If not, please make your case.
Control bits
Are the four bits defined in Simon's flags field what we need? Are there
any others?
Answer: no. More bits than we need. Current protocol doesn't use as
many. I've ended back at the original bits specified, rather than the smaller
set suggested by Bill Janssen. This enables full emulation of all the details
of a socket interface, which would not otherwise be possible. See details
around TCP and socket handling, discussed in books like "TCP/IP Illustrated,"
by W. Richard Stevens.
Am I all wet?
Opinion: I believe that we should do this.
Control Messages
Question: do we want/need a short control message? Right now, the out for
extensibility are control messages sent in the reserved (and as yet unspecified
) control session. This requires a minimum of 8 bytes on the wire. We could
steal the last available bit, and allow for a 4 byte short control message,
that would have 18 bits of payload.
Opinion: Flow control needs it; protocol/transport stacks need it. Document
above now defines some control messages.
Simplicity of default Behavior
The above specification allows for someone who just wants to SMUX a single
protocol to entirely ignore protocol ID's.
Glossary
To be supplied
References
-
J. Postel, "Transmission Control Protocol", RFC
793, Network Information Center, SRI International, September 1981
-
J. Postel, "TCP and IP bake off", RFC
1025, September 1987
-
J. Mogul, S. Deering, "Path MTU Discovery", RFC
1191, DECWRL, Stanford University, November 1990
-
T. Berners-Lee, "Universal Resource Identifiers
in WWW. A Unifying Syntax for the Expression of Names and Addresses of
Objects on the Network as used in the World-Wide Web", RFC
1630, CERN, June 1994.
-
R. Braden, "T/TCP -- TCP Extensions for Transactions: Functional Specification",
RFC
1644, USC/ISI, July 1994
-
R. Fielding, "Relative Uniform Resource
Locators", RFC
1808, UC Irvine, June 1995.
-
T. Berners-Lee, R. Fielding, H. Frystyk, "Hypertext
Transfer Protocol -- HTTP/1.0", RFC
1945, W3C/MIT, UC Irvine, W3C/MIT, May 1996
-
R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", RFC
2068, U.C. Irvine, DEC W3C/MIT, DEC, W3C/MIT, W3C/MIT, January 1997
-
S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC
2119, Harvard University, March 1997
-
J. Touch, "TCP Control Block Interdependence", RFC
2140, April 1997
-
W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit,
and Fast Recovery Algorithms", RFC
2001, January 1997
-
V. Jacobson, "Congestion
Avoidance and Control", Proceedings of SIGCOMM '88
-
H. Frystyk Nielsen, J. Gettys, A. Baird-Smith, E. Prud'hommeaux, H. W.
Lie, and C. Lilley, "Network
Performance Effects of HTTP/1.1, CSS1, and PNG", Proceedings of SIGCOMM
'97
-
S. Floyd and V. Jacobson, "Random
Early Detection Gateways for Congestion Avoidance", IEEE/ACM Trans.
on Networking, vol. 1, no. 4, Aug. 1993.
-
R.W.Scheifler, J. Gettys, "The X Window System" ACM Transactions
on Graphics # 63, Special Issue on User Interface Software, 5(2):79-109
(1986).
-
V. Paxson, "Growth
Trends in Wide-Area TCP Connections" IEEE Network, Vol. 8 No. 4, pp.
8-17, July 1994
-
S. Spero, "Session Control Protocol, Version 1.0"
-
S. Spero, "Session
Control Protocol, Version 2.0"
-
Keywords and Port numbers are maintained by IANA in the port-numbers registry.
-
Keywords and Protocol numbers are maintained by IANA in the protocol-numbers
registry.
-
W. Richard Stevens, "TCP/IP Illustrated,
Volume 1", Addison-Wesley, 1994
-
Berners-Lee, T., Fielding, R., Masinter, L., "Uniform Resource Identifiers
(URI): Generic Syntax and Semantics," Work in Progress of the IETF, November,
1997.
-
J. Lyon, K. Evans, J. Klein, "Transaction
Internet Protocol Version 2.0," Work in Progress of the Transaction
Internet Protocol Working Group, November, 1997.
-
B. Janssen, M. Spreitzer, "Inter-Language
Unification"; in particular see the manual section on Protocols
and Transports.
@(#) $Id: WD-mux-971202.html,v 1.2 1999/04/17 00:20:41 frystyk Exp $