Nearby: Call for participation
Section 1.1 of the Extensible Markup Language (XML) gives as a design goal that Terseness in XML markup is of minimal importance. The Standard Generalized Markup Language (SGML), of which XML is a Profile, has a number of features intended to reduce typing when humans are entering markup directly, or to reduce file sizes, but these features were not included in XML.
The resulting XML specification gave us a highly regular language, but one that can use a considerable amount of bandwidth to transmit in any quantity. Furthermore, although parsing has been greatly simplified in terms of code complexity and run-time requirements, larger data streams necessarily entail greater I/O activity, and this can be significant in some applications.
There has been a steadily increasing demand to find ways to transmit pre-parsed XML documents and Schema-defined objects, in such a way that embedded, low-memory and/or low bandwidth devices can make use of an interoperable, accessible, internationalised, standard representation for structured information, yet without the overhead of parsing an XML text stream.
Multiple separate experimenters have reported significant savings in bandwidth, memory usage and CPU consumption using (for example) an ASN.1-based representation of XML documents. Others have claimed that gzip is adequate.
Advantages of a binary representation of a pre-parsed stream of Information Items (as defined by the XML Infoset) might include:
One potential and very serious disadvantage is that one might lose the View Source Principle which has helped the Web to spread.
In September 2003, The W3C ran a Workshop, hosted by Sun Microsystems in Santa Clara, California, USA, to study methods to compress XML documents, comparing Infoset-level representations with other methods, in order to determine whether a W3C Working Group might be chartered to produce an interoperable specification for such a transmission format.
The Workshop concluded that the W3C should do further work in this area, but that the work should be of an investigative nature, gathering requirements and use cases, and prepare a cost/benefit analysis; only after such work could there be any consideration of whether it would be productive for W3C to attempt to define a format or method for non-textual interchange of XML.
See also Next Steps below for the conclusions as they were stated at the end of the Workshop.
The scribe for recording comments, questions and discussion for the first day was Chris Lilley (W3C).
David Orchard gave a presentation from BEA Systems.
Anish Karmarkar, Oracle: if there is no single soluton that works for all cases, would you prefer one, or none, or multiple solutions?
David Orchard, BEA: prefer zero to two or more
Steve Williams, HPT: Clarify point about research needed
David Orchard, BEA: Not inventing something new - lots of solutions out there. Carefully analyze the problem to be solved, and pick a good one. If no existing solution works, probably no new one would either.
Rick Marshall could not be present at the Workshop. The Chair (Liam Quin, W3C) read Rick Marshall's paper [PDF version].
Jim Trezzo, AgileDelta: Moore's law is fine, but does not apply to batteries.
John Schneider, AgileDelta: Energy conservation does not follow it either
John Schneider later expanded this as follows: Yes. Once we unplug a device from the wall, we have to remember the basic laws of physics. Work takes energy and every byte read takes work.
Margaret Green, Ontonet: XML, or Infoset? Need to be clear what is being discussedLiam: Agreed that the workshop title is the infoset, and Rick's talk is about XML and does not consider the infoset but is primarily about the serialisation.
Noah Mendelson, IBM: Infoset is not what results from parsing an xml document; it's only sometimes true. But there are also synthetic infosets, e.g. created via DOM and may be serialised later, but need not be. Essential to be clear on definitions.
note a brief discussion period was set aside for people to discuss terminology and a definition of the Infoset.
Michael Rys gave a Microsoft presentation.
Steve Williams, HPT: Binary DOM [need not be] isomorphic with infoset
Michael Rys, Microsoft: not a direct representation of the data model implied by the DOM; however you could support a DOM
Louis Reich, NASA: Are the last three lines of your last slide out of scope for this workshop? Seem highly relevant to me. Problems currently outside XML scope might be brought into the fold by using a binxml solution (eg large binary blobs, etc). Which part of 80:20 are we looking at?
Michael Rys, Microsoft: much of this is very domain specific, need some sort of packaging format
Louis Reich, NASA: infoset can hold binary data. Our community would prefer a sub-optimal but standard way rather than our own, discipline-specific standard. A single W3C standard would be very exciting for us. Disagree with your assertion - people would indeed use this if it was a standard.
Michael Rys, Microsoft: would need to extend the infoset to do this. Comparison with image compression, lots of special ways, most with encumbering IPR, most specific to particular type of content - would people abandon these to get a single binary representation? Or, better to use a packaging format and keep the image in its special, efficient format. It's a payload packaging problem.
Larry Masinter, Adobe: Agree with much of the analysis, puzzled by conclusion about whether w3C should work on this. Clearly other areas where the solution was not clear (eg semantic web) so what is the threshold of research required. Not clear at all that standardisation work would increase fragmentation?
Michael Rys, Microsoft: semantic web is research, it's not standards work. It's still too early, research is needed and should not be done at W3C as it stifles innovation once a standard is set. MS has internal binary representations, but we use textual XML for interop. It's all that works in all cases.
[aside: the Chair pointed out that the W3C Semantic Web research is at least partly funded externally]
Eduardo Pelegri-llopart, Sun: Interop in web services only gets to 80% using standards, the rest is reverse engineerring and non standardised extensions.
John Schneider, Agile Delta: LZ or Huffman (frequency based) compression only works on large messages with high character redundancy. it does not work for high frequency streams of small messages, typical of mobile environments and Web services. Zip will often make these bigger instead of smaller.
Also, I've heard many people expecting to see big improvements in user-visible performance. [supplied later by John Schneider: To be honest, it's not that clear that mobile users will see a noticeable speed increae given the high latency of mobile networks. The more significant benefits are economic. Carriers spend a great dela of money buying frequencies and putting up cell towers to increase capacity of their pipes. If the size of the data shrinks by 10 times, carriers can now fit 10 times more customers on the same pipe, meaning they can generate 10 times more revenue without huge infrastructure investments. For always-on packet-switched networks where users pay by the kilobyte, these savings are passed along to the customer.]
[originally minutes text: Latency eats up the performance improvement, but reduced bandwidth still helps the carrier get more customers on that network.]
Michael Rys, Microsoft: Carriers are not worried about interop with other xml apps. They use gateways and can send whatever they want between the gateway and the mobile device.
John Schneider, AgileDelta: Mobile devices don't necessarily use gateways any more. They can now hit any URL and access enterprise infrastructure directly. If I hit a SOAP Web service using my mobile device, the payload comes back as raw XML, with no gateway inbetween. So mobile devices definitely need efficient access to XML everywhere, not just through gateways.
Zero is not an option - MPEG7 and ASN1 is already in train. We need a general purpose standard that can deal with mixed namespaces.
[John expanded this in email later, for clarification: Like others, I also prefer zero standards to two. Unfortunately, however, zero is not an option. There are already two mainstream standards organizations working on binary encodings for XML, MPEG7 and ASN1. Are they compatible? No. Is either one of them general purpose enough to handle all mainstream XML applications? No. For example, neither one can handle mixed content, which is required for XHTML -- a pretty popular use case. We need a general purpose standard that can deal with the broad uses of XML.]
David Orchard, BEA: Our position is not "no"; it's "be sure what you are designing" and "pick one". My question to Microsoft (who said there is no 80:20 point now): Do you think there will be an 80:20 point in the future?
Michael Rys, Microsoft: Yes we would reconsider then, but finding such a point is hard. MS for example have been trying since 1998 and not found it yet that works across the whole company.
Robin Berjon gave the Expway presentation [this is a ZIP archive of SVG].
Noah Mendelson, IBM: Since you send compressed schemas, could you get benefits by sending the schemas for Schemas?
Robin Berjon, Expway: that schema is not valid; we had to hack it
Noah Mendelson, IBM: you could send the schema on a one shot basis - does this recurse?
no, magical types need special processing. it does work, but needs normalisation. Proper solution is to do a mapping to the common schema model, then send that
Noah Mendelson, IBM: ok so you do that with a 10-20k schema, what overhead to send it with each message?
Robin Berjon, Expway: with each message? and depends on the richness of the schema. With each message, only send the schema parts that are actually used that time. Or send a schema with a set of messages. We can also send incremental schema updates.
John Schneider, AgileDelta: well done! You dealt with many of the problems that we encountered. About representing any general infoset - MPEG7 currently does not support mixed content models, namespace prefixes, and some other infoset items.
Robin Berjon, Expway: Namespace prefixes are supposed to be disposable
John Schneider, AgileDelta: needs to apply to all uses of xml, and hit the mass market rather than high cost niche solutions.
[by email, John expanded this: Actually, prefixes are part of the infoset and while many people don't care about preserving prefixes, there are some communities that require it. Our solution needs to apply to all uses of xml, and hit the mass market rather than high cost niche solutions]
Robin Berjon, Expway: BiM 1 does not support mixed content, BiM 2 does and so does our product. PI is also doable, comments could be added too.
John Schneider, AgileDelta: Is open content (unexpected content, eg) supported?
Robin Berjon, Expway: yes. Generic Infoset encoding. You lose some of the benefits, but not all of them, and it still works.
Santiago Pericas-Geertsen, Sun: BiM has nice features like fragments, how much of that should be part of the format and how much left to other levels?
Robin Berjon, Expway: need to be sure the low level format is fragmentable, to let higher levels do it. Fragments need context like inscope ns declarations. BiM is for broadcast so it was designed to do that.
Oliver Goldman gave the Adobe Systems Inc. presentation.
(during the presentation, someone noted that base64 is 30% biggger - but if you need the clean form of base64, that is 133% bigger. Also questions of random access.)
Louis Reich, NASA: PSVI: what is the problem exactly, 1 to 2 years down the road?
Oliver Goldman, Adobe: some people use Schema, but many do not or have non-validated data, so it's important to work with that
Robin Berjon, Expway: using the XQuery Data Model, both PSVI and the ordinary Infoset can be represented
Oliver Goldman, Adobe: not sure, need to look at that more. Could I round trip through that?
Robin Berjon, Expway: yes
Mike Conner, IBM: problem with off-line access, schema not found - let's not do that
Oliver Goldman, Adobe: same issue for form data defined by a schema
John Schneider, AgileDelta: is that schema in the pdf file now, to show the form?
Larry Masinter, Adobe: no, it just has the form data and the presentation
Stephen Williams gave his position paper.
Microsoft: having the wire form and the internal form the same is very constraining on the applications choice of internal form
sw: yes it needs to be portable, not just a memory dump
Don Brutzman, Web3D: Streaming over the wire - network byte order solves that
unidentified/inaudible: (question about efficiency)
sw: needs to be little overhead from getting the data off the wire and starting using it. Currently [with text/xml interchange] a whole lot happens, object creation, lots of moves of small amounts of data, pointer creation.
Erik Wilde, ETH Zurich: parsing is expensive, you benefit from locality it seems, so how expensive are things like namespace information which is not necessarily local
Steve Williams: I am not compiling, it's not an issue
John Schneider, AgileDelta: some applications want to preserve namespace prefixes, so those applications are upset by namespace prefix redefinition
Noah Mendelson, IBM: Whole identity of in-memory and on-the-wire is at cross purposes to why many of us came to XML - so parties don't have to agree on their APIs and internal models, many of which are preexisting and already deployed. We tried to do this with DCOM and CORBA; now we are using XML because it decouples needing to care about all that byte-level pointer stuff. Your API still has elements and attributes?
Steve Williams: yes, it's like nested objects, most 3GL object oriented languages do this. Can choose to have overhead by mapping to native objects etc at some efficiency cost. Worst case is no worse than best case now, best case is a lot better.
Larry Masinter, Adobe: two types of pushback:
these are contradictory, swapping two bytes is a lot less expensive than parsing a document.
Need to scope the applicability of binary xml to those cases where parsing *is* a significant portion of the work.
Noah Mendelson gave IBM's position paper, which was accompanied by a presentation [PDF] by Mike Conner (also of IBM) on CBXML.
Eduardo Pelegri-llopart, Sun: Throughput - how does it compare to processing character form
Mike Conner, IBM: no, because parsing technology is making big leaps forward and production parsers are much faster than stock, free parsers. Code quality and maturity is a major determining factor. [network costs were not significant - 6% - extensive tests. Measuring instructions per character processed]
You can't assume things that the schema has not told you. Table based processing has more predictability and improves performance. Avoid schema-dependent processing (not encoding, but processing).
Doing a lot of UTF-8 to UTF-16 conversions (eg in SAX-based tree traversal) is very slow.
Larry Masinter, Adobe: Examples are all message passing, but XML used in many other cases. Is a CD-ROM full of XML "a message"?
Noah Mendelson, IBM: a 30% increase is an insignificant reason for standardising, but could be significant for terabyte range data. Not convinced that random access adressing is tightly bound to binfoset.
Larry Masinter, Adobe: Not appropriate to look at the considerations independently.
Michael Rys, MS: Random access and compression do relate.
Santiago Pericas-Geertsen and Eduardo Pelegri-Llopart gave the Sun Microsystems position paper.
Note: useful link to X.694 W3C XML Schema to ASN.1 mapping linked from www.itu.int/ITU-T/asn1/database/itu-t/x/x694/2003/
Noah: Is it using Java reflection to do the classes?
Santiago Pericas-Geertsen, Sun: No.
ms: Protocol encoding is highly based on schema implementation. How does it handle open content
Santiago Pericas-Geertsen, Sun: It can be handled, the holes do not have full performance; they can be mixed ok.
ms: it's a very message-oriented architecture. So you want one standard for the WS application area, even if it's unsuitable for other areas? This will produce fragmentation
Santiago Pericas-Geertsen, Sun: But WS are only interested in interop with other WS?
(several): No!
ms: MS and BEA and etc have interop with textual XML, doing this will bifurcate web services
Eduardo Pelegri-Llopart, Sun: This is coming from customer pressure, there is a real problem to solve.
Mark Nottingham: Could want to do XML signing, encryption etc and for that it needs to know the element and attribute names, etc. Variability in performance with fallback is less desirable than uniform performance.
DO BEA: So this is applicable to a part of WS, not even all of it, and you would rather do this than nothing?
Santiago Pericas-Geertsen, Sun: Right.
Eduardo Pelegri-Llopart, Sun: This addresses the needs of our customers, and they would rather do this with a standard. But they need it anyway.
DO BEA: So if everyone did that, BEA did it and IBM did it, are you prepared to live with someone elses standard?
Eduardo Pelegri-Llopart, Sun: We are not committed to this particular solution.
NM: (Looking at the slide with two multicolored bars - Performance Results, time spent in layers) So, there are four messages in that, roughly a gigahertz processor - we are seeing better numbers than that for textual XML in our applications. You are saying it's taking a million instructions to do the one message? 2000 instructions per character? Your left bar is too big by a factor of 10 or so.
Santiago Pericas-Geertsen, Sun: James Clark wrote about this on xml-dev, removing the whole network layer (scribe missed the point here)
steveibm: there are two optimisations, using the schema in the processing does get you better performance here
epl: Have to do a binding to do something, but it's optimised to transfer the objects. Removes the databinding level.sw: We cannot compare these without a standard corpus of test files that are run on different implementations.
steveibm: rmi slows way down when there are deeply nested structures
ad: did you disable nagles algorithm, because of throttling effects
spg: no, we did not
Larry Masinter (Adobe): Goal should be to improve the average over likely implementations, not to finely tune one implementation to the max - it would be just one sample point, and not address the wide variety of use cases. As far as creating a working group in W3C, do we need consensus on requirements to form a working group? Or can we find a group who is interested enough in a subset of requirements, and that everyone else agrees to leave them alone to do that work?
(several) laughs
Liam: no, this group is not expected to come to consensus on requirements. looking for rough consensus on desirability of further work in this area. First step would be a requirements document, probably only that, and in a fixed time period. Need to know whether it's generic or if it's industry sector specific.
Larry Masinter (Adobe): so it might not happen in xml core
Liam: no, xml core has not expressed interest in that area, a new group would need people experienced in that area and also we don't want to take time from XML Core's existing workload.
Introduction
Break-out Groups to discuss remaining papers
noon: luncheon
First reports from break-out groups
Break-out Groups to discuss requirements
Summaries of Break-Out Work
These are not in any particular order. During the meeting we tried to merge ones that were obviously the same, but we did not work on consolidating them.
Maintain unversal interoperability - all tools, all documents.
Continue to work with existing parsers and tools?
Do not want a domain-specific solution [e.g. wireless]
Efficient storage is important [compactness]
Efficient transmission is important
Support both storage & messaging representations
Want fast decompression, even if it means slower compression
10x faster than 2003 best practice with textual XML
Support parsing on low power device
Must reduce processing time (compared with parsing) including data binding time
Want to be able to create binary thingies directly, not just via pointy brackets
Performance comparable to (or better than) RMI
Must support packaging - bundle multiple files together, maybe with nesting
Want to be able to send deltas [updates, e.g. arbitrary sets of changes]
e.g. with versioned fragments
Must support random access based on infoset (XPath) or other boundaries, e.g. image, page, indexed; indexes built from stream
should be able to update in place based on Xpath
Fragments - be able to start reading at fragment boundaries
[e.g. repeated sub-elements for repetetive broadcast]
interchange fragment context (e.g. location in document tree)
Must progressive downloading [e.g. progressive rendering]
look at header before seen whole thing
Support streaming
Want progressive generation (e.g. no packet-length at start)
Want XML Fragment Interchange, e.g. for query results & document subsets
Transmitted fomat shall be easy to convert to/from native XML Schema data types
Must support XML Security, e.g. via canonical XML & reconstruction thereof
Want arbitrary precision numerical data formats
Want directory at the start, e.g. for allocation
Negotiation: fall back to text format if receiver can't understand binary
Must be able to distinguish text XML from binary format on insepction
Must be clear about MIME media type to be used
Must not rely on HTTP [e.g. file support]
Support one-way communication
Must work on assymetry in bandwidth
Can use schemas to help encoding
Must support schema version detection, multiple schemas at once
Robustness in face of mismatched schemas [e.g. reader makes right]
May support download of new schemas, or some other way of schema evolution
Not all devices will accept new schemas:
support read-only schemas for recipient
Must support open content, e.g. elements, values, subtrees not in the schema
also want to be able to modify understood items & pass on (forward) subtrees, including elements you didn't understand
Self-describing format
Solution must be mass market, commodity price points
Must support fast access to individual parts of the infoset, [e.g. for headers] mixed mode encoding, some parts compressed
Must support custom & multiple compression schemes for data types
Want to continue to work with SAX and DOM, Pull parsing, e.g. existing APIs
Must require minimal changes to application layer
Should support validation on receiving
Must approach efficiency of hand-coded (binary) formats [as per information theory]
Must work for both data and documents
May be willing to have lossy compression, infoset subset
e.g. might be able to lose comments, processing instructions, whitespace (some applications may be able to lose some kinds of content, e.g. precision on SVG coordinates)
Want round-tripability wrt. canonical XML
Must be able to represent complete infoset
Should support arbitrary extensions to infoset
Consider efficient support for other data structures, e.g. linked list, directed graph, e.g. id/idref optimization to pointers, point to any character, xpointer or something
Consider option to propagate inherited data (e.g. xml:lang, xmlns:) down
Must work equally well on high-frequency stream of small messages that may grow bigger if you use gzip on them; also large files with many small objects, e.g. a million floats
Must minimize bandwith for both small and large messages
Must be easy to implement
Want to specify the order of serialization... e.g. may want to send header up front, or may want leaf nodes first and may need to inform receiving end
More cycles available for compression than decompression
SOAP important
cell phones have high latency
memory & processor constrained but not power constrained
satellites, large images coming over low bandwidth (and high latency?) and must be archived for between 5 & 50 years
to explore - relationship between storage & message
bandwidth not major concern, memory footprint in the box (CPU), is.
extensibility of Infoset
exist military bandwidth requirements of 300 (effective) baud
Exchange live infosets, e.g. share in-memory representation?
.i.e. use the binary format in a database
Access XML-based data from mobile device, wireless link
Store XML in db
Program passes XML to another program
pass live infoset between programs
display SVG on cell phone, PDA
SOAP messaging over wireless (e.g. cell phones)
e.g. customers' data, i.e. not fixed payload
deliver XMl to low powered devices over wireless or other low bandwidth connection
Store infoset in mqml [?] system
Send infoset between devices on P2P network
Route SOAP messages
XQuery across multiple nodes
Access any page in large document
signing documents
Editing a portion of a document
Storing large documents
Annotating a document
adjournment (end of day two)
Do we go ahead? If so, what needs to be done? Use cases, test data, benchmarks?
Minutes for Friday were taken by Anish Karmarkar (assisted by Don Brutzman)
Should W3C do more work in this area?
Do we go ahead? If so, what needs to be done? Use cases, test data, benchmarks?
Jonathan Marsh (Microsoft) asked if we could have a discussion about people's view of the future, and the ramifications of not going ahead.
Jonathan: I don't have a good understanding as to what each of you want out of this workshop. I definitely want this to be part of XML. Products should be able to consume both formats, across the internet. Another option is that there are certain industries where there is a gateway between the internet and the private community. There could be common gateways which understand a common format. A third view is fast Web services.
Jim Trezzo: There could a phased adoption. It could start out with isolated communities and then may end up with wider adoption. Ultimate goal remains total adoption.
Eduardo: Total adoption may never happen.
John Schneider: what people want to achieve is the economic benefits and interoperability benefits of XML to areas which cannot benefit from XML right now. Eg. wireless. I would expect that these areas (where the pain is the most) would be the first adopters. Once this is available, other areas would adopt is as well. Everyone is not going to switch in one day. Migration is going to happen.
[We need a migration plan. This is not as difficult as it might sound. I've doen it in other similar domains.]
Michael (IBM): one of the criterion for getting this right, is to have the right tools, where I don't care whether on the disk it is binary or text. We do need to factor in (for transition) -- if the binary XML is a solution for say footprint reduction, then such a solution requires that text xml not be supported.
John Schneider: you need to have some tool to read the binary format for human consumption. Unicode and ASCII does have lot more tools right now then binary. I don't want to lose the ability to read XML and put it in specs, to write human-readable examples on a whiteboard so people can read it, or put XML examples in books. We will always need the text XML encoding for human accessibility.
Prof. Kimmo Raatkainen: XML will be used more and more for machine-to-machine interaction. We should look at trends on CPU speeds, memory etc are evolving. They don't evolve in the same way.
Robin: footprint problem - those devices would only support binary. I don't see that as a problem.
Show of hands: who has edited using non-xml tools XML. response: overwhelming.
Anish: Johnathan -- in the options that you outlined, are you assuming that we will be able to come up with a single binary format that will satisfy everyone? It is not clear to me that we would indeed be able to do that.
Glen Adams: I am more in the gated communited who would not be (possibly) connected to the internet.
Stephan Williams: perhaps a standard framework is a way to go
Don Brutzman: binary xml is not just for wireless. benefits of binary xml should not be restricted to just small areas such as wireless.
Craig Bruce: if u take a text file and gzip it, then you have a binary file.
MarkN: but one has to go through that tool before editing it.
Robin: but gzip is lossless.
Noah: we should not fudge the different requirements which are all over the place and conflicting. We have to be careful about finding a sweet spot. Competing requirements.
MarkN: if we want ubiquitous support, we may have to compromise on other properties. What are the trade-offs?
Liam: discussion on trade-offs is a discussion on core requirements.
Eduargo: schema v. no-schema - all schema proposals have to deal with the no-schema case as well.
Liam: a related question is, what if we don't go ahead? what are the options if the verdict of the W3C is no. It seems to me that we can draw a list of options -
View of the future
Jonathan Marsh of Microsoft
what if we don't go ahead?
everyone uses text only
many binary formats that don't interoperate, fragmentation
what if we go ahead?
Noah: a possible answer to tower of Babel. binary xml may converge, but it is not xml, may be compatible to XML.
MarkN: usecase and applications that don't find text xml useful, don't use it till parsers and tools get efficient.
Liam: we have more and more layers, and with time things don't necessarily get faster (eg: 1989 a window creation was faster then now).
Eduardo: there already exists ISO and other standards. New future scenario is - we would really like to see that existing efforts should not get hosed by new standards from W3C. If we have a binary standards (even if it isn't from W3C), we will have dissonance. Everybody loses then.
Alessandro Triglia: there is a set of ISO/IT-UT standards already available. I want to stress that there are standards that are available now. If w3c does not do anything, they can already use those standards. They are final committee drafts.
Glen: a corallary to (b) is W3C will effectively lose control.
Selim Balcisoy: the option (b) is important to us. If we don't have a successful binary format then we lose the advantage of text xml as well. You may lose a substantial part of the web.
Liam: as soon as you have end-to-end non-xml stuff going on. Then you end the story.
Selim: we do support xml right now. but the question is how long can we do that. There is a heavy burden on the devices. This is half of the web. What happens if we don't have a single binary implementation. There is a standard for mobile device - 3GPP. But they are looking to w3c to do this.
Don: if we say half of the data on the internet is non-structured (from a xml point of view). Then all the xml tools/standards (from XSLT up to the semantic web) cannot operated on the data and we have partitioning.
Liam: maybe binary xml format would make sense to people who want to store movies in XML. We are fragmenting it on one hand and on the other hand, we are expanding it.
[the background here is that someone in the film industry had approached Liam asking about representing entire movies in XML. This might become feasible if there were some more compact and efficient representation than textual XML.]
MarkN: spend some time capturing the risks of doing this.
Craig: wrt to representing movies in XML, with my open source impl, I get good compaction and performance. A binary rep of a movie can be embeded in a XML document.
MarkN: what is the purpose of this, you cannot use xpath for example on it.
Liam: embeding is a good usecase.
Robin: the svg WG is dealing with this as well. If we have xml representation of raster images then we can use XML tools.
Liam: we seem to have two groups. one who says this makes great sense, whereas the other group says this is crazy.
Eduardo: if a binary format came from one vendor then the other vendors suffer.
Noah: i would like to respond to what MarkN said - what could go wrong if we go ahead? I would like to reinforce that we are trying to chase the technology curve, eg: the cell phones would be really fast or not. It is worth looking at, but no decidable. I don't buy that over time things get worse (Liam's example). But I don't think things like this happen in this area. Look at the ethernet standards. One of the risks is that we won't need it as badly as we think we do. This is what happen with non TCP/IP. This is a risk.
Santiago: the approach -- wait for Moore's law to kick in. But that is not the best way to go, 'cause we may need this for other devices such as Java Card.
Arnaud Le Hors: look at the past to predict the future. We were looking at getting HTML to cell phones, people did not think that would happen, but it has. Selim Balcisoy: bandwidth does not follow Moore's law. Nor does battery life. We will not have one brower, but multiple applications. So there will be a constant need to have more (for wireless industry), binary format is one way to get there.
John: wrt to Moore's law - +1 to Selim's comment. Expectations - when I sit in front of my desktop, I have certain expectations. For mobile devices it's different. I will always want to squeeze more out of my mobile device to come closer to matching the capabilities of my desktop machine. Also, don.t forget the tragedy of the commons . as more computing power becomes freely available, our computing needs rapidly expand to use the available power.
David Orchard (BEA): 2 points. Let us assume that we go thru a benchmarking exercise and come out that it does not make that much of a difference in most of the cases. This might be perception issue. My point is: if we have a benchmark that says that it is not a big deal, we can convince the world that it is not necessary. Second point, customers say we need faster xml. They are going to do something anyway (wrt to faster xml). They don't necessarily make the best choices. One adv. of having a single spec (if we it is not the best solution), at least it is "one way" and a std. and economies of scale may kick in.
MarkN: a tool that provide performance number may be useful for comparison. Some of the solutions actually add more functionality which does not have anything to do with binary encoding. There might be a need to have a reworking of XML specs/APIs.
Oliver: 3a. is a dangerous agruement. Compactness is considered important, that can be address by technology getting faster. But as machines get faster, people produce bigger and bigger documents. There are certain requirements that do need to be addressed and cannot assume that technology will take care of it.
David Orchard (BEA): there are certain aspects that are not addressed by Moore's law such as random access to documents. Non-linear improvements is worth focusing on.
Noah: XML is tree-based and not sure if the community needs random access. we need to think about this. Some improvements such as random access may affect how we consider using XML not jsut binary XML.
Liam: one way to do this is to look at what happens in DB. For example, there isn't a common table format but a common query language. This may be applicable to random access.
Liam: we can do something more or better. And we could do something that we could not do it before. There is a difference.
Eduardo: We should separate things like binary XML and fragmentation (which are non-binary xml requirements)
Johnathan: there are a lot of problems such as reducing the XML brand or interop. But one thing we need to consider that we are not going to satisfy all the requirements, so we are going to disappoint someone and they will come up with their format anyway.
Liam: we need to get into the discussion of where do we go from here/next step?
Liam: anyone here thinks there should not be any further discussion?
(no one raised their hands)
Liam: we should have a forum for this dicussion. I would like to propose that we create a new w3c list for such a discussion.
ACTION: Liam to create a public W3C mailing list for discussion
John: should we document the objectives and deliverables for the list?
Noah: lot of people are saying - can w3c do some really serious work in this area. Beyond discussing this, how do you write a charter for the WG? There is a debate as to whether anything should be chartered at all. Part of writing the charter is to figure out if it is a bad idea. Perhaps a task force is the way to go. The debate will surely happen in the AC. I am not happy with the idea that we should have a WG to figure out if we should have a WG. This will result in a lot of delay. Or saying that there is disagreement and the AC will do what it does. We could find a process for deciding and this requires more than just a ML. We need someone from W3C listening in, figuring out where this is going.
David Orchard (BEA): we need to do to decide is yes/no questions for a WG is: it is highly likely that we will need to do some kind of benchmarking. There is a push back on this because of the time required to do this. It is more important on what the next deliverable is rather than the mechanism or doing this.
Noah: one other aspect is that we have to involve a broader range of people then what are involved here in this room.
Eduardo: I don't think we should do benchmark. Maybe micro-benchmarks or experiments. So as to convince people. It would be good to focus on the deliverables rather than the mechanism, but the mechanism is tied to the deliverables.
Jim Trezzo: the core people don't want to spend all their time on arguing about this with the world. All we need to show feasibility rather than benchmarks. We need to demonstrate that it is not a research project. And we want to do this really fast.
Liam: we could actually have a second event, where we discuss the findings. Similar to a interop workshop.
JimT: I don't want to take this too far. This may get too competitive.
David Orchard (BEA): various companies have benchmarks but they are not comparable (as they are not normalized).
MarkN: a mailing list is a horrible place to discuss a charter. As this requires some hard decisions. We should document the state of the community, usecases for and usecases that don't need them.
Nokia: I have concerns with benchmark. Benchmark does not deal with solutions that may be in the future. It should be only for information purposes. I would like to see some deadlines. Public email list is not the best way.
Liam: I don't think we mean to choose the one with the fastest benchmark. It is for demostration purpose and for comparison.
Margaret Green: There are other avenues then email - blogging/webspace/Wiki. There are other alternatives to email.
Liam: mailing list is good to have everyone's opinions heard and is a possibly good process to bring something to the attention of the AC. I don't want to get bogged down with the mechanism. Archives are helpful for mailing list.
Margaret: as far as measurement/benchmark goes, it should included a range of devices and domains. We should be able to demonstrate boundaries of what can be solved.
Stephen: we should use Wiki for collaboration. Measurements should ack that there will be some discussion on possible approaches.
Liam: some of the W3C IGs have had ML, where people were asked to stick to a particular topic for a period of time and then move to a different topic after sometime.
Don: big picture objective - near term - IG/WG so that we have a process to guide us. In the long term, I would suggest a goal. If we consider XML to be structured data, then the over-arching goal is - would XML be text+binary or would it be just text?
John: about benchmarks - we should compare apples-to-apples. The danger is that if I have a super-fast parser and compare it with binary XML built on the xerces/dom we don't have a meaningul comparison. Comparing numbers from different vendors doesn't really mean anything.
Liam: it is entirely possible that the act of saying - we need less bandwidth, memory etc. This may prompt people to improve their parser. There are DOM implementations that use two orders of magnitude more memory than others. Clearly one can push back on boundaries.
John: the point I am trying to make is that it is really hard to compare benchmarks without a controlled environment. This is not just limited to the hardware but also the software that is used.
Laamanen Hiemo: measurement on top of different systems are very different (GPRS etc).
Liam: quite right. we need to have a set of data. We are not trying to benchmark impl. but to see if text (or binary) does better.
David Orchard (BEA): if we don't normalize the data sets, env. The AC/public is going to poke holes in the benchmarks.
Liam: this will also result in challenging people to do better with text xml.
Stephen: test framework for various systems (wireless/wired etc) would be useful. There is some value in looking at what it would take in terms of the programs.
MarkN: all these things (benchmarks, test framework, measurements) require a WG to come to a consensus on. An IG is not going to come to a consensus.
Liam: WG - there are IPR issues, commitment requirements.
Arnaud: Typically WG have deliverables that are rec track. Participation requirement are much harder, attendence, standing etc. There are commitments.
Liam: I should point out that not all WG have rec deliverables.
MarkN: WG has IPR requirements and that gives me more comfort.
Don: people who care about this should read the process docs. I agree that WG is more appropriate. Perhaps we can start with a IG and progress to a WG. Goal is WG, fallback is IG for continued deliberate work towards WG. ML alone is a waste of time.
Dmitry: we need to decide on two deadlines: for the IG/ML and another for the WG/rec/standards. The IG/ML deadline should be short.
David Orchard (BEA): IPR disclosure are required for WG and not for IG.
Liam: in principle we can invent something like a RF IG. Or make it a task force under the auspices of a exiting WG.
David Orchard (BEA): the reason WG makes sense, is that a commitment requirement maps to a WG. Having said that, if IBM could not live with a WG but will live with IG, then we need to be flexible to be inclusive.
Arnaud: we have moved the dicussion about whether to discussion about an IG to how to form a WG. But we don't know what the charter is. I don't think this work justifies a charter. We need a dicussion on that.
Liam: if we came up with a charter (assume that for a minute) for binary XML, how many people would participate (this requires commitment).
some hands raised
Liam: how many people would not participate in a WG but in a ML/IG/etc?
a very few hands were raised
Liam: anyone would oppose a WG creation? Or cannot live with the question?
Johnathan: there are a lot of voices that have not been heard that are outside this room. this is not an easy question to answer.
no hands raised
Liam: are there people here who think w3c should not do further work.
no hands raised
Liam: consensus that w3c should do further work in this area.
Kimmo Raatikainen: can we discuss the deadline for a draft charter before we break.
Liam: perhaps before the AC meeting in Nov or May. May seems too far to me. It makes sense to talk about the AC meeting
Workshop Adjourned.
Forum for further discussion
ACTION: Liam will create a public W3C mailing list for discussion
We need either a WG or IG to:
Volunteers for drafting a possible WG charter for Nov (to figure out what to do with binary XML): Mark Nottingham (BEA), Robin Berjon (Expway), Stephen Williams, Santiago Pericas-Geertsen (Sun), Selim Balcisoy (Nokia), Kimmo Raatikainen (University of Helsinki), John Schneider (AgileDelta), Alex Danilo (Canon), Don Brutzman (Web3D Consortium), Mike Cokus (Mitre)
Representative: Liam Quin
Representative: Chris Lilley
Representative: Philippe Le Hegaret
Representative: Eduardo Pelegri-Llopart
Representative: Santiago Pericas-Geertsen
Representative: Cédric Thiénot
Representative: Robin Berjon
Representative: Mike Champion
Representative: Trevor Ford
Representative: Michael (Mike) Conner
Representative: Noah Mendelsohn
Representative: Jimmy Zhang
Representative: Kevin Lovette
Representative: Andrew Graham
Representative: Bjørn Reese
Representative: Erik Wilde
Representative: Bill Eller
Representative: Krissa Ross
Representative: Mike Cokus
Representative: Dr. Scott Renner
Representative: Michael Leventhal
Representative: Eric Lemoine
Representative: John Schneider
Representative: Jim Trezzo
Representative: Alessandro Triglia
Representative: Michael Marchegay
Representative: Kazunori Matsumoto
Representative: Takanari Hayama, Ph.D
Representative: Don Brutzman
Representative: Alan D. Hudson
This activity will consume 30% of the time of one W3C staff member for chairing the workshop, and 10% of the time of [the same] W3C staff member for managing the workshop website. This workshop is part of the W3C XML Activity.
Copyright © 1996-2003 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.