XML Processing Model WG f2f, morning session

Henry S. Thompson
28 Feb 2006

1.   Attendance

Eric Bruchez, Rui Lopes, Murray Maloney, Alex Milowski, Henry S. Thompson, Richard Tobin, Norm Walsh, Konrad Lenz (observer)

2.   Pipeline system design

RT: I'm concerned we define what we're doing at a suitable level of abstraction, not getting distracted by the problems of designing an XML serialisation of a definition.

RT: So I'd say something like

A component has named inputs, named outputs and named parameters. A step is a binding of components to pipes, so that every component input is bound to output of a single pipe, and every component output is bound to at least one pipe input. A program is a connected collection of steps.

RT: That's just an example of the level at which I think we should focus.

RL: What about a central repository for inputs, stylesheets, schema docs, etc.

[various]: discussion about why only one input, aggregation as a special important functionality.

EB: We use XInclude plus a pipeline-local URI scheme to manage connections in a generic way.

HST: [simple data flow vs. RM with locally-named infosets]

MM: Yes, I tried to tell a similar story about a global infoset, but NW pointed out that you can't have multiple DIIs in an infoset, so it became a multi-infoset, plus indeed named parameters with values.

RT: A benefit of the pure pipe story is it removes a class of mutability problems, which do arise when you have some kind of a blackboard story.

MM: So I'm got two outputs, one is a sequence of chapters, the other is a provisional TOC. The first goes through several more steps, the TOC skips> several. How does this work?

HST, RT: [Coloured picture]

KL: [CSS question]

NW: Components can do http GETs anytime they want to.

AM: I'm not happy with just saying multiple pipes can just be plugged into a single output. I think we need some kind of T construct.

EB: Really a question of syntax -- if it's easy to do this with a T, that's what will happen [scribe not sure he got this right]

RT: I agree that at the implementation level there will be something like a T, but conceptually it's just multiplexing the output.

KL: [Signature question]

MM: Sometimes it's really a pipe, but other times it will be a manufacturing process, with stuff going off-site and arriving from off-site and . . . We should just give up the metaphor.

NW: I think it still works.

RT: It's just like UN*X pipes.

MM: Not so, they are much more linear than what we're talking about.

MM: Real (fuel) pipelines are very different from this [example]

RT: The analogy is a bit strained, yes -- real pipelines don't transform things to the extent we're talking about.

EB: Is the terminology really a problem? I don't think so.

RT: Let's agree that we'll use the term without prejudice, no implication that any particular property of real pipelines carries over.

KL: Note if you don't provide a way for encapsulating all the inputs to a pipe you can't package it and re-use it reliably. You need a single source of documents for that.

HST: The web is that single point.

KL: But the web changes.

RT: Yes, but that's not a problem for most of us, it's a problem for your [security/signature] domain, maybe we can help, but not sure.

NW: So I understand and like most of [picture], but what if a component produces a number of outputs determined by its input. How does that work?

EB: When the result docs all go to a single pipe, we can accommodate this by supporting document streams in pipes, as previous discussed. If you want to send them to different pipes, we don't know how to do this.

RT: Three possibilities:

MM: If only XML documents flow through a pipe, surely we know what's happening, it's not like UNIX where EOF is really the end of it, there's only one document tree.

HST, others: [distraction about post-end-tag comments and PIs]

HST: Well, my experience is that some components are [streaming], and these normally work without change for sequence input as for single doc input, but other components are [not-streaming], they produce no output until all their input is finished, and they don't therefore manage document sequences badly.

EB: That's why I like the XPath 2 data-model, it gives us a type system in which sequence of documents is already present. We may not actually need sequences of integers. . .

HST: We don't need all that complexity, but the subset which really only uses doc and set-of-doc to 'type' each input socket and output socket is certainly close to what we need.

EB: I like the way this is going. It doesn't use the auto-for-each approach we have in XPL, but I think that was probably a mistake anyway.

MM: Aren't we obliged by the W3C Process to re-use the XDM?

HST: We have at least DOM, Infoset and XDM to choose from. At the moment this is moot, because all we're saying so far is we like the doc/seq-of-doc abstraction from XDM. We're not talking about requiring support for the whole thing, e.g. [integer, foo elt, bar attr].

EB: Well, I'd like to look forward to being able to extend a bit in that direction. . .

KL: Portability will require a standard serialisation, won't it?

NW: We're not going to go there, if you want reproducible serialisations you need to add that.

RT: We certainly should encourage implementations to document how they pass data down pipes, so it's easy for people to write new components.

MM: Is it a standard property of infoset builders that you can parse and serialise with them and then reparse and get the same infoset back?

[various]: [No]

EB: Back to the problem of sequence of documents.

NW: Well, sounds to me that we've got a story based on type signatures for components wrt doc/seq-of-doc on each input and output.

RT: Consensus that we do want to support seq-of-doc.

NW: So a component has fixed number of inputs, ditto outputs, each either doc or seq-of-doc.

RT: And parameters

EB: Well, I'd prefer to look again at the XDM, not for true static parameters, but for 'computed' parameters we should use the full power of the XDM, so we can say some component produces an integer, not a document. That could then connect to some other component input which needs an input. This gets rid of the distinction between inputs and parameters. What else can we do for this sort of thing -- hacking the infoset seems quite wrong.

RT: I don't want to have to support that kind of computed parameters.

NW: Well, there's a conceptual problem here we have to confront: I have an XSLT step with document and stylesheet inputs and [corporate logo image uri] parameter, which is to be computed earlier in the pipe. We have no story about this

HST: Eric's is a proposal on the table for this, right?

AM: Define the notion of scopes, and allow binding of names to values within a scope.

HST: I proposed something like that on the last call -- provide an API which allows you to post name-value bindings

MM: Where do the values come from?

AM: In my simple example, I extracted it with an XPath from a document. More complex cases, it could be computed.

HST: Imagine component with two fixed parameters, a name and an XPath, and its function is to bind the name in the surrounding scope to the value of the XPath on the input document. [what it does in case of a seq-of-doc is an interesting question]

RT: None of this is necessary, just write a component which has two inputs, doc and image URI, and produces a carrier doc with both in it, and the next step is XSLT which extracts what it needs and does the right thing.

NW: No, I don't want to have to hack the stylesheet, it already works with a param.

EB: I've sent an illustration of a similar workaround, in which the first component generates a simple XSLT stylesheet which binds the parameter and imports the original stylesheet, then the next step just runs it.

AM: This all works for XSLT but doesn't generalise.

NW, EB: [New example which the scribe didn't get]

RT: What about a generic wrapper component which has n+1 inputs, takes one of them to provide bindings, and passes on the rest to a wrapped component.

AM: That's what I was talking about too. . .

HST: OK, so that's not the same as what MM and I were proposing, which is more API-orientated. . .

NW: What's crucial is that we're extracting the image uri from an input.

EB: So I don't understand how Alex's blackboard story works

HST: Stop, Alex's story is simpler than the blackboard story, more like RT's wrapper story. It's a wrapper which takes as static parameter a set of name/XPath pairs, and it binds the names to the values of the XPaths wrt a sub-pipeline.

EB: Why not just do this for a single component? [scribe missed some discussion]

NW: So I don't have n inputs, I have n+m for m parameter inputs?

varia: No, n+1, a parameter stream

RT: In my wrapper example, I assumed a single param setting input.

HST: I don't like the idea of a parameter stream -- I can live with either a 'let' binding construct, or a readable/writable blackboard.

MM: So the scoping idea gives us a space within which the parameters are set.

NW: So from a components perspective, all parameter bindings are static, just some are local, some are let-bound and some are pipeline-invocation-bound.

RT: I don't like the blackboard story, it's too dangerous, asynchronous update/query, etc. I much prefer the let story, where once things are set you're done.

HST: But note this means that the let story breaks streaming, because you can't start the scoped pipeline until all the bindings are set.

[Coffee break]

EB: Still don't understand why you need a scope across more than one component

HST: Everywhere you can have a component you can have a pipeline. Suppose there's an input that provides a binding that doesn't survive to the place it's needed.

EB: So fork the document

HST: That's a huge amount of work

RT: How -- the param doc is tiny

HST: No, it's the whole input doc that has to be forked.

AM: The let proposal just streams things straight through.

HST: Point of order -- I think we need to see written designs to forestall the kinds of misunderstandings I keep seeing.

EB: I'd like to keep talking

NW: How about EB and AM put their heads together to clarify for each other what their proposals are, then bring it back to up.

RT: Question for AM -- are the let-bound names the same as the static names of parameters? What if two components have a parameter of the same name?

AM: Then you need to separate let scopes if you need separate bindings for a param with the same name, but this rarely happens.

KL: What about hiding?

AM: Nested scopes, yes, standard story.

3.   Conditionals

NW: So, two branches. How do we choose a branch -- XPath expression?

HST: Major dichotomy: A pipeline language construct, which uses XPaths to switch, or just an exploitation of some kind of try-catch, so choice is achieved by failure and fallback

RT: Third possibility -- pipeline construct, but using component success or failure, not an XPath

RT: But are we clear that the output on all branches is the input?

EB: No, I showed use cases yesterday where that isn't sensible.

RT: Yeah, if you're chaining backward it certainly shouldn't be necessary.

NW: [example of case where switched sub-pipes don't use the input which fed the choice]

EB: Don't understand RT's component-based it.

HST: It's like LISP if -- three arguments, eval the first, if it succeeds, eval the second, failing which the third.

HST: For pipelines, construct with a component and two subpipes -- run the component on stdin, if it succeeds, run first subpipe, otherwise second.

AM: This is where push and pull come in to play.

HST: Hope that's hidden at this level -- isn't this just a special case of the general observation that at the implementation level you can have at most one pushed (callback) input.

RT: This points to another issue: we have a condition with two alternative sub-pipes -- flow of control may be separate from flow of stdin [picture (if cond [s1 s2] [s3 s4]) where it's s2 which takes stdin from the stdin of the pipe.

EB: So we could say 'if' only has flow of control, it has no output, if the sub-pipes need the input to the 'if' it has to have been dupped.

RT: We could do it that way. Note that if the 'if' uses a component to do the test, it's that component which gets the input.

NW: Seems to me something like XPath-based and component-failure-based is very different.

RT: You can reconstruct the first using the second.

EB, HST: Need to distinguish between failure and 'abort'

NW: Seems to me try-catch is very different, e.g. recover from exception, than testing a condition.

HST: Well, schema validation is useful example -- we have implemented this as schema-validate, which may abort, but not fail. We then have a PSVI-testing component which succeeds or fails and switches the pipeline as a result. That's analagous to me catching a segfault on one branch and running a separate branch

NW: I want to take a hard line in v1 -- no protection against seg-fault, if a component dies the whole pipe dies. No try-catch, we just do 'if'.

RT: I should have said "has different exit status" as opposed to "succeeds or fails", so reconstruct as 'if' controlled by exit status of a component.

: Nothing to stop having a component which runs a pipe and converts abort into failure and success into 'carry on'. . .

AM: I really want try-catch for implementing a web service, because a good citizen web service always produces an XML output, and I want to do that within my pipeline.

NW: But surely there is a layer of software above the pipeline engine which can handle that.

HST: Second that.

KL: Do you have in mind to escalate it the failure in a controlled way, as in throw inside catch?

HST: RT, how do I get at stdout from the subpipe if it doesn't abort?

RT: if doesn't have a component 'inside' the 'if' construct, rather 'if' takes as input a true/false doct produced by a 'test' component. So in the case you ask about, the 'test' component wraps a subpipe, and that subpipe's output goes to the failure branch, the 'test' component output goes to the 'if' component.

EB: In most use cases, I haven't found exception handling to be necessary. We don't have use cases.

HST: Second that -- MT Pipeline put a lot of work into failure domains, but we don't use them much.

AM: I use them all the time, at the top level, for web services.

HST: We did it for interaction with database systems, haven't built any of those, but still think we will need that if database interaction is important.

AM: That's the other case when I use try-catch.

NW: You're just not going to be helped in V1

HST: Agree.

NW: We could have a global "use this error document on failure" phenom.

EB: We should try to at least prepare the ground for what will come in this area, so AM can look forward to something.

NW: OK, but not at the expense of shipping in 10 months

HST: Agree.

MM: What happens to stderr -- that is, I have a 17-step pipeline, each step produces error/debug statements, where do they go? Do we have a story about this at all?

AM, others: No, not yet, we should.

NW: I'm content that it's completely out-of-band. E.g. xsl:message goes to some log file, which is not accessible from the pipeline.

RT: error output should never go to stdin of next component.

RT: We could say that every component has an error socket, which the pipeline author may plumb somewhere, but not required to.

MM: It's broken to lose stderr.

HST: Yeah, but server-side that's what happens, and it's OK, because it's for devs, not users.

AM: I see servers which do give generic client access to error logs, but you have to be a wizard to get to them.

AM: You could do what RT said, it's just another output, not an error output. That's very different from providing a debugging interface to generic error output -- I don't want to go there.

KL: Why not make this a default, we could allow a pipeline author to say "take all the errors and put them here".

AM: I don't want to try to design a logging system.

MM: I don't want that either, I just want a place for all error/debug output to go.

NW: A component writer could provide such a facility, but I don't want the system to provide this by default.

MM: That's fine, I'm happy that we just give guidance and some infrastructure to make it easy for people to do this.

RT: My pipeline engine just wraps all component error output in a CDATA section and an XML doc.

AM: Back to RT's diagram, it doesn't work for me. This is too complicated for the simple case. Sxpipe's simple conditional is very powerful. We should think again about restricting ourselves to one-in-one-out straight-through pipelines. All topological complexity is handled in special components. Multiple inputs are reconstructed as computed resources, i.e. backward chaining.

KL: Isn't this just an optimisation?

AM: No, it's really different.

NW: I don't understand how this does 'if'.

HST: One component, with one input, which embeds two pipelines. I test my input, run one or the other embedded pipe, use its output as mine.

AM: We could extend this to multiple outputs, but I want to try paring back down to simple pipes. I particularly like the backward-chaining access to computed resources.

NW: Still struggling.

RT: Crucial difference is that all branches on AM's proposal are straight-through pipes, whereas in my example I needed a more complex pipe with a join where the computed stylesheet joined in.

RT: You've simplified the pipe, but added a Resource Manager!

HST: My experience is that having computed resources does simplify the way the pipe gets written, so I'm torn.

EB: Can't we combine the two?

AM: Yes, that's what I want, because the pure dataflow view is too complex.

EB: So you replace the flow complexity with indirection through a name.

AM: Yes -- to preserve the simplicity of the initial straight-through flow story.

NW: You're using exaggerated terminology to describe the alternative you don't like.

EB: I think it's a question of what's streamable.

AM: Don't agree.

EB: My problem is this makes the pipeline more complex for the user to save work for the implementer.

HST: Disagree -- on AM's account, you only need one component for XSLT, with one input and one parameter. On the original account, you need another component, with two inputs.

RT: No, you just need the pipeline language itself to allow for any input to be supplied by a pipe or a URI.

MM: We're just going in circles, two valid views, let's do both.

NW: We don't see the two as having an obvious difference in complexity/performance, so AM, you need to give us a more detailed example.

EB: Under the covers XSLT is like this, because you want to 'compile' the stylesheet handler from the stylesheet before you start processing the input.

RT: That's not a necessary property, just a fault in the implementation.

NW: I didn't think this was just an efficiency issue, but that in some cases things just don't work.