Bug 14932 - [FO30] fn:unparsed-text
[FO30] fn:unparsed-text
Status: CLOSED FIXED
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 3.0
Member-only Editors Drafts
PC Windows NT
: P2 normal
: ---
Assigned To: Michael Kay
Mailing list for public feedback on specs from XSL and XML Query WGs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-25 09:56 UTC by Tim Mills
Modified: 2012-10-02 12:56 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Mills 2011-11-25 09:56:02 UTC
In Bug 14831, it has been pointed out that fn:unparsed-text and fn:doc use different mechanisms for resolvling URIs to resources.

"doc() uses a mapping of URIs to resources in the static context,
whereas unparsed-text() lacks this indirection: the spec is written to require
direct resolution of the URI supplied."

This means that 

fn:doc('http://www.example.com/')

and

fn:unparsed-text('http://www.example.com/')

may bear no relation to each other.  This seems to be unfortunate.
Comment 1 Michael Kay 2011-11-28 23:39:42 UTC
See also #14971 concerning uri-collection(). Perhaps there is scope for a grand unification of the way in which functions use URIs to refer to external resources.
Comment 2 Tim Mills 2011-11-29 08:32:57 UTC
Hopefully, yes.

I'd like to see access to resources structured with three layers: octet streams at the bottom, text in the middle and XML at the top.
Comment 3 Erik Wilde 2011-11-29 18:35:17 UTC
not sure whether that's a comment or should be filed as a new bug: fn:unparsed-text() is not really clear about whether a text resource should be requested (by using HTTP Accept, for example), or whether a requested resource should be passed into XQuery as a string. the difference might matter, because essentially, there are three things that users might want to do:

- GET XML, and then process as XDM. this is covered by doc().
- GET XML, and then process it as text (i.e., get serialized XDM).
- GET text/plain, i.e. a plain text variant of a resource, which could be very different, such as the server stripping out markup and delivering formatted text/plain.

i think all of this could be addressed in a more explicit way how XQuery interfaces with HTTP and generally fits into the web, and that's one of the things i am really interested in, but that's definitely only appropriate for 3.1.
Comment 4 Michael Kay 2011-11-29 21:34:25 UTC
The use cases for unparsed-text in XSLT (where it has existed for some years) are primarily for reading and parsing data files in non-XML formats, either for conversion into XML or for searching. I have used it, for example to search Java source code. I think nearly all the applications I have come across are reading local files rather than doing HTTP access. But of course, that doesn't stop other people inventing other uses, and there is no reason why the spec should restrict the possibilities: mapping URIs to resources delivered as strings of characters seems a pretty general and powerful capability. (In fact, it makes you wonder why we need environment-variable()...)

Many of the applications involve collections of text resources, so we definitely need some kind of relationship between uri-collection() and unparsed-text().
Comment 5 Erik Wilde 2011-11-30 01:07:54 UTC
i've used unparsed-text() to crawl for robots.txt files and for CSS, and if you have extension functions for tidying HTML, you can do pretty amazing things with just writing XQuery.

i think it would be good to have a clear separation of concerns and talk about identification, interactions, and representation in well-separated steps. to a certain extent this exists with resolve-uri(), but there is also some mixing of concerns such as with doc(). in an ideal world, there should be resolve-uri(), then interaction (maybe specifically supported for HTTP), and then handling the result, so that users could resolve() a relative URI, GET() an XML resource, and then parse() it into an XDM. i think not separating these issues cleanly has created the situation reported in this bug, and the cleanest solution would be to separate the steps, expose them as functions, and then just define doc() as a combination of the basic function. maybe that's too ambitious, but personally, i'd like to see XQuery's web-friendliness increased anyway.
Comment 6 Jonathan Robie 2011-11-30 15:04:39 UTC
(In reply to comment #5)
> in an ideal world, there should be resolve-uri(),
> then interaction (maybe specifically supported for HTTP), and then handling the
> result, so that users could resolve() a relative URI, GET() an XML resource,
> and then parse() it into an XDM. i think not separating these issues cleanly
> has created the situation reported in this bug, and the cleanest solution would
> be to separate the steps, expose them as functions, and then just define doc()
> as a combination of the basic function. maybe that's too ambitious, but
> personally, i'd like to see XQuery's web-friendliness increased anyway.

I think it's useful to have functions that do each of these things.

However, doc() does not necessarily involve GET() and parse(). Any solution needs to work well for:

* Retrieving Web resources
* Persistent XDM instances in native stores
* Identifying data in foreign systems like relational databases
* Local files
* Data converters

I'm probably missing some important use cases - implementations do wildly different things with doc().

Web friendliness is important. Let's make sure we don't achieve that at the expense of data integration.
Comment 7 Erik Wilde 2011-11-30 16:17:23 UTC
(In reply to comment #6)
> > the cleanest solution would
> > be to separate the steps, expose them as functions, and then just define doc()
> > as a combination of the basic function. maybe that's too ambitious, but
> > personally, i'd like to see XQuery's web-friendliness increased anyway.
> I think it's useful to have functions that do each of these things.

is there any way this could become part of a more long-term plan to improve XQuery/XSLT functionality in that area?

> However, doc() does not necessarily involve GET() and parse().

sure, it all depends on the URI scheme. but maybe it should be possible to (a) find out which URI schemes are supported by an implementation, and then act accordingly, such as dereferencing the URI. HTTP and file probably would be two schemes that would be supported by most implementations.

> Any solution needs to work well for:
> * Retrieving Web resources

and that could also mean using protocol-oriented schemes such as FTP, if supported by the implementation.

> * Persistent XDM instances in native stores

how are these identified right now? nothing prevents an implementation from not actually performing an HTTP request if the resource is available locally, but logically, this still is access based on HTTP URIs, right?

> * Identifying data in foreign systems like relational databases

how is this achieved right now? nothing that is just URI-based, i assume?

> * Local files

there's the file URI scheme for that.

> * Data converters

how is this achieved right now? nothing that is just URI-based, i assume?

> I'm probably missing some important use cases - implementations do wildly
> different things with doc().

do they do any magic that is not in line with web architecture? if not, then separating the currently mixed concerns might help.

> Web friendliness is important. Let's make sure we don't achieve that at the
> expense of data integration.

i don't see how a more orthogonal design of web-oriented functionality would create problems for data integration. on the contrary, data integration often will be based on being able to take full advantage of web architecture, and if i don't have control over, for example, HTTP methods and headers, then there are a lot of things i just cannot do with data that is exposed on the web.
Comment 8 Jonathan Robie 2011-11-30 19:39:40 UTC
(In reply to comment #7)

> > Web friendliness is important. Let's make sure we don't achieve that at the
> > expense of data integration.
> 
> i don't see how a more orthogonal design of web-oriented functionality would
> create problems for data integration. on the contrary, data integration often
> will be based on being able to take full advantage of web architecture, and if
> i don't have control over, for example, HTTP methods and headers, then there
> are a lot of things i just cannot do with data that is exposed on the web.

I think we should start with requirements and use cases, both for existing applications and those you envision.

> how is this achieved right now?

You ask this about a few items - I think that's a question we would have to look at across existing implementations. The ones I have worked with tend to use collection() for persistent XDM or relational tables, and use doc() for a variety of things.

I would not be at all surprised if some implementations use doc() in the same way implementations I have been involved with use collection().

> > * Data converters
> 
> how is this achieved right now? nothing that is just URI-based, i assume?

Depends a great deal on the vendor.

DataDirect's data converters (which I did not specify) use URIs to specify a conversion, conversion parameters can be specified as part of the URL:

doc("converter:Base64:newline=crlf:encoding=utf-8?file///w:/myfiles/base_to_xml.bin")
 
> > I'm probably missing some important use cases - implementations do wildly
> > different things with doc().
> 
> do they do any magic that is not in line with web architecture? if not, then
> separating the currently mixed concerns might help.

I honestly don't know what all implementations do. I think we would have to find out. We have to be careful if we forbid anything that was previously allowed. Sometimes we do that, but very carefully.
Comment 9 Erik Wilde 2011-12-05 18:22:56 UTC
(In reply to comment #8)
> > > Web friendliness is important. Let's make sure we don't achieve that at the
> > > expense of data integration.
> > 
> > i don't see how a more orthogonal design of web-oriented functionality would
> > create problems for data integration. on the contrary, data integration often
> > will be based on being able to take full advantage of web architecture, and if
> > i don't have control over, for example, HTTP methods and headers, then there
> > are a lot of things i just cannot do with data that is exposed on the web.
> I think we should start with requirements and use cases, both for existing
> applications and those you envision.

that's what i was trying to do, but maybe this bug tracker isn't the best place? i am looking at using XQuery in a service-oriented environment where it is used to both expose and consume web services. for this to be well-supported, we need to match how web architecture works, and how XQuery supports the key parts of it, which at a high level ob abstraction are identification, interaction, and representation.

> You ask this about a few items - I think that's a question we would have to
> look at across existing implementations. The ones I have worked with tend to
> use collection() for persistent XDM or relational tables, and use doc() for a
> variety of things.
> I would not be at all surprised if some implementations use doc() in the same
> way implementations I have been involved with use collection().

may doc() and collection() would not be the best things to change to be better in line with web architecture, then, because they are well-established and implementations already have certain ways of how they implement them. in that case, having functions for

- handling identifiers,
- dereferencing identifiers, and
- handling representations

might be a better way to approach this, and while for some of these there already are functions (such as resolve-uri()), for others, there aren't.

> DataDirect's data converters (which I did not specify) use URIs to specify a
> conversion, conversion parameters can be specified as part of the URL:
> doc("converter:Base64:newline=crlf:encoding=utf-8?file///w:/myfiles/base_to_xml.bin")

wow, that's actually quite impressive. this pushes a lot of functionality into proprietary URI schemes, essentially turning those into a place where vendors then put their extensions. i wouldn't say that from a language design point of view, that's the best way of supporting this extensibility pattern.

> I honestly don't know what all implementations do. I think we would have to
> find out. We have to be careful if we forbid anything that was previously
> allowed. Sometimes we do that, but very carefully.

absolutely agreed. and i am not suggesting to disallow those things. but maybe we have an opportunity to revisit how essential parts of web architecture are being handled, and can try to come up with a different way how those are exposed in XQuery, instead of pushing everything into proprietary URI scheme behaviors.
Comment 10 Michael Kay 2012-02-13 23:04:56 UTC
Jonathan's proposal (member-only):
https://lists.w3.org/Archives/Member/w3c-xsl-query/2012Feb/0021.html

From today's minutes: DECIDED to accept Jonathan's proposal for resolving this with some modifications. Change "available resources" to "available text resources". Change the description of unparsed-text() to say it returns not "the resource" but "a string representation of the resource". We should note that there is no essential relationship between the sets of URIs accepted by the two functions unparsed-text() and doc() (a URI accepted by one may or may not be accepted by the other), and if a URI is accepted by both there is no essential relationship between the results (different resource representations are permitted by web architecture). We should also note that there are no constraints on the MIME type of the resource. Also unparsed-text() needs a similar section to the "Various aspects are implementation-defined" paragraph that currently exists for the doc() function.
Comment 11 Michael Kay 2012-10-01 20:33:33 UTC
The changes identified in comment #10 have been belatedly applied to the F+O spec.
Comment 12 Tim Mills 2012-10-02 12:56:34 UTC
Thanks.