Language semantics and operational meaning

W3C and other standards organizations are in the business of defining languages — conventions that organizations can choose to follow — and not in mandating operational behavior — telling organizations and participants in the network how they are supposed to behave. Organizations (implementors, operators, administrators, software developers) are free to choose which standards they adopt, and what their operational behavior will be.

In some posts on the www-tag mailing list, I was trying to point out the risks in defining languages such that the "meaning" of the language depends on operational behavior. In some ways, of course, this is a fallacy: in general, what an utterance "means" in some operational way depends on what the speaker intends and how the listener will interpret the utterance.

However, as an organization, W3C can, and should, define languages in which the meaning is defined in the document, in terms of abstractions rather than in terms of operational behavior. The result is more robust standards, those that have wider applicability, that can be used for more purposes, and that create a more vibrant and extensible web.

Search Engines take on Structured Data

Structured data on the web got a boost this week, with Google’s announcement of Rich Snippets and Rich Snippets in Custom Search. Structured data at such a large scale raises at least three issues:

  1. Syntax
  2. Vocabulary
  3. Policy

Google’s documentation shows support for both microformats and RDFa. It follows the hReview microformat syntax with small vocabulary changes (name vs fn). Support for RDFa syntax, in theory, means support for vocabularies that anyone makes; but in practice, Google is starting with a clean slate: data-vocabulary.org. That’s a place to start, though it doesn’t provide synergy with anyone who has uses FOAF or Dublin Core or the like to share their data.

The policy questions are perhaps the most difficult. Structured data is a pointy instrument; if anyone can say anything about anything, surely the system will be gamed and defrauded. Google’s rollout is one step at a time, starting with some trusted sites and an application process to get your site added. The O’Reilly interview with Guha and Hansson is an interesting look at where they hope to go after this first step; if you’re curious about how this fits in to HTML standards, see Sam Ruby’s microdata.

While issues remain–there are syntactic i’s to dot and t’s to cross and even larger policy issues to work out–between Google’s rollout and Yahoo’s searchmonkey and the UK Central Office of Information rollout, it seems that the industry is ready to take on the challenges of using structured data in search engines.

Data interchange problems come in all sizes

I had a pretty small data interchange problem the other day: I just
wanted to archive some play lists that I had compiled using various
music player daemon (mpd)
clients.
The mpd server stores playlists as simple m3u files,
i.e. line-oriented files with a path to the media file on each line. But
that’s too fragile for archive and interchange purposes.
I had a similar problem a while back with iTunes playlists. In that episode,
I chose hAudio, an
HTML dialect in progress in the microformats
community
, as my target.

Unfortunately, hAudio changed out from under me between when I
started and when I finished. So this time, a simple search found the
music ontology and I tried it
with RDFa, which
lets you use any RDF vocabulary in HTML*.
I’m mostly pleased with the results:

  1. from A Song’s Best Friend_ The Very Best Of John Denver [Disc 1]

    by John Denver

    Poems, Prayers And Promises
  2. from WOW Worship (orange)

    by Compilations

    Did you Feel the Mountains Tremble
  3. from Family Music Party

    by Trout Fishing In America

    Back When I Could Fly

The album names come before the track names because I didn’t read
enough of the the RDFa primer when I
was coding; RDFa includes @rev as well as @rel
for reversing subject/object order.
See
an
advogato episode on m3uin.py
for details about the code.

The Music Ontology was developed by a handful of people who
staked out a claim in URI space
(http://musicontology.org/...) and happily took comments from
as big a review community as they could manage, but they had no
obligation to get a really global consensus. The microformats process
is intended to reach a global consensus so that staking out a claim in
URI space is superfluous; it works well given certain initial
conditions about how common the problem is and availability of pre-web
designs to draw from. Perhaps playlists (and media syndication, as
hAudio seems to be expanding in scope to hMedia) will eventually reach
these conditions, but the music ontology already meets my needs, since
I’m the sort who doesn’t mind declaring my data vocabulary with URIs.

My view of Web architecture is shaped by episodes such as this
one. While giga-scale deployment is always impressive and definitely
something we should design for, small scale deployment is just as
important. The Web spread, initially, not because of global phenomena
such as Wikipedia and Facebook but because you didn’t need
your manager’s permission to try it out; you didn’t even
need a domain name; you could just run it on your LAN
or even on just one machine with no server at all.

In an
Oct 2008
tech plenary session on web architecture
,
Henri Sivonen said:

I see the Web
as the public Web that people can access. The resources you can
navigate publicly. I define Web as the information space accessible to
the public via a browser.
If a mobile operator operates behind
walls, this is not part of the Web.

I can’t say that I agree with that perspective. I’m no great fan of
walled gardens either, but freedom means freedom to do things we don’t
like as well as freedom to do things we do like. And architecture and
policy should have a sort of church-and-state separation between
them.

Plus, data interchange happens not just at planetary scale, but
also within mobile devices, across devices, and across communities
and enterprises of all shapes and sizes.

I’ve gone a little outside the scope of current
standards; RDFa has only been specified for use in modular XHTML, with
the application/xhtml+xml media type, so far.


See also:

Once more into Versioning — this time with HTML

The W3C TAG has worked on the general issue of “versioning” for many years, and many TAG members may be worn out on the issue.
However, undeterred by past history, I’m taking another run at it, this time trying to look specifically at the issues around versioning of HTML, CSS, JavaScript and other parts of the standard web browser landscape.
Part of what’s new (I think) is looking at the cost/benefits around deployment. See the www-tag mailing list archive for the HTML and versioning threads.

Palm webOS approach to HTML extensibility: x-mojo-*

I got pretty excited about the iPhone,
and even more about the openness of Android and the G1, and then I
learn that the Palm Pre developer platform is basically just the open
web platform: HTML, CSS, and JavaScript.

Just after the mobile buzz at Web Directions North and the TAG declared victory on how to build The Self-Describing Web with URI-based Extensibility , I get some details on how Palm is building on the open web platform:

A widget is declared within your HTML as an empty div with an x-mojo-element attribute.

<div x-mojo-element="ToggleButton" id="my-toggle"></div>

Oh great; x- tokens… aren’t those passe by now?

The suggestion in the HTML 5 draft is data-* attributes. The ARIA draft suggests @role. The Palm design looks like new information for issue-41, Decentralized-extensibility, in the HTML WG.

Anybody know how frozen the Palm design is? Or if they looked at ARIA, data-* or URI-based namespaces?

JavaScript required for basic textual info? TRY AGAIN

Sam says he’s Online and Airborne. “Needless to say, this is seriously cool.” I’ll say! But when I follow the link to details from the service provider, I get:

Sorry. You must have JavaScript enabled to view this page. Click the
BACK button below or enable JavaScript in your browser preferences and
click TRY AGAIN.

Let’s turn that around, shall we? Sorry, if you’re a network provider and you want my business, read up on unobtrusive javascript (aka the rule of least power), go BACK to work on your web site design and TRY AGAIN.

How to evaluate Web Applications security designs?

I could use some help getting my head around security for Web
Applications and mashups.

The first time someone told me W3C should be working on specs help
the browser prevent sensitive data from leaking out of enterprises, I
didn’t get it. “Use the browser as part of the trusted computing base?
Are you kidding?” was my response. I didn’t see the bigger picture.
Crockford explains in an April 2008 item:

… there are multiple interests involved in a web
application. We have here the interests of the user, of the site, and
of the advertiser. If we have a mashup, there can be many more
interests.

Most of my study of security protocols concentrated on whether a
request from party A should be granted by party B. You know, Alice and
Bob. Using BAN
logic
to analyze the Kerberos protocols was very interesting.

I also enjoyed studying capability
security and the E system
, which is a fascinating model of secure
multi-party communication (not to mention lockless concurrency),
though it seems an impossibly high bar to reach, given the
worse-is-better tendency in software deployment, and it seemed to me
that capabilities are a poor match for the way linking and access
control
work in the Web:

The Web provides several mechanisms
to control access to resources; these mechanisms do not rely on
hiding or suppressing URIs for those resources.

On the other hand, after wrestling with the patchwork of javascript
security policies in browsers in the past few weeks, the capability
approach in adsafe looks simple and elegant by comparison. Is there any
chance we can move the state-of-the-art that far? And what do we do in
the mean time? Crockford’s Jan 2008 post is quite critical of W3C’s current
work:

This same sort of wrong-end-of-the-network thinking can be seen today
in the HTML 5 working group’s crazy XHR access control language.

Access Control for Cross-Site Requests
is a mouthful, and “Access Control” is too generic, which leads to “W3C
Access Control”. Didn’t we already go through this with “W3C XML
Schema”? Generic names are awkward. I think I’ll call it WACL…
yeah… rhymes with spackle… let’s see if it sticks. Anyway…

Crockford’s comment cites his proposal and argues…

JSONRequest
does not allow the server to abdicate its responsibility of deciding if
the data should be delivered to the browser. Therefore, no policy
language is needed. JSONRequest requires explicit authorization.
Cookies and other tokens of ambient authority are neither sent nor
delivered.

I’m not sure I understand that. I’m glad to learn there’s more to
the difference between XMLHttpRequest and JSONRequest than just
<pointy-brackets> vs {curly-braces}, but I’d like to understand
better how “ambient authority” relates to the interests of users,
sites, advertisers, and the like.

In response, the FAQ in the WACL spec says:

JSONRequest has been considered by the Web Applications Working
Group and the group has concluded that it does not meet the documented
requirements. E.g., requests originating from the JSONRequest API
cannot include credentials and JSONRequest is format specific.

Including credentials seems more like a solution than a
requirement; can someone help me understand how it relates to the
multiple interests involved in a web application?

Caching XML data at install time

The W3C web server is spending most of its time serving DTDs to
various bits of XML processing software. In a follow-up comment on an item on DTD traffic, Gerald says:

To try to help put these numbers into perspective, this blog post
is currently #1 on slashdot, #7 on reddit, the top page of
del.icio.us/popular , etc; yet www.w3.org is still serving more than
650 times as many DTDs as this blog post, according to a 10-min
sample of the logs I just checked.

Evidently there’s software out there that makes a lot of use of the
DTDs at W3C and they fetch a new copy over the Web for each use. As
far as this software is concerned, these DTDs are just data files,
much like the timezone database your operating system uses to convert
between UTC and local times. The tz database
is updated with respect to changes by various jurisdictions from time
to time and the latest version is published on the Web, but your
operating system doesn’t go fetch it over the Web for each use. It
uses a cached copy. A copy was included when your operating system
was installed and your machine checks for updates once a week or so
when it contacts the operating system vendor for security updates and
such. So why doesn’t XML software do likewise?

It’s pretty easy to put together an application out of components
in such a way that you don’t even realize that it’s fetching DTDs
all the time. For example, if you use xsltproc like this…

$ xsltproc agendaData.xsl weekly-agenda.html >,out.xml

… you might not even notice that it’s fetching the DTD and several
related files. But with a tiny HTTP
proxy
, we can see the traffic. In one window, start the proxy:

$ python TinyHTTPProxy.py
Any clients will be served...
Serving HTTP on 0.0.0.0 port 8000 ...

And in another, run the same XSLT transformation with a proxy:

$ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html

Now we can see what’s going on:

	connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:00] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd HTTP/1.0" - -
connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent HTTP/1.0" - -
bye
bye
connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent HTTP/1.0" - -
bye
connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent HTTP/1.0" - -
bye

This is the default behaviour of xsltproc, but
it’s not the only choice:

  • You can use xsltproc --novalid tells it to skip DTDs altogether.
  • You can set up an
    XML catalog
    as a form of local cache.

To set up this sort of cache, first grab copies of
what you need:

$ mkdir xhtml1
$ cd xhtml1/
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
--15:29:04--  http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
=> `xhtml1-transitional.dtd'
Resolving www.w3.org... 128.30.52.53, 128.30.52.52, 128.30.52.51, ...
Connecting to www.w3.org|128.30.52.53|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32,111 (31K) [application/xml-dtd]
100%[====================================>] 32,111       170.79K/s
15:29:04 (170.65 KB/s) - `xhtml1-transitional.dtd' saved [32111/32111]
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
...
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
...
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
...
$ ls
xhtml1-transitional.dtd  xhtml-lat1.ent  xhtml-special.ent  xhtml-symbol.ent

And then in a file such as
xhtml-cache.xml, put a little catalog:

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<rewriteURI
uriStartString="http://www.w3.org/TR/xhtml1/DTD/"
rewritePrefix="./" />
</catalog>

Then point xsltproc to the catalog file and try it again:

$ export XML_CATALOG_FILES=~/xhtml1/xhtml-cache.xml
$ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html

This time, the proxy won’t show any traffic. The data was all
accessed from local copies.

While XSLT processors such as xsltproc and Xalan have no
technical dependency on the XHTML DTDs, I suspect they’re used with
XHTML enough that shipping copies of the DTDs along with the XSLT
processing software is a win all around. Or perhaps the traffic comes
from the use of XSLT processors embedded in applications, and the DTDs
should be shipped with those applications. Or perhaps shipping the
DTDs with underlying operating systems makes more sense. I’d have to
study the traffic patterns more to be sure.

p.s. I’d rather not deal with DTDs at all; newer schema technologies make them obsolete as far as I’m concerned. But

  • some systems were designed before better schema technology came along, and W3C’s commitment to persistence applies to those systems as well, and
  • the point I’m making here isn’t specific to DTDs; catalogs work for all sorts of XML data, and the general principle of caching at install time goes beyond XML altogether.

The details of data in documents: GRDDL, profiles, and HTML5

GRDDL, a mechanism for putting RDF data in XML/XHTML documents, is
specified mostly at the XPath data model level. Some GRDDL software
goes beyond XML and supports HTML as she are spoke, aka tag soup. HTML 5 is intended to standardize the connection between tag soup and XPath. The tidy use case for GRDDL anticipates that using HTML 5 concrete syntax rather than
XHTML 1.x concrete syntax involves no changes at the XPath level.

But in GRDDL and HTML5,
Ian Hickson, editor of HTML 5, advocates dropping the profile attribute
of the HTML head element in favor of rel=”profile” or some such. I
dropped by the #microformats channel to think out loud about this stuff, and Tantek said similarly, “we may solve this with rel=”profile” anyway.” The rel-profile topic in the microformats wiki shows the idea goes pretty far back.

Possibilities I see include:

  • GRDDL implementations add support for rel=”profile” along with HTML 5 concrete syntax.
  • GRDDL
    implementors don’t change their code, so people who want to use GRDDL
    with HTML 5 features such as <video> stick to XML-wf-happy HTML 5
    syntax and they use the head/@profile attribute anyway, despite what
    the HTML 5 spec says.
  • People who want to use GRDDL stick to XHTML 1.x.
  • People who want to put data in their HTML documents use RDFa.

I
don’t particularly care for the rel=”profile” design, but one should
choose ones battles and I’m not inclined to choose this one. I’m
content for the market to choose.

life without MIME type sniffing?

In a recent item on IE8 Security, Eric Lawrence, Security Program Manager for Internet Explorer, introduced a work-around to the security risks associated with content-type sniffing: an authoritative=true parameter on the Content-Type header in HTTP. This re-started discussion of the content-type sniffing rules and the Support Existing Content design principle of HTML 5. In response to a challenge asking for evidence that supporting existing content requires sniffing, Adam made a suggestion that I’d like to pass along:

I encourage you to build a copy of Firefox without content sniffing
and try surfing the web. I tried this for a while, and I remember
there being a lot of broken sites …

That reminded me of an idea I heard in TAG discussions of MIME types and error recovery: a browser mode for “This is my content, show me problems rather
than fixing them for me silently.”

Though Adam offered a patch, building firefox is not something I have mastered yet, so I’m interested to learn about run-time configuration options in IE (notes Julian) and Opera (notes Michael). Eric Lawrence’s reply points out:

Please do keep in mind, however, that most folks (even the ultra-web engaged on these lists) see but a small fraction of the web, especially considering private address space/intranets, etc.

A report from one developer suggests there’s light at the end of the tunnel, at least for sniffing associated with feeds:

I did, partly as an experiment, stop sniffing text/plain in the latest release of SimplePie (which, inevitably, isn’t the nicest of things to do, seeming there are tens of thousands of users). Next to nothing broke. I know for a fact this couldn’t have been done a year or two ago: things have certainly moved on in terms of the MIME types feeds are served with …

If you get a chance to try life without MIME type sniffing, please let us know how it goes.