Thanks for a great 15 years at W3C

After 15 years working with all of you all around the world on Web technologies and standards, I’m taking a position as a Biomedical Informatics Software Engineer in the department of biostatistics at the University of Kansas Medical center.

The new job starts in just another week or two; I’ll update the contact information and such on my home page before I’m done here.
While my new position is likely to keep me particularly busy for a few months, I hope to surface in Mad Mode from time to time; it’s a blog where I’m consolidating writing on free software, semantic web research, and other things I’m mad-passionate about.

Thanks to all of you who contribute to the work at W3C; I’m proud of a lot of things that we built together. And thanks to all my mentors and collaborators who taught me, helped me, and challenged me.

The Web is an incredibly important part of so many parts of life these days, and
W3C plays an important role in ensuring that it will work for everyone over the long haul. Although it’s hard to leave an organization with a mission I support, I am excited to get into bioinformatics, and I look forward to what W3C and the Web community come up with next as well.

The Mission of W3C

I’ve now been with W3C for almost three months. My first priority was to meet with the global stakeholders of the organization.

I began with W3C membership. Through meetings, phone calls, technical conferences, and informal sessions I’ve met upwards of one hundred members and have had profound conversations with many of them.

I also made a point of meeting with organizations that are part of the ecosystem within which W3C works. This includes other standards organizations, government ministers, students, researchers in Web science, and thought leaders in the industry.

I also reached out to organizations that “should” be in W3C. Often this includes presenting our activities and roadmaps. I’ve reached over one thousand people in this way.

And it was important to do this on a global basis. During these two and a half months I travelled to eight countries, but have spoken to participants from many other locations.

The primary purpose of all of these meetings was to listen. W3C has been an effective organization, but any organization can do better. What are the stakeholders of W3C asking from us?

Four primary requests

W3C has established principles including Web for All and Web on Everything. We’ve established a technical vision as well. There is broad agreement to these principles and technical vision.

People are asking us to be more tangible and specific in how we achieve this.

There are many ways of summarizing the requests, but four recurring themes best capture the idea. W3C needs to:

  1. Drive a Global and accessible Web. There is little dispute that we should work towards a Web for All. But so many are deprived sufficient access – for reasons of handicap, language, poverty, and illiteracy – that we need a stronger technical program to improve the situation.
  2. Provide a Better Value Proposition for Users. Everyone is a consumer and everyone is an author. Yet our focus has been on vendors that build products. We need to complement that with a better user focus.
  3. Make W3C the best place for new standards work. I blogged last month about the expanding Web platform. There is so much new innovation and we must encourage the community to bring their work rapidly to W3C.
  4. Strengthen our core mission. With the expansion of innovation on the Web, we cannot do it all. We must be very crisp about what we achieve in W3C, what companion organizations achieve, and how do we relate.

Having identified clear imperatives, we are building teams that will look at each of these topics. Typically a team involves W3C staff, participating members, and outside experts. I expect to update you from time to time as this work gets underway.

One more focus area

As we try to improve the global accessible Web; the Web of Users, new standards work, and strong delivery of a core mission, there is a legitimate danger that we will find more work to do without the resource to do it. So we will also make sure that this clearer exposition of our mission is aligned with the resources required to complete that mission!

Why does the address bar show the tempolink instead of the permalink?

An important feature of HTTP is the temporary redirect, where a resource can have a “permanent” URI while its content moves from place to place over time. For example,
http://purl.org/syndication/history/1.0 remains a constant name for that resource even though its location (as specified by a second URI) changes from time to time.

If this is such a useful feature, then why does the browser address
bar show the temporary URI instead of the permanent one? After all,
the permanent one is the one you want to copy and paste to email, to
bookmark, to place in HTML documents, and so on. The HTTP
specification says to hang on to the permanent link (“since the
redirection MAY be altered on occasion, the client SHOULD continue to
use the Request-URI for future requests.”). Tim Berners-Lee says the
same thing in User Agent watch points
(1998): “It is
important that when a user agent follows a “Found” [302] link that the
user does not refer to the second (less persistent) URI. Whether
copying down the URI from a window at the top of a document, or making
a link to the document, or bookmarking it, the reference should
(except in very special cases) be to the original URI.”
Karl Dubost amplifies this in his 2001-2003 W3C Note Common User
Agent Problems
: “Do not
treat HTTP temporary redirects as permanent redirects…. Wrong: User
agents usually show the user (in the user interface) the URI that is
the result of a temporary (302 or 307) redirect, as they would do for
a permanent (301) redirect.”

So why do browsers ignore the RFC and these repeated admonitions?
Possibly due to lack of awareness of the issue, but more likely
because the status quo is seen as protecting the user. If
the original URI (the permalink) were shown we might have the following scenario:

  1. an attacker discovers a way
    to establish a 3xx redirect from
    http://w3.org/resources/looksgood to
    http://phishingsite.org/pretendtobew3 – either because w3.org
    is being careless, or because of a conscious decision to deed part
    of its URI space to other parties

  2. user sees address bar = http://w3.org/resources/looksgood with
    content X, and concludes that the X is attributable
    to the resource http://w3.org/resources/looksgood

  3. user treats the http://w3.org/ prefix as an informal credential
    and treats the http://w3.org/resources/looksgood content as
    coming from W3C (without any normative justification; they just
    do) when in fact it’s a phishing site pretending to be W3C

  4. user enters their W3C password into phishing form, etc.

Were the user to observe address bar = http://phishingsite.org/pretendtobew3 with the same content, she
might suspect an attack and decline to enter a password.

An attacker might make use of an explicit redirection service on a site similar to that provided by purl.org, or it might exploit a redirect script that takes a URL as part of the
query string, e.g.
http://w3.org/redirect?uri=http://phishingsite.org/pretendtobew3 .

This line of reasoning is documented in the Wikipedia article URL redirection and its references and
in Mozilla bug 68423.

There are two possible objections. One is that the server in these
cases is in error – it shouldn’t have allowed the redirects if it
didn’t really mean for the content source to speak on behalf of the
original resource (similar to an iframe or img element). The other is
that the user is in error – s/he shouldn’t be making authorization
decisions based on the displayed URI; other evidence such as a
certificate should be demanded. Unfortunately, while correct in
theory, neither of these considerations is very compelling.

If browser projects are unwilling to change address bar behavior – and
it seems unlikely that they will – is there any other remedy?

Perhaps some creative UI design might help. Displaying the permalink
in addition to the tempolink might be nice, so that it could be
selected (somehow) for bookmarking, but that might be confusing and
take too much screen real estate. One possible partial solution would
be an enhancement to the bookmark creation dialog. In Firefox on
selecting “Bookmark This Page” one sees a little panel with text
fields “name” and “tags” and pull-down “folder”. What if, in the case
of a redirection, there were an additional control that gave the
option of bookmarking the permalink URI in place of the substitute
URI? With further thought I bet someone could devise a solution that would work for URI copy/paste as well.

(Thanks to Dan Connolly, other TAG members, and David Wood for their
help with this note.)

Default Prefix Declaration

Default Prefix Declaration

Table of Contents

1. Disclaimer

The ideas behind the proposal presented here are neither
particularly new nor particularly mine. I’ve made the effort to
write this down so anyone wishing to refer to ideas in this space
can say “Something along the lines of [this posting]” rather than
“Something, you know, like, uhm, what we talked about, prefix
binding, media-type-based defaulting, that stuff”.

2. Introduction

Criticism of XML
namespaces
as an appropriate mechanism for enabling distributed
extensibility for the Web typically targets two issues:

  1. Syntactic complexity
  2. API complexity

Of these, the first is arguably the more significant, because
the number of authors exceeds the number of developers by a large
margin. Accordingly, this proposal attempts to address the first
problem, by providing a defaulting mechanism for namespace prefix
bindings which covers the 99% case.

3. The proposal

Binding
Define a trivial XML language which provides a means to
associate prefixes with namespace names (URIs);
Invoking from HTML
Define a link relation dpd for use in the (X)HTML
header;
Invoking
from XML
Define a processing instruction xml-dpd and/or an
attribute xml:dpd for use at the top of XML
documents;
Defaulting by Media Type
Implement a registry which maps from media types to a published
dpd file;
Semantics
Define a precedence, which operates on a per-prefix basis,
namely xmlns: >> explicit invocation >> application
built-in default >> media-type-based default, and a semantics
in terms of namespace
information items
or appropriate data-model equivalent on the
document element.

4. Why
prefixes?

XML namespaces provide two essentially distinct mechanisms for
‘owning’ names, that is, preventing what would otherwise be a name
collision by associating names in some way with some additional
distinguishing characteristic:

  1. By prefixing the name, and binding the prefix to a particular
    URI;
  2. By declaring that within a particular subtree,
    unprefixed names are associated with a particular URI.

In XML namespaces as they stand today, the association with a
URI is done via a namespace declaration
which takes the form of an attribute, and whose impact is scoped to
the subtree rooted at the owner element of that attribute.

Liam Quin
has proposed
an additional, out-of-band and defaultable,
approach to the association for unprefixed names, using
patterns to identify the subtrees where particular URIs apply. I’ve
borrowed some of his ideas about how to connect documents to prefix
binding definitions.

The approach presented here is similar-but-different, in that its primary
goal is to enable out-of-band and defaultable associations of namespaces
to names with prefixes, with whole-document scope. The
advantages of focussing on prefixed names in this way are:

  • Ad-hoc extensibility mechanisms typically use prefixes.
    The HTML5 specification already has at least two of these:
    aria- and data-;
  • Prefixed names are more robust in the face of arbitrary
    cut-and-paste operations;
  • Authors are used to them: For example XSLT stylesheets and W3C
    XML Schema documents almost always use explicit prefixes
    extensively;
  • Prefix binding information can be very simple: just a set of
    pairs of prefix and URI.

Provision is also made for optionally specifying a binding for the default namespace at the document element, primarily for the media type registry case, where it makes sense to associate a primary namespace with a media type.

5. Example

If this proposal were adopted, and a dpd document for use in HTML 4.01 or XHTML1:

<dpd ns="http://www.w3.org/1999/xhtml">
<pd p="xf" ns="http://www.w3.org/2002/xforms"/>
<pd p="svg" ns="http://www.w3.org/2000/svg"/>
<pd p="ml" ns="http://www.w3.org/1998/Math/MathML"/>
</dpd>

was registered against the text/html media type, the following would result in a DOM with html and body elements in the XHTML namespace and an input element in the XForms namespace:

<html>
<body>
<xf:input ref="xyzzy">...</xf:input>
</body>
</html>

Orthogonality of Specifications

,,

The general principle of platform design is that platforms consist of a set of standard interfaces. Standard interfaces allow substitution of components across the interface boundary, while independence of interfaces allow evolution of the interfaces themselves. In a PC, for example, the disk bus interface allows many different disk vendors to offer disk products independent of the model of display or keyboard, but the orthogonality of interfaces allow evolution of the interfaces themselves. If the display interface were linked to the disk interface too tightly, it wouldn’t be possible to evolve ISA to SATA without updating VGA.

In the web platform, the three important interfaces are transport, format and reference, and the current definitions of those interfaces are HTTP, HTML and URI. The interfaces are standard, allowing many different implementations: HTTP standard lets you use HTTP servers from many vendors, the HTML standard lets you use many different HTML authoring tools or template systems, and the URI specification allows identification of many different components.

While HTTP is the current “common denominator”  protocol that all web agents are expected to speak, the web should continue to work if web content is delivered by other protocols — FTP, shared file systems, email, instant messaging, and so forth.  HTTP as it has evolved has severe difficulties, and designing a Web that only works with HTTP as it is currently implemented and deployed would unfortunate. We should work harder to reduce the dependencies and isolate them.

HTML is the ‘lingua franca’, the common language that all agents are currently expected to be able to produce, process, read and interpret (or at least a well-defined subset of it). Having a common language is important for interoperability, but  the web should also work for other formats — extensions to HTML  including scripting, DOM APIs, but also other formats and application environments such as XHTML, Java, PDF, Flash, Silverlight, XForms, 3D objects, SVG, other XML languages and so forth. Certainly HTML has it has evolved is overly complex for the purposes to which it is designed.

The URI is the fundamental element of reference, but the URI itself is evolving to deal with internationalization, reference to session state, IRIs, LEIRIs, HREFs and so forth. Many applications use URIs and IRIs, not just the formats described above but other protocols and locations, including databases, directories, messaging, archiving, peer-to-peer sharing and so forth.

The is just one of many communication applications on the global Internet; for web browsing to integrate will with the rest of the distributed networking, web components should be independent of the application, and work well with messaging, instant messaging,  news feeds, etc etc.

A sign of a breakdown of this architectural principle would be for a specification of a format (say HTML) to attempt to redefine, for its purposes, the protocol (say HTTP) or the method of reference (URI).  The specifications should be independent, or at least, dependencies isolated, minimized, reduced. If those other elements of the web architecture are incorrect, need to evolve to meet current practice or have flaws in their definitions, they need to evolve independently, so that orthogonality of the specifications and reusability of the components are the promoted.

There may well be reasons to link some features of HTML to the fact that it is delivered over an interactive protocol, but linking HTML directly to HTTP in a way that features would work only for HTTP and not for any other protocol with similar features – that would be unfortunate. It might not matter in the short-term (that’s all we have right now) but it is harmful to the long-term evolution of the web.

(Should go without saying, but just in case: this is a personal post, not reviewed by the TAG)

Language semantics and operational meaning

W3C and other standards organizations are in the business of defining languages — conventions that organizations can choose to follow — and not in mandating operational behavior — telling organizations and participants in the network how they are supposed to behave. Organizations (implementors, operators, administrators, software developers) are free to choose which standards they adopt, and what their operational behavior will be.

In some posts on the www-tag mailing list, I was trying to point out the risks in defining languages such that the "meaning" of the language depends on operational behavior. In some ways, of course, this is a fallacy: in general, what an utterance "means" in some operational way depends on what the speaker intends and how the listener will interpret the utterance.

However, as an organization, W3C can, and should, define languages in which the meaning is defined in the document, in terms of abstractions rather than in terms of operational behavior. The result is more robust standards, those that have wider applicability, that can be used for more purposes, and that create a more vibrant and extensible web.

Search Engines take on Structured Data

Structured data on the web got a boost this week, with Google’s announcement of Rich Snippets and Rich Snippets in Custom Search. Structured data at such a large scale raises at least three issues:

  1. Syntax
  2. Vocabulary
  3. Policy

Google’s documentation shows support for both microformats and RDFa. It follows the hReview microformat syntax with small vocabulary changes (name vs fn). Support for RDFa syntax, in theory, means support for vocabularies that anyone makes; but in practice, Google is starting with a clean slate: data-vocabulary.org. That’s a place to start, though it doesn’t provide synergy with anyone who has uses FOAF or Dublin Core or the like to share their data.

The policy questions are perhaps the most difficult. Structured data is a pointy instrument; if anyone can say anything about anything, surely the system will be gamed and defrauded. Google’s rollout is one step at a time, starting with some trusted sites and an application process to get your site added. The O’Reilly interview with Guha and Hansson is an interesting look at where they hope to go after this first step; if you’re curious about how this fits in to HTML standards, see Sam Ruby’s microdata.

While issues remain–there are syntactic i’s to dot and t’s to cross and even larger policy issues to work out–between Google’s rollout and Yahoo’s searchmonkey and the UK Central Office of Information rollout, it seems that the industry is ready to take on the challenges of using structured data in search engines.

Data interchange problems come in all sizes

I had a pretty small data interchange problem the other day: I just
wanted to archive some play lists that I had compiled using various
music player daemon (mpd)
clients.
The mpd server stores playlists as simple m3u files,
i.e. line-oriented files with a path to the media file on each line. But
that’s too fragile for archive and interchange purposes.
I had a similar problem a while back with iTunes playlists. In that episode,
I chose hAudio, an
HTML dialect in progress in the microformats
community
, as my target.

Unfortunately, hAudio changed out from under me between when I
started and when I finished. So this time, a simple search found the
music ontology and I tried it
with RDFa, which
lets you use any RDF vocabulary in HTML*.
I’m mostly pleased with the results:

  1. from A Song’s Best Friend_ The Very Best Of John Denver [Disc 1]

    by John Denver

    Poems, Prayers And Promises
  2. from WOW Worship (orange)

    by Compilations

    Did you Feel the Mountains Tremble
  3. from Family Music Party

    by Trout Fishing In America

    Back When I Could Fly

The album names come before the track names because I didn’t read
enough of the the RDFa primer when I
was coding; RDFa includes @rev as well as @rel
for reversing subject/object order.
See
an
advogato episode on m3uin.py
for details about the code.

The Music Ontology was developed by a handful of people who
staked out a claim in URI space
(http://musicontology.org/...) and happily took comments from
as big a review community as they could manage, but they had no
obligation to get a really global consensus. The microformats process
is intended to reach a global consensus so that staking out a claim in
URI space is superfluous; it works well given certain initial
conditions about how common the problem is and availability of pre-web
designs to draw from. Perhaps playlists (and media syndication, as
hAudio seems to be expanding in scope to hMedia) will eventually reach
these conditions, but the music ontology already meets my needs, since
I’m the sort who doesn’t mind declaring my data vocabulary with URIs.

My view of Web architecture is shaped by episodes such as this
one. While giga-scale deployment is always impressive and definitely
something we should design for, small scale deployment is just as
important. The Web spread, initially, not because of global phenomena
such as Wikipedia and Facebook but because you didn’t need
your manager’s permission to try it out; you didn’t even
need a domain name; you could just run it on your LAN
or even on just one machine with no server at all.

In an
Oct 2008
tech plenary session on web architecture
,
Henri Sivonen said:

I see the Web
as the public Web that people can access. The resources you can
navigate publicly. I define Web as the information space accessible to
the public via a browser.
If a mobile operator operates behind
walls, this is not part of the Web.

I can’t say that I agree with that perspective. I’m no great fan of
walled gardens either, but freedom means freedom to do things we don’t
like as well as freedom to do things we do like. And architecture and
policy should have a sort of church-and-state separation between
them.

Plus, data interchange happens not just at planetary scale, but
also within mobile devices, across devices, and across communities
and enterprises of all shapes and sizes.

I’ve gone a little outside the scope of current
standards; RDFa has only been specified for use in modular XHTML, with
the application/xhtml+xml media type, so far.


See also:

Once more into Versioning — this time with HTML

The W3C TAG has worked on the general issue of “versioning” for many years, and many TAG members may be worn out on the issue.
However, undeterred by past history, I’m taking another run at it, this time trying to look specifically at the issues around versioning of HTML, CSS, JavaScript and other parts of the standard web browser landscape.
Part of what’s new (I think) is looking at the cost/benefits around deployment. See the www-tag mailing list archive for the HTML and versioning threads.