All your spec are belong to us! Irrigating dev resources from specs

Facilitator: Dominique Hazaël-Massieux, François Daoust

Machines read specs too. This breakout session will review existing and possible tools and projects that make use of content automatically extracted from specs.

Minutes (including discussions that were not audio-recorded)

Previous: Long-form reading on the web All breakouts Next: MDN Developer Need Assessments: results and next steps - TPAC 2020 breakout

Transcript

Okay so, once again, welcome to this session titled All Your Specs Belong to Us.

My colleague, Francois, and myself Dominique Hazael-Massieux from W3C staff thought we would use TPAC as an opportunity to look at how specifications developed today for the web platform, what tools exist around them, what ecosystem of tools exist around them and also what improvements we think would be worthwhile bringing into that ecosystem, both for the benefits of spec editors, implementers, people active in using the web platform but also for the benefit that can be derived for developers and hopefully what I mean by that will become clearer as we go through this presentation.

So, where are we coming from?

Why are we even bringing this topic?

If you look at how the Open Web Platform is developed today, you're looking at a very diverse set of communities and groups working together.

More specifically there are at least 26 working groups in W3C that are developing specs meant for web browsers.

At the very least nine community groups but possibly more.

For those of you not familiar with W3C, working groups do form off in additions whereas community groups do more pre-standardization and incubation work.

So on top of that work happening in W3C, there are 15 work streams happening in the WHATWG including notably HTML, Dome and Fetch certifications.

The core of the JavaScript language is developed in (mumbles) by TC39.

So that's a big piece of what people developing for browsers use on a daily basis and on top of that, you should at least count also the WebGL work done in the Khronos Group and that may be a subset of all the technologies, standards, specifications that are used to define what developers use on a day to day basis for building web pages and applications.

And the diversity is great.

It's a reflection of the diversity of communities and interests and so on that back these technologies but we are working together on a single platform so we cannot really afford for that diversity to have a detrimental effect on the consistency of the platform.

And, I like this so-called Conway's Law which says any organization that design a system will produce a design or a structure that's a copy of the organization's communication structure which basically means that any product or solution you developed will develop around the boundaries of the subgroups that developed it.

And I think if you look at the way the web platform gets developed, you do see quite a bit of that flow in application where the fact that this or that technologies take this or that approach, builds that or that way, will be in many cases, a reflection of the culture or the background of the group who developed it.

And in general, if we want to keep the consistency high despite that diversity, if you take Conway's Law to heart, it means we need to improve the communications across these groups to ensure that the design of patterns flow better between the use technologies.

But all these groups work on their own pace, their own times, and their own rules so communication basically needs to be asynchronous and either decentralized or distributed.

For anyone who has put in place asynchronous distributed communication system, you know that they hardly get it right.

But the good thing is that although tools cannot solve the problem on their own, tools can help make the problem at least a lot more attractable and that's what this presentation is trying to show, both in terms of what we are already doing in the Open Web Platform space and also highlighting things we could or should be doing to make that overall process smoother, easier and more natural for everyone.

So, I think one cool trend that anyone who has been involved in web standardization for the past decade is that we've seen a lot of inspiration on the way software has been developed, applied to how standards are developed.

The notion of evergreen standards or living standards.

The notion of using GitHub as a way to get contributions from as many people as possible.

The notion that there needs to be a lot more specificities, in for instance, how algorithms get defined.

And, you know, different people will have different opinions on whether all of it was good or not but I think in terms of the structure that it has brought to the platform and its ability to grow into new spaces, there is probably broad consensus that this was a needed development.

And one of the way we think specs can go farther in that direction is by looking in more details into the dependency management.

Again, for better or for worse, dependency management is a pretty hot topic in the space of software development with lots of packaging system, registries, systems to reduce or subset these dependencies.

But, so far in specs land, dependencies have been managed in a very coarse fashion.

Basically say spec A has this normative dependency on spec B and spec C, but we are talking about specs that for some of them are several hundred pages long.

And so, saying that a given spec, for instance, has a dependency on the HTML spec, is not really saying much at all.

There must be very few specifications that don't have a dependency on HTML and you could say, well, that doesn't really matter but in fact it does because a dependency in spec, as much as in software, means that if the piece of a spec you're depending on changes, you may need to adapt your own spec as you would adapt your own software to reflect that change.

And right now this is managed in a very adult fashion.

Sometimes you'll determine that something gets changed because a link no longer works or someone implementing will find that it doesn't really make sense any more.

And, you know, it kind of works but it's certainly not smooth or seamless in any fashion.

And, one of the things we want to review today is that we now have the infrastructure and the tooling to make that a lot more precise and a lot more useful for groups and spec writers and the model we want to go towards is the one illustrated on the right hand side of the slide where we show a lot more granular cross dependencies between specs where it's clear that Spec A depends on this particular definition in Spec B. This particular idea fragments in Spec B. This and that algorithm in Spec C and so on and so forth.

It also shows that not everything that is declared in Spec B or in Spec C is available for referencing.

Only a subset of the definitions whether they're abstract or idea or anything else, only a subset of those are expected to be referenceable.

In the software land, you would say these are the exported interface of your modules.

And, again, so point I'm trying to convey here is that we have a lot of the infrastructure to make this possible today.

There is still some work to make it happen but we do see quite a bit of value in getting there.

And to illustrate that value, I want to bring light on the ecosystems that exist around this spec development environment.

If you've been involved in writing or contributing to specifications in W3C and in what we do, you will know that most of them are developed using two main authoring tools.

One is Bikeshed, the other is ReSpec.

And because of the many improvements that have occurred in these tools over the past years, we're now able to bring a lot more of this added value to the specification development effort.

Part of it is because on top of these authoring tools, there are a number of tracking tools that have emerged and we're going to monitor them on the various bits of this diagram but, again, if you've been involved in writing specs, you might have heard about Specref as a way to create references to other documents.

You may be less familiar with browser spec or the W3C API or Shepherd but, again, we'll come back onto those.

They play an increasingly important role in making sure this full ecosystem works reliably.

And one particular project that Francoise and I have been working on the past few years, with an increase in the past few months, is called Reffy, with an Amazon project called Web Ref that exploits Reffy and Reffy is a tool that is able to go through all of the specifications developed for the web platform, whether they're are WHATWG, W3C, YCG or Khronos and soon hopefully W3C 39.

So that requires all of these specs and twice to extract as much reusable data out of these specifications.

More specifically, it extracts all of the idea fragments.

So the interface definition, language fragments that define Web APIs for JavaScript.

It extracts all of the formal definition of CSS, properties and values but it also extracts the pieces of the specs that are marked as definitions with the ability to identify whether they are meant to be exported.

If you'll recall my previous discussion about whether something is meant to be reused by another spec or not and it extracts also which spec references which other specs.

And what we've seen emerge out of this data that we extract on a regular basis is another collection of tools, some of which meant for spec developers but many of which meant for the browser web developer ecosystem that I'm about to exploit this data and make use of it in many fashions.

We'll discuss some of them but I think the key message here is that we don't even need to know about all of them.

We don't even need to understand all of their use.

It's clear that there is a demand from this web developer ecosystem to get access to this data.

And so, we have to make sure going forward that we make these extractions possible, we make them more useful, more powerful, and we see tools emerging out of the data we provide.

(mumbles) So I mentioned a number of tools and so I just wanted to give a quick overview of the tools that are already available today or are emerging.

I mentioned Specref, which is this tool developed by Tobie Langel, which enable specs to reference each other very easily without having to keep track of which version they're at, which that was the hardest one.

Again, if you've used, if you've contributed to the W3C specs, you might recognize these double square bracket syntax that can be used to do this automatic referencing.

And Specref gets automatic updates for the most part.

There are still parts that require manual updates, particularly around committee groups but this has clearly grown as a fundamental piece of how W3C is, and WHATWG specs, for that matter, keep track of each other.

One piece that has emerged more recently over the last year or so, I guess a bit more now, is the ability to not only reference specifications but as I had alluded to earlier, reference specific definitions made in other specifications.

So right now that is based on this other spec processing tool called Shepherd done by Peter Linss but the work Francoise and I have been doing on Reffy and WebRef, hopefully we complete that with a lot more time or a lot more up-to-date data.

And again, that compilation of definition is used in specification so that instead of saying referring to this definition in this specification at this URL, you just use a specific syntax, a syntax that will vary whether you're referencing, for instance, an idea, a token or to a definition.

We've given a couple of examples on this slide.

But once you do that, you basically get all the links done for free instead of having to use links that can break and the tools will also warn you if you're referencing a definition that doesn't exist or no longer exists, which can help with both spotting typos but also spotting changes in definitions that, for instance, would have been removed.

To make that, make it more useful, we will need groups to step up in terms of making sure they export all the definitions that think their specs should expose.

And conversely that they do not export definitions that they don't want those specs to be able to be used.

If, for instance, you think that the definition is too specific to the context of the spec that it shouldn't be reused then that's an important message to convey.

That also means that there may be negotiation between groups in figuring whether this algorithm, this definition should be exported or not.

But from our experience, these negotiations are to be useful coordination point, a useful encores to this asynchronous discussion paradigm that I was alluding to earlier.

And, the nice bit about this formal approach to definition is that it also provides very useful encores for all the tooling on top of the specs.

You may have seen emerge over the past few months specifications which shows whether the feature in a given section is being implemented in what browsers and that is done by re-exploiting data collected by these NGN Web Docs project, which is a part of documentation on our web technologies and which includes for as many technology as possible, manually collected data but whether a given browser supports a given feature at a given version of the browser and so being able to know that the feature is being defined at this specific place in the spec allows this kind of automated annotation.

Another, I think, pretty impressive change that has occurred over the last year, based on this automatic extraction of data I was referring to.

Because now we are able to get the latest idea automatically extracted from each and every spec.

This idea of fragments can be automatically brought to the Web Platform Test Project, which is this huge test suite that is used by browsers, both as regression and conformance test suites to ensure the conformance to web platform specs improve over time.

And using a semi-automated process set up by Phillip Leidensteidt, we are now able to keep that set of idea automatically updated and reviewed for testing purposes, which means that any time a spec is being tracked, then it gets automatically tested across the latest version of each of the major browser engine.

I mentioned earlier the MDN Browser Compat Data which shows which browser implements which feature of the platform.

And that data is also well positioned to benefit from this automatic extraction we're doing because they need to know which feature exists.

They need to know where it is defined.

They need to know when it evolves and right now, at least so far, that had been done as a manual process, which, of course, is not ideal but once you're able to build on this extracted data then you can at least see the information about which feature needs to be documented and in the best cases, you can even automatically test whether a given browser supports the same feature or not.

Or if not completely automatically test, you can get early results that can be (mumbles).

And that's what MDN, BCD collector project which, again, we use the data we extract from certifications, is in the process of completion.

And you may or may not be aware of MDN Browser Compat Data.

You're very likely have heard about can I use that come?

Which nowadays for a big part of its data, we use the data exposing MDN, BCD. So again, being able to extract this data, exploit it, has a very deep repercussion in the developer ecosystem.

I think very few developers don't use either BCD or can I use on a (mumbles) most likely daily basis to know what they can rely on in the development of specs.

Another tool that you may have crossed if you have been involved in spec writing, spec teaching, developed by Academy, is this tool that we automatically fix WebIDL across the specifications when there is a change in the WebIDL definition itself.

For instance, recently the type that was known as void has been renamed undefined and given that there are probably two or 300 specifications that use WebIDL getting that type of change adopted used to be really difficult, painful and challenging.

And it isn't quite yet completely seamless but nowadays thanks to that amazing tool we can get pull requests submitted across a very large number of the specifications automatically so that they catch up with the latest syntactic change in the WebIDL spec.

So, and that again, is based on the list of specs identified by the processing tools.

Another project from Academy that I thought was interesting to mention, some of you may know TypeScript, which is this variant of JavaScript which use stronger type annotations to help developers keep their codes correct and clean.

And, because TypeScript need typing, getting this type automatically derived from the WebIDL typing that we developed for web specifications sounds like a really good reuse of that data and that's what is being explored right now in the pro requesting from the slide having basically all Web APIs automatically typed using the WebIDL annotation.

And again, that is based on this automatically data extract from Webref.

And, when TypeScript, I guess, is pretty popular variant of JavaScript nowadays.

So again, bringing all this work we are putting in formalizing our APIs and making it available to the product developer because this system, I think, is a really powerful approach to our work, increasing its impact significantly.

A smaller tool that I developed a few years ago, again, based on this automatically extracted data is called Webidpedia.

Basically it lets you review all the interfaces, dictionaries and new defined across the platforms.

One reason for that is as spec writer, if you can reuse the naming convention that have been used in other specs, that's always good.

If you can review it, that makes it much easier.

It can also help with figuring out which specifications are extending a particular interface.

So if you're changing an interface, knowing which of the specs might be affected, you can do that through this tool.

And, every so often we've seen another set of tools that have emerged that are we using this data?

For instance, the fact that we're exposing the CSS grammars, or the idea of the grammars have helped with building first testing around the platform to generate automatic tests to make sure the implementations are either correct or resist to crash when it is not exactly correct.

You can also use it to build validators.

And again, there are probably many more uses than the one we can think of or that we need to think of.

(chuckles) And, this is leading us to, I guess, or plea for those of you that are involved in editing specs or continuing to speculating.

We've seen huge value out of this extraction work.

We think there is a lot more pending value across formalization of the definitions in specs.

Right now there are still a lot of adult conventions in specs, not on spec or aware of this distinction between what has to be exported or not, what has to be marked as a concept or a different type of definition because not all spec have done this.

There are also a lot of legacy usage and how you reuse definition from other specs.

So I guess this is a heads up if you are in a group that we will probably be pushing for a grade in the specifications in the months to come to clean that situation up with the goal of providing a lot more visibility in these cross dependencies management I was referring to earlier.

What we want to enable is that whenever, for instance, you plan on removing a definition from your spec, that you be able to figure out which specs were using it automatically.

Maybe then negotiate with the spec editors to see what new definitions they should be using.

And that's true for definition in general but there are some definitions, for instance, formal algorithms that expect a specific set of input and output and right now there is no way to track when an algorithm expect a new output or a different type of input.

And that, we think, we can automate using this web structured approach to definition tracking.

And in general, we think that getting this more fine grained map of which spec depends on what specific pieces of which other specs can help a lot of the discussions around the maturity of a given spec, in particular in the context of the W3C process, understanding whether a given spec is stable enough to get recommendation label.

We think there are also a lot more opportunities around integration with BCD Browser Compat Data.

And so, again, bringing more visibility for developers about what's cooking in the platform, what's being implemented, what may not be implemented so quickly.

One way we are looking at to make all this data more usable by other tools for the software project is by packaging it.

We've started building NPM packages for the idea fragments that we extract from specifications.

We are hoping to do the same for CSS definition as well shortly thereafter.

And, as we extract more data, more definitions, more value out of the specifications, we would be trying to package it, again, to help other tools evolve on top of that ecosystem.

I also wanted to take the opportunity to highlight the fact that a lot of these projects are done as spare time projects.

Not all of it but some of it and not all the people that I involve in this are doing it on an implement basis.

So, in particular, I've linked to two open collective if you want to support both SpecInfra, which is behind the Specref tool I was alluding to.

Or ReSpec, which again, is one of the major tools used by many W3C groups.

You may have seen a few months ago this exclusivity, a drawing that illustrates the issue with a lot of the open source infrastructure nowadays, the impact where a lot of it might include very stable and robust but if it all depends on a single person doing work on their spare time, then it's not just as robust as it should be.

And so, you could contribute to this open collective but we're also very much looking for motivated people to join us in building that vision that I sketch earlier and working with the values specification writers, with the various groups in adopting that approach, bringing their specs up to date with the latest practices.

So if you're interested and willing to help, do absolutely get in touch.

And, to finish this presentation, we also had a few questions that may serve for the discussion part although we also happy to take any other remarks or discussion points.

If you're aware of other tools in that ecosystem I was describing, a tool for us to know, in particular as we're looking at doing this packaging, understanding the requirements that the packaging need to follow would be useful.

Are there other data that we should extract from specification?

Other scenarios that these tools could be helping with that you're aware of?

When we are looking at this final grade dependency map, do you have ideas on how you would use it and sure developing specs?

Sometimes we do get push back on some of the changes we're trying to get groups to adopt.

So, again, if you have feedback on whether this values that we're seeing emerging is indeed worth the effort from groups, that's useful.

The other thing we noted is that I think we've seen a lot of really interesting evolutions among these two ecosystems.

It's not clear, to me at least, that most of groups and editors are aware of this evolution and so we are wondering if you had suggestions on what to do beyond organizing another TPAC breakout session.

Spec-prod@w3.org is a meeting place where at least a subset of the editors contribute but it's not clear that it has that much traction with the editors.

And we have also more fine grained discussion points about what kind of definitions, what kind of definition types we should consider but that's (background noise drowns out speaker).

I'll stop the recording now so that we can move to the organization part.

Sponsors

Platinum sponsor

Media sponsor

For further details, contact sponsorship@w3.org