wpt-coverage – 30 October 2020

Meeting minutes

<jgraham> RRSAgent: make minutes v2

dom: Session is not being recorded as it is mostly discussion

leobalter: This is a follow-up discussion from TPAC sessions last week

leobalter: Topic is what coverage means for WPT, how we measure it, and how to reach out to new contributors

leobalter: I have a lot of background with test262, extracting information from tests and specs
… first suggestion is to generate test plans, list observable parts
… no-one wants to add complexity, create something to help new contributors instead
… test-plans help guide new contributors, and help increase coverage
… in ecmascript, I always did manual test plans, not automated. For the specs we are talking about here it may be challenging.
… that is all the slides I have, so open the floor to discussion

<Zakim> dom, you wanted to share experience tracking coverage in WebRTC spec

dom: wanted to share my experience with webrtc wg, tracking coverage in WPT for it
… you can enable a 'test annotation' toggle in the webrtc spec UI, spec will be annotated with green and pink sections
… green sections have associated WPTs, pink (or red) do not
… based on heuristics, so not perfect but can be updated manually to fixup problems

<Jemma> where can I find current status/spec of WPT? Is there any url?

leobalter: is this based on tests pointing to spec?

dom: no, spec points to tests via respec

<jgraham> Jemma: /web-platfrom-tests.org/https://web-platform-tests.org//github.com/web-platform-tests/wpt and https://web-platfrom-tests.org

dom: I am also involved in reffy, which crawls specs and can identify items from them (e.g. idl definitions, which are exported to WPT)

<Hexcles> also wpt.fyi for test results

dom: as part of reffy, we have some very early results on simplifying cross-linking between tests and specs
… currently it goes in both directions, but is cumbersome and no way to automatically update. Hoping to discuss later this year.

leobalter: This seems very useful. I want to contribute tests to WPT, something like that helps guide me (albeit I would likely turn it into a test plan)

florian: In CSSWG, we have done something similar to what dom has shown
… we have metadata in the tests, pointing back to sections of specification
… we have been fairly thorough about having this metadata
… often point to multiple sections or specs, when testing interaction

florian: in the other direction, pointing from spec to test, we use bikeshed which also allows this
… does not have the heuristics, just manual adding, but does let you 'watch' a directory in wpt and warn you if new tests appear that aren't listed
… maintenance is somewhat cumbersome, but does give a good sense of coverage

florian: UX wise, bikeshed does not let you toggle the test view on/off dynamically (have to rebuild spec), but open issue to fix that

<dom> [for clarity, the heuristics bits I showed in the webrtc spec have been added to the WebRTC spec only, not in ReSpec as a whole https://github.com/w3c/webrtc-pc/blob/master/webrtc.js#L1 ]

florian: My experience with coverage; binary coverage is good for detecting no coverage at all, but hard to determine how much coverage is enough when there's some coverage

<fantasai> +1

<dom> +1 to Florian on evaluating detailed coverage needing more precision

florian: maybe we should go with yellow when it has some tests, and require a human to mark it green

<fantasai> +1

florian: For getting specs to a better state, I do create manual test plans. Right now doing it on CSS text level 3
… a thousand tests or so, so tedious (but getting there!)

florian: Clarification - the tedious part is the double-tracking of metadata, having to update them, having to have PRs reviewed, etc
… not the tests themselves!

leobalter: Sounds like a big part of the pain is having a lot of manual work

leobalter: The problem of having many tests per paragraph is also interesting, shows the diff between human language and tests
… think we should try to have these lists, but try to automate them

<fantasai> I think also just listing tests isn't enough to understand coverage, you have to understand what cases the test is covering. For example, we had tests for border-radius clipping content, but we didn't have any for clipping replaced elements.

jgraham: I wanted to ask how this works for test262.
… seems ecmascript spec is very formal and explicit style, which likely means tooling is easier to write
… whereas other web-platform specs tend to be more diverse in style
… not always formal, different editors make different decisions
… heuristics might end up having to be spec-specific?
… which is a lot of work.
… But maybe I've misunderstood, would be interested in how test262 works

leobalter: Actually, test262 does not have any sort of automation around their annotation
… even with 5 years of experience in test262, no way to do it yet! But in discussion with current editors to formalize more of the ecmascript test to make it easier to do.
… so far all test plans were manual work

leobalter: So overall, I would say that ecmascript is far behind what the rest of the web-platform has

<Zakim> jgraham, you wanted to react to jgraham

jgraham: When you wrote the test plan for test262, was the goal to write tests for ???
… for web-platform, we usually ask browser engineers to write the feature and the test
… so its different than when you have a dedicated QA team who have the time to write a dedicated test plan, take the necessary time, etc

leobalter: Can you clarify the question?

jgraham: Was your role when working in test262 QA-specific (writing tests), or also developing features?

leobalter: Mostly QA, I was a test262 maintainer. Using my time to faciliate others to write tests.
… at tc39, whoever champions proposals have to write tests, but people aren't ???
… test262 is also slightly more formal than WPT as it has required metadata/etc

florian: Jumping in; WPT has metadata but its optional
… I think the fact that WPT is mostly feature-implementor written, causes this difference
… from the CSSWG, we used to write tests before browsers were even sharing their tests, so it was a QA-driven effort outside the browsers and so metadata was part of the culture
… once browsers joined in, CSS kept its metadata culture
… this didn't hold for the rest of WPT, where for much of it its more browser driven and far less metadata

leobalter: Grain of salt, but feels like browser engineers don't want to add metadata?

jgraham: I stand by not requiring metadata
… getting browser vendors to contribute at all requires reducing the friction to writing tests
… even now we have vendors with a substantial fraction of tests not shared due to friction even without metadata required

<fantasai> Browser engineers have learned to comment their functions to explain to future engineers what it's supposed to do. They should also be able to comment their tests for the same reason.

jgraham: even early in css/ tests, people didn't want to deal with the metadata when upstreaming large numbers of tests

<fantasai> Sure, some really simple tests are self-explanatory. But many aren't.

florian: Going back to annotating specs, I would welcome some automation but think we should be careful about which parts are most useful

<fantasai> If a test is to be maintainable, you have to understand what it's trying to test. Then if behavior changes in either the test or the infrastructure that sets it up, you can adjust the test without losing coverage.

<fantasai> You can't do that if you don't understand what it's covering.

florian: reasonably easy to write heuristics to check if you have tests for things like idl blocks, css property defs, etc

<fantasai> And for a lot of tests, it isn't obvious.

florian: but if we write this, we may cause people to write tests for syntax not behavior
… concerned that these syntax tests may lead to bad choices, e.g. due to PR pressure
… as far as I know not happening a whole lot right now, but has happened in the past
… want to avoid writing shallow tests

<Zakim> dom, you wanted to discuss tooling ideas and to share feedback I've heard from would-be external contributors (re the other topic) and to mention a heuristic I've used for behaviors (which could use better formalization)

dom: Agree with florians last point on shallow tests
… heuristic for webrtc spec is for algorithmic scripts not just idl
… but no formal shared markup for algorithmic content; could be a space where work with reffy could help

dom: One idea I have for webrtc, is when you do a PR it should tell you "oh this section has this test, please look at that test".
… not sure if there are active conversations around such tooling?

dom: Want to discuss onboarding new contributors to WPT as well
… this is a topic which has received some strongly worded feedback so lets make sure we discuss

dom: One feedback we've heard several times (in particular one contributor) is that with the new automated submission from browser vendors (which has been positive), it feels harder as an outsider to contribute
… There's a large queue of PRs that don't get approved rapidly, so creates a two-tier system
… feels unwelcoming to new contributors
… want to share that feedback from a motivated person

leobalter: I've tried to mentor people to contribute to test262
… most questions are 'what should I test'
… so I think it comes back to coverage as well, in terms of a shared test-plan
… think that helps (a) avoid shallow tests, and (b) helps new contributors

leobalter: Browser vendors tend to have a better understanding of the depth of a feature, because they did the implementation. New contributors tend to need some guidance to avoid writing shallow tests.
… figuring out the set of tests to write takes a lot of time, and I'm used to it

<fantasai> On the other hand, browser vendors often don't write the obvious tests so sometimes there's entire "shallow" areas that are completely untested... you need both

leobalter: identifying gaps is one of the best ways to get people involved

<fantasai> Like, we have tons of tests on box-shadow parsing, but had hardly any on rendering

jgraham: Agree with florian's point about the backlog queue of PRs, difficult to get people to review them, and its nobody's job to review them

<Hexcles> Also a long queue of "missing-coverage" issues: https://github.com/web-platform-tests/wpt/labels/type%3Amissing-coverage

jgraham: Need to find an incentive structure
… for browser vendors, this is 'avoid web compat issues', or 'improve platform', but note until we made it easy they still didn't do it.
… so we made it possible inside their incentive structure; put it inside their existing systems where they are incentivized to contribute

<fantasai> +1

jgraham: You can argue we've tried hard for the PR review problem; we have tooling to assign people, etc. Try to get it into workflow, etc. But limited success.
… if we really want to make progress, how do we make it so there exists an incentive for people to actually review PRs?
… One problem - there's no core WPT team for the *tests*. Every spec has its own set of experts.
… I can't review a PR for a css spec, because I don't have the expertise. The CSS reviewer probably can't review a PR for HTML or DOM.
… so need to bring a lot of people onboard to make the situation good
… most of those experts are already paid to work on a browser engine, so they have their incentives
… so far our attempts are basically 'send emails saying this PR is assigned to you' and people ignore them

jgraham: I don't want to defend the existing situation, but if we want to make it better we need to have a plausible story about why its going to work this time

dom: To be clear, I appreciate this is a challenging issue

dom: Need to align our recommendation with our ability to have people review
… leo's point was that having test-plans would help, I agree
… but also need to realize that we'd better have a good story on how to make contribution's meaningful. Maybe associate test-plans with attached people who are welcoming the contributors.
… one thing that Mike and I discussed was to have each WG have an onboarding person assignedto it
… I don't know if that would be enough, onboarding skills might not mean reviewing expertise
… but generally having someone to smooth the challenges may recreate incentive structure

<dom> [to be clear, my proposal would not be that the onboarding person would do the reviews - they would ensure the reviews get done by their fellow group participants]

leobalter: If someone has the skills to guide new contributors, it seems like they should be writing tests instead. Reviewing PRs from new contributors can command extra time for getting it done.
… so much to explain to contributor

<dom> [agree that writing good tests is hard - if we don't think feel that's something newcomers can meaningfully contribute to, we should also be clear about it :) ]

leobalter: have to be careful about causing burnout in people reviewing (and contributors)

<Zakim> jgraham, you wanted to react to jgraham

[scribing doesn't leave much time for this, but my question would still be - do we *specifically* know who the people are we hope to review PRs? My suspicion is that its mostly people paid by browser vendors, and those browser vendors have... made their decision? Would we need to push for a cultural shift in the browser vendors to achieve this?]

<fantasai> I learned to write tests by creating simplified testcases for specific bugs. Maybe that's easier than starting from a spec and trying to fix coverage?

jgraham: Think over the last decade, industry moved to a culture where engineers write the tests, and they find doing test plans extra work that they aren't rewarded for.
… even if you believe its in their long term interests, they don't see thast
… doesn't mean I don't think we should do it, but we need to align interests

florian: Think there's a number of things we can do
… friction of using WPT is reduced but far from zero
… CI takes too long, fails too often
… documentation is far from perfect

florian: But when it comes to inviting newcomers, disheartening to have people get through this friction and write the PR, but then have nobody review it.
… so people just don't come back after their first contribution
… sadly I think the onboarding person isn't going to work - I know everyone in CSSWG and yet I still couldn't get reviewers
… can we just get W3C budget to fund someone to review test PRs?
… but money doesn't fall from trees :)

<Zakim> jgraham, you wanted to ask smcgruer_[EST]'s question

<Zakim> jgraham, you wanted to react to jgraham

jgraham: Asking question that smcgruer_[EST] posed, but he's scribing so asking for him
… do we specifically know who the people are we want to review PRs? Seems mostly people who are paid by browser vendors, who seem to have decided not to do this. Should we change internal structure or is there another pool of people we're looking at?

florian: Not all spec editors work for a browser vendor, but many are
… and nearly all spec editors seem uninterested, whether they are paid by browser vendors or not

jgraham: Quick follow-up; spec editors are experts, implementors are experts, and then we're out of experts

r12a: Apologize for arriving late, but want to raise question: do we need to review tests? Could we just accept the tests.
… one of three things happens: the test is ok, the test gives a false-positive (maybe people would come back and spot this), or the test doesn't work properly and delivers a false result
… for this last one, the browser developer would triage this and see the test is bad
… just an idea, throwing it out there :)
… since we have an unmovable problem in terms of getting folks to review

jgraham: I have previously advocated for this eventual consistency

<dom> [in fact, I've found a surprising number of broken tests in the WebRTC test suite that no implementer flagged as problematic for a reason or another]

jgraham: think it has some place, but if you have a high fraction of broken tests, folks get grumpy and lose faith in the test suite

[Note: on the Chromium side already, we have teams pushing back against WPT saying there are too many bad tests - in their view]

[Which of course they think other people always added ;)]

<Zakim> florian, you wanted to respond to r12a

florian: There are different types of failures even within the categories you listed
… tests that always pass - annoying but ultimately ok
… but also tests that misunderstand the spec and work, but works wrong. Doesn't test the spec.
… so people may fix their implementation to match the test, not the spec
… not sure if review today actually catches this, but feels like a danger

<dom> [I wonder if there is a distinction to be made between new ests that fail current browsers vs those that don't (which the CI already identifies, although not very explicitly)]

florian: Note that a few folks in CSSWG today, myself included, have the power to merge tests without review

jgraham: I want to make clear - we are merging *many* PRs. They're just mostly (60%) coming from browser vendors
… so to me this is not existential, its just that we're not doing as well as we could be for new contributors

<Zakim> dom, you wanted to invest in dashboard / monitoring

dom: Do we know if this is a uniform problem across all specs, or for specific specs?
… if it is specific specs, maybe we can do something about those specs specifically
… generally want a better understanding of the queue

<Hexcles> Maybe we can merge "unreviewed" PRs automatically, but label them as so either in the filename or the content, and surface them through manifest.

jgraham: I know smcgruer_[EST] has some overall stastistics, but not per-spec
… also we're two minutes over-time

leobalter: I've learned a lot here about the bigger picture, which is great

<fantasai> A good test * passes when it's supposed to pass * fails when it's supposed to fail * tests the thing it thinks it's testing.

<rtoyg_m2> +1

leobalter: think there's challenges, there's work to do, but also things we can do and make progress
… hope we can formalize a consistent format for test plans in the future
… but baby-steps first

<Hexcles> dom: we used to have pullpanda, which is being turned down, but even that didn't allow us to filter out e.g. auto-exported PRs which is something we really need when analyzing reviews

dom: Ok, thanks everyone. Thanks leo for organizing, good discusson. Follow-ups to happen in #testing channel, public test infra mailing list, and WPT issues

<jgraham> RRSAgent: make minutes v2

<jgraham> RRSAgent: stop

<jgraham> RRSAgent:stop

<jgraham> RRSAgent: make logs v2

<jgraham> RRSAgent: bye

<jgraham> RRSAgent: make logs public

<jgraham> RRSAgent: publish minutes v2

– DRAFT –
wpt-coverage

30 October 2020

Attendees

Meeting minutes

Diagnostics