Web Platform Tests, Day 1, TPAC 2019 -- 16 Sep 2019

<jgraham> RRSAgent: Make logs public

<ato> Is it "present+ <nick>"?

<ato> Komehyo

<ato> Uh, https://www.w3.org/2002/03/RRSAgent

<jgraham> RRSAgent: make minutes

<jgraham> https://blog.mozilla.org/opendesign/firefox-the-evolution-of-a-brand/

<BitBot> (14wpt) [PR] moz-wptsync-bot requested 13#19070 merge into 07master: [Gecko Bug 1315892] text-orientation: upright' forces used 'direction' to LTR. - https://git.io/JeOvA

<JohnJansen> jgraham: that's how you do it.

<zcorpan> https://bocoup.com/blog/how-to-scribe-at-tpac

<Hexcles> RRSAgent: halp!

<foolip> scribenick: foolip

Intro from Luke Bjerring

lukebjerring: we have an Agenda in a doc, will work on that after first break
... first status updates: https://docs.google.com/presentation/d/10pP5UdurCE3_5YMk6ds8ksz5GIj_Bxixfi9GJP2LOkw/edit?usp=sharing
... the dents in the Safari graph are infrastructure issues that affected reults

jgraham: big change for Firefox is rewrite of encodings

<ato> RRSAgent: make minutes

lukebjerring: browser-specific failures and passes are interesting for interoperability

jugglinmike1: yes!
... meet.google.com/xyo-wzse-nss
... I thought you'd be out for the day

example of is:different: https://wpt.fyi/results/?label=master&label=experimental&aligned&q=is%3Adifferent

<MikeSmith> jugglinmike1, I can webrtc you in

example of the `all` query: https://wpt.fyi/results/?label=master&label=experimental&aligned&q=all%28status%3Aerror%29

<kevers> present

Live demo from Robert Ma

Hexcles: I'll do a live demo now!

<ato> Title: Web Platform Tests, Day 1, TPAC 2019

<ato> RRSAgent: make minutes

<ato> RRSAgent: agenda?

jgraham: Taskcluster is going to split into separate instances for Gecko CI and rest of world, will give us new URLs and new UI
... hope to do Android x86 emulator on Taskcluster by end of year
... want to run Chrome and Firefox. much more capacity limits than desktop
... will probably start with daily runs

foolip: also trying to get WebKitGTK running

jgraham: trying to add decision task for Taskcluster, would allow us to schedule dependent tasks
... for example, we can avoid getting different versions of Firefox
... or could have a build job

JohnJansen: Edge is now a Chromium browser
... have looked to take advantage of wpt.fyi. 99.9% results are the same a Chrome, so differences are interesting
... we have found bugs (regressions) where Edge is different to Chrome
... we aim to be 24 hours behind Chrome
... but we also have different features that can cause tests to break. usually turns into blocking bugs
... really appreciate the interop view
... when will Edge stable be released? we're bug driven, not date driven
... beta felt very stable to me, very impressed by quality of Chromium out of the box.
... wpt.fyi has helped us immensely. interesting to see the pass rate increase so much, now differences are really important

https://wpt.fyi/results/?label=master&label=experimental&product=chrome&product=edge&aligned&q=is%3Adifferent is a view that might be helpful

<jugglinmike1> foolip: I can talk docs

<MikeSmith> agenda: https://docs.google.com/document/d/1_d2xUBgNn6nmiIXM6m9zSLjzYhvDS4LMrmgoxMQJKuU/edit#

<JohnJansen> https://wpt.fyi/interop/IndexedDB/idbobjectstore_createIndex15-autoincrement.htm?label=master&label=experimental&product=chrome&product=edge&aligned&diff&filter=ADC

jugglinmike1: we've been working to improve the docs over the past year
... we've switched for GitHub pages (Jekyll) to Sphinx, a Python project

<yigu> See Mike's screen for the figure here: https://meet.google.com/xyo-wzse-nss

<zcorpan> example search result https://web-platform-tests.org/search.html?q=assert_throws&check_keywords=yes&area=default

jgraham: thanks for improving the docs, it's a big improvement compared to what we had

<jugglinmike1> Docs: "2019 WPT Documentation Improvements" https://docs.google.com/document/d/16KJbWVRtIjZQX80CQhOvwG2RHlIqVM1ADQTG1Q-QdlE/edit

jgraham: Another update. We've now moved from Travis to GitHub Actions. Somewhat mixed results, some spurious failures. Mostly things that require a github token, because the secrets handling is easier

JohnJansen: some differences we see between Chrome and Edge might be because we run on Windows too.

lukebjerring: let's break for food

jgraham: let's be back here at 10:45

<Hexcles> (i.e. in 30min)

<JohnJansen> Meeting: Web-Platform-Tests TPAC 2019

<JohnJansen> back. intros...

JohnJansen: if you add `(edge:!pass&edge:!ok)` to the diff query you can see the stuff that's more likely a problem

<jorydotcom> +present Jory Burson, Bocoup

<jorydotcom> @jugglinmike1 sleep tight :D

<jorydotcom> lol

<gsnedders> so who have we agreed CSS WG joint meeting with on the CSS WG side?

We're now doing Agenda smithing

<BitBot> (14wpt) [PR] chromium-wpt-export-bot requested 13#19071 merge into 07master: Port two webkit-xxx-interpolation.html to wpt/ - https://git.io/JeOJr

<BitBot> (14wpt) [PR] chromium-wpt-export-bot requested 13#19072 merge into 07master: Delete three webkit-xxx-interpolation.html - https://git.io/JeOJK

<gsnedders> because https://wiki.csswg.org/planning/tpac-2019#tuesday doesn't have any joint meeting with us there?

<MikeSmith> https://www.w3.org/wiki/TPAC/2019/SessionIdeas

<JohnJansen> https://w3c.github.io/tpac-breakouts/sessions.html

<scribe> scribenick: foolip

Review of 2019 priorities from last TPAC starting

https://docs.google.com/document/d/1UE2KB7gvaEw5gvp4aAQNS9TrFYmnzgDzh5Sr4LQqaQo/edit?usp=sharing

Review of 2019 priorities from last TPAC starting

<zcorpan> scribenick: zcorpan

zcorpan: documentation has been worked on

lukebjerring: i like the improved docs!

AutomatedTester: improve debugging, this has to do with reftests
... improving tooling about getting debugging info out

jgraham: duplicate with the later bullet point

JohnJansen: last year i couldn't figure out how to debug a test in python

ato: multiprocess debugging in python is fundamental limitation

JohnJansen: nobody knows how to debug a test running through wptrunner?

jgraham: in wdspec test?

JohnJansen: yes

jgraham: i don' tknow how to do that

foolip: printf()

jgraham: dunno if something spins up a new python process

gsnedders: yes

jgraham: maybe pause before the process starts and get the process id

JohnJansen: reftests have clearly improved in wpt.fyi

ato: pytest subprocess
... command line flag to something something

jgraham: the wdspec case is really one process, other things is waiting on it
... if waiting time is indefinite, which i think we can, then it could work

ato: also webdriver related timeouts you need to tweak

jgraham: for other test types we try to do that
... disabling timeouts when running gdb debugger

ato: we've set timeout multiplier to something very high

lukebjerring: feature request?

JohnJansen: i'm logging one

foolip: async_cleanup is major thing that happened
... how many have used reftests?

(show of hands)

<BitBot> (14wpt) [issue] thejohnjansen opened 13#19073: We need a way to attach a debugger to the executing test when run from wptrunner - https://git.io/JeOUk

gsnedders: "has the assignee reviewed" is a metric we can check

<JohnJansen> I logged an issue for debugging tests: https://github.com/web-platform-tests/wpt/issues/19073

foolip: i set a filter to prio PRs where i'm assigned

<Hexcles> https://github-health.appspot.com/

lukebjerring: for people who use github-health this helps

jgraham: interesting to see if it's working for ppl who are not on infrastructure

<ato> What Mozilla uses: https://addons.mozilla.org/en-US/firefox/addon/myqonly/

foolip: i look at things i'm assigned to

zcorpan: me too, or reassign

<ato> Shows a notification in your browser for GitHub items you are assigned to.

foolip: ideas for wpt-pr-bot?

jgraham: improved a bit, but hasn't shifted a lot

gsnedders: how many of the 250 PRs from 2017 are still open?
... we need people who work on the web features in question to review them
... we've tried for years

foolip: teaching people how to do email filtering works

<foolip> zcorpan: we know that new contributors don't have a good time

<AutomatedTester> scribenick: foolip

zcorpan: their PRs are stuck for years
... do we want their contributions from people who don't already work on browser engines?
... if the answer is no we shouldn't pretend that we accept PRs, but if we do we should allocate time to do the review

jgraham: the problem is who are the we who should do the reviews?
... ownership is fine-grained and we don't have a way to enforce that people do their job

zcorpan: we can explore incentives to do reviews, like celebrate those who do the most reviews

<jorydotcom> +1

<zcorpan> jgraham: it might be not worthwhile in the moment, but in the long run it may be

<zcorpan> jgraham: choice between possible impact over a long period vs definite impact over short period

<zcorpan> ato: are we sure that a stale review has been considered

<zcorpan> ato: if nobody has looked at it, how do we know if it's important

<zcorpan> miketaylr: has anyone retriaged?

<zcorpan> miketaylr: might be 50 out of the 440 that are amazing

<zcorpan> jgraham: i think we classify things in terms of their status (waiting for review, waiting for OP)

<zcorpan> jgraham: don't classify by importance, size

<zcorpan> miketaylr: if i say to other person at moz that 50 tests are important, it can be prioritized

<zcorpan> jgraham: if we can sort things by stuff that show browser issues

<zcorpan> jgraham: that is valuable

<zcorpan> foolip: come back to this?

<zcorpan> jgraham: yes

<zcorpan> jgraham: next bullet point: beginners onboarding

<zcorpan> foolip: more to say that docs?

<zcorpan> jgraham: mdn survey

<zcorpan> foolip: i think that survey can be helpful to prioritize our work

<zcorpan> jgraham: "make CI more robust"

<zcorpan> foolip: i think we did

<zcorpan> foolip: reliability question, can talk about PRs in next session

<zcorpan> foolip: running more, getting more complete results

<zcorpan> jgraham: problems with safari

<BitBot> (14wpt) [PR] chromium-wpt-export-bot 04closed 13#18999: Revert "Reland "Started implementing the STAPIT algorithm"" - https://git.io/Jemrw

<zcorpan> foolip: github actoins acting up, problems for PRs

<zcorpan> foolip: PR results and full run are better now than 1 year ago

<zcorpan> jgraham: "give web platform engineers the tools they need to prio"

<zcorpan> jgraham: is done or in progress

<zcorpan> gsnedders: don't have bug links

<zcorpan> jgraham: so that's in progress

<zcorpan> lukebjerring: every bug that i triage that has a crbug is fixed or explanation why it won't be in the short term

<zcorpan> lukebjerring: made some progress there

<zcorpan> foolip: would you encourage others to do the same?

<zcorpan> lukebjerring: yes

lukebjerring: https://wpt.fyi/insights

<zcorpan> lukebjerring: double down on what is an important failure, not yet really solved

<Hexcles> lukebjerring: RFC https://github.com/web-platform-tests/rfcs/pull/16 will move the needle more significantly

<zcorpan> jgraham: we've looked at firefox-only failures

<zcorpan> jgraham: triage that but resistance is always "ok these things fail but are they relevant to compat?"

<zcorpan> jgraham: we know if we fix compat bugs they often end up with new tests

<zcorpan> jgraham: don't have the reverse: does fixing a test fix web compat

<zcorpan> jgraham: data exists that building up internal understanding of when it's important

<zcorpan> jgraham: things that are different between firefox and safari removes some objections

<zcorpan> jgraham: or chrome and safari. firefox-specific failures

<zcorpan> jgraham: have better ways to tell if a failure is important, is valuable to us

<zcorpan> jgraham: we still don't understand how given a test failure impacts web compat

<zcorpan> JohnJansen: if we reduce a site bug and fix that, then track back to which tests now pass

<zcorpan> JohnJansen: but not reverse

<zcorpan> JohnJansen: interaction between features also

<zcorpan> jgraham: we could test that, but don't test all possible intersections

<zcorpan> jgraham: reducing things that go into the suite

<zcorpan> foolip: beyond harness errors, can we talk about flaky tests?

<zcorpan> jgraham: infra has improved

<zcorpan> jgraham: some ideas that we havent' followed up

<zcorpan> jgraham: if we can get to a situation where a library works around a browser bug, if they also file a browser bug

<zcorpan> jgraham: would be impactful

<zcorpan> foolip: triage metadata and bug linking, if there's labeling and start counts

<zcorpan> foolip: test to spec linking. bikeshed and respec have something!

<zcorpan> jgraham: progress, but relevant people aren't in the room

<zcorpan> jgraham: test coverage

<zcorpan> Hexcles: in blink we have coverage comparison between wpt and legacy layout tests

<zcorpan> Hexcles: results are pretty good overall

<zcorpan> lukebjerring: 10% difference between running all tests (wpt+layout) and only running wpt

<zcorpan> lukebjerring: identify low hanging fruit to upstream tests to wpt

<Hexcles> https://storage.googleapis.com/blink-wpt-coverage/201812/index.html

<zcorpan> Hexcles: we can easily measure coverage data

<zcorpan> Hexcles: i think there's effort to collect data on ongoing basis

<zcorpan> JohnJansen: region coverage?

<zcorpan> JohnJansen: what does it mean?

<zcorpan> gsnedders: basic blocks

<zcorpan> jgraham: for gecko we have a coverage metric

Tootip is "Region coverage is the percentage of code regions which have been executed at least once. A code region may span multiple lines (e.g in a large function body with no control flow). However, it's also possible for a single line to contain multiple code regions (e.g in 'return x || y && z')."

<zcorpan> jgraham: daily coverage runs maybe

<zcorpan> jgraham: can see per testsuite

<ato> RRSAgent: make minutes, please

<zcorpan> jgraham: the UI doesn't yet display diffs between testsuites

<zcorpan> jgraham: I have some of that data for gecko

<ato> ScribeNick: zcorpan

jgraham: some bits are better covered by mochitests, like gamepad
... some areas where wpt has better coverage

<ato> RRSAgent: make minutes

Hexcles: one thing stood out
... accessibility is poorly tested in wpt

gsnedders: not possible to test

jgraham: coverage is a way to identify places where wpt is weak
... might not always be fixable (like maybe GC)
... gamepad or accessibility should be testable

foolip: that's test automation
... has coverage improved?

MikeSmith: how many tests do we have that are using testdriver.js
... cases manual tests are converted
... incentive to groups to automate manual tests
... awareness, people don't know this is available

foolip: 400 files testdriver.js

jgraham: automating stuff that we can't currently automate... some success

ato: in terms of testability extensions of webdriver api
... we've seen other specs e.g. permissions, write extensions
... but no implementations

foolip: generate test report has been implemented

jgraham: gecko hasn't implemented that yet

foolip: ~200 files testdriver.js a year ago
... pointer events, painful?

NavidZ_: let anyone add switch to testdriver protocol
... gives user activation to the page
... nobody can do that
... one way to ask the test author to click or something
... question is, how much of this do we want to expose to the testers
... not testing parts of the browser
... some apis already that can expose that
... if we want to go down that path, do it all the way
... on windows, behaves differently than linux on mac
... chromium tests only, test can choose based on platform
... do we want to expose that in testdriver?
... focus next: can't get that without user interaction. difference between platforms
... adding automation for specific things, expose more and more of inner workings of the browser

foolip: do we want to add API for create user interaction defined in html?

ato: big discussion in browser testing and tools

<std-lunch>

<BitBot> (14wpt) [PR] dirkschulze requested 13#19074 merge into 07master: Add smfr as reviewer for CSS Transforms - https://git.io/JeOTN

<BitBot> (14wpt) [PR] dirkschulze 03merged 13#19074 into 07master: Add smfr as reviewer for CSS Transforms - https://git.io/JeOTN

RRSAgent: make minutes

<BitBot> (14wpt) [PR] moz-wptsync-bot requested 13#19075 merge into 07master: [Gecko Bug 1579993] Add WPT subgrid tests and a few regular Grid baseline alignment tests. - https://git.io/JeOkz

participants & their position to help with scribing https://docs.google.com/spreadsheets/d/1cqPK6ze2OCLsho4twJHNLZUPktfejIiiDlMwv0TaZBg/edit#gid=0

<ato> RRSAgent: make minutes, please

<foolip> Just added https://www.w3.org/wiki/TPAC/2019/SessionIdeas#web-platform-tests_update_.26_discussion

<ato> ScribeNick: ato

Infra: making full use of the test results on PRs

foolip: PR checks are a bit noisy, so at the moment you want to ignore them. Fourteen in total.
... Flaky tests also. Should we do something about them?

jgraham: Let's talk about what is there already.

<foolip> Example: https://github.com/web-platform-tests/wpt/pull/19067

<BitBot> (14wpt) [PR] dirkschulze 03merged 13#11169 into 07master: Remove tests for SVG transform with CSS syntax - part 1 - https://git.io/JeOk6

foolip: The Azure pipeline is noisy, just produces results for Safari.
... This is a request for the Azure team, to make this less noisy.

JohnJansen: You'd like the report to be merged into one? All the pipelines into one?

gsnedders: In GitHub checks it makes sense for them to be separate?

foolip: Talking to Microsoft about this makes sense I think.

jgraham: With TaskCluster you have to click through to get the specific URL for the job.
... This is not natural to everyone.
... I think switching that to checks will make it noisy, but may make it easier to figure out what's failing.
... A decision task would help also, because it would just run the tasks that are relevant.

<JohnJansen> ACTION: JohnJansen follow up with Azure Pipelines team for this

jgraham: Rather than the tasks we have decided to run for you.

[talk about priorities]

jgraham: With a decision task [on Taskcluster] it will run only the things that are dependencies, such as lints when source file changes, but essentially just the test jobs related.
... We've run up quite hard against the limits of GitHub.
... Checks vs. non-blocking checks

foolip: Final part, wpt.fyi.
... Sometimes there are more wpt.fyi non-blocking checks, due to deployment etc.
... Could we merge this into a single check somehow?

lukebjerring: If it's desirable we could reuse the same name and aggregate the information.
... Pending results will tie into these, so we will have pending checks whilst the result processor is still ongoing.

foolip: Sometimes I see the same problem in all three.

jgraham: I've never clicked on these that have neutral status.
... So I don't think it’s an effective signal to give users.
... Being able to look at the Firefox results specifically has been useful, but I don't think that the neutral results provide any value.

foolip: How do we indicate this in a better way? Make it fail?

jgraham: Failing with a button to "un-fail" it might be OK, but confusing.
... Maybe submit an issue on the code, somewhere?

foolip: Review comment?

jgraham: "This test appears to be erroring in this browser, if that’s fine you should dismiss this review and accept the PR."

<foolip> https://developer.github.com/v3/checks/runs/ has screenshot at top of possible outcomes

lukebjerring: If you click on the details for the wpt.fyi job, there’s a recompute button right now, and for the case that is neutral which I would hope to upgrade to blocking, it would add a comment saying which user clicked the button.
... "Luke marked this as passing using the ignore button"

foolip: There are a few different options, a red triangle and provide text.
... I suppose it wouldn’t be blocking then. We could decide not to make it blocking, I mean.

jgraham: I’m happy to experiment with stuff here, but I’m very cynical about how much people writing these tests are going to care to dig into the issues.
... At least initially people will be whining because their PRs are stuck and don’t bother investigating.

foolip: Low-frequency but serious solution we’re willing to experiment with?

jgraham: No pass everywhere.

foolip: Let's go with "consistent error" everywhere.
... A lot of tests that Chrome adds are harness errors in Safari.

jgraham: Maybe it should be more cynical about things that come from browser syncs.
... A test for an API that only exists in Chrome might be problematic.

<Hexcles> ACTION: lukebjerring Hexcles: explore setting wpt.fyi check results to failures when e.g. tests error everywhere

lukebjerring: It has to previously been failing and now passing at the moment, and I think it would be more useful to make it blocking than neutral.
... We have merged PRs that failed tests, but they were neutral in the PR and didn't check. We had to unroll a big list of changes.

jgraham: There needs to be some mechanism whereby the author has to say that “this is what I intended”.
... Pass going to fail happens often. The test could’ve been wrong.
... The test might’ve been passing in all browsers, but is now always failing.

lukebjerring: If you fix a test that was passing incorrectly and becomes failing, you’ll be blocking becaus ethe statistics will be wrong [?]

gsnedders: It’s not the job failing within the Chromium CI system.

jgraham: People are less invested in gettnig their change landed in Chromium than they are in WPT.
... Or is your CL blocked until it lands in WPT?

lukebjerring: That’s a discussion we want to have today.
... We want to propagate that into the original CL before it lands, before the export happens. This reduces human manual intervention.

jgraham: [explains Mozilla process]

lukebjerring: Forcing someone to click an extra button is worth it balanced against the pain it puts on us later.
... Because of the privilege that WPT is given, it is important for WPT not to land into a bad state.

jgraham: Occasionally we land something that fails all our tests.
... Coming up with good heuristics is hard, but I agree it’s a problem.
... The problem is that this is a system platform developrs don’t interact with often.

ato: Should upstream PRs be scrutinised harder?

jgraham: Maybe that case should have different heuristics.

CalebRouleau: Whitelist of things that is supposed to be passing?
... This would be a change in the code.

jgraham: The expectation metadata could be put in WPT perhaps.
... But it would meet a lot of resistance.

Hexcles: I agree.

jgraham: It would be like for the infra tests. You would have to go and update the expectations when you mean for something to change.
... It would work, because that’s how we do it for Gecko.

CalebRouleau: The proposal was _whitelist_, not expectation data for all.

lukebjerring: You’re going to end up exhaustively listing the metadata anyway, and you’ll want this per-browser basis.

jgraham: It would also increase the workload on authors because they would have to update the metadata also for other web browsers.

lukebjerring: Someone can submit something to the codebase and suddenly everything fails, without any warning.

jgraham: Human intervention needed at some level.
... Different rules for different directories?

ato: It’s not unprecedented that tests are wrong and we make them go from pass to fail expecedtly.

jgraham: [explains a recent case]

<Hexcles> foolip: if you are heavily involved you can become a reviewer or even codeowner on github

lukebjerring: I think it’s reasonable to ask people to explain why they are making a test go from pass to fail.
... It’s easy to demonstrate why we would impose a blocking check for it that people would understand.

jgraham: I would need to see statistics to see that it’s usually problematic [?]
... Another case is where someone adds a test that passes only in a single browser, but times out or fails in all others.

lukebjerring: Do we make these a failing action on GitHub and allow it to be ignored?
... And when a regression happens, they are propagated into the Firefox CI?

jgraham: We’re starting up a project so that we can hopefully surface this stuff as feedback in the code review, as opposed to finding out after a change lands.
... I think we can surface this stuff to developers earlier and get them to look at it more readily.
... It’s unclear what the best mechanism for the feedback in WPT on GH should be.
... If you choose to ignore you have to give a reason I think.
... With checks you can't maybe.

foolip: The GH checks are made for this, they have a desired outcome and you have to give a reason.
... That flow is sort of built for this.

jgraham: I’m unconvinced they have got the UI right.

foolip: Well we haven’t tried it out.

jgraham: OK, maybe we should try the checks thing first, otherwise try the code review thing.

Hexcles: We could have different rules for different directories also.

jgraham: If we could initially roll it out for directories with developers we know are engaged.

foolip: If we see the flow in a PR, maybe this is going to be easier to assess.

lukebjerring: I do a deliberate regression PR to wpt.fyi for this

RRSAgent: make minutes

RRSAgent: Make minutes

<Hexcles> ACTION: foolip to write an RFC for making regressions detected by wpt.fyi require actions

foolip: Understanding the flakiness is super-hard.
... We could make the logs less verbose, but then we’d have to increase it to find out what’s wrong.

jgraham: We have a log handler that picks out the things it thinks is important.

lukebjerring: Custom interpretation jobs is what wpt.fyi is doing by definition.

<Hexcles> RRSAgent make minutes

lukebjerring: Instead of designing TaskCluster to have custom log interpretation, we should do this in wpt.fyi.

jgraham: But if we had this in TaskCluster, we would use this consistently also for other things. It could produce an artifact we could reuse elsewhere.

<Hexcles> ACTION: Hexcles: switch Taskcluster to GitHub Checks

jgraham: [explains TaskCluster]
... Regarding flakey tests, recently the expected test status at Mozilla have support for multiple test statuses.
... For example, this test can either pass or fail.

lukebjerring: What is the main reason you’re against having expectation data upstream in WPT?

jgraham: On WPT we’re running things mostly on one platform.
... On Gecko we’re running a vast number of platforms which is specific.
... This could cause double work because you would have to care for other browsers’ expectation data in addition to your own.

Hexcles: There's a more fundamental problem: browser versions matter.
... If you have an upstream expectation that applies to a specific version, it will be even more difficult.

lukebjerring: There's an implicit falkiness if you have multiple acceptable statuses, and having a totally separate place to say soemthing is flaky seems kind of bad because it duplicates information.

jgraham: If you had the WPT flakiness data in tree you could build some interesting tooling.
... "You’ve marked this test as not-flaky in Chrome, but it’s still flaky elsewhere.”

foolip: Flakiness on master vs. flakiness on PRs?

lukebjerring: We run a cron job that looks at the last ten runs and checks for flaky tests and lets some person know about it.

jgraham: Can we have a checks page?

form on*

lukebjerring: Recompute, ignore
... You could also have "flag as flaky" which would change the metadata and re-run the computation, say.
... If you have a flaky test on TC and people aren’t bothered looking into it, they are already ignoring this and force merging.
... So it would not make the current situation any worse.

foolip: I don’t think we necessarily have anything we disagree on with regards to flaky tests.

JohnJansen: Explanation on how to deal with flaky tests.

lukebjerring: Documentation for “so you’ve been told your tests are flaky” sounds like a good idea.

Hexcles: A tutorial linked from the GH checks.

foolip: If we have a button that marks as flaky, you can be sure people are going to click it.
... This will eventually make the system useless if overused.

lukebjerring: It’s hard to identify false-positives.

jgraham: If you haven’t seen flakiness in the last month, then we probably don’t care.
... You can remove expectation data when the flakiness goes away.
... In Chromium you run every test on every commit, you can get backed out if the test becomes flaky as a result of the CL.
... In the Gecko case it matters less if the metadata is a little bit inaccurate.

Python 3

jgraham: Python Foundation are going to stop maintaining Python 2 sometime next year.
... There is no need for immediate panic because RedHat will continue providing updates for another four years.
... But we should have a plan for migration to Python 3.
... Gecko is starting to move things to Python 3 slowly, and there is increased need for us to have a roadmap for this as well.

Hexcles: There was no conclusion what WebKit is going to do based on the email thread.

jgraham: Previously we had assumed that WebKit was a blocking concern.
... But now it looks like they are switching to Python 3, or possibly going to stop shipping Python altogether.

Hexcles: macOS 10.15
... We should find someone from the WebKit community.

s/.11/10.15/

RRSAgent: make minutes

jgraham: We need support for both Python 2 and 3.
... For example making the WPT frontend run in either, then make the commands it despatches to run Python 3.
... That seems to be the way Gecko works.

Break.

<BitBot> (14wpt) [PR] autofoolip requested 13#19076 merge into 07master: Update interfaces/IndexedDB.idl - https://git.io/JeOIW

<BitBot> (14wpt) [PR] autofoolip requested 13#19077 merge into 07master: Update interfaces/gamepad.idl - https://git.io/JeOI8

<BitBot> (14wpt) [PR] autofoolip requested 13#19078 merge into 07master: Update interfaces/geometry.idl - https://git.io/JeOI4

<BitBot> (14wpt) [PR] autofoolip requested 13#19079 merge into 07master: Update interfaces/webmidi.idl - https://git.io/JeOIB

<BitBot> (14wpt) [PR] autofoolip requested 13#19080 merge into 07master: Update interfaces/webrtc-stats.idl - https://git.io/JeOIR

WebXR

RRSAgent: make minutes

<foolip> A testing API exists: https://github.com/immersive-web/webxr-test-api

RRSAgent: this meeting spans midnight

<BitBot> (14wpt) [PR] foolip 03merged 13#19076 into 07master: Update interfaces/IndexedDB.idl - https://git.io/JeOIW

RRSAgent: listen

RRSAgent: make minutes

jgraham: How to test things that you can't model to simple interaction.
... WebXR is an example of a spec implemented in more than just Chrome, that the tests won't work in other browsers because they have this Mojo crap in them.

<Manishearth> https://github.com/immersive-web/webxr-test-api/blob/master/explainer.md

jgraham: The question is what is the testing strategy for WebXR.

<BitBot> (14wpt) [PR] foolip 03merged 13#19078 into 07master: Update interfaces/geometry.idl - https://git.io/JeOI4

mounir: There is a testing API in Chrome.
... The backend of that in Chrome is using Mojo.
... So this is not directly exposed to test code.
... The solution we have is to have an internal API or something.

Manishearth: There was a testing API proposal that was out of date, and no one implemented it.

<BitBot> (14wpt) [PR] foolip 03merged 13#19080 into 07master: Update interfaces/webrtc-stats.idl - https://git.io/JeOIR

Manishearth: There were WPT that had a utils folder that did include specific things, but the tests were written on a shared API.
... It was backed by some Mojo sstuff.
... We implemented a new API for testing that has a backend in Chrome.
... So that's why there's still Mojo in there.
... We're able to run the tests just fine [in Firefox] because we have a native implementation of the API.

jgraham: I heard an expectation that Mojo had to load, and the tests would fail irrespective of what the browser did.

Manishearth: I've certainly been able to run the tests without Mojo on Servo with success.
... In Servo it's a regular WebIDL that we expose with a pref when needed.

foolip: We haven't figured out how to get this to work in Chrome on regular Chrome build.

jgraham: Do the WebXR people have any needs from us?

mounir: We are challenged about where to put the Mojo bits [?]

jgraham: If you want us to install a magic extension we could facilitate that.

Hexcles: This seems like a very Chrome specific problem.

<BitBot> (14wpt) [PR] chromium-wpt-export-bot requested 13#19081 merge into 07master: [LayoutNG] Allow overflow-/word-wrap to work with keep-all - https://git.io/JeOIp

Hexcles: We're discussing archiving and fetching Mojo for testing.

jgraham: If you want to set a pref, put a file on the filesystem, or install an extension we could do that.
... For Gecko tests you can use internal APIs by loading a web extension first.

Hexcles: There are some challenges lining up the right Mojo version with the right Chrome version.
... We would need to map it on a revision by revision basis.

jgraham: It sounds like there are no fundamental WPT issues here.

lukebjerring: We build the chromedriver binary [unsigned?] in [some Google system].

Hexcles: We will need to add some logic to WPT to figure out the URL to fetch Mojo from.
... I suppose there's no objection to that.

jgraham: We're happy to do browser specific stuff for some tests.

mounir: Why don't you guys use a content shell for testing?

jgraham: The question is how representative it is of the user experience.

Hexcles: Internally we're moving towads running the full browser.

jgraham: It used to be the case that you couldn't run chromedriver against content shell.

Hexcles: It's supposed to work.

JohnChen: I can't speak to whether it works today, but it's meant to.

ato: The complication with Firefox is that it reads a bunch of prefs at startup time.

Manishearth: Is this the first time such a testing API is implemented?

jgraham: I think it is for something that is tested cross-browser.

Manishearth: It's not that we've made a grave mistake?

jgraham: No.

mounir: What is the time line for moving to wptrunner for the Chrome infrastructure, so we run full builds?

foolip: I don't know the timeline, but we're working on it and have for some time.

[discussion about how to fetch the latest Chrome]

gsnedders: It would be useful to have Chromium nightly builds running in WPT.

<Hexcles> ^ That's unsigned Chromium, where EME etc. does not work

https://download-chromium.appspot.com/ provides chromium builds

foolip: Just to fix the Mojo problem, there has to be some JS to inject the Mojo files...

jgraham: We could mark certain directories to require the Mojo stuff or something.
... Unfortunately for the prefs stuff, this is encoded in the Mozilla metadata and this is not upstreamed to WPT.
... But there is an argument that this could be upstreamed, because at the moment you will occassionally see differences when running Firefox tests upstream.

RRSAgent: make minutes

Python 3

whsieh: Older versions of macOS might not have Python 3 installed.
... It might still be years before we can drop Python 2.

jgraham: We need a strategy for us to move to Python 3 in a finite amount of time.
... But we can't move to Python 2.
... Does the manifest generation work in 3 now?

gsnedders: It generates a completely different manifest in 2 and 3.
... Is Python 2 going to be maintained past 2024?

jgraham: Gecko is moving away from Python 2, but it's going to be years.
... There are for example also dependencies on wptrunner.

foolip: So wptrunner can support only 3?

jgraham: I think it needs to support 2 and 3 for some time.

foolip: It would be nice if wptrunner keeps working on Python 2 on older Macs.
... So you don't have to download anything special.

jgraham: That has historically been a requirement.

Hexcles: It sounds like this is not an urgent matter.

gsnedders: Until Apple stops shipping Python 2.

jgraham: In Gecko, all new commands has to be Python 3.
... In practice I'm not sure if it matters.
... But there's a push to move to Python 3.
... One first step would be to run the infrastructure tests in both versions.
... We should maybe start writing new code in Python 3?
... For example, require entry-points to be Python 2+3 compatible.

Hexcles: There's currently no incentive.
... Every vendor seems to be postponing the migration indefinitely.

jgraham: Not sure that is the case for Mozilla.
... wptrunner is Python 2 only.
... The web server stuff works on 3, but the handler scripts might be fine but no one has checked.

Hexcles: They are definitely not fine!

<BitBot> (14wpt) [PR] chromium-wpt-export-bot requested 13#19082 merge into 07master: [webnfc] Add tests for NFCPushOptions.ignoreRead - https://git.io/JeOL8

jgraham: For new entry-points we could require 3.
... That's not a big ask, because people add these relatively seldom.
... Once we get the manifest generator to generate the same results in 2 and 3, there should be a unit test for the behaviour.

Hexcles: Doesn't sound like there's a modular approach.

gsnedders: Manifests does a lot of string manipulation, but there is less string conversion happening in wptrunner.

jgraham: I've started getting random patches for this.
... Often I'm scared of accepting these because testing is hard.

Hexcles: It's hard to modularise wptrunner.
... Someone needs to spend time to make it work on Python 3, then have integration tests for the Python 3 fixes.

jgraham: One first step would be to get it to import cleanly in Python 3 without SyntaxErrors.

Hexcles: Do you have an estimate?

jgraham: It's not a small amount of work.

gsnedders: There's a long tail of work.

jgraham: I can imagine some team at Mozilla might get an intern to do this.
... There seems to be some agreement that we need to do this work, and that it's acceptable to stand up tests for the Python 3 behaviour.
... I'm saying there are people working on this, and we should support the people doing work on this. Not that we should do the work right now.
... And maybe in two years macOS might be more of a force to dictate further progress.

MikeSmith: On macOS, homebrew installs Python 3 by default.
... This is a huge hurdle for contributors to WPT.
... Because it overrides the system default Python 2.

<BitBot> (14wpt) [PR] chromium-wpt-export-bot 03merged 13#19081 into 07master: [LayoutNG] Allow overflow-/word-wrap to work with keep-all - https://git.io/JeOIp

<foolip> Doubt in the room of whether this is correct.

gsnedders: The constraint comes from the WebKit community, who are opposed to installing any other software on the system.

MikeSmith: Increasingly there are more and more brew packages relying on Python 3.

[discussion about misguided Linux distributions about how they are shipping Python]

<gsnedders> MikeSmith: https://docs.brew.sh/Homebrew-and-Python#python-3x-or-python-2x says it souldn't

gsnedders: The manifest migration is easily a month's work.
... There are performance challenges involved.
... And hard to do without making a complete mess of it.

[technical discussion about type annotations]

RRSAgent: make minutes

RRSAgent: stop

- DRAFT -

Web Platform Tests, Day 1, TPAC 2019

16 Sep 2019

Attendees

Contents

Intro from Luke Bjerring

Live demo from Robert Ma

Review of 2019 priorities from last TPAC starting

Infra: making full use of the test results on PRs

Python 3

WebXR

Python 3

Summary of Action Items

Summary of Resolutions

Scribe.perl diagnostic output