Immersive Web WG/CG 2024/March face-to-face Day 2

Meeting minutes

Progress on the spec (immersive-web/model-element#78)

rik: recently looking at spec, and didn't look like much recent progress recently, any updates?

Brandel: was recently looking at this. currently have POC implementation in safari but its somewhat old, 2 years old. Looking at now how to make more concrete proposals with educated proposals for implementation, within the last month or two

Brandel: Focused on the standards body as well to move forward on the model format support, as well as MaterialX as a way of handling materials for models, to resolve some questions and clarifying

cabanier: should we close a lot of the many opened (old) issues created in the model element repo?

Brandel: we could potentially do that, often the backlog of GH issues may not be that relevant

Leonard: we didn't want to expose an API that provided head tracking data for privacy reasons which is one reason the model element was proposed and created. still skeptical that need model element when alternatives available

Brandel: stereoscopic display of 3D model is a significant advantage in a privacy concious way is an advantage, in addition to having the platform take care of rendering has other benefits. This is the perspective of Apple.

Leonard: Are there alternatives that can achieve this, such as by additional specs? Also that could move faster than OS/browser updates

Brandel: no, I don't think alternatives are viable with the privacy + platform advantages of rendering on platform: reflections, stereo, etc. Stereo is a big aspect. don't see a way JS could do that (without privacy concerns)

Leonard: would like to see concerns more detailed in writing, something to point to as reason why

bajones: question about formats, where is the discussion being made. There was a recent meeting in bay area of a format organization talking about formats. Meeting saw benefit of both gltf and usdz in terms of delivery

<Leonard> Note to group from a previous Ada statement: Khronos is looking into procedural textures based on MaterialX

<alcooper> immersive-web/model-element#70

Brandel: upcoming goal is to figure out interop more, but skeptical of how gltf could be embedded in usdz which some have proposed

<Leonard> A lot of investigation between glTF / USD interoperability is being done in the Metaverse Standards Format.

alcooper: can we get more in the spec itself on reasons for model element in terms of privacy, echoing Leonard

High level scene graph manipulation API (immersive-web/model-element#65)

<giorgio> https://matrix.org/blog/2023/06/07/introducing-third-room-tp-2-the-creator-update/

giorgio: there was a web scene graph api proposal, with attempt to have w3c take over

giorgio: wanted to bring up this proposal, and see if could be a springboard for usage/inspiration when controlling a model element

Brandel: scene graph manipulation makes sense, we definitely want, but in context of trying to ship a v1, suggests that we should see how far we can get without taking on this more complicated topic that would be hard to standardize

Brandel: there is room to experiment and play with a lot of possibilities and alternatives on what a scene graph could be with this model element context, rather than trying to standardize at this time

Brandel: can come back to this later, after figure out other model tag issues

Leonard: agree that can wait, and also that this is likely the forum to handle this discussion and development, verses other things like the Metaverse standards forum etc

cabanier: doesn't standardizing scene graph conflict with having a model element that supports formats from gltf to usdz to others?

Brandel: the formats and what they intrinsically do share enough in that a common scene graph would be able to work / be compatible. converter apps that mostly work demonstrate this, preserving most things

Brandel: there are a lot of formats, it might be useful to have the UA decide what to support. Don't see any inherent issues

cabanier: why not just do usdz verses supporting multiple formats?

Brandel: different user agents can support different formats. There will be questions of what exactly different UAs and devices and what formats (or subsets) the can support. I.e. maybe not the full usd format might not make sense anytime soon

Brandel: usd has a lot of complicated features, they could be differentially supported

<yonet> Can we all mute notifications please, it's very distracting

nick-niantic: Apple has been implicitly developing opinions of 3d models/formats via their visionOS/realitykit work. Is there a world in which there is a realitykit for the web?

Brandel: There is the question on what the various levels of the stack to standardize on. Like MaterialX for matrials props for example. We don't want to overindex on an opaque system black box

nick-niantic: with the model element proposal it has some build in limitations based on how embeeded in page, in a browser. etc. Maybe there's a way we could start from a different point, that doesn't have the same implied limitations

Brandel: it's an interesting thought that could have something work in a different way, like a web realitykit

cwilso: the reason model-viewer started with a simple API that didn't have a scene graph and kept simple was to keep it high level. scene-graph level is harder to pin down. Formats effect the scene graph in a big way

bajones: there are tools that make formats interoperate, so could get to a good middle ground. it would be opinionated (a scene graph api), and this is good. Likely would be influenced by whatever format is being implemented and supported first

bajones: so there is a concern that even unintentionally could tie too much to shape of the first format. This is not without benefits to getting something working sooner, but could be backed into a corner where one file format is essentially or in part made the best format to use. and we would like to avoid that

bajones: if everyone just used one format i.e. usdz that would be one thing, but vendors don't necessarily want to support this format, they want to use other formats.

bajones: different platforms have built in support for formats like usdz and can make an easy choice to support that, but other platforms don't have that and would like to support something like gltf that was designed as a transmission format. we can't take for granted that all platforms will even eventually get usdz support

bajones: there is early discussion to make usd into a more web centric format. Wants to hear about issues of supporting gltf in usdz, etc.

Brandel: one issue is licensing of gltf implementers licence, for instance

Brandel: in a technical level, gltf is simple but doesn't support some use cases involving variants, something supported with usd that can do things more conditionally based on platform capabilities and choices

Brandel: want to support a format that has that capability verses modifying gltf for example to add that capability.

bajones: Examples of conditionality: loading a model in VR would show the build in lights of model, loading in AR would want to ignore and use the real world (inferred) lighting

<Brandel> https://remedy-entertainment.github.io/USDBook/terminology/LIVRPS.html

Brandel: the format plays a part in that discussion. Choosing a format that has laying mechanisms that support uses cases, verses a format like gltf that doesn't or might not have this support currently

bajones: want to understand more specifics of how much of this is required by model format, verses what a page itself could manage. It would be great to have examples

Brandel: Sharing one example where one usdz downloaded from a webpage could itself switch out asset parts dynamically with the platform making decisions on what to do

bajones: Does the format usd, have this conditionality of if this context do this, else if this context do that?

Brett: Will the model element ever be to moved in front of the webpage in a XR browser context? What other capabilities will be available like multiple models for one model element?

Brandel: not part of the proposal today / at this point. More of a timeline question, focus of getting out a v1 that resolves core uses

nick-niantic: webxr is a low level api, and things like realitykit and model are higher level. where is the potential for poly filling support using webxr (lower level) ?

Brandel: higher level API helps prevent from over-indexing on assumptions related to formats and features of one platform's 3D 'kit' like realitykit, etc

Brandel: example of webGPU. webGPU benefited from having multiple competting but similar low level APIs to abstract from. For now it doesn't seem like there are multiple platforms with things like RealityKIt, so it's missing the fertile ground to help with that abstraction effort mentally

bajones: webGPU took 7 years as well, which is a long time. And has some differences in what was trying to handle for its v1

Leonard: want more information on gltf license issue, want to follow up on

Supply an Image-based Light file (immersive-web/model-element#71)

<ada> immersive-web/model-element#71

Brandel: Image Based Lighting is very neat and important. Currently we see Model as a portal on a page. And people want to light it in very specific ways.
… we initially wanted to define this on a "root element" but got push back internally
… still think we should specify this. either as a JS API or ideally in a declarative way.
… there's merit to having a child object inside the tag but don't mind it if we put everything on the top level model element

bajones: I agree that it is neat.

<Leonard> No embed with IBL - no extension in progress

bajones: of all the options you outlined (all fine), I'm wondering if we could embed these in some of the model formats themselves (USD or an GLTF extension)
… why should it be separate form the model itself?

Brandel: you might want to use different IBL for the same model
… MaterialX also has the ability to define IBL, but for simplicity it makes sense to specify the IBL separately

bajones: no complaints

Leonard: in the glTF case lighting is part of the environment and not the model, for the same reasons Brandel mentioned
… IBLs are very specific, even changes based on the time of day

Brandel: for USD too, we have Quick Look, and I don't think anybody would want custom backed in IBLs when showing the object in your environment
… that said the MaterialX route would override this and could be weird (and fun!)

Brandel: bajones said he's less worried about formats here
… and there was talk at TPAC around HDR image formats too
… should be part of the broader conversation

Specify a camera's pivot location (immersive-web/model-element#72)

Brandel: the PoC implementation of model in iOS/macOS has a camera pitch/yaw/scale
… inspired by Quick Look
… but I don't know that it is all that poweful
… so I've been looking at simply providing a transform
… it's not a camera but it is moving around the root element, and gives people the ability to pan or orbit
… while it's appealing to try to make a simpler version for simple models, but for anything more complicated the "simple case" implementation would get in the way
… so I suggest that we use a transform (applying to the "stage"), in the context of a spatial display
… and maybe even that we _do not_ provide basic user interactions
… since it might not always be appropriate
… that's my suggestion, wonder what people think

cabanier: if you manipulate the transform, can you move the model out of the viewport of the portal?

Brandel: yes it's functionally moving the content inside of the view, like an SVG viewport
… you can make mistakes
… viewBox (SVG) can be used for incredible things
… while I'm not proposing a SVG-like scenegraph for now, aligning here would make sense and would be future proof
… that's why it's not a camera

cabanier: you also mentioning not having default orbit controls, but it sounds like an important thing most people want

Brandel: but the default case is in conflicts of what people would want for more complex models
… and the default orbit controls are easy to port to the transform, a couple of lines of code
… and the simplest case might come at the cost of the more complicated one
… I could be wrong!

cabanier: does USD define where your pivot point is? I think in gLTF you can

Brandel: all 3D files have a system origin, it's up to people to be sensible

cabanier: but that's not the point of pivot for orbit

Brandel: I think most people just use bounding box center etc...
… some people have _logic_ to find the least silly pivot point
… but I would expect people putting assets on a page to know where this point should be
… and without scenegraph inspection we couldn't figure it out

cabanier: I think FBX (AutoDesk) let's you define it

Brandel: without jumping back into formats... that might be a useful capability to look at and bring back to other formats
… it sounds neat

bajones: the idea of setting a transform programmatically is a baseline good idea
… it's going to be extremely useful
… but if it's strictly required, then we're back into a strictly imperative proposal
… it removes some of the appeal
… it also brings up a question about scale, and what the default presentation of the object is going to be
… if the object itself doesn't define it, you're probably going to have a pretty strict default, that might but you _in_ the model
… most people will probably expect to zoom out and fully frame the object, for which you need to determine the scale
… gLTF you have ways to figure it out (Meta had extensions proposal)
… but without the Scene Graph we're not giving people a way to do that
… that feels like a lot of work to do ahead of time, people might just go back to model-viewer, you might loose some of target audience
… no concerns about the capability, it's a powerful thing
… it opens up possibilities without going into SceneGraph API
… note: gLTF can include cameras, but models don't always include one
… other node: people are not always responsible, some models are not well designed
… "it looked good in my tool"
… some of this might be hurdle for people who don't have intimate knowledge of the model itself (think SketchFab)
… they do a lot of processing anyway, but there are situation where you don't have perfect knowledge about the model
… you need a way to toss it to the browser and ask for "reasonable defaults"

Brandel: that's all very reasonable. in a Spatial Context, we still get stereo which is neat and better than an image
… I could be convince that we need a default behavior, but "interactive" might not be a good name for it
… Quick Look sets the pivot to the bounding box center of the first frame I think, and it's often wrong

bajones: Brandel, in agreement: it's still the least worst guess

Brandel: if we can agree on the default behavior it seems reasonable

bajones: there's still a leap from the default behavior (sometimes wrong) and something slightly more complicated
… it'll require some basic introspection (dimension and location of the model)
… it would be nice if we could expose a model min max to facilitate this, maybe just the first frame min max
… it'll let people pick up from where the default stops

Brandel: there's probably some basic high level scene details that can be shared

bajones: I'm fine with that, and it side steps some of the full Scene Graph API issues. and it's read only

Bratndel is a peace with this

Leonard: there was some mention of FBX as a possible file format

Brandel: just as an example of a strategy

Leonard: there is no future to FBX according to AutoDesk

Brett__1: (in person demo), asking about moving model elements via a HTML attribute

marisha: imperatively via JS or via an attribute

bajones: since it's read-only you kind of have to use JavaScript
… not sure how much we want to go into CSS things there
… might not be well suited for this environment

Brandel: I'm sympathetic to the idea of using CSS here
… the imperative API will help us figure out the shape of things

ada: in Brett__1's example you would lookup the bounding box then figure out the transform, in JS
… to put the element in view

bajones: but that would be the default behavior too

Brett__1: is this an issue with units?

Brandel: one of the challenge of Model is that it bridges two modalities: the DOM and the model
… currently discussing things "inside of the model world"
… CSS top left might not map directly to this

Brett__1: could I shift the vertices while ingesting the model?

Brandel: we're proposing the model as an opaque format for now

<Leonard> You cannot translate on read because of the need to account for scale and rotation.

Brandel: you would need a parser for this

bajones: could be done as an offline process too
… you would update it before showing it online

Brett__1 (more in-person demo, waving a cookie around)

bajones, ada... discussing CSS units and their weirdness when using a DOMMatrix

bajones: generic 3D environments are typically unitless
… we could use 1 unit = 1 meter like WebXR
… but it wouldn't be a good fit for all models
… so CSS couldn't easily operate in that context
… and percentages are hard, because the universe has no clear bounds

Brett__1: but the box will have dimensions

<eddy_wong> I'd like to raise my hand (sorry first time raising my hand)

bajones: units

bajones: it generally doesn't matter until you bring it in the real space

bajones: Brandel said it well, manipulating this through CSS is probably not the right place to start

Brandel: all about timelines, we have to start simple

bajones: might be a different topic, but talking about the cameras... that's one half of the equation, the other is the projection property of the viewport itself
… any thoughts? probably want a reasonable set of defaults too
… do we want to allow that to be manipulated?

Brandel: the view transforms all the way to ground truth actually leaks where the user is looking
… so we don't want to expose that
… one of the rationale for "not providing a camera" is because the camera is your eyes
… but when not presented spatially it might be useful to have full cameras
… not sure when it is appropriate

bajones: wasn't thinking about the spatial use case, I see how you derive a lot of information from that
… but that won't be what most people use
… so we need a way to line up what I do on a flat page and what I see in a spatial context
… these shouldn't be completely separate
… in the flat page baseline, I need some sense of the field of view
… if it's very narrow I'll have to push the model back a lot, if it's really wide I'll position the model differently
… I also want a vague idea of where the near field is
… I'd like to avoid having a "magic value" of 1m, that might actually be different device to device
… for people to intelligently frame their objects, they need at least to know the defaults
… we need people to have _some_ context of how the viewport will be presented

Brandel: the near clipping plane is the view plane in our implementation (nothing comes out)

bajones: sure, but we can't put the near plane at 0 (because maths)
… so presumably you'll have some small value for the near z, we can decide on

Brandel: but the camera is your eye (so the z clipping plane is not at 0)

bajones: you're thinking spatial only again

Brandel: don't have strong feelings right now about the non-spatial display environment
… and I would personally want to play with FoV etc...
… the web has "real world units", the window increase in pixel size when you resize it on visionOS too
… but we don't want people to know how big the window is (in real world units)
… so we can probably imply defaults

bajones is going to draw on the board
… fully onboard with not exposing private information
… but people should be able to express things across modalities
… first example: non spatial rendering, the browser picks a "eye" to setup the rendering of the model, the author wants it positioned inside of the view's frustum
… don't care about the physical size of the element actually
… now the spatial example, with someone's actual eye
… now we are looking through the page, it's a great effect!
… but as an author I don't know the view's frustum anymore, and I maybe only tested in a non-spatial context
… so now the model might not be framed properly
… so while the effect is cool, we need to mitigate the fact that it might not be setup properly
… I need to know enough about the environment (in spatial mode) to frame the element
… probably want to frame my object inside a "magic box" (drawing will be in the minutes)
… this is challenging but it can be done in a privacy preserving way
… and I'd like the logic to be mostly the same in the spatial and non-spatial modes

Brandel: this is very useful
… in the non-spatial mode, scrolling the page might move the view frustum
… or devices with a webcam could potentially do this too
… to make things "more spatial" in non-spatial mode
… in practice the bounding box might be enough to reasonably position the model (in spatial mode)
… because the page always points at you (on visionOS)
… not proposing dynamic sheering of the view frustum but it might be an interesting idea

(for non-spatial context)

bajones: the key thing here is that in the non-spatial case, people don't know things are wrong (badly positioned for spatial)
… we need to nudge people toward doing the right thing for spatial when viewing in non-spatial

Brandel: in principle I don't object to exposing the whole view transform when non-spatial
… and I agree that it would be good for people to act more consistently between the two contexts

bajones: this goes back to earlier discussions about "magic windows"
… we should look into past minutes

eddy_wong: want to share some learnings from my prototypes
… I prototyped object-fit, like I would use for an image
… the model tag bounds exactly fits the model's bound
… the benefit is that the web author doesn't need to do the maths, just object-fit: contains, then they can query the transform and manipulate it
… rotating, scaling etc...
… so without introspection on the model's bounds, the author can rely on the initial transform (due to the CSS properties)
… wanted to share this learning, might trigger more ideas
… and what we see on the whiteboard is a great problem
… in our mind the model is still a 2D thing (width: 100px, height: 100px), should it have a depth?
… would work well with the CSS object-fit
… and people would be guaranteed the position/scale of the model would be reliable
… so that would be my answer here, adding depth, but it's a very good question!

Brett__1: the whiteboard illustrates a problem with 3D CSS too
… the non-spatial stuff is hard

bialpio: is allowing the web developer to introspect the model transform risky?
… could I craft something that would expose head pose

Brandel: maybe, the place where we do the computation probably wouldn't have head pose, just access to the scene geometry
… shouldn't be not too hard to give a privacy preserving worse answer

bajones: you don't need it to be perfect from all angles
… just from a reasonable angle / distance

bialpio: I'm doing this from JS and I can create many models

bajones: the fit fonction should be stable even while you scroll etc...
… optimizing for the middle of the page, straight on etc...

bialpio: once I have the scroll position, and I create another element on top...

bajones: still think we'd only use the model bounding box

Brandel: maybe some timing attacks? putting things that are expensive out of the frame?
… but there's probably no stopping that

ada: maybe we shouldn't do view based culling

bajones: this is not a new problem for this context, we see this with WebGL too

bajones: we wanted textures that you could sample from but not read-back
… but people could still do timing attacks
… that's the reality of using the GPU

Brandel: that's an added reason for putting a firewall between the web content process and what's doing the rendering
… anything can do a timing attack but you don't have the granularity here
… more fine grained introspection might bring up new issues

nick-niantic: as we're looking into having the model shrink to the appropriate size, it reminded me of stuff we do for our AR system
… not physically accurate but more useful. could we scale the content based on the distance of the viewer to the page?

Brandel: making an initial computation based on the bounding box is adequate
… having a mapping from screen units to meters is probably the best we can do there, not based on the actual distance from the window
… visionOS expects everything to me 1.5 meters away

<Leonard> I won't join immediately afterwards - another meeting

<ada> immersive-web/model-element#76

Allow a <model> to specify a preference for spatial display (immersive-web/model-element#76)

Brandel: The stereoscopic spatial display of model and the affordances of a model tag are great, but not all devices can do that. There are some thing syou might want to do for a non-spatial non-stereoscopic view

Brandel: On the basis of that divergence, it's good for people to know that's going to be the case. (In a similar vein as <a> rel and Quicklook)

Brandel: It would beb good to indicate a stereoscopic property on a model

Brandel: For a portal environment it doesn't make sense to put it behind the surface of the page.. but if you do want to do those things, giving folks an option to do that would be desirable

Brandel: Should we consider exposing that in the API, setting a preference for spatial or non-spatial display?

bajones: This seems straightforward but I have some questions
… It seems the primary purpose is to dictate to show "mono" regardless of environment. Is there a sensible reason to dictate something to be seen in "stereo"?

Brandel: One thing Model can do is for materials and surface discrimination. Knowing that you're not going to do that could be useful.
… Something being intrinsically spatial might be desirable enough for constructing a page mockup

bajones: The way this is worded makes me think the intention is to give a page a display hint without necessarily being able to read back the actual value
… so does "knowing" I'm in a particular mode means knowing the preference rather than the actual

Brandel: It's not that we want to prevent people from determining whether a user is on a spatial device, but it could be valuable for content tailoring

bajones: I broadly agree with not exposing information unless the user can do something useful with it (whether it applies here)
… For stating something only wants to be displayed in 'mono,' could it be triggered intentionally or otherwise via stylesheets?
… For scenarios like overlaying text on top of a background, the model might be forced into mono mode

Brandel: Other scenarios to force mono mode might include CSS filters
… Some effects will only apply to mono models rather than stereo

cabanier: Why would sepia not work for stereo

Brandel: It would compel a complete rewrite of the CSS stack
… since in Apple's case Model is not rendered with web but with Reality Kit

ada: We might need a list of CSS filters that would not work in spatial mode

cabanier: Let's say you use reality kit to render the monoscopic model, you still wouldn't be able to use filters

Brandel: That's a matter of plumbing
… I might need to back out of things if it turns out my understanding is misaligned with reality
… A mono rendered thing may have the ability to respond to CSS filtering depending on the renderer

cabanier: -Asking clarification on Brandon's position-

cabanier: Could stereo or mono be a CSS feature instead of an attribute

Brandel: Maybe. Could possibly be a media query
… This idea could be used to discriminate between different types of experiences, but should also potentially give the user control
… Having an indicator of how well something is supported on a platform and for the dev to make decisions about rendering in that context could be valuable.

cabanier: I was thinking properties like 'stereo,' 'mono,' or 'auto'
… this could apply to other features like a scene graph

Brandel: Looking Glass exists, I don't know how conformant their mechanism for presentation is
… but stereo may be inadequate to describe things in scenarios like that
… we may want to be more encompassing of multiple types of contexts

bajones: I would love to see a media query for determining this
… In regard to Looking Glass, I don't see a problem referring to that as 'stereo,' it's generating more than two images but you're still perceiving two images (one for each eye)

Brandel: that makes sense

First pass at defining tracked sources (immersive-web/webxr#1361)

cabanier: I will introduce the problem space for this
… On Quest devices, a feature was introduced to track controllers and hands at the same time, so now there are 4 input sources
… But no WebXR experience supports this properly
… They draw either only hands or only controllers
… And it's hard to determine what the user's intent is in regard to what should be rendered when
… The WebXR spec has the notion of primary and auxiliary input sources
… and Apple Vision Pro has a problem today with transient pointers, there could be multiple input sources

ada: The decisions about unseen defaults - experiences tend to request both hands first - it means that we get a lot of difficulty determining inputs, it only behaves if you have one hand behind your back, and then you get a tracked pointer and a transient pointer

ada: Regarding your proposal: one of the content that works in Vision Pro: Monster Hands would break because you could no longer see the hands

cabanier: The proposal is to create a new attribute - a list of all tracked sources but do not create input. For Vision Pro this would be hands. In Quest this would be controllers after you put them down

ada: For my personal stance: I don' think we should specify granular input sources like "right and left"-

cabanier: I do think that though

ada: So we differ on that

ada: It would be nice to come up with a sensible agnostic model for inputs

cabanier: I want specifically right and left hands specified though

Brandel: My objection is that it is anthropocentric.
… Meta overindexes on what people currently already encounter on Meta hardware
… Handedness has implications about human participants with hands (rather than sightedness or other attributes associated with those inputs)
… There could be automated agents, or possibly accessibility tools that abstract away some of these principles

cabanier: But everything is broken everywhere because it's not web compatible

ada: In terms of WebXR's lifetime, I hope we're at the very beginning. There is some pain now, that this proposal will alleviate, and will give some time to work out things to become more generic
… This change means that the first two inputs that have select events will work, so if there are controllers, they will work

cabanier: It rubs me the wrong way to spec things that is counter to what everyone has implemented.

bajones: If you have multiple primary input sources, if the experience falls back to only recognize the first two sources, it would still be acceptable

Brandel: In general it's worth making sure the specs refer to real-world things like body plans and things that exist in the world - and then look at those as broadly as possible
… so ideas like handedness... in reference to the body tracking API, how well does that represent people with different limb configurations
… There are benefits and costs to reflecting reality vs what is ideal
… We want to make sure this accounts for all the ways WebXR could be consumed.

cabanier: If you introduce accessibility tools, now you have a third framework that will also not work

ada: Part of the goal here is to buy us time to get out of the current problem space of difficulty

alcooper: The pattern described of holding the controllers and then switching to becoming tracked sources, that sounds needlessly complicated - can't you just always track all sources and then change which ones emit input events

cabanier: That's mostly how that works but we wanted to enable folks to render the objects separately from input handling

alcooper: The spec doesn't currently say there's only one left and one right; web devs really seem to have overindexed on their own specific implementations
… It is unfortunate that they have overindexed on having only two input sources, but I don't love that we just add another attribute as a bandaid to the problem

nick-niantic: In the show One Piece, one character has a sword in each hand and one in his mouth. You could do an experience like this

nick-niantic: I don't know what will change or break here

cabanier: Nothing will break, it will just allow people to adjust things for the Vision Pro
… Tracked Sources is the new thing. These are objects that are tracked but don't generate input sources

cabanier: -speculation on how API would have to change to glob all objects together regardless of being an input source-

nick-niantic: Why on Oculus can you not generate events with both controllers and hands at the same time

cabanier: That's just the way it is. You can't do that with gamepad right now.

nick-niantic: Just trying to imagine different scenarios where you have different kinds of controller configuration

bajones: When you pick up a controller, do you get a hand object in the tracked sources and then when you put it down they switch places?

cabanier: yes

Brandel: Could you attach multiple controllers to a Quest device, like four controllers?

cabanier: I don't know if that is supported, but conceivably.

bialpio: How far off are we from having a WebXR system that can track multiple people?
… like having multiple players handled by a single WebXR session
… Having just one display device but having two users

bajones: I can imagine a Looking Glass scenario with multiple users coming up to a large display and their hands being tracked..
… That's a scenario where you can't utilize a lot of the drive-by web content out there
… It would have to be using very specialized hardware

bialpio: I'm just thinking of a future nearer than having robotic arms

bajones: Most specialized systems are not going to be picking up random three.js content - they are implemented for a very specific purpose that does not need to necessarily be served by standard APIs

ada: After the discussion yesterday about secondary views, maybe we should pivot those views to helping enable scenarios that are a bit more elaborate

Brandel: Like playing multiplayer Goldeneye

<Brandel> (on an autostereoscopic, lenticular display like Google project Starline)

bajones: If we have variable expectations for controllers, it might inconvenience but not break you. For requiring multiple views being rendered, that is a lot more constricting

ada: Part of this issue is that we conflate input rendering and input interaction. For gamepad it made sense, but one option might be to provide an array of things to render and have a position in 3D space, and to which you should listen for select events
… It would be a significant re-architecture of the way we do input
… Not knowing what was coming has put us in a difficult situation

alcooper: It looks like the select event should be fired from the session object. What is content doing to ignore events?

bajones: Three specifically has you query controllers by index and then do work to ensure you don't have to think about disconnects and reconnects
… They do a lot to map input sources to controllers and surface them to the user by index

alcooper: And then Aframe is querying off gamepad events

ada: They started in the WebVR era and never changed it when we moved to WebXR

alcooper: I wish these sites hadn't all overindexed on the current implementation

Brandel: It's not even the current implementation

ada: Back in the day this was new, and we didn't have a strong design doc for inputs for WebXR, and then for years we've had one model of how inputs are supposed to work - a headset with two sticks
… and then Android devices came along - it was very hard to get uptake to support WebXR
… You don't see the same amount of content on handheld devices
… We haven't shipped WebXR on iPhones yet which means no cross compatibility
… We're in an awkward era now that headsets are starting to diverge - screens, Vision Pro, TV displays with multiple inputs..
… If we can get out of this hurdle we'll be healthier in teh long run.. but this solution is sensible in general because there is new hardware coming out that isn't an input source

alcooper: Things like tracked pucks are good candidates for this since they shouldn't be rendered

bajones: When this topic came up I thought this was initially already declared in the spec. In general I feel that we shouldn't change the API in backwards incompatible ways just because pages have decided to do something else
… but we do see more devices coming online that don't fit the general input mode
… and we expect to have a lot more accessories and tracking devices
… so being able to communicate these in a clear and concise way without imposing unnecessary expectations, and if we can also "fix" some of the existing web content to work on existing devices, sounds like a good path forward.

nick-niantic: If the Apple Vision Pro has an existing implementation and is expected to have a new implementation, we'd like to see that change as quickly as possible
… People are going to start making content for that device and so time is short to fix this

ada: Yeah the timing is awkward and painful for us

nick-niantic: I would advocate to make the shift while you're still in experimental mode, but also we want you to get out of experimental mode as quickly as possible

nick-niantic: Something else: You mentioned a future when iOS devices would have full webXR support (hypothetically)
… one of the things we have been fighting is that experiences that are developed for headsets and phones are very different
… if that hypothetical comes to pass, we would like to know what kind of device we are running on - a phone or headset
… Android presents itself as a headset to us which is why we do User Agent testing
… we don't want to have to do the same thing for iOS

Brandel: We already experience this with AR Quicklook
… [amusing anecdote about having to headbutt a virtual control intended for a phone]

ada: On Vision Pro you do have inputs, it's just not something you wave around. That would work very well today on something like a phone using transient inputs

nick-niantic: You wouldn't build beat saber for phones because you wouldn't get controllers

felix_z: Why is wrapping so I can listen to input events from a particular source a bad way to do things? I don't have a universe select event that is left-or-right handed. What if I just have a gun in my hand and want to listen to that?

bajones: I think that was the mentality behind why things were originally built how they were
… the hope was that in a lot of cases, you can get by with fundamental actions starting/stopping, so if I have a play/pause button, and all I really care about is whether the button is pressed, all I have to do is listen to the session
… The expectation was that we were making it easy to work across the widest amount of hardware
… The concern is that some of these libraries make that more incompatible mechanism the default (whether it's best for the developer or not)
… Lots of experiences don't really care where the input is coming from. So determining which input event is coming from where is not the right model for everybody.

Brandel: On the 2D web, we don't listen for events on the mouse - we listen on a target *from* a mouse - they are object-centric

felix_z: Okay I was just wondering why it would be bad for a framework to be this way. The examples you gave are samples, I don't know if they represent the majority

ada: The issue with these frameworks is that they make a lot of assumptions right out of the gate, and they exclude anything that doesn't look like the default "two sticks and a screen" hardware
… That model has worked well up until now because it's been the predominant available hardware, so it wasn't breaking anybody at the time. But now things are breaking (with some advance warning) and now we have to make the content work for a new paradigm.

bajones: I should say that I've been complaining about frameworks a lot but it's not really fair of me. The frameworks have been doing what they felt was right or even easiest to enable the type of content they were looking to enable.
… There's a little bit of frustration for the editors here where assumptions didn't match reality of new hardware, and we can complain that they didn't stick closely enough to our own spec.. but no one is being a bad actor. The spec just happened to make a better decision than the libraries did. It wasn't guaranteed to turn out that way
… as Rik pointed out, even some of the samples don't work properly here

alcooper: If we are thinking about tracked sources including elbow pads or knobs, should they be a new type of input sources?

ada: I think they should still be tracked input sources, with a property about whether they will be rendered or not
… additional properties: handedness: none, rendered: false, no pointer

alcooper: Was just wondering whether we should still call it an input source

ada: I think it makes sense to expose optional input features for existing input sources

yonet: How do you detect whether controller shouldn't be used as the input source?

cabanier: If you let go of the controller it's no longer detected as an input
… I think. It's a bit of a black box in the OS

<ada> scribenick Brandel

2024 recharter (immersive-web/administrivia#204)

yonet: We need to re-charter in the coming months, and there are some questions that we need to address

cwilso: for context, a charter at W3C must always be for a fixed term, to allow for fixed goals over that time period.

cwilso: These charters are things that your company needs to be able to assent to. The charters are broad and directional - what is "mostly right" about goals within the timeframe specified

cwilso: Especially, charters shouldn't be overly ambitious at the risk of driving some companies away

immersive-web/administrivia#204

cwilso: [providing background for a "Living Spec" in the context for potentially adopting it in IWWG]

cwilso: This has given rise to the notion of a "Candidate Recommendation Snapshot", which is a way to update a specification in-place, rather than working strictly to a fixed goal

cwilso: There are Intellectual Property distinctions between a V1/2/3 CR model and a living spec-based one

<atsushi> current draft

yonet: Some of the WG's goals seem like a good fit for living spec, and there are some things that the charter may want to aim to promote from CG to WG

bajones: WebGPU is also considering moving to a living Spec, based on the expectation of an ongoing stream of feature/extensions that would be possible to more gracefully accommodate under this model

bajones: Would it be better for us to pursue this approach for the additional _modules_ in this kind of mode, rather than have them scattered through a wider array of documents?

<Zakim> bajones, you wanted to ask about Living Spec in relation to our modules (once Chris is done explaining)

yonet: It seems like this would have the greatest impact on the editors and the work of integrating more content into the base specification documents

<Zakim> cwilso, you wanted to react to bajones

ada: One benefit of having separated module specifications is that a given implementation may have no notion of gamepad / hand / etc. If a single spec covers all of the functional contents of the modules, that would limit what is permitted to constitute a "compliant" platform

cwilso: This is down to the language in the spec describing whether components are mandatory - more that "when hands exist, they exist in this way" rather than "all platforms need hands"

bajones: Going back to the comparison with webGPU, there is a clear baseline compliant implementation - but there are extensions that may be available

bajones: We could spec things in a way that states a minimum implementation - the situation is a little different

cwilso: The language should say "there is a baseline you must meet." The living standard can do that as well

ada: This seems like something we can and should take back to our companies and lawyers to assess the practicality

bajones: There are two questions - whether to livingSpec, and secondly to roll things into a central one rather than pursue modules as independent components

<Zakim> alcooper, you wanted to ask about how changes to CR happen/are expected to be implemented

alcooper: What does a living spec change, in practice, about accepting a spec change?

cwilso: breaking changes remain important - living standard reflects the reality of how people have worked on and with HTML in the past anyway

cwilso: It tries to allow the use to reflect the ideal, rather than say that strict adherence with a momentary understanding and saying "this is ISO 999"

atsushi: we need to get to a review before getting to a CR. a CR snapshot still needs to undergo IP review

<cwilso> (and other reviews, like horizontal reviews and TAG reviews)

atsushi: This only changes the use of a Rec or a WD - reviews still apply when something becomes CR

alcooper: so This functionally still requires referring to a CR as of a date - it's not the case that any new amendments are automatically CR

atsushi: a CRS should reflect an update 6-24 months after a "substantive change"

atsushi: the CRS should be republished in a "timely manner" after "reasonable substantive changes"

bialpio: It seems like it would make things easier if modules are rolled into a single spec because it would be easier to detect cross-cutting changes

bialpio: To Ada's point, extra modules tend to have a strong statement of optionality - "IF your device supports things, this is the way to do it". not that they all _must_ - integrating them into a single document shouldn't change that

bialpio: Affecting more webIDL definitions may require updating more things for more people, but most new parameters are specified as nullable

bialpio: It's not clear that splitting into modules has a bearing on the amount of work that vendors are obliged to do

yonet: And what do we move to WG? WebGPU bindings was mentioned

ada: We have also indicated that a WG charter goal should require two implementers to support, given that a CR ultimately needs two implementations to pursue

[bajones and cabanier indicate they would support webGPU bindings moving to WG]

<ada> https://www.w3.org/immersive-web/list_spec.html

ada [raising the 'real-world meshing' feature as a promotion candidate]

bajones: Real-world _Geometry_ is the name of the plane system on Android and Quest. As an aside, a single repo would mean we don't have to come up with different names!

bajones: Real-world meshing is the process of identifying the three-dimensional _shape_ of the world, rather than just surfaces (or 2D bounding geometry)

ada: It sounds like we may want to check with our people to decide if it's reasonable (inside 2 years' time)

atsushi: We already have RW Geometry as a candidate - isn't that enough?

alcooper: I think we can promote from CG to WG mid-charter if it's already on the general radar for the group(s)

bialpio: We did this for some of the geometry work over the last charter window

ada: The charter is often used by the legal departments of vendors to check for the risk to IP portfolios. As long as we have adequate mention in the WG of those keywords that most interest the lawyers we should be okay

ada: [enumerating the list of candidates in the CG listing]

yonet: We should consider a mechanism to demote CG modules / interests to indicate they are archived or no longer pursued

alcooper: This is a place where there may be a distinction between the contents of the charter and what happens to have a repo today

ada: Could we indicate that by removing the 'w3c.json' such that it's no longer regarded as an active proposal?

yonet: We also have tags to denote incubation to indicate that there's no expectation that implementations are incoming

atsushi: We can mark documents as "Articles", such that it's not a recommendation and no there is no active development expected

ada: returning to the list of candidates - "capture?" recording of sessions?

ada: No work has progressed on it, but it seems like there is a valid need for it

ada: Demote "computer vision", as it became "raw camera access"

ada: Demote the 3D CSS stuff for the time being

Brett: detached elements are easy from a specification perspective, it's easy to say what they should do

ada: I was interested in this while at Samsung, but then moved to Apple - so the two parties are now Apple. If there are rumblings about it we can keep it on the docket

bajones: Should we not put this in the charter and CSS decides they _would_ like to pursue it - would we be upset about that?

ada: I think we would be _so_ happy if that happened.

brett: Anyone could develop an interest in VR at any moment in time

ada: would there be any objection to adding it, running the risk that there may be no progress in the 2-year window?

cabanier: It was added 5 years ago to CSS and then removed already

ada: Front-facing camera? keep it where it is (yes)

ada: Geo alignment?

yonet: It's not identical to shared anchors. People are interested in it, but the people responsible for making past progress in MS are no longer assigned to it

alcooper: this concept of demotions should be considered on the basis of moving to WG or not

ada: Marker tracking? (yes, promote)

alcooper: The last disagreement was over the type for marker tracking, QR vs other objects

ada: Model? (promote/ add to list of 'potential deliverables')

ada: I think we should move "Navigation" to "potential"

cabanier: We shipped it, but there's no spec

yonet: Under potential deliverables, we have anchors, model element and body tracking.

yonet: Navigation to move to WG? (move to charter, don't move it to WG)

nick-niantic: re: Splats - I would want to know why we add splats into a specific place - e.g. webGPU, model etc.

nick-niantic: I'm fine to say it, but want to know why _we_ are saying it rather than things we could just do _with_ the technology

nick-niantic: This is like scene graphs, in that we are generally interested but not sure what to do with.

<yonet> https://w3c.github.io/immersive-web-wg-charter/immersive-web-wg-charter.html

<atsushi> https://w3c.github.io/charter-drafts/2024/immersive-web-wg.html

ada: note that you don't have to be comfortable answering now - but take this to your people so we can have a substantive conversation soon

Update on Metaverse Standards Forum (immersive-web/administrivia#206)

immersive-web/administrivia#206

<ada> scribenick bialpio

matlu: [presenting slides]
… metaverse standards forum is ~1yr old, it is not a standards organization but the ambition is to create open/interoperable metaverse
… would like to assist standards organizations
… idea is to get input from industry and feed it back to standardization bodies to help with use cases and accelerate the standardization work
… various membership levels, including free membership, first board meeting coming up soon
… members can suggest creating exploratory groups & working groups
… multiple domains the forum focuses on, 3d interoperability among others
… accessibility also moves to be an exploratory group
… forum also creates standards register to help figure out where standardization happens
… plenty presentations available
… glTF USD Asset Interoperability WG is also active
… interoperable characters / avatars WG
… [more working groups described]

Brett: where do you meet?

matlu: zoom calls, f2f meetings

<matlu> Go here and click on the small padlock icon on the top right to join with you organization email

<matlu> https://metaverse-standards.org/

<ada> hi

immersive-web/proposals#79

<webirc83> hello this is Diego

webirc83: apple released transient pointer
… it allows integrating with gaze to WebXR
… but the missing part is the lack of possibility to highlight UI elements on hover using gaze
… what'd be the acceptable path to making something like this possible?

ada: I understand this is an issue
… apple's general stance on how the hover should be done is out-of-process glow (?)
… I'd like to move this forward but I don't know yet what path we'd be able to take
… and until I figure this out I don't think we'd be able to come up with a proposal
… unlikely that gaze tracking is going to be exposed, it is not currently exposed on AVP even to native apps
… very privacy invasive
… other browsers may disagree but we're not likely to implement a gaze tracking api if it were exposed in WebXR

bajones: when mozilla was more active, they also had pretty strong stance on this as well, they did not want to expose gaze vectors

Brett4: we'd like to be able to look eye to eye in VR
… to solve the problem, could there be a half-pinch that's distinctive from full pinch

ada: what I suggest to the authors is that when selectstart arrives, they can enter the state that marks the start of the gaze gesture
… and selectend is when the interaction is processed
… it is a workaround but not too bad

cabanier: we chatted about accessibility tree yesterday - in theory, device like AVP could use the tree to render highlights over UI elements

ada: with the accessibility buffer, we wouldn't know what to do, but may be doable with accessibility tree

etienne: safari does not draw the glow itself

<webirc83> An app might want to highlight 2D UI elements and also 3D objects in the scene that the user can interact with

Brandel: there is a possibilty that ui elements can be vectorized
… safari is a shared mode app, unlike WebXR / unity which is rendered in fully exclusive mode, it may have different rendering paths
… there's work for us to do but we don't yet know how it'll look like

ada: it'd require OS changes and I have ideas and fallback ideas, as soon as we figure out what we can do I'll let the group know

cabanier: to summarize, exposing the gaze is a no-go but we'll try to have some alternative

<webirc83> a small demo I was building with A-Frame that would love to port to AVP https://aframe.io/aframe/examples/showcase/spatial-ui/

ada: start highlightling on selectstart + raycast to know what to highlight and fire the UI element event on selectend

webirc83: there is a laser pointer mode that you could enable?

Brandel: that's an accessibility setting, yes

webirc83: is this something that could be exposed?

Brandel: we haven't looked into exposing it

nick-niantic: we talked about scene description and challenges w/ solutions
… hover may be one of the benefits
… rather than specifying detailed information, we could look into specifying more coarse grained data that could help here

ada: I'd rather have accessibility buffers first, vectorized data 2nd, and triangles last

Brandel: is there anything else that you're targeting here?

webirc83: I'm considering general user experience
… later we'd like to have avatars w/ eye tracking but that's a separate problem

Brandel: treating user input as 2-step process allows it to be cancellable

Brandel: can OS construct such a view where we could intervene? can the OS make changes in an opaque ways to the session?

ada: we don't have anything in the spec that'd disallow the OS to show its UI on top of the session

bajones: correct, we should not have any language like that but I can double-check

ada: we're looking into how to make all of this work

bajones: spec has "trusted immersive UI" concept
… [reads the spec]
… this is not explicitly forbidden by spec

alcooper: [reads another part of the spec]

bajones: this is probably not something that is generally spoofable

<alcooper> Section I quoted: https://immersive-web.github.io/webxr/#exclusive-access

<alcooper> Section Brandon quoted: https://immersive-web.github.io/webxr/#trusted-ui

<yonet> https://avibarzeev.medium.com/for-xr-the-eyes-are-the-prize-25d43a533f2a

yonet: eyes are windows to the soul, and that's exactly why we don't want to expose them
… there are dangers of exposing them

Brett4: having eye tracking work would be a great carrot to encourage devs to use accessibility tree

ada: accessibility is also important w/o being a carrot but it's nice when things work out together like that

Brett4: [setting up presentation]

marisha: question / clarification: 628mb on the cube, what is that?

Brett4: that's just bytes

bajones: is it text only or binary?

Brett4: it has both

bajones: so 628 bytes was ascii or binary?

Brett4: ascii

Brandel: simple - yes, but simple for whom? there are reasons why we don't use pbm for images

bajones: nice companion presentation for splats introduced yesterday
… I don't see anything about materials here - is this something that people put alongside .ply?

Brett4: there's no standards body, the data can be placed in the file and it'd be up to the viewer to interpret what's inside

nick-niantic: people use .ply for splats because it's easy representation
… they omit some required data but include other data, and it is something that's driven by consensus
… ply is simple and understood widely, but they are rarely used when something else than lowest common denominator is used
… the files themselves can be huge
… the reason why gltf exists is because there's plenty other information that people want to use and that don't map nicely to vertex data
… we talked about other proprietary formats and why they are proprietary is because they contain a lot of cleverness related to data representation
… although there are formats like webp which is open
… but not simple
… simplicity has both costs and benefits

Brandel: having a body that discusses what is a standard implementation is actually a benefit for us

Brett4: being able to agree between ourselves is beneficial

Brandel: I don't know if web standards group is the place where we need to come up with formats

bajones: lesson learned from working on browser is that there's big difference between getting things first working and working properly
… first prototype for webvr took me ~1 week
… and that was a very fun time
… and then it took another 6 years

ada: WebVR was 2yrs young when it got deprecated

bajones: and it was 10 years ago
… there's a draw to simplicity, but we will need to go far beyond what it takes to implement an MVP

marisha: the simplicity and the fact is that it's from '90s is interesting
… ply looks like a primitive format that could have grown alongside the web
… but didn't

bajones: it's easy to polyfill
… we could try something out and prove this concept
… if this is something you're enthusiastic about, you don't need a permission from anyone in this room

ada: it should be possible for you to polyfill ply, convert it to usd and people would be happy with it, that'd be a signal for us

marisha: mode switch in order to view e.g. an image is a limitation

bajones: I'd like to see a way to feed the buffers directly to a model tag to display, we'd not have to have blessed formats then

– DRAFT –
Immersive Web WG/CG 2024/March face-to-face Day 2

26 March 2024

Attendees