Web Fonts Working Group Teleconference -- 29 Jul 2019

<Vlad> scribenick: Persa_Zula

<myles> ScribeNick: Persa_Zula

Encoding of code point sets

Garret: Starting to think about how to implement protocol about enrichment, because we need this in detail to perform analyliss

<Vlad> Document link: https://docs.google.com/document/d/19K5MCElyjdUZknoxHepcC3s7tc-i4I8yK2M1Eo2IXFw/edit

Garret: Biggest problem is sending subset codepoints back and forth; CJK in particular has thousands. Naive approach is inefficnet
... Generated sample sets based on chars from each script; encoded to get a sense of what method what might work best
... First 2 are more naive approaches, each int as a series of bytes. Then tried diff. styles of bitsets
... Tried bloomfilters as well - probablistic data structure
... Analysis was informative - the bitsets performed the best except for very small sets
... GZIP or Brotli compression is helpful as well

Myles: even on the bitsets?

Garret: Yes, even with the bitsets, it was surprising.
... Even sparse bit sets when applying brotli saw improvements
... A bitset seems like the way to go from this data, but we might want to support multiple types of encoding and choose what is best for that set of codepoints
... Even the integer style is useful for some
... Tried some more advanced methods as well
... Not documented. Extension of sparse bit set to allow encoding of ranges
... Between this start and this endpoint. We can get a smaller encoding with a range. First encoded a series of ranges. Anythign that we want in the range can then be encoded in the bitset. Then you can add in smarts to figure out whether to add anything to range or bitset. Experiments show it outperforms brotli, which previously the bitset compressed by brotili was best
... Will write this approach up and send it to group

Myles: This approach will not be applicable to the range request approach?

Garret: Correct, because you'd be stuck with HTTP request protocol
... We can't use alternate range encodings, must use what protocol uses

Myles: Relative to the size of the fonts and overhead in transfers, roughly what % is the charset itself

Garret: Glyph data is the biggest part of the payload on the response side
... CJK font, client loads large portion of it, 4k codepoints; encounters just a few chars and sends a request back. They actually have to specify that whole set. 4KB of data to encode that data. If they're req. just a few bytes,
... In this case, you're sending much more to get little back

Myles: Every instance of progressive enrichment gets worse and worse the longer the page gets

Garret: Yes. Need to find how to optimize how it works as the page gets bigger

Myles: The analysis is not going to model a single page where characters are needed over the long lifetime; but a series of pages and their payloads?

Garret: Yes, we need to look at more pageviews

Myles: We can tweak this knob; the chain is really long or really short. To change that knob to best match customer experience

Garret: We need to know the avg. # of pageviews before we get to the end of the sequence

Myles: if you req. 50% of font in one request vs. in 100 requests, results can be dramatically different.
... we need to be intentional about the decision on how to model it

Garret: This comes back to the idea to send more back to the client than they requested to avoid extra overhead in getting it later

Myles: don;'t think the range request method suffers from this at all

Garret: Correct
... unless you're requesting a large set all at once

Myles: maybe we can both think about this and how we can intentally include this in the future analysis that we do

Vlad: the initial assumption is that when we send the request back, we include everything that's already been requestewd in the past
... if we can ID request coming from each client by client ID, then the client shouldnt have to repeat requests on what has already been delivered

Garret: That's another approach worth thinking about, esp if there is material overhead
... But that requires us to move from stateless to stateful and it has problems with privacy

Myles: Could be real bad for servers and cachabiliuty too

Vlad: stateless would simplify things but each incremental request, amt of data goes up but the new piece of data amount is going down.

Garret: exploring additional techqnies. would be good to have something that gets better as it gets bigger

Vlad: Have you tried to estimate overhead of each request going uop

Garret: No, not beyond the set stuff. Going to explore more. In parallel working on design doc. Once it's done we get a sense of overhead

Myles: If we really want to go the path of making the protocol stateful - as in the server has a database, then the # of websites that can use it significantly dwindkles. Maybe we need two approaches. One for huge websites that can do this and one for everyoen else
... assuming the difference between the two is significant

Garret: yes, if they're otherwise close on performance

Vlad: a wish, additional work. Document you put togther, garret, is good. Useful to extend document with discussion of stateful vs stateless

Garret: Yes, will include it

Vlad: will set an initial starting point by stating the issue with each approach

<scribe> ACTION: Garret: Extend document with discussion on stateful vs stateless approach

<trackbot> Created ACTION-208 - Extend document with discussion on stateful vs stateless approach [on Garret Rieger - due 2019-08-05].

KenLundeADBE: has anyone considered how to add sequences?
... Codepoints have sequences associated with them. Codepoints are not encoded. Format 14 CMAP subtable. Whereby variation selector not associated with any glyph in the font, but processed when the table is processed
... Many of the codepoints are unencoded and require a sequence

Myles: Item 3 on the agenda

KenLundeADBE: this is something that Adobe's implementation handles quite well

<scribe> ACTION: Persa_Zula: Bring back to the group how we do this

<trackbot> Error finding 'Persa_Zula'. You can review and register nicknames at <https://www.w3.org/Fonts/WG/track/users>.

<scribe> ACTION: Persa_Zula: Bring back to the group how we handle sequences

<trackbot> Error finding 'Persa_Zula'. You can review and register nicknames at <https://www.w3.org/Fonts/WG/track/users>.

3. Different approaches to PFE and how to test them

Vlad: First approach, if you send back to server list of codepoints, then server has to make smart decisions on what those codepoints and what that subset is. WHen you include anything that could be needed
... Second approach - get all info for shaping and layout, then you know exactly the glyph IDs to render the text, and then you get glyph IDs

myles: We need data on this

Vlad: that might be significant from one font and one script to another
... both approachs would be irrelevant on simple fonts; but with more stylistic sets (8 or so) which means you have 8 glyphs for each codepounbts and you will only use one of them
... vs the glyph closure so it will send back everything associated with that codepoint

myles: we should look at both options and analyze the data, each option could handle this
... client can ask for shaping, run it locally, and then send for codepoints. or the client can list the codepoints and the server can find the closure
... the font can be sorted for the shaping info to be in place and the client send it to server; or the sorting algo can not take shaping into account and try to sort what chars are popular and make a map of chars in the page needed

Vlad: we have to analyze two different approaches for smart server type process and my gut feeling and might need justifcation, is that the glyph id for byte based request would be best

Garret: 3rd option. set of features that is required. then, if you need featured beyond that on the client side. then ask for them and the server can includ ethem in the closure to avoid sending stylistic sets you don't need

Vlad: when client knows small caps is enabled, then just send it over.

myles: approach seems like a good option. should be extensible because future extensions to OT; might be able to extend

Garret: when I design the protocol for this, we need to make sure it's flexible to be able to incrementally transfer. will try to build it in and try to make sure we can add new things down the road

myles: can one of us do some analysis to find the relationship with # of codepoints and closure

Garret: yeah if you can take it on that would be great
... can use FontTools for it - to keep all features it will compute the full closure on everything

2. Streamable fonts & privacy

myles: the idea that if the client sends exactly the codepoints that the server needs, that's effectively as all the codepoints on the page that the webpage is viewing. if the font service is different. noow the font service knows almost fairly certainly what something is reading. this is fairly true in chinese
... happens often with Chinese fonts. it would be a privacy violation

<Vlad> ACTION: Myles: analysis of relationships between codepoints and glyph closure

<trackbot> Created ACTION-209 - Analysis of relationships between codepoints and glyph closure [on Myles Maxfield - due 2019-08-05].

myles: this is scary because a service that hosts many fonts on many webpages, you could almost construct a history of the user's browsing
... proposal on how to solve this is to require the client to request more codepoints than it actually needs to obfuscate the signal of what's on the page

Garret: yes. we need to do this. Tackyfont does do fuzzing this way because we do request more than we need. This is a good addition to require.

Vlad: for a number of times during past discussions. we brought up predicting characters needed in addition to what is requested. in order to solve both problems, we can marry these two approaches. not just extra codepoints for fuzzing, but also to make it more efficient for next request for using a predictor for missing chars that can be needed

myles: right. both having the client req. more than needed and the server giving back more than needed, both are important for differnt reasons. If each responds with 130% of what is needed, that's mor ethan 60% additional characters. but saying privacy should be on by default is important to include

Garret: will include in design doc

myles: another question - how far to turn that knob? is 30% reasonable? needs more analysis to find out what roughly that number might be. have some ideas that aren't ready to share yet. different browsers have different privacy requirements. this is a privacy and performance tradeoff. we need to include a minimum that can guarantee privacy for every user on the web

Vlad: don't like the word guarantee because we can't prove that. as discussed in the email - depending of script, sometimes it's easy to predict the content and sometimes it's not
... can't be guaranteed for each user. but much better to do something about this

jpamental: is there a relationship in how much we need to request related to total number of glyphs in a given font

myles: there's a difference between a single chinese font that has thousands of chars and a font that has a zillion languages supported
... for non-alphabetic languages might be doing somehting more complicated. but for alphabetic ones it would be more straightforward . it has to do with usage and language

jpamental: i wonder if it's something that might be identified in the font

Garret: if we're dealing with codepoints in this langaguages, we need to add these additional characters for obfuscation. coming up with reasonable levels for each language

ned: ideographic scipts would suffer from this issue. each codepoint has more meaning than others. from an information standpoint, the greater the number of characters needed to convey something in a language,

John Hudson: on the CJK front, the privacy of someone knowing the fairly specific content of a site that someone's visited. but if a bad agent is looking to track what kind of sites someones been looking at, they may be looking for specific chars or specific sequences of chars, indicating something. not sure that simply adding a bunch of other chars to prevent someone from saying someone is following protests in hong kong

myles: I wish that one of us could come up with a better solution

Vlad: we haven't really given this much time to think about; having better knowledge of what the problem is and how we can possibly solve it. if we can, having more than one approach up for discussion is a good thin

Garret: the current state of the art, font requests include referrers. this is pretty effective at telling which fonts are being used on what pages. not to say we shoulodn't deal with this, we have a big privacy hole with font requests

myles: the perfect shouldn't be the enemy of the good. just because we have problems now doesnt mean we should make those problems worse

Vlad: it's an excellent idea to keep things like privacy right at the beginning of a design for a solution because it will inform our design. we don't have a proposed solution at this point. let's keep an open mind about this

myles: the thing I was hoping for in this discussion is that we require that user agents make some effort to solve this problem

Vlad: I think the answer is yes; typically this comes up at the end of the process when we put something subject to wide review. thinking of this ahead of time is definitely a benefit
... in essenace, w3c process requires a statement at some point that we thought of this. sometimes these are broad statements. but in this case it might be our concern

myles: it is absolutely our concern

Vlad: for woff2 or woff1 it was not our problem, because we were a transfer agent
... but here we are dealing wiht the content given to us
... we now have a much clearer picture of what is needed
... no time left to discuss Face to Face topic
... at the end of F2F we want a well defined timeline; there are other things that have to happen. we still have time to do it, I think. in the next months, things that can be done should be done. should have a good estimate of what to do. includes what's been done by garret and the documents he has created. we also have that for myles proposal because that approach because that depends on the tool for the font data in and make it more efficient[CUT]
... myles we need an estimate of what it would take to have that tool availabnle
... jpamental any more info on logicists?

jpamental: director of atypei has promised that we will have a space available for the full day and latest is that it will be in the main conf. venue rather than somwhere else
... complication is atypei did not have venue for day before, but she will make arrangements for us
... getting ready to launch schedule site this week so will have more info soon

Vlad: let us know as you get more info on the email list
... out of office next week, no discussion next Monday

<jpamental> exit

<jpamental> bye

Thanks for letting me give it a shot :)

- DRAFT -

Web Fonts Working Group Teleconference

29 Jul 2019

Attendees

Contents

Encoding of code point sets

3. Different approaches to PFE and how to test them

2. Streamable fonts & privacy

Summary of Action Items

Summary of Resolutions

Scribe.perl diagnostic output