<Vlad> scribenick: Persa_Zula
<myles> ScribeNick: Persa_Zula
Garret: Starting to think about how to implement protocol about enrichment, because we need this in detail to perform analyliss
<Vlad> Document link: https://docs.google.com/document/d/19K5MCElyjdUZknoxHepcC3s7tc-i4I8yK2M1Eo2IXFw/edit
Garret: Biggest problem is
sending subset codepoints back and forth; CJK in particular has
thousands. Naive approach is inefficnet
... Generated sample sets based on chars from each script;
encoded to get a sense of what method what might work
best
... First 2 are more naive approaches, each int as a series of
bytes. Then tried diff. styles of bitsets
... Tried bloomfilters as well - probablistic data
structure
... Analysis was informative - the bitsets performed the best
except for very small sets
... GZIP or Brotli compression is helpful as well
Myles: even on the bitsets?
Garret: Yes, even with the
bitsets, it was surprising.
... Even sparse bit sets when applying brotli saw
improvements
... A bitset seems like the way to go from this data, but we
might want to support multiple types of encoding and choose
what is best for that set of codepoints
... Even the integer style is useful for some
... Tried some more advanced methods as well
... Not documented. Extension of sparse bit set to allow
encoding of ranges
... Between this start and this endpoint. We can get a smaller
encoding with a range. First encoded a series of ranges.
Anythign that we want in the range can then be encoded in the
bitset. Then you can add in smarts to figure out whether to add
anything to range or bitset. Experiments show it outperforms
brotli, which previously the bitset compressed by brotili was
best
... Will write this approach up and send it to group
Myles: This approach will not be applicable to the range request approach?
Garret: Correct, because you'd be
stuck with HTTP request protocol
... We can't use alternate range encodings, must use what
protocol uses
Myles: Relative to the size of the fonts and overhead in transfers, roughly what % is the charset itself
Garret: Glyph data is the biggest
part of the payload on the response side
... CJK font, client loads large portion of it, 4k codepoints;
encounters just a few chars and sends a request back. They
actually have to specify that whole set. 4KB of data to encode
that data. If they're req. just a few bytes,
... In this case, you're sending much more to get little
back
Myles: Every instance of progressive enrichment gets worse and worse the longer the page gets
Garret: Yes. Need to find how to optimize how it works as the page gets bigger
Myles: The analysis is not going to model a single page where characters are needed over the long lifetime; but a series of pages and their payloads?
Garret: Yes, we need to look at more pageviews
Myles: We can tweak this knob; the chain is really long or really short. To change that knob to best match customer experience
Garret: We need to know the avg. # of pageviews before we get to the end of the sequence
Myles: if you req. 50% of font in
one request vs. in 100 requests, results can be dramatically
different.
... we need to be intentional about the decision on how to
model it
Garret: This comes back to the idea to send more back to the client than they requested to avoid extra overhead in getting it later
Myles: don;'t think the range request method suffers from this at all
Garret: Correct
... unless you're requesting a large set all at once
Myles: maybe we can both think about this and how we can intentally include this in the future analysis that we do
Vlad: the initial assumption is
that when we send the request back, we include everything
that's already been requestewd in the past
... if we can ID request coming from each client by client ID,
then the client shouldnt have to repeat requests on what has
already been delivered
Garret: That's another approach
worth thinking about, esp if there is material overhead
... But that requires us to move from stateless to stateful and
it has problems with privacy
Myles: Could be real bad for servers and cachabiliuty too
Vlad: stateless would simplify things but each incremental request, amt of data goes up but the new piece of data amount is going down.
Garret: exploring additional techqnies. would be good to have something that gets better as it gets bigger
Vlad: Have you tried to estimate overhead of each request going uop
Garret: No, not beyond the set stuff. Going to explore more. In parallel working on design doc. Once it's done we get a sense of overhead
Myles: If we really want to go
the path of making the protocol stateful - as in the server has
a database, then the # of websites that can use it
significantly dwindkles. Maybe we need two approaches. One for
huge websites that can do this and one for everyoen else
... assuming the difference between the two is significant
Garret: yes, if they're otherwise close on performance
Vlad: a wish, additional work. Document you put togther, garret, is good. Useful to extend document with discussion of stateful vs stateless
Garret: Yes, will include it
Vlad: will set an initial starting point by stating the issue with each approach
<scribe> ACTION: Garret: Extend document with discussion on stateful vs stateless approach
<trackbot> Created ACTION-208 - Extend document with discussion on stateful vs stateless approach [on Garret Rieger - due 2019-08-05].
KenLundeADBE: has anyone
considered how to add sequences?
... Codepoints have sequences associated with them. Codepoints
are not encoded. Format 14 CMAP subtable. Whereby variation
selector not associated with any glyph in the font, but
processed when the table is processed
... Many of the codepoints are unencoded and require a
sequence
Myles: Item 3 on the agenda
KenLundeADBE: this is something that Adobe's implementation handles quite well
<scribe> ACTION: Persa_Zula: Bring back to the group how we do this
<trackbot> Error finding 'Persa_Zula'. You can review and register nicknames at <https://www.w3.org/Fonts/WG/track/users>.
<scribe> ACTION: Persa_Zula: Bring back to the group how we handle sequences
<trackbot> Error finding 'Persa_Zula'. You can review and register nicknames at <https://www.w3.org/Fonts/WG/track/users>.
Vlad: First approach, if you send
back to server list of codepoints, then server has to make
smart decisions on what those codepoints and what that subset
is. WHen you include anything that could be needed
... Second approach - get all info for shaping and layout, then
you know exactly the glyph IDs to render the text, and then you
get glyph IDs
myles: We need data on this
Vlad: that might be significant
from one font and one script to another
... both approachs would be irrelevant on simple fonts; but
with more stylistic sets (8 or so) which means you have 8
glyphs for each codepounbts and you will only use one of
them
... vs the glyph closure so it will send back everything
associated with that codepoint
myles: we should look at both
options and analyze the data, each option could handle
this
... client can ask for shaping, run it locally, and then send
for codepoints. or the client can list the codepoints and the
server can find the closure
... the font can be sorted for the shaping info to be in place
and the client send it to server; or the sorting algo can not
take shaping into account and try to sort what chars are
popular and make a map of chars in the page needed
Vlad: we have to analyze two different approaches for smart server type process and my gut feeling and might need justifcation, is that the glyph id for byte based request would be best
Garret: 3rd option. set of features that is required. then, if you need featured beyond that on the client side. then ask for them and the server can includ ethem in the closure to avoid sending stylistic sets you don't need
Vlad: when client knows small caps is enabled, then just send it over.
myles: approach seems like a good option. should be extensible because future extensions to OT; might be able to extend
Garret: when I design the protocol for this, we need to make sure it's flexible to be able to incrementally transfer. will try to build it in and try to make sure we can add new things down the road
myles: can one of us do some analysis to find the relationship with # of codepoints and closure
Garret: yeah if you can take it
on that would be great
... can use FontTools for it - to keep all features it will
compute the full closure on everything
myles: the idea that if the
client sends exactly the codepoints that the server needs,
that's effectively as all the codepoints on the page that the
webpage is viewing. if the font service is different. noow the
font service knows almost fairly certainly what something is
reading. this is fairly true in chinese
... happens often with Chinese fonts. it would be a privacy
violation
<Vlad> ACTION: Myles: analysis of relationships between codepoints and glyph closure
<trackbot> Created ACTION-209 - Analysis of relationships between codepoints and glyph closure [on Myles Maxfield - due 2019-08-05].
myles: this is scary because a
service that hosts many fonts on many webpages, you could
almost construct a history of the user's browsing
... proposal on how to solve this is to require the client to
request more codepoints than it actually needs to obfuscate the
signal of what's on the page
Garret: yes. we need to do this. Tackyfont does do fuzzing this way because we do request more than we need. This is a good addition to require.
Vlad: for a number of times during past discussions. we brought up predicting characters needed in addition to what is requested. in order to solve both problems, we can marry these two approaches. not just extra codepoints for fuzzing, but also to make it more efficient for next request for using a predictor for missing chars that can be needed
myles: right. both having the client req. more than needed and the server giving back more than needed, both are important for differnt reasons. If each responds with 130% of what is needed, that's mor ethan 60% additional characters. but saying privacy should be on by default is important to include
Garret: will include in design doc
myles: another question - how far to turn that knob? is 30% reasonable? needs more analysis to find out what roughly that number might be. have some ideas that aren't ready to share yet. different browsers have different privacy requirements. this is a privacy and performance tradeoff. we need to include a minimum that can guarantee privacy for every user on the web
Vlad: don't like the word
guarantee because we can't prove that. as discussed in the
email - depending of script, sometimes it's easy to predict the
content and sometimes it's not
... can't be guaranteed for each user. but much better to do
something about this
jpamental: is there a relationship in how much we need to request related to total number of glyphs in a given font
myles: there's a difference
between a single chinese font that has thousands of chars and a
font that has a zillion languages supported
... for non-alphabetic languages might be doing somehting more
complicated. but for alphabetic ones it would be more
straightforward . it has to do with usage and language
jpamental: i wonder if it's something that might be identified in the font
Garret: if we're dealing with codepoints in this langaguages, we need to add these additional characters for obfuscation. coming up with reasonable levels for each language
ned: ideographic scipts would suffer from this issue. each codepoint has more meaning than others. from an information standpoint, the greater the number of characters needed to convey something in a language,
John Hudson: on the CJK front, the privacy of someone knowing the fairly specific content of a site that someone's visited. but if a bad agent is looking to track what kind of sites someones been looking at, they may be looking for specific chars or specific sequences of chars, indicating something. not sure that simply adding a bunch of other chars to prevent someone from saying someone is following protests in hong kong
myles: I wish that one of us could come up with a better solution
Vlad: we haven't really given this much time to think about; having better knowledge of what the problem is and how we can possibly solve it. if we can, having more than one approach up for discussion is a good thin
Garret: the current state of the art, font requests include referrers. this is pretty effective at telling which fonts are being used on what pages. not to say we shoulodn't deal with this, we have a big privacy hole with font requests
myles: the perfect shouldn't be the enemy of the good. just because we have problems now doesnt mean we should make those problems worse
Vlad: it's an excellent idea to keep things like privacy right at the beginning of a design for a solution because it will inform our design. we don't have a proposed solution at this point. let's keep an open mind about this
myles: the thing I was hoping for in this discussion is that we require that user agents make some effort to solve this problem
Vlad: I think the answer is yes;
typically this comes up at the end of the process when we put
something subject to wide review. thinking of this ahead of
time is definitely a benefit
... in essenace, w3c process requires a statement at some point
that we thought of this. sometimes these are broad statements.
but in this case it might be our concern
myles: it is absolutely our concern
Vlad: for woff2 or woff1 it was
not our problem, because we were a transfer agent
... but here we are dealing wiht the content given to us
... we now have a much clearer picture of what is needed
... no time left to discuss Face to Face topic
... at the end of F2F we want a well defined timeline; there
are other things that have to happen. we still have time to do
it, I think. in the next months, things that can be done should
be done. should have a good estimate of what to do. includes
what's been done by garret and the documents he has created. we
also have that for myles proposal because that approach because
that depends on the tool for the font data in and make it more
efficient[CUT]
... myles we need an estimate of what it would take to have
that tool availabnle
... jpamental any more info on logicists?
jpamental: director of atypei has
promised that we will have a space available for the full day
and latest is that it will be in the main conf. venue rather
than somwhere else
... complication is atypei did not have venue for day before,
but she will make arrangements for us
... getting ready to launch schedule site this week so will
have more info soon
Vlad: let us know as you get more
info on the email list
... out of office next week, no discussion next Monday
<jpamental> exit
<jpamental> bye
Thanks for letting me give it a shot :)
This is scribe.perl Revision: 1.154 of Date: 2018/09/25 16:35:56 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00) Default Present: Vlad, Garret, ChristopherChapman, Persa_Zula Present: Vlad Garret ChristopherChapman Persa_Zula Regrets: Jonathan_Kew Found ScribeNick: Persa_Zula Found ScribeNick: Persa_Zula Inferring Scribes: Persa_Zula WARNING: No meeting chair found! You should specify the meeting chair like this: <dbooth> Chair: dbooth Found Date: 29 Jul 2019 People with action items: back bring garret myles persa_zula WARNING: Input appears to use implicit continuation lines. You may need the "-implicitContinuations" option. WARNING: IRC log location not specified! (You can ignore this warning if you do not want the generated minutes to contain a link to the original IRC log.)[End of scribe.perl diagnostic output]