W3C

- DRAFT -

Web Fonts Working Group Teleconference

09 Dec 2019

Attendees

Present
Vlad, jpamental, Persa_Zula, myles, Garret
Regrets
Chair
SV_MEETING_CHAIR
Scribe
myles

Contents


jpamental: 👋

<Persa_Zula> present

<scribe> ScribeNick: myles

Vlad: Action on me was to find out how to use Monotype font data for analysis
... This was more complicated because for different types of analysis, we need to get our hands on the full raw font files, so we can restructure them. It needs to be given up for manipulation. I managed to convince people that it's something we should be doing but it's not .... my colleagues were relatively comfortable with sharing it within the WG, but not publicly.
... The product that our customers use to get our fonts is called Mosaic. In Mosaic, we have a separate account for this WG. It will be populated with fonts that are downloadable so we can get our hands on raw font files on that account. That account will be full of fonts we need for testing.
... I'm working with my coworkers to identify different font options. But the account is open.

Garret: That's good news.
... I've been hard at word coding at the analysis framework and on simulations of transfer methds.
... It's on the repo.
... I'll walkthrough the pieces.
... We've populated the open source repo with code. A couple of pieces: 1) The PFE framework, in the analysis directory. Most important pieces is analyzer.py. This takes in some input data and a set of fonts, and runs through that input data using a variety of enrichment methods and network models. (Defined at top) and outputs resulting data based on running the simulations.
... The input data is protos. We're going to have sequences of page views. Each page view is a set of code points. A single page view can have multiple fonts mapping to different code points.
... That performs the analysis and gets results back
... It's mostly done
... Next, I implemented some common font transfer methods. 3 exist. 1) Whole font - transfering one single font file. 2) Unicode range, using unicode-range based subsets, which are the same subsets google fonts uses. If other people have subsets, we can use them too. For now, this is okay. 3) Optimal transfer method, which is a theoretical minimum. We can use this as a benchmark to compare against other methods.
... There will be more of these (Myles is supposed to write one)
... You need to create a session object, which tracks states. <lists methods inside the class>
... There are 3 examples, you can look at those if you want to make a new one.
... I also implemented the patch and subset inside patch_subset, using C code with Python wrappers. This is not fully complete yet. It's not complete, but partially complete.
... Can simulate basic usage.
... I used "bazel" for the build system. There are some examples in the README.
... There are couple of english pages in a sequence we can use to test things.
... The raw data is hard to parse, so there are some tools. One just prints the total cost of each network model.
... We can continuous integration running. Each commit runs all the tests. We also have a format checker.

myles: can the format be per directory? I don't want to have to rewrite all my code

Garret: yes
... And you can have your own formatter, or no formatter, whatever you want
... Pretty much everything has tests, we should continue this practice.
... please let me know if you need help.
... Right now, the transfer models are simplistic, and doesn't account for headers and overhead in HTTP. That needs to be included. The cost function is random right now, so we need to pick more reasonable values instead, and tune it. Same with the network model parameters. And patch-n-subset is not fully implemented yet.

<Garret> Myles: doing research activity on sorting font files.

<Garret> Myles: gave a presentation months ago with some initial work. Have been building on that to improve it.

<Garret> Myles: started with attempts for finding an order for a font file.

<Garret> Myles: input was character frequency data. Output was some different orders.

<Garret> Myles: additional work has been extending this.

<Garret> Myles: strategy that I've been using is instead of finding one solution been creating a toolkit of different solutions. Since one solution will not work well for all font files.

<Garret> Myles: all of these solutions take some amount of frequency data or the state of nature as input. THe output is an ordering.

<Garret> Myles: so the approach that I've taken has three lvels. Stage 1 is picking one or set of ordering from data.

<Garret> Myles: the ordering from a couple months ago was from co-occurence. For each character found other characters that also appeared.

<Garret> Myles: over the past few months I've improved on this by tracking individual bytes. This includes things like that some glyphs are larger then others.

<Garret> Myles: for counting bytes this matters.

<Garret> Myles: the second way that I've improved, rather then downloading 100 webpages manually I used an off the shelf web crawler. Got a corpus of 100k pages now. Crawler has built in functionality for grabbing just text content.

<Garret> Myles: the corpus is not page views, the data is just collections of tuples (url, text content)

<Garret> Myles: that's level one of this. Creating ordering based on this type of data.

<Garret> Myles: another type is taken from facebook research which publishes a data set about modeling languages. Every word in a language is associated with a multi dimensional vector.

<Garret> Myles: these vectors have been trained so that they have semantic meaning associated with a word.

<Garret> Myles: say if you have the world queen and add man and woman to it you get king.

<Garret> Myles: facebook has published this model.

<Garret> Myles: this provides semantic grouping of words and because of how chinese works I'm pursuing treating these words as simple characters.

<Garret> Myles: there's a couple other implementations of this but facebook is the easiest to use.

<Garret> Myles: level one is coming up with orderings based on that sematnic info.

<Garret> Myles: level 2 is coming up with better orderings.

<Garret> Myles: used a couple different strategies. 1 is a genetic algorithm, 2 gradient descent, 3 simulated annealing, 4 particle simulation.

<Garret> Myles: they accepted ordering and a fitness function and produce better orderings.

<Garret> Myles: level 3, optimizing the optimizers.

<Garret> Myles: in ml there's this concept of hyper parameter optimization.

<Garret> Myles: genetic algorithm has parameters like what is a good random distribution to pick to work best. or in simulated annealing the rate that energy decreases.

<Garret> Myles: when you create these optimizers theres knobs that need to be tuned to get good results.

<Garret> Myles: so a human can pick some numbers and hope for the best. In literature theres a bunch of work on how to pick values for these parameters.

<Garret> Myles: so level 3 is picking good numbers for the optimizers. Using "hyperopt" to optimize them. That's the scope of what I've been working on.

<Garret> Myles: have some results, better then a couple of months ago, but not ready to share yet.

<Garret> Myles: so in the next few weeks, finish up the pieces, open source the code, gather results. Then hook up into the framework.

<Garret> Myles: written in objective c.

<Garret> Myles: one more thing, taking the fasttext yper dimentional and projecting it into one dimension. Yields an ordering, so want to explore hyper parameter opt to come up with a projection matrix.

<Garret> Myles: hopefully that'll be another input.

<Garret> Myles: chinese fonts have thousands of glyphs. The search space is very large. The work I'm doing is finding good areas in that space without searching the whole thing. The size of the projection matrix is only the size of the vector space (300 vs 8000)

<Garret> Myles: so hopefully that will yield some good results.

Vlad: how much more time do you need?

Garret: roughly 2 weeks.

<Vlad> Myles: will be out until end of January

Garret: no update on the corpus yet

<Vlad> Next tentative call date - Feb 3rd, 2020

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes manually created (not a transcript), formatted by David Booth's scribe.perl version 1.154 (CVS log)
$Date: 2019/12/09 17:41:13 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision: 1.154  of Date: 2018/09/25 16:35:56  
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00)

Succeeded: s/202/2020/
Default Present: Vlad, jpamental, Persa_Zula, myles, Garret
Present: Vlad jpamental Persa_Zula myles Garret
Found ScribeNick: myles
Inferring Scribes: myles

WARNING: No "Topic:" lines found.


WARNING: No meeting chair found!
You should specify the meeting chair like this:
<dbooth> Chair: dbooth

Found Date: 09 Dec 2019
People with action items: 

WARNING: No "Topic: ..." lines found!  
Resulting HTML may have an empty (invalid) <ol>...</ol>.

Explanation: "Topic: ..." lines are used to indicate the start of 
new discussion topics or agenda items, such as:
<dbooth> Topic: Review of Amy's report


WARNING: IRC log location not specified!  (You can ignore this 
warning if you do not want the generated minutes to contain 
a link to the original IRC log.)


[End of scribe.perl diagnostic output]