jpamental: 👋
<Persa_Zula> present
<scribe> ScribeNick: myles
Vlad: Action on me was to find
out how to use Monotype font data for analysis
... This was more complicated because for different types of
analysis, we need to get our hands on the full raw font files,
so we can restructure them. It needs to be given up for
manipulation. I managed to convince people that it's something
we should be doing but it's not .... my colleagues were
relatively comfortable with sharing it within the WG, but not
publicly.
... The product that our customers use to get our fonts is
called Mosaic. In Mosaic, we have a separate account for this
WG. It will be populated with fonts that are downloadable so we
can get our hands on raw font files on that account. That
account will be full of fonts we need for testing.
... I'm working with my coworkers to identify different font
options. But the account is open.
Garret: That's good news.
... I've been hard at word coding at the analysis framework and
on simulations of transfer methds.
... It's on the repo.
... I'll walkthrough the pieces.
... We've populated the open source repo with code. A couple of
pieces: 1) The PFE framework, in the analysis directory. Most
important pieces is analyzer.py. This takes in some input data
and a set of fonts, and runs through that input data using a
variety of enrichment methods and network models. (Defined at
top) and outputs resulting data based on running the
simulations.
... The input data is protos. We're going to have sequences of
page views. Each page view is a set of code points. A single
page view can have multiple fonts mapping to different code
points.
... That performs the analysis and gets results back
... It's mostly done
... Next, I implemented some common font transfer methods. 3
exist. 1) Whole font - transfering one single font file. 2)
Unicode range, using unicode-range based subsets, which are the
same subsets google fonts uses. If other people have subsets,
we can use them too. For now, this is okay. 3) Optimal transfer
method, which is a theoretical minimum. We can use this as a
benchmark to compare against other methods.
... There will be more of these (Myles is supposed to write
one)
... You need to create a session object, which tracks states.
<lists methods inside the class>
... There are 3 examples, you can look at those if you want to
make a new one.
... I also implemented the patch and subset inside
patch_subset, using C code with Python wrappers. This is not
fully complete yet. It's not complete, but partially
complete.
... Can simulate basic usage.
... I used "bazel" for the build system. There are some
examples in the README.
... There are couple of english pages in a sequence we can use
to test things.
... The raw data is hard to parse, so there are some tools. One
just prints the total cost of each network model.
... We can continuous integration running. Each commit runs all
the tests. We also have a format checker.
myles: can the format be per directory? I don't want to have to rewrite all my code
Garret: yes
... And you can have your own formatter, or no formatter,
whatever you want
... Pretty much everything has tests, we should continue this
practice.
... please let me know if you need help.
... Right now, the transfer models are simplistic, and doesn't
account for headers and overhead in HTTP. That needs to be
included. The cost function is random right now, so we need to
pick more reasonable values instead, and tune it. Same with the
network model parameters. And patch-n-subset is not fully
implemented yet.
<Garret> Myles: doing research activity on sorting font files.
<Garret> Myles: gave a presentation months ago with some initial work. Have been building on that to improve it.
<Garret> Myles: started with attempts for finding an order for a font file.
<Garret> Myles: input was character frequency data. Output was some different orders.
<Garret> Myles: additional work has been extending this.
<Garret> Myles: strategy that I've been using is instead of finding one solution been creating a toolkit of different solutions. Since one solution will not work well for all font files.
<Garret> Myles: all of these solutions take some amount of frequency data or the state of nature as input. THe output is an ordering.
<Garret> Myles: so the approach that I've taken has three lvels. Stage 1 is picking one or set of ordering from data.
<Garret> Myles: the ordering from a couple months ago was from co-occurence. For each character found other characters that also appeared.
<Garret> Myles: over the past few months I've improved on this by tracking individual bytes. This includes things like that some glyphs are larger then others.
<Garret> Myles: for counting bytes this matters.
<Garret> Myles: the second way that I've improved, rather then downloading 100 webpages manually I used an off the shelf web crawler. Got a corpus of 100k pages now. Crawler has built in functionality for grabbing just text content.
<Garret> Myles: the corpus is not page views, the data is just collections of tuples (url, text content)
<Garret> Myles: that's level one of this. Creating ordering based on this type of data.
<Garret> Myles: another type is taken from facebook research which publishes a data set about modeling languages. Every word in a language is associated with a multi dimensional vector.
<Garret> Myles: these vectors have been trained so that they have semantic meaning associated with a word.
<Garret> Myles: say if you have the world queen and add man and woman to it you get king.
<Garret> Myles: facebook has published this model.
<Garret> Myles: this provides semantic grouping of words and because of how chinese works I'm pursuing treating these words as simple characters.
<Garret> Myles: there's a couple other implementations of this but facebook is the easiest to use.
<Garret> Myles: level one is coming up with orderings based on that sematnic info.
<Garret> Myles: level 2 is coming up with better orderings.
<Garret> Myles: used a couple different strategies. 1 is a genetic algorithm, 2 gradient descent, 3 simulated annealing, 4 particle simulation.
<Garret> Myles: they accepted ordering and a fitness function and produce better orderings.
<Garret> Myles: level 3, optimizing the optimizers.
<Garret> Myles: in ml there's this concept of hyper parameter optimization.
<Garret> Myles: genetic algorithm has parameters like what is a good random distribution to pick to work best. or in simulated annealing the rate that energy decreases.
<Garret> Myles: when you create these optimizers theres knobs that need to be tuned to get good results.
<Garret> Myles: so a human can pick some numbers and hope for the best. In literature theres a bunch of work on how to pick values for these parameters.
<Garret> Myles: so level 3 is picking good numbers for the optimizers. Using "hyperopt" to optimize them. That's the scope of what I've been working on.
<Garret> Myles: have some results, better then a couple of months ago, but not ready to share yet.
<Garret> Myles: so in the next few weeks, finish up the pieces, open source the code, gather results. Then hook up into the framework.
<Garret> Myles: written in objective c.
<Garret> Myles: one more thing, taking the fasttext yper dimentional and projecting it into one dimension. Yields an ordering, so want to explore hyper parameter opt to come up with a projection matrix.
<Garret> Myles: hopefully that'll be another input.
<Garret> Myles: chinese fonts have thousands of glyphs. The search space is very large. The work I'm doing is finding good areas in that space without searching the whole thing. The size of the projection matrix is only the size of the vector space (300 vs 8000)
<Garret> Myles: so hopefully that will yield some good results.
Vlad: how much more time do you need?
Garret: roughly 2 weeks.
<Vlad> Myles: will be out until end of January
Garret: no update on the corpus yet
<Vlad> Next tentative call date - Feb 3rd, 2020
This is scribe.perl Revision: 1.154 of Date: 2018/09/25 16:35:56 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00) Succeeded: s/202/2020/ Default Present: Vlad, jpamental, Persa_Zula, myles, Garret Present: Vlad jpamental Persa_Zula myles Garret Found ScribeNick: myles Inferring Scribes: myles WARNING: No "Topic:" lines found. WARNING: No meeting chair found! You should specify the meeting chair like this: <dbooth> Chair: dbooth Found Date: 09 Dec 2019 People with action items: WARNING: No "Topic: ..." lines found! Resulting HTML may have an empty (invalid) <ol>...</ol>. Explanation: "Topic: ..." lines are used to indicate the start of new discussion topics or agenda items, such as: <dbooth> Topic: Review of Amy's report WARNING: IRC log location not specified! (You can ignore this warning if you do not want the generated minutes to contain a link to the original IRC log.)[End of scribe.perl diagnostic output]