https://lists.w3.org/Archives/Public/public-automotive/2019May/0009.html
Audio problems, stick with mail discussion
Josh: I was asked to share
details about our data models
... at present going through process to open them fully and
will do so before the September workshop, this is an
overview
... 10k ft view at Uber
... 200k managed datasets, we passed the 10B trip mark last
year, quite a bit more sensor data
... we measure the on-demand data by rows and other by TB
... we have been working on schemas for on-demand and streaming
datasets, RPC
... not much top down structure, standardizing internally and
putting quite a bit of effort in using some central standard
schemas
... mostly for RPC, less for streaming and storage
... need to track drivers, devices, vehicles - easy and obvious
identifiers but needed normalizing to be able to bring datasets
together
... we are using this internally and interested in opening
up
... UUIDs, timestamps - you would be surprised at all the
different types of timestamps being used
... sensors, money, geospacial
... we have a notion of entities and their relationships linked
by UUID
... we use primarily protobuf, trift, avro, RDF and Property
Graph
... we have a common data model we can map from
... we have tooling to map data schemas
... all of these 200k+ datasets have metadata around them,
around privacy (GDPR), retention etc
... it is not readily intuitive what is user information or
PII
... we are annotating our data to make clear what it is, user,
vehicle etc so we can share information across schemas
[algebraic pg slide]
Josh: attended W3C workshop in
May on Graphs
... here are some of the main formats used. at the top is our
common logical format. we use rdf, yaml...
... this will give some glimpses of our schemas - position
events...
... we author the schemas in YAML format, we can them map them
to protobuf, thrift, avro, rdf, generate documentation
Ted: your YAML2x tools open source by chance? we do similar with our data model VSS - YAML2x
Josh: which language?
Ted: Benjamin or Daniel? (as I don't use it much)
Daniel: Python
[interest in leveraging each others' tooling]
Benjamin: can we go back to your
mapping
... don't you loose information or need to draw assumptions in
going between these formats?
Josh: if you ignore name graphs
in RDF drops a subset, going back from it a challenge
... avro or protobuf, eg position event, you have ordering but
not in rdf. there are things that don't map correctly and what
we tend to do is embed the rest in comments and that way the
conversion can contain and recover when possible
Benjamin: ok understand your 1:1 with comments. based on your experiences which do you prefer, what is the primary choice for analysis?
Josh: our logical model which I
called algebraic property graphs in earlier slide
... I gave a talk at GraphDay, title is Graph is a graph is a
graph that explains in more detail these mappings
<PatrickLue> Other presentation from Joshua: https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012
Josh: RDF mapping is used mainly for visualization and integration with other systems
Ted: can you comment on other systems used?
Josh: we use an in-house system
on top of sandra for oltp, realtime graph
... patches for apache spark for our jobs, submitted by data
scientists manually and they can get results back in
hours
... for realtime we get 100ms responses
... we use graph sage for knowledge graphs and related
techniques
... in the case of the metadata, we have table and field level
and a service to retrieve it on a dataset basis
Ted: field level granularity mostly for privacy handling or other metadata?
Josh: yes, privacy and
retention
... we are considering different options for our graph db
... our PoC for inference on categories is cql (category query
language by david spinnak at mit and his group)
Daniel: this effort is pretty transformative, how much manpower was behind this in the last year?
Josh: wish we had more people,
looking to grow the team. it is an organizational challenge
about where it should reside
... it is a diverse team, coordinating between the different
departments
Benjamin: anxious to learn about
the specific ontologies of course
... from the sample excerpt and your mention of coordinating on
eg timestamp are you following spatial data best practices from
w3c?
Josh: initial push is internally
and then we need to work on externalizing, standardizing
... we focused on schema.org for alignment. of the thousands of
ontologies
... our time ontology is roughly based on the owl one. we have
a number of existing schemas dealing with time which was an
issue
Benjamin: wanting something stable and not requiring major changes and wanting to map in a future proof way
Josh: mapping is hard and will
remain so
... we did not find a good standard ontology for addresses for
example. we started with a schema.org like format
... our address experts have pushed for a more nuanced internal
standard. it is similar to format in google api, we have
addresses for display and storage (key/value map), interservice
communication
Ted: can you provide a list of your schemas under consideration for this workshop?
Josh: simple type defs URLs, UUIDs, timestamps. contact information emails, addresses. spherical geography, currency amounts, exchange rates
[all on Data Standardization slide]
Josh: Trip table is not a
standard schema but a table
... I cannot elaborate too much on sensor events yet
Ted: sensors, along with vehicle signals could be useful for federated learning for autonomous vehicles. issue would be in the variation of placement etc across manufacturers
Benjamin: that would be just
another schema to handle how they differ across vehicles for
consumption
... another challenge would be the volume of data
... do you intend to contribute to schema.org as well some of
your fundamental ontologies
Josh: we consider to contributing to them or extending. the rdfs format wasn't a good fit and do not have a clear path for contributing to them at present
Benjamin: you can propose schemas for peer review. you should queue up what may be used by the community
Josh: we would have to look at
the different datasets on the web and their relationships with
our schemas
... there is an iot.schema.org but have not looked at vehicles
one
Benjamin: best to not create
competing ones. concerning information about vehicle going
simply from a to b but it is limited
... you will find lots of datasets, eg public taxi, it is too
basic
... you do not have much about vehicle sensors, it is a
different case
... you are more into the data mobility than vehicle sensor
area
Josh: we have looked around very carefully for some schemas
Ted: could you perhaps ask someone from ATG to talk with us on sensors, adas etc signals?
Josh: hope to bring them in
Benjamin: there are some ways to
coordinate via iot.schema.org and maybe bring in their
perspective
... simple interaction patterns
... schema.org prefers step by step extensions, it is meant to
be really central