I recently spoke with Shoaib Mufti, YarcData Vice President of R&D, about Big Data and Semantic Web technology. YarcData is Cray subsidiary, accustomed to crunching lots of data. YarcData recently joined W3C.
Ian: Why did YarcData join W3C?
SM: YarcData has products, including the Urika data appliance, for manipulating graphs. Instead of reinventing the wheel, we found benefits in existing standards. We also decided to contribute to the significant semantic effort at W3C.
Ian: How do you see Big Data relating to the Semantic Web?
SM: We think that for Big Data you need standards, and that linked data standards fit the bill. What’s exciting is the opportunity to find in data something of value that was non-obvious. For this you need tools to join terms, reason and inference, and query. These are the fundamentals for getting value from big data. There are open standards for these capabilities, so we have moved away from proprietary solutions.
Ian: Let’s start with a story.
SM: One of our customers is in the financial sector. They are very interested in understanding how changes in one part of their portfolio affect the rest of their portfolio. For example, if a company goes bankrupt, what is the downstream effect on other assets in a mutual fund? What happens to the companies that depended on the now bankrupt company? How should the financial institution reorganize its portfolio based on these events?
SM: The first challenge the financial institution faces is data integration. They want to integrate public and private data, in large amounts. Because the market moves quickly and the financial stakes are enormous, they need fast integration. They cannot afford to wait months for the results of an analysis.
SM: The second challenge relates to query. They have multiple questions to ask over 50 billion triples, and they need to reduce the cost of running those queries. This particular company identified some “forbidden queries” they would lead to a huge performance hit on their servers.
Ian: How did Semantic Web technology help?
SM: It makes them less dependent on database optimization. RDF and SPARQL are schema-less, which makes it much faster and easier to ask ad-hoc questions without the performance hit. The flexibility to do ad-hoc queries efficiently has given this company a big competitive advantage.
SM: The story doesn’t end there. Although their initial interest concerned portfolio optimization, the company found another use for the technology. There are legal penalties and public relations nightmares around insider trading. Detecting insider trading is challenging and can happen in many ways (such as someone providing a friend with insider information). This financial institution realized they could use Semantic Web technology to detect insider trading effectively and improve compliance.
Ian: This is the second time in a recent months people have told me about Semantic Web technology and compliance; see my interview with Paul Groth and Luc Moreau.
SM: That example of serendipity is not unique. In 2012 we held a contest — the YarcData Graph Analytics Challenge — for people to solve some Big Data graph problems. The winners, from The Institute for Systems Biology (ISB), studied drug repurposing.
Ian: What is drug repurposing?
SM: I’ll explain with an example: Viagra. Viagra was originally developed for managing heart problems. The trials revealed an interesting side effect. And so the drug was repurposed.
Ian: Yay, science!
SM: Drug companies realize that for a number of “failed” projects, there are great opportunities to repurpose the drugs. Our contest winners studied some data sets and found that a particular HIV drug could be repurposed to treat breast cancer. By querying diverse data sets from research literature and clinical trials, they were able to find a common pathway. The whole project took about six weeks, which is astonishing compared to the usual time it takes to develop a drug. What’s more, FDA approval time for repurposed drugs is much shorter than for new drugs.
Ian: How have these technologies benefited YarcData?
SM: First, in cost savings. We can use software available in the ecosystem instead of writing proprietary utilities. For instance, we used to convert relational data to graphs with a proprietary tool; now we can use something like d2r. There are many such tools, related to inference and other capabilities.
SM: The second benefit is value. RDF is simpler than our former custom format, and this has made data integration both simpler and faster for us. One of our engineers ran a data integration project using other techniques; the integration required several months. With RDF it took a week. And, we can more easily reuse existing data sets.
SM: I think many organizations face similar data integration challenges. In the enterprise, people use a bunch of heterogeneous systems: email, plain text, unstructured data, and structured data. Managing all of the data is a big challenge for any organization.
Ian: Do any of your customers choose to migrate to RDF after you’ve worked with them?
SM: Absolutely. People use our appliance on premises since much of their data is sensitive for them. We work with them to convert their data to RDF, which we feed to our appliance. They see firsthand our efficient conversion process and how fast we can do integration. One large clinic with a lot of unstructured data in data warehouses was so impressed they issued a directive that any new data they create will be available in RDF. Once people understand the simplicity, they say “Let’s make this the way we do things going forward.” Only a few of our customers are doing this now, but we see it increasing.
Ian: Which industries do you see adopting RDF?
SM: I think the farthest along is life sciences, then financial, and then US government.
Ian: Shoaib, thank you very much for sharing those stories!