Last week, I was at Provenance Week 2016. This event happens once every two years and brings together a wide range of researchers working on provenance. You can check out my trip report from the last Provenance Week in 2014. This year Provenance Week combined:
- the International Provenance and Annotation Workshop (IPAW 2016);
- the Theory and Practice of Provenance (TaPP 2016) workshop;
- PROV: 3 years latter; and
- Provenance-based Security and Transparent Computing.
For me, Provenance Week is like coming home, lots of old friends and a favorite subject of mine. It’s also a good event to attend because it crosses the subfields of computer science, everything from security in operating systems to scientific workflows on to database theory. In one day, I went from a discussion on the role of indirection in data citation to staring at the C code of a database. Marta, Boris and Sarah really put together a solid program. There were about 60 attendees across the four days:
So what was I doing there? Having served as co-chair of the W3C PROV working group, I thought it was important to be at the PROV: Three years later event where we reflected on the status of PROV, it’s uptake and usage. I presented some ongoing work on measuring the usage of provenance on the web of data. Additionally, I gave the presentation of joint work led by my student Manolis Stamatogiannakis and done in conjunction with Ashish Gehani‘s group at SRI. The work focused on using benchmarks to help inform decisions on what provenance capture system to use. Slides:
I’ll now walk through my 3 big take aways from the event.
Provenance to attack Advanced Persistent Threats
DARPA’s $60 million transparent computing explicitly calls out the use of provenance to address the problem of what’s called an Advanced Persistent Threat (APTs). APTs are attacks that are long terms, look like standard business processes, and involve the attacker knowing the system well. This has led to a number of groups exploring the use of system level provenance capture techniques (e.g. SPADE and OPUS) and then integrating that from multiple distributed sources using PROV inspired data models. This was well described by David Archer is his talk as assembling multiple causal graphs from event streams. James Cheney’s talk on provenance segmentation also addressed these issues well. This reminded me some what of the work on distributed provenance capture using structured logs that the Netlogger and Pegasus teams do, however, they leverage the structure of a workflow system to help with the assembly.
I particularly liked Yang Ji, Sangho Lee and Wenke Lee‘s work on using user level record and replay to track and replay provenance. This builds upon some of our work that used system level record replay as mechanism for separating provenance capture and instrumentation. But now in user space using the nifty rr tool from Mozilla. I think this thread of being able to apply provenance instrumentation after the fact on an execution trace holds a lot of promise.
Overall, it’s great to see this level of attention on the use of provenance for security and in more broadly of using long term records of provenance to do analysis.
PROV as the starting point
Given that this was the ten year anniversary of IPAW, it was appropriate that Luc Moreau gave one of the keynotes. As really one of the drivers of the community, Luc gave a review of the development of the community and its successes.One of those outcomes was the W3C PROV standards.
Overall, it was nice to see the variety of uses of PROV and the tools built around it. It’s really become the jumping off point for exploration. For example, Pete Edwards team combined PROV and a number of other ontologies including (P-Plan) to create a semantic representation of what’s going on within a professional kitchen in order to check food safety compliance.
Another example is the use of PROV as a jumping off point for the investigation into the provenance model of HL7 FHIR (a new standard for electronic healthcare records interchange).
As whole, I think the attendees felt that what was missing was an active central point to see what was going on with PROV and pointers to resources for implementation. The aim is to make sure that the W3c PROV wiki is up-to-date and is a better resource overall.
Provenance as lens: Data Citation, Documents & Versioning
An interesting theme was the use of provenance concepts to give a frame for other practices. For example, Susan Davidson gave a great keynote on data citation and how using a variant of provenance polynomials can help us understand how to automatically build citations for various parts of curated databases. The keynote was based off her work with James Frew and Peter Buneman that will appear in CACM (preprint). Another good example of provenance to support data citation was Nick Car’s work for Geoscience Australia.
Furthermore, the notion of provenance as the substructure for complex documents appeared several times. For example, the Impacts on Human Health of Global Climate Change report from globalchange.gov uses provenance as a backbone. Both the OPUS and PoeM systems are exploring using provenance to generate high-level experiment reports.
Finally, I thought David Koop‘s versioning of version trees showed how using provenance as lens can help better understand versioning of version trees themselves. (I have to give David credit for presenting a super recursive concept so well).
Overall, another great event and I hope we can continue to attract new CS researchers focusing on provenance.
- PROV in JSON-LD – good for streaming
- Theoretical provenance paper recipe = extend provenance polynomials to deal with new operators. Prove nice result. e.g. now for Linear Algebra.
- Prefixes! R-PROV, P-PROV, D-PROV, FS-PROV, SC-PROV, — let me know if I missed any..
- Intel Secure Guard Extensions (SGX) – interesting
- Surprised how dependent I’ve become on taking pictures in conferences for note taking. Not being able to really impacted my flow. Plus, there are less pictures for this
- Thanks to Adriane for hosting!
- A provenance based data science environment
- 👍Learning Health Systems – from Vasa Curcin