Last week, I was a the Theory and Practice of Provenance 2015 (TaPP’15) held in Edinburgh. This is the seventh edition of the workshop. You can check out my trip report from last year’s event which was held during Provenance week here. TaPP’s aim is to be a venue for a place where people can present their early and innovative research ideas.
The event is useful because it brings a cross section of researchers from different CS communities ranging from databases, programming language theory, distributed systems, to e-science and the semantic web. While it’s nice to see old friends at this event, one discussion that was had during the two days was how we can connect back in a stronger fashion to these larger communities especially as the interest in provenance increases within them.
I discuss the three themes I pulled from the event but you can take a look at all of the papers online at the event’s site and see what you think.
1. Execution traces as a core primitive
I was happy to be presenting on behalf of one of my students Manolis Stamatogiannakis whose been studying how to capture provenance of desktop systems using virtual machines and other technologies from the systems community. (He’s hanging out at SRI with Ashish Gehani for the summer so couldn’t make it.) A key idea in the paper we presented was to separate the capture of an execution trace from the instrumentation needed to analyze provence (paper). The slides for the talk are embedded below:
The mechanism used to do this is a technology called record & replay (we use PANDA) but this notion of capturing a light weight execution trace and then replaying it deterministically is also popping up in other communities. For example, Boris Glavic has been using it successfully for database provenance in his work on GProM and reenactment queries. There he uses the audit logging and time travel features of modern databases (i.e. execution trace) to support rich provenance queries.
This need to separate capture from queries was emphasized by David Gammack and Adriane Chapman work on trying to develop agent based models to figure out what instrumentation needs to be be applied in order to capture provenance. Until we can efficiently capture everything this is still going to be a stumbling block for completely provenance aware systems. I think that thinking about execution traces as a core primitive for provenance systems may be a way forward.
2. Workflow lessons in non-workflow environments
There are numerous benefits to using (scientific) workflow systems for computational experiments one of which is that it provides a good mechanism for capturing provenance in a declarative form. However, not all users can or will adopt workflow environments. Many use computational notebooks (e.g. Jupyter) or just shell scripts. The YesWorkflow system (very inside community joke here) uses convention and a series of comments to help users produce a workflow and provenance structure from their scrips and file system (paper). Likewise, work on combining noWorkflow, a provenance tracking system for python, and iPython notebooks shows real promise (paper). This reminded me of the PROV-O-Matic work by Rinke Hoekstra.
Of course you can combine yesWorkflow and noWorkflow together into one big system.
Overall, I like the trend towards applying workflow concepts in-situ. It got me thinking about applying the scientific workflow results to the abstractions provided by Apache Spark. Just a hunch that this might be an interesting direction.
3. Completing the pipeline
The last theme I wanted to pull out is that I think we are inching towards being able to truly connect provenance generated by applications. My first example, is the work by Glavic and his students on importing and ingesting PROV-JSON into a database. This lets you query the provenance of query results but include information on the pipelines that got it there.
This is something I’ve wanted to do for ages with Marcin Wylot’s work on TripleProv, I was a bit bummed that Boris got their first but I’m glad somebody did it 🙂
The second example was the continued push forward for provenance in the systems community. In particular, the OPUS and SPADE systems, which I was aware off but now also the work on Linux Provenance Modules by Adam Bates that was introduced to me at TaPP. These all point to the ability to leverage key operating systems constructs to capture and manage provenance. For example, Adam showed how to make use of mandatory access control policies to provide focused and complete capture of provenance for particular applications.
I have high hopes here.
To conclude I’ll end with some thoughts from the notebook.
- I really enjoyed Renée Miller talk on Big Data curation. Most of data science is data munging!
- Also, re Miller’s talk, it’s great to see how Linked Data is now making database people think. Her group did LinkedCT but is firmly in the database community.
- I had the pleasure of doing a cameo appearance in Luc Moreau’s talk on integrating provenance in publications using latex styles. I talked about the issues of getting this metadata through the publishing pipeline at Elsevier. Note we actually did it. More on this soon.
- Check prov-sty out.
- Lot’s of discussion of new RCUK research data requirements which require certain provenance metadata. Maybe a boon to the use of provenance systems.
I hope to see many familiar and new faces at next year’s Provenance Week (which combines TaPP and IPAW).