A couple of weeks ago I was at Provenance Week 2018 – a biennial conference that brings together various communities working on data provenance. Personally, it’s a fantastic event as it’s an opportunity to see the range of work going on from provenance in astronomy data to the newest work on database theory for provenance. Bringing together these various strands is important as there is work from across computer science that touches on data provenance.
The week is anchored by the International Provenance and Annotation Workshop (IPAW) and the Theory and Practice of Provenance (TaPP) and includes events focused on emerging areas of interest including incremental re-computation , provenance-based security and algorithmic accountability. There were 90 attendees up from ~60 in the prior events and here they are:
The folks at Kings College London, led by Vasa Curcin, did a fantastic job of organizing the event including great social outings on-top of their department building and with a boat ride along the thames. They also catered to the world cup fans as well. Thanks Vasa!
I had the following major takeaways from the conference:
Improved Capture Systems
The two years since the last provenance week have seen a number of improved systems for capturing provenance. In the systems setting, DARPAs Transparent Computing program has given a boost to scaling out provenance capture systems. These systems use deep operating system instrumentation to capture logs over the past several years these have become more efficient and scalable e.g. Camflow, SPADE. This connects with the work we’ve been doing on improving capture using whole system record-and-replay. You can now run these systems almost full-time although they capture significant amounts of data (3 days = ~110 GB). Indeed, the folks at Galois presented an impressive looking graph database specifically focused on working with provenance and time series data streaming from these systems.
Beyond the security use case, sciunit.run was a a neat tool using execution traces to produce reproducible computational experiments.
There were also a number of systems for improving the generation of instrumentation to capture provenance. UML2PROV automatically generates provenance instrumentation from UML diagrams and source code using the provenance templates approach. (Also used to capture provenance in an IoT setting.) Curator implements provenance capture for micro-services using existing logging libraries. Similarly, UNICORE now implements provenance for its HPC environment. I still believe structured logging is one of the under rated ways of integrating provenance capture into systems.
Finally, there was some interesting work on reconstructing provenance. In particular, I liked Alexander Rasin‘s work on reconstructing the contents of a database from its environment to answer provenance queries:
Also, the IPAW best paper looked at using annotations in a workflow to infer dependency relations:
Lastly, there was some initial work on extracting provenance of health studies directly from published literature which I thought was a interesting way of recovering provenance.
Provenance for Accountability
Another theme (mirrored by the event noted above) was the use of provenance for accountability. This has always been a major use for provenance as pointed out by Bertram Ludäscher in his keynote:
However, I think due to increasing awareness around personal data usage and privacy the need for provenance is being recognized. See, for example, the Royal Society’s report on Data management and use: Governance in the 21st century. At Provenance Week, there were several papers addressing provenance for GDPR, see:
Also, the I was impressed with the demo from Imosphere using provenance for accountability and trust in health data:
Re-computation & Its Applications
Using provenance to determine what to recompute seems to have a number of interesting applications in different domains. Paolo Missier showed for example how it can be used to determine when to recompute in next generation sequencing pipelines.
I particular liked their notion of a re-computation front – what set of past executions do you need to re-execute in order to address the change in data.
Wrattler was a neat extension of the computational notebook idea that showed how provenance can be used to automatically propagate changes through notebook executions and support suggestions.
Marta Mattoso‘s team discussed the application of provenance to track the adjustments when performing steering of executions in complex HPC applications.
The work of Melanie Herschel‘s team on provenance for data integration points to the benefits of potentially applying recomputation using provenance to make the iterative nature of data integration speedier as she enumerated in her presentation at the recomputation worskhop.
You can see all the abstracts from the workshop here. I understand from Paolo that they will produce a report from the discussions there.
Overall, I left provenance week encouraged by the state of the community, the number of interesting application areas, and the plethora of research questions to work on.
- Very nice introduction to Provenance in Databases + Semirings from Pierre Senellart.
- ProvSQL – database provenance implemented in/over postgres
- Answer Set Programming implementation
- RDA provenance patterns working group
- W3C Prov popped up in a ton of talks and it clearly serves as an excellent reference point in the community and even enables some inoperability.
- A 2017 provenance survey.
- Good to see the relaunch of openprovenance.org – lots of good tools for working with W3C PROV.
- Principles of Provenance and Galois connections