Thoughts from the Dagstuhl Principles of Provenance Workshop
Last week, I attended a workshop at Dagstuhl on the Principles of Provenance. Before talking about the content of the workshop itself, it’s worth describing the experience of Dagstuhl. The venue itself is a manor house located pretty much in the middle of nowhere in southwest Germany. The nature around the venue is nice and it really is away from it all. All the attendees stay in the manor house so you spend not only the scheduled workshop times with your colleagues but also breakfast, lunch, dinner and evenings. They also have small tricks to ensure that everyone mingles, for example, by pseudo-randomly seating people at different tables for meals. Additionally, Dagstuhl is specifically for computer science – they have a good internet connection and one of the best computer science libraries I’ve seen. All these things together make Dagstuhl a unique intellectually intense environment. It’s one of the nicest traditions in computer science.
With that context in mind, the organizers of the Principles of Provenance workshop (James Cheney, Wang-Chiew Tan, Bertram Ludaescher, Stijn Vansummeren) brought together computer scientists studying provenance from the perspective of databases, the semantic web, scientific workflow, programming languages and software engineering. While I knew most of the people in this broad community (at least from paper titles), I met some new people and got to know people better. The organizers started the workshop with overviews of provenance work from 4 areas:
- Provenance in Database Systems
- Provenance in Workflows and Scientific Computation
- Provenance in Software Engineer, programming languages and security
- Provenance interchange on the web (i.e. the w3C standardization effort)
These tutorials were a great idea because they provided a common basis for communication throughout the week. The rest of the week combined quite a few talks and plenty of discussion The organizers are putting together a report right now containing abstracts and presentations so I won’t go into that more here. What I do want to do is pull out 3 take-aways that I had from the week.
1) Connecting consensus models to formal foundations
Because provenance often spans multiple systems (my data is often sourced from somewhere else), there is a need for provenance systems to interoperate. There have be a number of efforts to enable this interoperability including the creation of the Open Provenance Model as well as the current standardization effort at the W3C. Because these efforts are trying to bridge across multiple implementation, they are driven by community consensus: what models can we agree upon, what is minimally necessary for interchange, what is easy to understand and implement.
Separately, there is quite a lot of work on formal foundations of provenance especially within the database community. This work is grounded in applications but also in formal theory that ensures that provenance information has nice properties. Concretely, one can show that certain types of provenance within a database context can be expressed as polynomials, algebraically manipulated, and also related. (semirings!) Plus, provenance polynomials sounds nice. Check out T.J. Green’s thesis for starters:
Todd J. Green. Collaborative Data Sharing with Mappings and Provenance. PhD thesis, University of Pennsylvania, 2009
During the workshop, it became clear to me that the consensus based models (which are often graphical in nature) can not only be formalized but also be directly connected to these database focused formalizations. I just needed to get over the differences in syntax. This could imply that we could have nice way to trace provenance across systems and through databases and be able to understand the mathematical properties of this interconnection.
2) Social implications of producing provenance
For a couple of years now, I’ve been asked by people and have asked myself, so what do you do with provenance? I think there are a lot of good answers for that (e.g. requirements for provenance in e-science). However, the community has spent a lot of time thinking about how to capture provenance from a technical point of view asking questions like: how do we instrument systems? how do we store provenance efficiently? can we leverage execution environments for tracing?
At Dagstuhl, Carole Goble asked another question, why would people record and share provenance in the first place? There are big social implications that we need to grapple with: producing provenance may expose information that we are not ready to share, it may require us to change work practice leading to effort that we may not want to give or it may be in form that is to raw to be useful. Developing techniques to address these issues is from my point of view a new and important area of work.
From my perspective, we are starting to work on the ideas of how to reconstruct provenance from data that will hopefully reduce the effort for producers of provenance.
3) Provenance is important for messy data integration
A key usecase for provenance is tracking back to original data sources after data has been integrated. This is particularly important when the data integration requires complex processing (e.g. natural language processing). Christopher Ré gave a fantastic example of this with a demonstration the WiscI system part of the Hazy project. This application enriches Wikipedia pages with facts collected from a (~40 TB) web crawl and provides links back to a supporting source for those facts. It was a great example of how provenance is really foundational to providing confidence in these systems.
Beyond these points, there was a lot more discussed, which will be summarized in the forthcoming report. This was a great workshop for me. From my point of view, I wanted to thank the organizers for putting it together. It’s a lot of effort. Additionally, thanks to all of the participants for really great conversations.
Using Provenance in the Semantic Web – JWS Special Issue
The Journal of Web Semantics recently published a special issue on Using Provenance in the Semantic Web edited by myself and Yolanda Gil. (Vol 9, No 2 (2011)). All articles are available on the journal’s preprint server.
The issue highlights top research at the intersection of provenance and the Semantic Web. The papers addressed a range of topics including:
- tracking provenance of DBpedia back to the underlying Wikipedia edits [Orlandi & Passant];
- how to enable reproducibility using Semantic techniques [Moreau];
- how to use provenance to effectively reason over large amounts (1 billion triples) of messy data [Bonatti et al.]; and
- how to begin to capture semantically the intent of scientists [Pignotti et al.].
A common thread through these papers is the use of already existing provenance ontologies. As the community comes to an increasing agreement on the commonalities of provenance representations through efforts such as the W3C Provenance Working Group, this will further enable new research on the use of provenance. This continues the fruitful interaction between standardization and research that is one of the hallmarks of the Semantic Web.
Overall, this set of papers demonstrates the latest approaches to enabling a Web that provides rich descriptions of how, when, where and why Web resources are produced and shows the sorts of reasoning and applications that these provenance descriptions make possible
Finally, it’s important to note that this issue wouldn’t have been possible without the quick and competent reviews done by the anonymous reviewers. This is my public thank you to them.
I hope you take a chance to take a look at this interesting work.
Why provenance is fundamental for people
Here’s an interesting TED talk by cognitive psychologist Paul Bloom about the origins of pleasure. What’s cool to me is he uses the same anecdotes (Hans van Meergeren, Joshua Bell) that I’ve used previously to illustrate the need for provenance. I often make a technical case for provenance for automated systems. He makes a compelling case that provenance is fundamental for people. Check out the video below… and let me know what you think.
Thanks to Shiyong Lu for the pointer.
CFP: Using Provenance on the Semantic Web
I’m editing along with Yolanda Gil a special issue of the Journal of Web Semantics on using provenance in the semantic web. You can check out the complete call at the JWS blog. Here’s the first paragraph of the call to get you excited about submitting something (or reading the resulting issue).
The Web is a decentralized system full of information provided by diverse open sources of varying quality. For any given question there will be a multitude of answers offered, raising the need for assessing their relative value and for making decisions about what sources to trust. In order to make effective use of the Web, we routinely evaluate the information we get, the sources that provided it, and the processes that produced it. A trust layer was always present in the Web architecture, and Berners-Lee envisioned an “oh-yeah?” button in the browser to check the sources of an assertion. The Semantic Web raises these questions in the context of automated applications (e.g. reasoners, aggregators, or agents), whether trying to answer questions using the Linked Data cloud, use a mashup appropriately or determine trust on a social network. Therefore, provenance is an important aspect of the Web that becomes crucial in Semantic Web research.
Where did that tweet come from?
Check out Dan Brickley’s post on the chaos around tweets about Iran. The key quote from the post in my opinion is:
Without tools to trace reports to their source, to claims about their source from credible intermediaries, or evidence, this isn’t directly useful. Even grassroots journalists needs evidence.
Even with retweets it’s hard to figure out where information is coming from and from whom especially as it flows in real time.
Paul on Provenance
I’ve been on vacation during the week of Thanksgiving but before I took off to Amsterdam I gave a talk on provenance for multi-institutional applications for the ISI Intelligent Systems Division AI Seminar series. It’s about an hour long with question and answers. If you have time, let me know what you think. The slides are on the talk’s page as well so you can zip through those if you don’t have time to listen to the whole thing.
I recommend checking out the AI Seminar page. We have some really great speakers come and talk to us at ISD and most of the talks are streamed and archived.
a really good source
The New York Times has a fun article up about two filmmakers who developed a fake public policy expert and advisor to the McCain campaign, Martin Eisenstadt, who’s been sourced by several news organizations (MSNBC, Los Angeles Times). Here’s a clip of Martin Eisenstadt responding to a non-existent BBC news documentary featuring him:
Here’s a clip of the same fake advisor being used as a source on MSNBC:
Besides being funny, I think the chief point is that in today’s always on media world where there’s a constant scramble for the the next story, publishers (whether news organizations or bloggers) fail to verify the provenance of a story. Likewise, it’s difficult for consumers of the story to easily check the veracity of the story.
Others have suggested that news organizations put up all their source material. The New York Times has begun to do this in some cases. Unfortunately, most news consumers don’t have the time to actually look at the sources themselves. Maybe the solution is an automated or algorithmic mechanism for verifying the provenance of a story. hmm….
Yet more food provenance…
The Globe and Mail as an article up about Dr. Nick Low at the University of Saskatchewa talking about his chromatography technique. I never knew about this technique for tracing the origins of the contents of food. The two key quotes from the article are:
He can trace the botanical origins of a food product, such as wine or tequila or apple juice, and determine whether the manufacturer’s list of ingredients on the box or bottle is true.
…10 per cent of all food is not what it claims to be.
The article doesn’t really get into the details of the research spending more time talking about the departments grants and possibilities of commercialization. But I thought I’d put the link up as a good reminder of what food scientists are doing with respect to provenance.

