communicating provenance

Yesterday, Luc (my coauthor) and I received our physical copies of Provenance: An Introduction to PROV in the mail. Even though the book is primarily designed to be distributed digitally – it’s always great actually holding a copy in your hands. You can now order your own physical copy on Amazon. The Amazon page for the book there also includes the ability to look inside the book.

booksonshelf Prov Book Cover

Cross-posted from

Here’s an interesting TED talk by cognitive psychologist Paul Bloom about the origins of  pleasure. What’s cool to me is he uses the same anecdotes (Hans van Meergeren, Joshua Bell) that I’ve used previously to illustrate the need for provenance.   I often make a technical case for provenance for automated systems. He makes a compelling case that provenance is fundamental for people. Check out the video below… and let me know what you think.

Thanks to Shiyong Lu for the pointer.

I’m in London for  a number of meetings. Last week I had a great time talking with chemist and IT people about how to deal with chemistry data in the new project I’m working on OpenPhacts. You’ll probably hear more about this from me as the project gets up and running. This week I’m at a workshop discussing and hacking some next generation ways of measuring impact in science.

Anyway, on the weekend I got to visit some of London’s fantastic museums. I spend a lot of my time thinking about ways of describing the provenance of things particularly  data. This tends to get rather complicated… But visiting these museums, you see how some very simple provenance can add a lot to understanding something. Here’s some examples:

A very cool looking map of britain from the Natural History Museum:

Checking out the bit of text that goes with it:

We now know that it was produced by William Smith by himself in 1815 and that this version is a facsimile. Furthermore, we find out that it was the first geological map of Britain. That little bit of information about the map’s origins makes it even cooler to look at.

Another example this time from the Victoria and Albert Museum. An action packed sculpture:

And we look at the text associated with it:

and find some interesting provenance information. We have a rough idea about when it was produced between 1622-23 and who did it (Bernini). Interestingly, we also find out how it transitioned through its series of owners from Cardinal Montalto to Joshua Reynolds and then in was in the Yarborough Collection and finally purchased by the museum. This chain of ownership is classic provenance. Actually, wikipedia has even more complete provenance of the sculpture.

These examples illustrate how a bit of provenance can add so much more richness and meaning to objects.I’m going to be on the look out for provenance  in the wild.

If you spot some cool examples of provenance, let me know.

One of the things I’ve been wondering for a while now is how easy it is to develop end-users applications that take advantage of provenance. Is the software infrastructure there, do we have appropriate interface components, are things fast enough? To test this out, we held a hackathon at the International Provenance and Annotation Workshop (IPAW 2010).

The hackathon had three core objectives:

  1. Come up with a series of end user application ideas
  2. Develop cool apps
  3. And understand where we are at in terms of enabling app development

Another thing I was hoping to do was to get people from different groups to collaborate together. So how did it turn out?

We had 18 participants who divided up into the following teams:

  • Team Electric Bill
    • Paulo Pinheiro da Silva (UTEP)
    • Timothy Lebo (RPI)
    • Eric Stephan (UTEP)
    • Leonard Salayandia (RPI)
  • Team GEXP
    • Vitor Silva (Universidade Federal do Rio de Janeiro)
    • Eduardo Ogasawara (Universidade Federal do Rio de Janeiro)
  • Team Social Provenance
    • Aida Gandara (UTEP)
    • Alvaro Graves (RPI)
    • Evan Patton (UTEP)
  • Team MID
    • Iman Naja (University of Southampton)
    • Markus Kunde (DLR)
    • David Koop (University of Utah)
  • Team TheCollaborators
    • Jun Zhao (Oxford)
    • Alek Slominski (Indiana University)
    • Paolo Missier (University of Manchester)
  • Team Crowd Wisdom
    • James Michaelis (RPI)
    • Lynda Niemeyer (AMSEC, LLC)
  • Team Science
    • Elaine Angelino (Harvard)

From these teams, we had a variety of great ideas:

  • Team Electric Bill – Understanding energy consumption in the home
  • Team GEXP – Create association between abstract experiments and workflow trials
  • Team SocialProvenance –  Track the provenance of tweets on twitter
  • Team MID – Add geographic details to provenance
  • Team TheCollaborators – Research paper that embeds the provenance of and artifact
  • Team CrowdWisdom – Use provenance to filter the information from crowd sourced websites
  • Team Science – Find the impact of a change in a script on the other parts of the script

Obviously, to implement these ideas completely would take quite a while but amazingly these teams got quite far. For example, Team SocialProvenance was able to recreate twitter conversations for a number of hashtag topics including the world cup and ipaw in Proof Markup Language. Here’s a screenshot:

Here’s another screen shot from Team MID, showing how you can navigate through an Open Provenance Model graph with geo annotations:

Geo Provenance Mashup

Geo Provenance Mashup from Team MID

Those are just two examples, the other teams got quite far as well given that we ended at 4pm.

So where are we at. We had a brief conversation at the end of the hackathon (also I received a number of emails) about whether we were at a place where we could hack provenance end-user apps. The broad conclusions were as follows:

  • The maturity of tools is not there yet especially for semantic web apps. The libraries aren’t reliable and lack documentation.
  • Time was spent generating provenance not necessarily using it.
  • It would be good to have guidelines for how to enhance applications with provenance. What’s the boundary between provenance and application data?
  • It would be nice to have a common representation of provenance to work on. (Go W3C incubator!)

You can find some more thoughts about this from Tim Lebo, here. As for the hackathon itself, the participants were really enthusiastic and several said that they would continue building on the ideas they developed in the hackathon itself.

Hackathon Winners

Myself, Jim Myers (NCSA), Luc Moreau (IPAW PC co-chair) judged the apps and came up with what we thought the top three apps. Our judging criteria were: whether the app was aimed at the end user, whether it worked, whether provenance was required, and coolness factor.  We will announce the winners tomorrow at the closing session of IPAW. The winners will receive some great prizes sponsored by the Large Knowledge Collider Project (LarKC). LarKC sponsored this hackathon because provenance is becoming a crucial part of semantic web applications. the hackathon let LarKC see how they can ensure that their platform  can support hackers in building great provenance-enabled semantic web apps.


I was impressed with all the participants and the apps that were produced. We are a fairly new research community so to see what could be built in so little time is great. We are getting there and I can imagine that very soon we will have the infrastructure necessary to build provenance user-facing apps fast.

I’m editing along with Yolanda Gil a special issue of the Journal of Web Semantics on using provenance in the semantic web. You can check out the complete call at the JWS blog. Here’s the first paragraph of the call to get you excited about submitting something (or reading the resulting issue).

The Web is a decentralized system full of information provided by diverse open sources of varying quality.  For any given question there will be a multitude of answers offered, raising the need for assessing their relative value and for making decisions about what sources to trust.  In order to make effective use of the Web, we routinely evaluate the information we get, the sources that provided it, and the processes that produced it.  A trust layer was always present in the Web architecture, and Berners-Lee envisioned an “oh-yeah?” button in the browser to check the sources of an assertion.  The Semantic Web raises these questions in the context of automated applications (e.g. reasoners, aggregators, or agents), whether trying to answer questions using the Linked Data cloud, use a mashup appropriately or determine trust on a social network. Therefore, provenance is an important aspect of the Web that becomes crucial in Semantic Web research.

Check out Dan Brickley’s post on the chaos around tweets about Iran. The key quote from the post in my opinion is:

Without tools to trace reports to their source, to claims about their source from credible intermediaries, or evidence, this isn’t directly useful. Even grassroots journalists needs evidence.

Even with retweets it’s hard to figure out where information is coming from and from whom especially as it flows in real time.


There’s much to be said about today’s inauguration of President Obama but the thing I want to focus on is how well documented this event was from multiple view points. Here’s a link to a slide show from flickr, with at the time over 5000 photos from over 2000 people (I’m sure both stats will grow as more people add their photos). Each photo from a different vantage point, different person, and at a slightly different time all documenting the same event: an incredible range of views.

This notion of views as explored in my thesis or accounts as it’s called in the Open Provenance Model is critical to understanding how things have happened. The more documented perspectives we have of an event the more likely we are to be able to understand its true nature (free from bias and more detailed) after the fact. So much documentation is hard to process and thus a clear story is sometimes difficult to construct from it. However, it’s clear that this is one area where technology can help us. Here, I’m thinking about technologies such as  Photosynth (see CNN’s the moment) that help munge multiple sources into one element. The hard part is once we have the synthesis of information how do we understand its provenance?

On my way back from my Christmas holiday in Amsterdam, I was looking through the shows available on the inflight entertainment system and found a show called How It’s Made. The show’s premise is obvious, it goes through how various products. Below is a clip for how a compass is constructed. I actually think it would be neat if every product purchased could be linked to this type of video. I doubt it would be difficult for the PR department of most businesses to produce these short videos. These videos really provide a broad description of the provenance of the particular product in question and are very accessible to most consumers. 

I discussed the idea of using quick videos to describe provenance in a previous post that focused on using them for science. The key is to connect specific instances of a product to the video that describes, in general, how products of the particular type are constructed.

The New York Times has a fun article up about two filmmakers who developed a fake public policy expert and advisor to the McCain campaign, Martin Eisenstadt, who’s been sourced by several news organizations (MSNBC, Los Angeles Times). Here’s a clip of Martin Eisenstadt responding to a non-existent BBC news documentary featuring him:

Here’s a clip of the same fake advisor being used as a source on MSNBC:

Besides being funny, I think the chief point is that in today’s always on media world where there’s a constant scramble for the the next story, publishers (whether news organizations or bloggers) fail to verify the provenance of a story. Likewise, it’s difficult for consumers of the story to easily check the veracity of the story. 

Others have suggested that news organizations put up all their source material. The New York Times has begun to do this in some cases. Unfortunately, most news consumers don’t have the time to actually look at the sources themselves. Maybe the solution is an automated or algorithmic mechanism for verifying the provenance of a story. hmm….

%d bloggers like this: