Archive

provenance

Last week, I was a the Theory and Practice of Provenance 2015 (TaPP’15) held in Edinburgh. This is the seventh edition of the workshop. You can check out my trip report from last year’s event which was held during Provenance week here. TaPP’s aim is to be a venue for a place where people can present their early and innovative research ideas.

The event is useful because it brings a cross section of researchers from different CS communities ranging from databases, programming language theory, distributed systems, to e-science and the semantic web. While it’s nice to see old friends at this event, one discussion that was had during the two days was how we can connect back in a stronger fashion to these larger communities especially as the interest in provenance increases within them.

I discuss the three themes I pulled from the event but you can take a look at all of the papers online at the event’s site and see what you think.

1. Execution traces as a core primitive

I was happy to be presenting on behalf of one of my students Manolis Stamatogiannakis whose been studying how to capture provenance of desktop systems using virtual machines and other technologies from the systems community. (He’s hanging out at SRI with Ashish Gehani for the summer so couldn’t make it.) A key idea in the paper we presented was to separate the capture of an execution trace from the instrumentation needed to analyze provence (paper). The slides for the talk are embedded below:

The mechanism used to do this is a technology called record & replay (we use PANDA) but this notion of capturing a light weight execution trace and then replaying it deterministically is also popping up in other communities. For example, Boris Glavic has been using it successfully for database provenance in his work on GProM and reenactment queries. There he uses the audit logging and time travel features of modern databases (i.e. execution trace) to support rich provenance queries. 

This need to separate capture from queries was emphasized by David Gammack and Adriane Chapman work on trying to develop agent based models to figure out what instrumentation needs to be be applied in order to capture provenance. Until we can efficiently capture everything this is still going to be a stumbling block for completely provenance aware systems. I think that thinking about execution traces as a core primitive for provenance systems may be a way forward.

2. Workflow lessons in non-workflow environments

There are numerous benefits to using (scientific) workflow systems for computational experiments one of which is that it provides a good mechanism for capturing provenance in a declarative form. However, not all users can or will adopt workflow environments. Many use computational notebooks (e.g. Jupyter)  or just shell scripts. The YesWorkflow system (very inside community joke here) uses convention and a series of comments to help users produce a workflow and provenance structure from their scrips and file system (paper). Likewise, work on combining noWorkflow, a provenance tracking system for python, and iPython notebooks shows real promise (paper). This reminded me of the PROV-O-Matic work by Rinke Hoekstra.

Of course you can combine yesWorkflow and noWorkflow together into one big system.

Overall, I like the trend towards applying workflow concepts in-situ. It got me thinking about applying the scientific workflow results to the abstractions provided by Apache Spark. Just a hunch that this might be an interesting direction.

3. Completing the pipeline

The last theme I wanted to pull out is that I think we are inching towards being able to truly connect provenance generated by applications. My first example, is the work by Glavic and his students on importing and ingesting PROV-JSON into a database. This lets you query the provenance of query results but include information on the pipelines that got it there.

This is something I’ve wanted to do for ages with Marcin Wylot’s work on TripleProv, I was a bit bummed that Boris got their first but I’m glad somebody did it 🙂

The second example was the continued push forward for provenance in the systems community. In particular, the OPUS and SPADE systems, which I was aware off but now also the work on Linux Provenance Modules by Adam Bates that was introduced to me at TaPP. These all point to the ability to leverage key operating systems constructs to capture and manage provenance. For example, Adam showed how to make use of  mandatory access control policies to provide focused and complete capture of provenance for particular applications. 

I have high hopes here.

Random thoughts

To conclude I’ll end with some thoughts from the notebook.

I hope to see many familiar and new faces at next year’s Provenance Week (which combines TaPP and IPAW).

Advertisements

Last week I got from a great 8! days in Riva del Garda, Italy attending the 2014 International Semantic Web Conference and associated events. This is one of those events where your colleagues on Facebook get annoyed with the pretty pictures of a lakes and mountains that their other colleagues keep posting:

2014-10-23 06.57.15

ISWC is the key conference for semantic web research and the place to see what’s happening. This year’s conference had 630 attendees which is a strong showing for the event. The conference is as usual selective:
2014-10-21 09.11.38
Interestingly, the numbers were about on par with last year except for the in-use track where we had a much larger number of submissions. I suspect this is because all tracks had synchronized submission deadlines whereas the in-use track was after the research track last year. The replication, dataset, software, and benchmark track is a new addition to the conference and a good one I might add. Having a place to present for these sorts of scholarly output is important and from my perspective a good move by the conference. You can find the papers (published and in preprint form) on the website.. More importantly you can find a big chunk of the slides presented on Eventifier.

So why am I hanging out in Italy (other than the pasta).  I was co-organizer the Doctoral Consortium for the event.

Additionally, I was on a panel for the Context Interpretation and Meaning workshop. I also attended a pre-meeting on archiving linked data for the PRELIDA project. Lastly, we had an in-use paper in the conference on adaptive linking used within the Open PHACTS platform to support chemistry.. Alasdair Gray did a fantastic job of leading and presenting the paper.

So on to the show.Three themes, which I discuss in turn:

  1. It’s not Volume, it’s Variety
  2. Variety & the Semantic Spectrum
  3. Fuzziness & Metrics

It’s not Volume, it’s Variety

I’m becoming more convinced that the issue for most “big” data problems isn’t volume or velocity, it’s variety. In particular, I think the hardware/systems folks are addressing the first two problems at a rate that means that for many (most?) workloads the software abstractions provided are enough to deal with the data sizes and speed involved. This inkling was confirmed to me a couple of weeks ago when I saw a talk by Peter Hofstee, the designer of the Cell microprocessor, talking about his recent work on computer architectures for big data.

This notion was further confirmed at ISWC. Bryan Thompson of BigData triple store fame, presented his new work using GPUs (mapgraph.io) that can do graph processing on hundreds of millions of nodes using GPUs using similar abstractions to Signal/Collect or GraphLab. Additionally, as I was sitting in the session on Large Scale RDF processing – many of the systems were focused on a clustered environment but using ~100 million triple test sets even though you can process these with a single beefy server. It seems that for online analytics workloads you can do these with a simple server setup and for truly web scale workloads these will be at the level of clusters that can be provisioned fairly straightforwardly using THE cloud. I mean in our community the best examples are webdatacommons.org or the work of the VU team on LODLaundry  – both of these process graphs in the billions using the Hadoop ecosystem on either local or Amazon based clusters. Furthermore, the best paper in the in-use track (Semantic Traffic Diagnosis with STAR-CITY: Architecture and Lessons Learned from Deployment in Dublin, Bologna, Miami and Rio) from IBM actually scrapped using a specific streaming system because even data coming from traffic sensors wasn’t fast enough to make it worthwhile.

Indeed, in Prabhakar Raghavan‘s  (yes! the Intro. to Information Retrieval and Google guy) keynote, he noted that he would love to have problems that were just computational in nature. Likewise, Yolanda Gil discussed that the difficulties and that the challenges lay not in necessarily data analysis but in data preparation (i.e. it’s a data mess!) 2014-10-21 14.08.27

The hard part is data variety and heterogeneity, which transitions, nicely, into our next theme…

Variety & the Semantic Spectrum

Chris Bizer gave an update to the measurements of the Linked Data Cloud – this was a highlight talk.

The Linked Data Cloud has grown essentially doubling (towards generously ~1000 datasets) but the growth of schema.org based data (see the Microdata+RDFa series ISWC 2014 paper) has ~500,000 datasets. Chris gave an interesting analysis about what he thinks this means in a nice mailing list post. The comparison is summed up below:

So what we are dealing with is really a spectrum of semantics from extremely rich knowledge bases to more shallow mark-up (As a side note: Guha’s thought’s on Schema.org are always worth a revisit.) To address, this spectrum, I saw quite a few papers trying to deal with it using a variety of CS techniques from NLP to databases. Indeed, two of the best papers were related to this subject:

Also on this front were works on optimizing linked discovery (HELIOS), machine reading (SHELDON), entity recognition, and query probabilistic triple stores. All of these works hand in common trying to take approaches from other CS fields and adapt or improve them to deal with these problems of variety within a spectrum of semantics.

Fuzziness & Metrics

The final theme that I pulled out of the conference was the area of evaluation metrics but ones that either dealt with or catered for the fact that there are no hard truths, especially, when using corpora developed using human judgements. The quintessential example of this is my colleague Lora Aroyo’s work on Crowd Truth – trying to capture disagreement in the process of creating gold standard corpora in crowd sourcing environments. Other example is the very nice work from Michelle Cheatham and Pascal Hitzler on creating an uncertain OAEI conference benchmark.  Raghavan‘s keynote also homed in on the need for more metrics especially as we have a change in the type of search interfaces that we typically use (going from keyword searches to more predictive contextual search). This theme was also prevalent in the workshops in particular how to do we measure in the face of changing contexts. Examples include:

A Note on the Best Reviewers

Good citizens:

A nice note: some were nominated by authors of papers that the reviewer rejected because the review was so good. That’s what good peer review is about – improving our science.

Random Notes

  • Love the work Bizer and crew are doing on Web Tables. Check it out.
  • Conferences are so good for quick lit reviews. Thanks to Bijan Parsia who sent me the direction of Pavel Klinov‘s work on probabilistic reasoning over inconsistent ontologies.
  • grafter.org – nice site
  • Yes, you can reproduce results.
  • There’s more provenance on the Web of Data than ever. (Unfortunately, PROV is still small percentage wise.)
  • On the other hand, PROV was in many talks like last year. It’s become a touch point. Another post on this is on the way.
  • The work by Halpin and Cheney on using SPARQL update for provenance tracking is quite cool. 
  • A win from the VU: DIVE 3rd place in the semantic web challenge 
  • Amazing wifi at the conference! Unbelievable!
  • +1 to the Poster & Demo crew: keeping 160 lightening talks going on time and fun – that’s hard
  • 10 year award goes to software: Protege: well deserved
  • http://ws.nju.edu.cn/explass/
  • From Nigel’s keynote: it seems that the killer app of open data is …. insurance
  • Two years in a row that stuff I worked has gotten a shout out in a keynote (Social Task Networks). 😃
  • ….. I don’t think the streak will last
  • 99% of queries have nouns (i.e. entities)
  • I hope I did Sarven’s Call for Linked Research justice
  • We really ought to archive LOV – vocabularies are small but they take a lot of work. It’s worth it.
  • The Media Ecology project is pretty cool. Clearly, people who have lived in LA (e.g. Mark Williams) just know what it takes 😉
  • Like: Linked Data Fragments – that’s the way to question assumptions.
  • A low-carb diet in italy – lots of running

Welcome to a massive multimedia extravaganza trip report from Provenance Week held earlier this month June 9 -13.

Provenance Week brought together two workshops on provenance plus several co-located events. It had roughly 65 participants. It’s not a huge event but it’s a pivotal one for me as it brings together all the core researchers working on provenance from a range of computer science disciplines. That means you hear the latest research on the topic ranging from great deployments of provenance systems to the newest ideas on theoretical properties of provenance. Here’s a picture of the whole crew:

Given that I’m deeply involved in the community, it’s going to be hard to summarize everything of interest because…well…everything was of interest, it also means I had a lot of stuff going on. So what was I doing there?

Activities


 

PROV Tutorial

Together with Luc Moreau and Trung Dong Huynh, I kicked off the week with a tutorial on the W3C PROV provenance model. The tutorial was based on my recent book with Luc. From my count, we had ~30 participants for the tutorial.

We’ve given tutorials in the past on PROV but we made a number of updates as PROV is becoming more mature. First, as the audience had a more diverse technical background we came from a conceptual model (UML) point of view instead of starting with a Semantic Web perspective. Furthermore, we presented both tools and recipes for using PROV. The number of tools we now have out for PROV is growing – ranging from  conversion of PROV from various version control systems to neuroimaging workflow pipelines that support PROV.

I think the hit of the show was Dong’s demonstration of interacting with PROV using his Prov python module (pypi) and Southampton’s Prov Store.

Papers & Posters

I had two papers in the main track of the International Provenance and Annotation Workshop (IPAW) as well as a demo and a poster.

Manolis Stamatogiannakis presented his work with me and Herbert Bos – Looking Inside the Black-Box: Capturing Data Provenance using Dynamic Instrumentation . In this work, we looked at applying dynamic binary taint tracking to capture high-fidelity provenance on  desktop systems. This work solves what’s known as the n-by-m problem in provenance systems. Essentially, it allows us to see how data flows within an application without having to instrument that application up-front. This lets us know exactly which output of a program is connected to which inputs. The work was well received and we had a bunch of different questions both around speed of the approach and whether we can track high-level application semantics. A demo video is below and you can find all the source code on github.

We also presented our work on converting PROV graphs to IPython notebooks for creating scientific documentation (Generating Scientific Documentation for Computational Experiments Using Provenance). Here we looked at how to try and create documentation from provenance that is gathered in a distributed setting and put that together in easy to use fashion. This work was part of a larger kind of discussion at the event on the connection between provenance gathered in these popular notebook environments and that gathered on more heterogeneous systems. Source code again on github.

I presented a poster on our (with Marcin Wylot and Philippe Cudré-Mauroux) recent work on instrumenting a triple store (i.e. graph database) with provenance.  We use a long standing technique provenance polynomials from the database community but applied for large scale RDF graphs. It was good to be able to present this to those from database community that we’re at the conference. I got some good feedback, in particular, on some efficiencies we might implement.

 

I also demoed (see above) the really awesome work by Rinke Hoekstra on his PROV-O-Viz provenance visualization service. (Paper, Code) . This was a real hit with a number of people wanting to integrate this with their provenance tools.

Provenance Reconstruction + ProvBench

At the end of the week, we co-organized with the ProvBench folks an afternoon about challenge tasks and benchmark datasets. In particular, we looked at the challenge of provenance reconstruction – how do you recreate provenance from data when you didn’t track it in the first place. Together with Tom De Nies we  produced a number of datasets for use with this task. It was pretty cool to see that Hazeline Asuncion used these data sets in one of her classes where her students used a wide variety of off the shelf methods.

From the performance scores, precision was ok but very dataset dependent and relies on a lot on knowledge of the domain. We’ll be working with Hazeline to look at defining different aspects this problem going forward.

Provenance reconstruction is just one task where we need datasets. ProvBench is focused on gathering those datasets and also defining new challenge tasks to go with them. Checkout this github for a number of datasets. The PROV standard is also making it easier to consume benchmark datasets because you don’t need to write a new parser to get a hold of the data. The dataset I most liked was the Provenance Capture Disparities dataset from the Mitre crew (paper). They provide a gold standard provenance dataset capturing everything that goes on in a desktop environment, plus, two different provenance traces from different kinds of capture systems. This is great for testing both provenance reconstruction but also looking how to merge independent capture sources to achieve a full picture of provenance.

There is also a nice tool to covert Wikipedia edit histories to PROV.

Themes


I think I picked out four large themes from provenance week.

  1. Transparent collection
  2. Provenance aggregation, slicing and dicing
  3. Provenance across sources

Transparent Collection

One issue with provenance systems is getting people to install provenance collection systems in the first place let alone installing new modified provenance-aware applications. A number of papers reported on techniques aimed to make it easier to capture more transparent.

A couple of approaches tackled this for the programming languages. One system focused on R (RDataTracker) and the other python (noWorkflow). I particularly enjoyed the noWorkflow python system as they provided not only transparent capture for provenance systems but a number of utilities for working with the captured provenance. Including a diff tool and a conversion from provenance to Prolog rules (I hope Jan reads this). The prolog conversion includes rules that allow for provenance specific queries to be formulated. (On Github). noWorkflow is similar to Rinke’s PROV-O-Matic tool for tracking provenance in python (see video below). I hope we can look into sharing work on a really good python provenance solution.

An interesting discussion point that arose from this work was – how much we should expose provenance to the user? Indeed, the team that did RDataTracker specifically inserted simple on/off statements in their system so the scientific user  could control the capture process in their R scripts.

Tracking provenance by instrumenting the operating system level has long been an approach to provenance capture. Here, we saw a couple of techniques that tried to reduce that tracking to simply launching a system background process in user space while improving the fidelity of provenance. This was the approach of our system Data Tracker and Cambridge’s OPUS (specific challenges in dealing with interposition on the std lib were discussed).  Ashish Gehani was nice enough to work with me to get his SPADE system setup on my mac.  It was pretty much just a checkout, build, and run to start capturing reasonable provenance right away – cool.

Databases have consistently been a central place for provenance research.  I was impressed  Boris Glavic’s vision (paper) of a completely transparent way to report provenance for database systems by leveraging two common database functions – time travel and an audit log. Essentially, through the use of query rewriting and query replay he’s able to capture/report provenance for database query results. Talking to Boris, they have a lot it implemented already in collaboration with Oracle. Based on prior history (PostgresSQL with provenance), I bet it will happen shortly.  What’s interesting is that his approach requires no modification of the database and instead sits as middleware above the database.

Finally, in the discussion session after the Tapp practice session, I asked the presenters who represented the range of these systems to ballpark what kind of overhead they saw for capturing provenance. The conclusion was that we could get between 1% – 15% overhead. In particular, for deterministic replay style systems you can really press down the overhead at capture time.

Provenance  aggregation, slicing and dicing

I think Susan Davidson said it best in her presentation on provenance for crowdsourcing  – we are at the OLAP stage of provenance. How do we make it easy to combine, recombine, summarize, and work with provenance. What kind of operators, systems, and algorithms do we need? Two interesting applications came to the fore for this kind of need – crowdsourcing and security. Susan’s talk exemplified this but at the Provenance Analytics event there were several other examples (Huynh et al., Dragon et al).

The other area was security.  Roly Perera  presented his impressive work with James Cheney on cataloging various mechanisms for transforming provenance graphs for the purposes of obfuscating or hiding sensitive parts of the provenance graph. This paper is great reference material for various mechanisms to deal with provenance summarization. One technique for summarization that came up several times in particular with respect to this domain was the use of annotation propagation through provenance graphs (e.g. see ProvAbs by Missier et al. and work by Moreau’s team.)

Provenance across sources

The final theme I saw was how to connect provenance across sources. One could also call this provenance integration. Both Chapman and the Mitre crew with their  Provenance Plus tracking system  and Ashish with his SPADE system are experiencing this problem of provenance coming from multiple different sources and needing to integrate these sources to get a complete picture of provenance both within a system and spanning multiple systems. I don’t think we have a solution yet but they both (ashish, chapman) articulated the problem well and have some good initial results.

This is not just a systems problem, it’s fundamental that provenance extends across systems. Two of the cool use cases I saw exemplified the need to track provenance across multiple sources.

The Kiel Center for Marine Science (GEOMAR)  has developed a provenance system to track their data throughout their entire organization stemming from data collected on their boats all the way through a data publication. Yes you read that right, provenance gathered on awesome boats!  This invokes digital pens, workflow systems and data management systems.

The other was the the recently released US National Climate Change Assessment. The findings of that report stem from 13 different institutions within the US Government. The data backing those findings is represented in a structured fashion including the use of PROV. Curt Tilmes presented more about this amazing use case at Provenance Analytics.

In many ways, the W3C PROV standard was created to help solve these issues. I think it does help but having a common representation is just the start.


Final thoughts

I didn’t mention it but I was heartened to see that community has taken to using PROV as a mechanism for interchanging data and for having discussions.  My feeling is that if you can talk provenance polynomials and PROV graphs, you can speak with pretty much anybody in the provenance community no matter which “home” they have – whether systems, databases, scientific workflows, or the semantic web.  Indeed, this is one of the great things about provenance week, is that one was able to see diverse perspectives on this cross cutting concern of provenance.

Lastly, there seemed to many good answers at provenance week but more importantly lots of good questions. Now, I think as a community we should really expose more of the problems we’ve found to a wider audience.

Random Notes

  • It was great to see the interaction between a number of different services supporting PROV (e.g. git2prov.org, prizims , prov-o-viz, prov store,  prov-pings, PLUS)
  • ProvBench on datahub – thanks Tim
  • DLR did a fantastic job of organizing. Great job Carina, Laura and Andreas!
  • I’ve never had happy birthday sung to me at by 60 people at a conference dinner – surprisingly in tune – Kölsch is pretty effective. Thanks everyone!
  • Stefan Woltran’s keynote on argumentation theory was pretty cool. Really stepped up to the plate to give a theory keynote the night after the conference dinner.
  • Speaking of theory, I still need to get my head around Bertram’s work on Provenance Games. It looks like a neat way to think about the semantics of provenance.
  • Check out Daniel’s trip report on provenance week.
  • I think this is long enough…..

A couple of weeks ago, I was at the European Data Forum in Athens talking about the Open PHACTS project. You can find a video of my talk with slides here. Slides are embedded below.

Yesterday, Luc (my coauthor) and I received our physical copies of Provenance: An Introduction to PROV in the mail. Even though the book is primarily designed to be distributed digitally – it’s always great actually holding a copy in your hands. You can now order your own physical copy on Amazon. The Amazon page for the book there also includes the ability to look inside the book.

booksonshelf Prov Book Cover


Cross-posted from blog.provbook.org

If you follow this blog, you’ll know that one of the main themes of my research is data provenance – one of the main use cases for it is reproducibility and transparency in science.  I’ve been attending and speaking at quite a few events talking about data sharing, reproducibility and making science more transparent.  I’ve even published [1, 2] on these topics.

In this context, I’ve been thinking about my own process as a scientist and whether I’m  “eating my own dogfood“. Indeed at the Beyond the PDF 2  conference in March, I stood up at the end and in front of ~200 people said that I would change my work practice – we have enough tools to really change how we do science. I knew I could do better.

So this post is about doing just that. In general, my research work consists of larger infrastructure projects in collaborations and then smaller work  developing experimental prototypes and mucking  with new algorithms. For the former, the projects use all the standard software development stuff (github, jira, wikis) so this gets documented fairly well.

The bit that’s not as good as it should be is for the smaller scale things.  I think with my co-authors and I do an ok job at publishing the code and the data associated with our publications — although this could be improved. (It’s too often on our own websites). The major issue I have is that the methods are probably not as reproducible or transparent as they should be – essentially it’s a bit messy for other people to figure out exactly what I was up to when doing something new.  It’s not in one place nor is it clearly documented. It also hurts my process in that a lot of the mucking about I do gets lost or it takes time to find. I see this is as a particular problem as I do more web science research where the gathering cleaning and reanalyzing  data is a critical part of the endeavor.

With that in mind, I’ve decided to get my act together and follow in the footsteps of the likes of Titus Brown and Carl Boettiger  and do more of my science in a reproducible and open fashion.

To do this, I’ve decided to adopt IPython Notebooks as my new note taking environment. This solves the problem of allowing me to try different things out and keep track of all the parts of a project together. Additionally, it lets me “narrate my work” – that is mix commentary with my code, which is pretty cool.  My notebook is on github and also contains information about how my system is setup including versions of libraries I’m relying on.

There’s still a long way to go to pass Phil’s test for research programming effectiveness (see also Why use make?), but I think this is a right step in my direction.

To honor this step, I’m giving $100 to FORCE11 to spread the word about how we can make scholarship better.

%d bloggers like this: