Monthly Archives: March 2012

For the last two days, I was at the kickoff for the COMMIT/ Program – a major computer science (i.e. ICT) research initiative in the Netherlands (110 million euros in funding). The entire program has 15 projects covering most of the major hot topics in computer science everything from sensor networks to large scale data management. It involves 76 partners both academic and industrial. The event itself was attended by ~220 people.

I’m involved in COMMIT/ as part of the Data2Semantics project where we’re developing approaches that enable scientists to more easily publish, share and reuse data.

The projects of COMMIT/

The aim of the kickoff was (from my perspective) two fold:

  1. to connect these different projects;
  2. to encourage participants to think beyond traditional academic output.

With respect to 1, the the leaders of COMMIT/ are trying to create a cohesive program and not just a bunch of separate projects that happened to be funded by the same source. This is going to be a hard task but the kickoff was a good start. The event focused extensively on networking exercises, for example, developing demonstrator ideas with other mixed groups of partners. In addition, they gave out t-shirts which is always a good way to create cohesion ūüôā Indeed, the theme of the event was try to position COMMIT/ as an unfolding story.

During one session of lightening talks, I drew the above picture trying to capture a quick visual summary of each project and the common themes across them: health, scale, storytelling.

With respect to 2, it’s clear that one of the main goals of the program is to have impact in the world beyond academia. This was shown by the emphasis on research communication to the outside world. I attended a fantastic workshop on visual storytelling given by Onno van der Venn from Zeeno. In addition, there was emphasis placed on creating companies or helping develop products within the existing companies within the various projects. A number support opportunities were discussed.

The event was well organized and it was great to be able to network under the COMMIT/ banner. The program is just beginning so it will be interesting to see how the story progresses whether indeed these various already big projects can be brought together.

Personally, I’m hoping to see if we can come up with something that can use the kind of tech transfer instruments the program is encouraging…so back to work ūüôā

Note this post has been cross posted at The the International Collaboration of Early Career Researchers 

This past Thursday, I had the opportunity to participate in a mini-symposium held by the VU University Amsterdam (where I work) around open data for science titled Open Data for Science: Will it hurt or help?

The symposium consisted of three 15 minute talks and then some lively discussion with the audience of I think ~60 people from the university. We were lucky to have Jos Engelen the chairman of the NWO (the dutch NSF) discuss the perspective from research policy makers. The main take away I got from his presentation and the subsequent discussion is that open data (despite all reservations) is a worthy endeavor to pursue and something that research funders should (and will) encourage. Furthermore, just his presence means that policy makers are reaching out to see what the academic community thinks and that the community will have a say in how (open) data management policies will be rolled out in the Netherlands.

The most difficult talk to give was by Eco de Geus, who was asked to reflect on the more negative aspects of open data.  He presented important points about incentive structures (will I be scooped?), privacy, and the tendency towards one size fits all open data policies. These were important points. I think what made the reservations more poignant is that Prof. de Geus is not anti open data indeed he is deeply involved in large open data project  in his domain.

I talked about the view from a scientist starting out in their career. I told two stories:

  1.  how open data really benefited a collaborator of mine in her study of interdisciplinary work practices. As a consumer it really of data, open data really removes a number of barriers.
  2. in an analogy to open code, I discussed how an open source code I produced during my PhD led to more citations, a new collaboration, and others comparing there work to mine. However, these benefits were contrasted with the need to do support and having to be comfortable exposing my work practices.

I ended by making the following points about open data:

  1. Open data is a boon to young scientists when they are acting as consumers of data.
  2. It’s a more difficult position for producers of data. There are trade-offs including concerns about credit, time for support, and time to prepare data.
  3. Given 2, if we want to help scientists as consumers of data, we need to give support to producers.
  4. Clear simple guidelines for data publication are critical. Scientists shouldn’t need to be lawyers to either produce or consume data sets.
  5. Credit where credit is due. For open data to succeed, we need data citation on par with traditional citation.

You’ll find the slides to my talk below. Although they are a lot images so may not make much sense.

Overall, I thought the talks and discussion were excellent. It’s great to see this sort of discussion happening where I work. I hope it’s happening in many other institutions as well.

Last week, I attended a workshop at Dagstuhl on the Principles of Provenance. Before talking about the content of the workshop itself, it’s worth describing the experience of Dagstuhl. The venue itself is a manor house located pretty much in the middle of nowhere in southwest Germany. ¬†The nature around the venue is nice and it really is away from it all. All the attendees stay in the manor house so you spend not only the scheduled workshop times with your colleagues but also breakfast, lunch, dinner and evenings. They also have small tricks to ensure that everyone mingles, for example, by pseudo-randomly seating people at different tables for meals. Additionally, Dagstuhl is specifically for computer science – they have a good internet connection and one of the best computer science libraries I’ve seen. ¬†All these things together make Dagstuhl a unique ¬†intellectually intense environment. It’s one of the nicest traditions in computer science.

Me at the Principles of Provenance workshop

With that context in mind, the organizers of the Principles of Provenance workshop (James Cheney, Wang-Chiew Tan, Bertram Ludaescher, Stijn Vansummeren) brought together computer scientists studying provenance from the perspective of databases, the semantic web, scientific workflow, programming languages and software engineering. While I knew most of the people in this broad community (at least from paper titles), I met some new people and got to know people better. The organizers started the workshop with overviews of provenance work from 4 areas:

  1. Provenance in Database Systems
  2. Provenance in Workflows and Scientific Computation
  3. Provenance in Software Engineer, programming languages and security
  4. Provenance interchange on the web (i.e. the w3C standardization effort)

These tutorials were a great idea because they provided a common basis for communication throughout the week. The rest of the week combined quite a few talks and plenty of discussion The organizers are putting together a report right now containing abstracts and presentations so I won’t go into that more here. What I do want to do is pull out 3 take-aways that I had from the week.

1) Connecting consensus models to formal foundations

Because provenance often spans multiple systems (my data is often sourced from somewhere else), there is a need for provenance systems to interoperate. There have be a number of efforts to enable this interoperability including the creation of the Open Provenance Model as well as the current standardization effort at the W3C. Because these efforts are trying to bridge across multiple implementation, they are driven by community consensus: what models can we agree upon, what is minimally necessary for interchange, what is easy to understand and implement.

Separately, there is quite a lot of work on formal foundations of provenance especially within the database community. This work is grounded in¬†applications¬†but also in formal theory that ensures that provenance information has nice properties. Concretely, one can show that certain types of provenance within a database context can be expressed as polynomials, algebraically manipulated, and also related. (semirings!) Plus, provenance polynomials sounds nice. Check out T.J. Green’s thesis for starters:

Todd J. Green. Collaborative Data Sharing with Mappings and Provenance. PhD thesis, University of Pennsylvania, 2009

During the workshop, it became clear to me that the consensus based models (which are often graphical in nature) can not only be formalized but also be directly connected to these database focused formalizations. I just needed to get over the differences in syntax.  This could imply that we could have nice way to trace provenance across systems and through databases and be able to understand the mathematical properties of this interconnection.

2) Social implications of producing provenance

For a couple of years now, I’ve been asked by people and have asked myself, so what do you do with provenance? I think there are a lot of good answers for that (e.g. requirements for provenance in e-science). However, the community has spent a lot of time thinking about how to capture provenance from a technical point of view asking questions like: how do we instrument systems? how do we store provenance efficiently? can we leverage execution¬†environments¬†for tracing?

At Dagstuhl, Carole Goble asked another question, why would people record and share provenance in the first place? There are big social implications that we need to grapple with: producing provenance may expose information that we are not ready to share, it may require us to change work practice  leading to effort that we may not want to give or it may be in form that is to raw to be useful. Developing techniques to address these issues is from my point of view a new and important area of work.

From my perspective, we are starting to work on the ideas of how to reconstruct provenance from data that will hopefully reduce the effort for producers of provenance.

3) Provenance is important for messy data integration

A key usecase for  provenance is tracking back to original data sources after data has been integrated. This is particularly important when the data integration requires complex processing (e.g. natural language processing). Christopher Ré gave a fantastic example of this with a demonstration the WiscI system part of the Hazy project. This application enriches Wikipedia pages with facts collected from a (~40 TB) web crawl and provides links back to a supporting source for those facts. It was a great example of how provenance is really foundational to providing confidence in these systems.

Beyond these points, there was a lot more discussed, which will be summarized in the forthcoming report. This was a great workshop for me. From my point of view, I wanted to thank the organizers for putting it together. It’s a lot of effort. Additionally, thanks to all of the participants for really great conversations.

%d bloggers like this: