I’m pleased to announce that at the beginning of 2015 I’ll be joining Elsevier Labs as Disruptive Technology Director 1.

In the past several years, there has been an explosion of creativity with respect to both research communication and research infrastructure. Whether it’s new ways to think about the impact of research (e.g. altmetrics), the outsourcing of experiments (e.g. Science Exchange) or the impact of massive datasets on the creation of large scale models (e.g. Big Mechanism), this is an exciting space to be in.

I’ve been lucky to be part of teams2 that have been addressing the issues of research infrastructure and communication through novel computer science. At Elsevier Labs , I’ll continue to focus on this area in an environment with amazing data, resources, people and potential for impact. This ability to focus is one of the reasons I’ve decided to make the jump from academia.3 In my new position, I’ll probably be out and about even more talking and writing about this area.4

Finally, a word on open science. My view on open science is strongly shaped by Cameron Neylon’s articulation of the need to reduce friction in the science system.5 The removal of barriers is central to being able to do science better. I think there is a strong role for commercial organizations to facilitate this reduction in friction in an open environment.6 Indeed, the original role of publishers did just that. From my discussions with the Labs team and others at Elsevier, the organization is absolutely receptive to this view and is moving in this direction.7 My hope is that I can help Elsevier use its many strengths to support a better, more open, and frictionless science ecosystem.


  1. See Horace Deidu’s Disruption FAQ 
  2. e.g. Open PHACTS, Data2Semantics, SMS, Wings, PASOA 
  3. Lada Adamic does a much better job of summing up the reasons for leaving academia and discussing the trade-offs. Many of her points ring true to me. 
  4. I really like writing trip reports 
  5. See also Please Keep it Simple 
  6. In a completely other context, see this discussion of how DigitalOcean works with open source. 
  7. e.g. Mendeley, Research Data Services @ Elsevier 

Welcome to a massive multimedia extravaganza trip report from Provenance Week held earlier this month June 9 -13.

Provenance Week brought together two workshops on provenance plus several co-located events. It had roughly 65 participants. It’s not a huge event but it’s a pivotal one for me as it brings together all the core researchers working on provenance from a range of computer science disciplines. That means you hear the latest research on the topic ranging from great deployments of provenance systems to the newest ideas on theoretical properties of provenance. Here’s a picture of the whole crew:

Given that I’m deeply involved in the community, it’s going to be hard to summarize everything of interest because…well…everything was of interest, it also means I had a lot of stuff going on. So what was I doing there?

Activities


 

PROV Tutorial

Together with Luc Moreau and Trung Dong Huynh, I kicked off the week with a tutorial on the W3C PROV provenance model. The tutorial was based on my recent book with Luc. From my count, we had ~30 participants for the tutorial.

We’ve given tutorials in the past on PROV but we made a number of updates as PROV is becoming more mature. First, as the audience had a more diverse technical background we came from a conceptual model (UML) point of view instead of starting with a Semantic Web perspective. Furthermore, we presented both tools and recipes for using PROV. The number of tools we now have out for PROV is growing – ranging from  conversion of PROV from various version control systems to neuroimaging workflow pipelines that support PROV.

I think the hit of the show was Dong’s demonstration of interacting with PROV using his Prov python module (pypi) and Southampton’s Prov Store.

Papers & Posters

I had two papers in the main track of the International Provenance and Annotation Workshop (IPAW) as well as a demo and a poster.

Manolis Stamatogiannakis presented his work with me and Herbert Bos – Looking Inside the Black-Box: Capturing Data Provenance using Dynamic Instrumentation . In this work, we looked at applying dynamic binary taint tracking to capture high-fidelity provenance on  desktop systems. This work solves what’s known as the n-by-m problem in provenance systems. Essentially, it allows us to see how data flows within an application without having to instrument that application up-front. This lets us know exactly which output of a program is connected to which inputs. The work was well received and we had a bunch of different questions both around speed of the approach and whether we can track high-level application semantics. A demo video is below and you can find all the source code on github.

We also presented our work on converting PROV graphs to IPython notebooks for creating scientific documentation (Generating Scientific Documentation for Computational Experiments Using Provenance). Here we looked at how to try and create documentation from provenance that is gathered in a distributed setting and put that together in easy to use fashion. This work was part of a larger kind of discussion at the event on the connection between provenance gathered in these popular notebook environments and that gathered on more heterogeneous systems. Source code again on github.

I presented a poster on our (with Marcin Wylot and Philippe Cudré-Mauroux) recent work on instrumenting a triple store (i.e. graph database) with provenance.  We use a long standing technique provenance polynomials from the database community but applied for large scale RDF graphs. It was good to be able to present this to those from database community that we’re at the conference. I got some good feedback, in particular, on some efficiencies we might implement.

 

I also demoed (see above) the really awesome work by Rinke Hoekstra on his PROV-O-Viz provenance visualization service. (Paper, Code) . This was a real hit with a number of people wanting to integrate this with their provenance tools.

Provenance Reconstruction + ProvBench

At the end of the week, we co-organized with the ProvBench folks an afternoon about challenge tasks and benchmark datasets. In particular, we looked at the challenge of provenance reconstruction – how do you recreate provenance from data when you didn’t track it in the first place. Together with Tom De Nies we  produced a number of datasets for use with this task. It was pretty cool to see that Hazeline Asuncion used these data sets in one of her classes where her students used a wide variety of off the shelf methods.

From the performance scores, precision was ok but very dataset dependent and relies on a lot on knowledge of the domain. We’ll be working with Hazeline to look at defining different aspects this problem going forward.

Provenance reconstruction is just one task where we need datasets. ProvBench is focused on gathering those datasets and also defining new challenge tasks to go with them. Checkout this github for a number of datasets. The PROV standard is also making it easier to consume benchmark datasets because you don’t need to write a new parser to get a hold of the data. The dataset I most liked was the Provenance Capture Disparities dataset from the Mitre crew (paper). They provide a gold standard provenance dataset capturing everything that goes on in a desktop environment, plus, two different provenance traces from different kinds of capture systems. This is great for testing both provenance reconstruction but also looking how to merge independent capture sources to achieve a full picture of provenance.

There is also a nice tool to covert Wikipedia edit histories to PROV.

Themes


I think I picked out four large themes from provenance week.

  1. Transparent collection
  2. Provenance aggregation, slicing and dicing
  3. Provenance across sources

Transparent Collection

One issue with provenance systems is getting people to install provenance collection systems in the first place let alone installing new modified provenance-aware applications. A number of papers reported on techniques aimed to make it easier to capture more transparent.

A couple of approaches tackled this for the programming languages. One system focused on R (RDataTracker) and the other python (noWorkflow). I particularly enjoyed the noWorkflow python system as they provided not only transparent capture for provenance systems but a number of utilities for working with the captured provenance. Including a diff tool and a conversion from provenance to Prolog rules (I hope Jan reads this). The prolog conversion includes rules that allow for provenance specific queries to be formulated. (On Github). noWorkflow is similar to Rinke’s PROV-O-Matic tool for tracking provenance in python (see video below). I hope we can look into sharing work on a really good python provenance solution.

An interesting discussion point that arose from this work was – how much we should expose provenance to the user? Indeed, the team that did RDataTracker specifically inserted simple on/off statements in their system so the scientific user  could control the capture process in their R scripts.

Tracking provenance by instrumenting the operating system level has long been an approach to provenance capture. Here, we saw a couple of techniques that tried to reduce that tracking to simply launching a system background process in user space while improving the fidelity of provenance. This was the approach of our system Data Tracker and Cambridge’s OPUS (specific challenges in dealing with interposition on the std lib were discussed).  Ashish Gehani was nice enough to work with me to get his SPADE system setup on my mac.  It was pretty much just a checkout, build, and run to start capturing reasonable provenance right away – cool.

Databases have consistently been a central place for provenance research.  I was impressed  Boris Glavic’s vision (paper) of a completely transparent way to report provenance for database systems by leveraging two common database functions - time travel and an audit log. Essentially, through the use of query rewriting and query replay he’s able to capture/report provenance for database query results. Talking to Boris, they have a lot it implemented already in collaboration with Oracle. Based on prior history (PostgresSQL with provenance), I bet it will happen shortly.  What’s interesting is that his approach requires no modification of the database and instead sits as middleware above the database.

Finally, in the discussion session after the Tapp practice session, I asked the presenters who represented the range of these systems to ballpark what kind of overhead they saw for capturing provenance. The conclusion was that we could get between 1% – 15% overhead. In particular, for deterministic replay style systems you can really press down the overhead at capture time.

Provenance  aggregation, slicing and dicing

I think Susan Davidson said it best in her presentation on provenance for crowdsourcing  – we are at the OLAP stage of provenance. How do we make it easy to combine, recombine, summarize, and work with provenance. What kind of operators, systems, and algorithms do we need? Two interesting applications came to the fore for this kind of need – crowdsourcing and security. Susan’s talk exemplified this but at the Provenance Analytics event there were several other examples (Huynh et al., Dragon et al).

The other area was security.  Roly Perera  presented his impressive work with James Cheney on cataloging various mechanisms for transforming provenance graphs for the purposes of obfuscating or hiding sensitive parts of the provenance graph. This paper is great reference material for various mechanisms to deal with provenance summarization. One technique for summarization that came up several times in particular with respect to this domain was the use of annotation propagation through provenance graphs (e.g. see ProvAbs by Missier et al. and work by Moreau’s team.)

Provenance across sources

The final theme I saw was how to connect provenance across sources. One could also call this provenance integration. Both Chapman and the Mitre crew with their  Provenance Plus tracking system  and Ashish with his SPADE system are experiencing this problem of provenance coming from multiple different sources and needing to integrate these sources to get a complete picture of provenance both within a system and spanning multiple systems. I don’t think we have a solution yet but they both (ashish, chapman) articulated the problem well and have some good initial results.

This is not just a systems problem, it’s fundamental that provenance extends across systems. Two of the cool use cases I saw exemplified the need to track provenance across multiple sources.

The Kiel Center for Marine Science (GEOMAR)  has developed a provenance system to track their data throughout their entire organization stemming from data collected on their boats all the way through a data publication. Yes you read that right, provenance gathered on awesome boats!  This invokes digital pens, workflow systems and data management systems.

The other was the the recently released US National Climate Change Assessment. The findings of that report stem from 13 different institutions within the US Government. The data backing those findings is represented in a structured fashion including the use of PROV. Curt Tilmes presented more about this amazing use case at Provenance Analytics.

In many ways, the W3C PROV standard was created to help solve these issues. I think it does help but having a common representation is just the start.


Final thoughts

I didn’t mention it but I was heartened to see that community has taken to using PROV as a mechanism for interchanging data and for having discussions.  My feeling is that if you can talk provenance polynomials and PROV graphs, you can speak with pretty much anybody in the provenance community no matter which “home” they have – whether systems, databases, scientific workflows, or the semantic web.  Indeed, this is one of the great things about provenance week, is that one was able to see diverse perspectives on this cross cutting concern of provenance.

Lastly, there seemed to many good answers at provenance week but more importantly lots of good questions. Now, I think as a community we should really expose more of the problems we’ve found to a wider audience.

Random Notes

  • It was great to see the interaction between a number of different services supporting PROV (e.g. git2prov.org, prizims , prov-o-viz, prov store,  prov-pings, PLUS)
  • ProvBench on datahub – thanks Tim
  • DLR did a fantastic job of organizing. Great job Carina, Laura and Andreas!
  • I’ve never had happy birthday sung to me at by 60 people at a conference dinner – surprisingly in tune – Kölsch is pretty effective. Thanks everyone!
  • Stefan Woltran’s keynote on argumentation theory was pretty cool. Really stepped up to the plate to give a theory keynote the night after the conference dinner.
  • Speaking of theory, I still need to get my head around Bertram’s work on Provenance Games. It looks like a neat way to think about the semantics of provenance.
  • Check out Daniel’s trip report on provenance week.
  • I think this is long enough…..

I seem to be a regular attendee of the Extended Semantic Web Conference series (2013 trip report). This year ESWC was back in Crete, which means that you can get photos like the one below taken to make your colleagues jealous:

2014-05-26 18.11.15

 

As I write this, the conference is still going on but I had to leave early to early to head to Iceland where I will briefly gate crash the natural language processing crowd at LREC 2014. Let’s begin with the stats of ESWC:

  • 204 submissions
  • 25% acceptance rate
  • ~ 4.5 reviews per submission

The number of submissions was up from last year. I don’t have the numbers on attendance but it seemed in-line with last year as well. So, what was I doing at the conference?

This year ESWC introduced a semantic web evaluation track. We participated in two of these new evaluation tracks. I showed off our linkitup tool for the Semantic Web Publishing Challenge. [paper]. The tool lets you enrich research data uploaded to Figshare with links to external sites. Valentina Maccatrozzo presented her contribution to the Linked Open Data Recommender Systems challenge. She’s exploring using richer semantics to do recommendation, which, from the comments on her poster, was seen as a novel approach by the attendees. Overall, I think all our work went over well. However, it would be good to see more of the VU Semweb group content in the main track. The Netherlands only had 14 paper submissions. It was also nice to see PROV mentioned in several places. Finally, conferencse are great places to do face-2-face work. I had nice chats with quite a few people, in particular, with Tobias Kuhn on the development of the nanopublications spec and with Avi Bernstein on our collaboration leveraging his group’s Signal & Collect framework.

So what were the big themes of this year’s conference. I pulled out three:

  1. Easing development with Linked Data
  2. Entities everywhere
  3. Methodological maturity

Easing development

As a community, we’ve built interesting infrastructure for machine readable data sharing, querying, vocabulary publication and the like. Now that we have all this data,  the community is turning towards making it easier to develop applications with it. This is not necessarily a new problem and people have tackled it before (e.g. ActiveRDF). But the availability of data seems to be renewing attention to this problem. This was reflected by Stefan Staab’s Keynote on Programming the Semantic Web. I think the central issue he identified was how to program against flexible data models that are the hallmark of semantic web data. Stefan argued strongly for static typing and programmer support but, as an audience member noted, there is a general trend in development circles towards document style databases with weaker type systems. It will be interesting to see how this plays out.

Aside: A thought I had was whether we could easily publish the type systems that developers create when programming back out onto the web and merge them with existing vocabularies….

This notion of easing development was also present in the SALAD workshop (a workshop on APIs). This is dear to my heart. I’ve seen in my own projects how APIs really help developers make use of semantic data when building applications. There was quite a lot of discussion around the role of SPARQL with respect to APIs as well as whether to supply data dumps or an API and what type of API that should be. I think it’s fair enough to say that Web APIs are winning, see the paper RESTful or RESTless – Current State of Today’s Top Web APIs, and we need to devise systems that deal with that while still leveraging all our semantic goodness. That being said it’s nice to see mature tooling appearing for Linked Data/Semantic Web data (e.g. RedLink toolsMarin Dimitrov’s talk on selling semweb solutions commercially).

Entities everywhere

There were a bunch of papers on entity resolution, disambiguation, etc. Indeed, Linked Data provides a really fresh arena to do this kind of work as both the data and schemas are structured and yet at the same time messy. I had quite a few nice discussions with Pedro Szekely on the topic and am keen to work on getting some of our ideas on linking into the Karma system he is developing with others.  From my perspective, two papers caught my eye. One on using coreference to actually improve sparql query performance. Often times we think of all these equality links as a performance penalty, it’s interesting to think about whether they can actually help us improve performance on different tasks. The other paper was “A Probabilistic Approach for Integrating Heterogeneous Knowledge Sources“, which uses Markov Logic Networks to align web information extraction data (e.g. NELL) to DBpedia. This is interesting as it allows us to enrich clean background knowledge with data gathered from the web. It’s also neat in that it’s another example of the combination of  statistical inference and (soft) rules.

This emphasis on entities is in contrast with the thought-provoking keynote by Oxford philosopher Luciano Floridi, who discussed various notions of complexity and argued that we need to think not in terms of entities but in fact interactions. This was motivated by the following statistic – that by 2020 7.5 billion people vs. 50 billion devices and all of these things will be interconnected and talking.

Indeed, while entities especially in messy data is far from being a solved problem, we are starting to see dynamics emerging as clear area of interest. This is reflected by the best student paper Hybrid Acquisition of Temporal Scopes for RDF Data.

Methodological maturity

The final theme I wanted to touch on was methodological maturity. The semantic web project is 15 years old (young in scientific terms) and the community has now become focused on having rigorous evaluation criteria. I think every paper I saw at ESWC had a strong evaluation section (or at least a strongly defensible one). This is a good thing! However, this focus pushes people towards safety in their methodology, for instance the plethora of papers that use LUBM, which can lead towards safety in research. We had an excellent discussion about this trend in the EMPIRICAL workshop – check out a brief write up here. Indeed, it makes one wonder if

  1. these simpler methodologies (my system is faster than yours on benchmark x) exacerbate a tendency to do engineering and not answer scientific questions; and
  2. whether the amalgamation of ideas that characterizes semantic web research is toned down leading to less exciting research.

One answer to this trend is to encourage a more wide spread acceptance and knowledge of different scientific methodologies (e.g. ethnography), which would allow us to explore other areas.

Finally,  I would recommend Abraham Bernstein & Natasha Noy – “Is This Really Science? The Semantic Webber’s Guide to Evaluating Research Contributions“, which I found out about at the EMPIRICAL workshop.

Final Notes

Here are some other pointers that didn’t fit into my themes.

 

A couple of weeks ago, I was at the European Data Forum in Athens talking about the Open PHACTS project. You can find a video of my talk with slides here. Slides are embedded below.

One of my guilty pleasures is listening to mac-oriented tech podcasts. One that I listen to is the Accidental Tech Podcast, which features Marco Arment (of Tumblr and Instapaper fame), John Siracusa (Arstechnica and long mac reviews fame) and Casey Liss (…) . All three are programmers working on everything from  .Net consultancy to  iOS apps. As somebody who has spent my career as part computer science research/higher education, I find it interesting to hear what people in the software industry find useful from their education. So I sent the guys the following question:

I actually had a question related to the whole software methodology discussion. I’m a CS professor and I’m always curious what particular things that we teach turn out to be useful in the end. You had asked each other what one thing you would take from software methodology. My question is what are the one/two things from your CS education that you find the most useful when coding?

On their recent show (#56, The Woodpecker), they answered the question (starting 35:30 in). You can listen to their thoughtful answers. But I’ll try to summarize it. I heard 3 main points:

  1. Learning from the ground up. They talked about the importance of learning the entire stack from designing a chip on up. In particular, knowing operating systems, memory management (pointers!) and assembly language helps them make smarter decisions while programming. It’s not that you use these “low-level/behind the scenes” things often in practice but understanding them helps one make better choices at higher levels of abstraction.
  2. Dealing with diversity.  They pointed out how they learned to use multiple different pieces of technology during their degrees. Marco singled out what I would call a programming languages course. This is a course where you learn and program a little bit in all types of languages and learn about the concepts that underlie them (e.g. functional vs. imperative, pass-by-reference vs pass-by-value etc.). This means that learning a new language in the real world, whether its Objective-C or perl, is that much easier. In general, getting practice in picking up a new technology and applying it immediately to a problem was seen as helpful.
  3. Core concepts and principles. They noted that having learned core CS topics like data structures and algorithms and general CS principles was useful. It’s not that they are used everyday but “knowing what to look up on wikipedia” is useful. They also noted that in business there is less/no time to learn these core principles. Furthermore, it’s hard to learn them if you’re not forced to do so.

From my perspective, it’s nice to hear a response that fits with what I (and I think most CS professors) would say. We should be teaching core concepts and principles and letting students learn the whole stack of computing. The one thing I think I’ll probably take away from this for our own curriculum is maybe not to worry so much about consistency in programming languages across courses. Indeed, that may be a feature not a bug.

Anyway, if your interested in this sort of thing check out the podcast.

This is my 100th blog post here at Think Links. I started blogging October 23, 2008 with a post about the name of the blog. That’s about 5 years of blogging averaging about 20 posts a year. So not a huge amount but consistent. This blog is what I would consider an academic blog or at least a work related one. As a forum of scholarly communication, I’ve found blogging to be a very beneficial. Here are 10 things that I like personally about the medium (yes, a listicle!):

  1. It provides a home for material that is useful but wouldn’t belong in a more formal setting. For example, comments on work practiceteaching or neat randomly related stuff.
  2. It’s quick. If I have something to note, I can just put it out there.
  3. The public nature forces me to make my own notes better. In particular, I’ve been doing trip reports, which have been really helpful in synthesizing my notes on various events. Even though most are not read the fact that they are public makes my writing more coherent.
  4. Embedding multimedia. It provides a way to aggregate a lot of different content into one place. Lately, I’ve been using the embed tweet feature to capture some of that conversation in context.
  5. Memories of the 5 paragraph essay. I had a very good history teacher in high school who drilled into us how to write 5 paragraph essays quickly. I find posts fairly easy to write because of this training. (I know there’s criticism of this style but I think the form helps to write.).
  6. It let’s me put another take on research papers that we’ve done in a more personal voice.
  7. A single searchable history. Reverse chronological order is helpful way to review what’s gone on. Furthermore, because it’s on the web you get all that fancy search stuff.
  8. Analytics are fun to look at. – altmetrics anyone?
  9. It’s part of the future of academic discourse…
  10. Links.

There’s more I’d like to do with this blog. Publishing directly from code. Personal videos. Interactive visualizations. Whether I do those things or not, having this space on the web in this format has been great for my own thinking and I hope for others as well. If you’re reading this, thanks and I hope you keep following.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers

%d bloggers like this: