Archive

Monthly Archives: December 2008

Wired Science Blog has an interesting post up about the increase in wave height around the world and particularly at a surfing spot in the US northwest reported at this years American Geophysical Union meeting.  The key excerpt from the post that I want to focus on is the following:

“This is high quality data and you didn’t have enough data to do this kind of analysis until very recently,” Ruggiero said.

This is a great example of data driven science, where the availability of data (in this case from a buoy) leads to new hypotheses and new theories. This approach to science was a big topic at the e-Science conference last week as relies fundamentally on IT. Furthermore, one of the interesting things to note was that after the discovery this change in wave height in selected locations, researchers wanted even more data from around the globe. 

As scientists increasingly rely on data from multiple sources furnished from around the globe it becomes vital that the mechanisms are there to easily find out where the data is coming from.

I’m watching a talk about swiss-experiment.ch, which is a great project using a semantic wiki as a data infrastructure for a sensor network. The presenter started out by showing a graph of sensor data measuring temperature from the swiss alps in winter. The graph looked reasonable (i.e. it was cold). However, he pointed out that in fact the sensor was broken and the data was invalid. His quote (roughly):

Never trust someone else’s sensor without the metadata.

Fundamentally, you need provenance to enable the confident use of other people’s data.

Hi, I’m giving a talk  at e-Science 2008 this morning on a distributed provenance query algorithm I developed. You can actually tune in by going to the multimedia portion of the e-Science site. For those of you watching the talk, I’d be interested in your comments, which you can leave here. Here’s the abstract of the talk:

As computational techniques for tracking provenance have become more widely used, applications are beginning to produce large quantities of provenance information. Furthermore, many of these applications are composed from distributed components (e.g., scientific workflows) that may, for reasons of scalability, security, or policy, need to store this information across multiple sites. In this paper, we describe an algorithm, D-PQuery, for determining the provenance of data from distributed sources of provenance information in a parallel fashion. To enable scientist to use D-PQuery on already existing Grid infrastructure, we present an implementation of the algorithm as a Condor DAGMan workflow that works across Kickstart records, which are produced in several production e-Science applications including the example application used in this paper, the astronomy application, Montage. Initial performance benchmarks are also presented.

Unfortunately, I’m on at the same time as End-to-End e-Science: Integrating Workflow, Query, Visualization, and Provenance at an Ocean Observatory, which I’d like to see. Maybe, I can use the video archives…we’ll see…

Just saw a great talk by Alexander Szalay about the work at John Hopkins university to develop a cluster to perform data intensive computing. The resulting cluster just won the HPC Storage Challenge at Supercomputing 2008. 

There approach is to reach back to Gene Amdahl’s rules of thumb for computer architecture and apply these to large scale parallel machines. See his article with Gordon Bell and Jim Gray in IEEE Computer 2006 (10.1109/MC.2006.29). It’s a nice way of looking at how to design such machines and it works.

This week I’m attending the Microsoft e-Science Workshop and the IEEE e-Science Conference so I’ll be posting some quick links of things I find interesting.

The first quick link is to e-Science Central a project by Paul Watson for providing a scientific platform to scientist leveraging existing cloud computing infrastructure. In particular, I like the ability to dynamically deploy services from a service repository, this is critical for provenance, we need a mechanism to be able to get a hold of old analysis rountines.

jovescreenshotWhile working in Boston, I noticed a flyer for a talk to be given by the CEO of Jove (Journal of Visualized Experiments). Jove is a site that lets scientists post video of the protocols they develop for use in the lab. This is a great addition to text based protocols like those found at Nature Protocols. The site is nicely done with links to the original paper, comments, tags, etc. An important point is that all the videos are peer reviewed so that the quality is maintained. 

Here’s a link to a protocol for Staining Protocol in Gels. (Screenshot to the left) Unfortunately,  I don’t think the site allows you to embed the video protocols in other sites.

This is a great example of how technology can support the reproducibility of scientific work. I see this as a great adjunct to computational workflows where the input to the workflow comes from a bench experiment. For example, one could track the provenance of a digital result back through the computational procedure to the video of the procedure the lab scientists used to generate the input data. (myExperiment + Jove anyone?)

I’ve been on vacation during the week of Thanksgiving but before I took off to Amsterdam I gave a talk on provenance for multi-institutional applications for the ISI Intelligent Systems Division AI Seminar series. It’s about an hour long with question and answers. If you have time, let me know what you think. The slides are on the talk’s page as well so you can zip through those if you don’t have time to listen to the whole thing. 

I recommend checking out the AI Seminar page. We have some really great speakers come and talk to us at ISD and most of the talks are streamed and archived.

%d bloggers like this: