Wired Science Blog has an interesting post up about the increase in wave height around the world and particularly at a surfing spot in the US northwest reported at this years American Geophysical Union meeting. The key excerpt from the post that I want to focus on is the following:
“This is high quality data and you didn’t have enough data to do this kind of analysis until very recently,” Ruggiero said.
This is a great example of data driven science, where the availability of data (in this case from a buoy) leads to new hypotheses and new theories. This approach to science was a big topic at the e-Science conference last week as relies fundamentally on IT. Furthermore, one of the interesting things to note was that after the discovery this change in wave height in selected locations, researchers wanted even more data from around the globe.
As scientists increasingly rely on data from multiple sources furnished from around the globe it becomes vital that the mechanisms are there to easily find out where the data is coming from.
Me in a discussion at the end of the SWBES08: Challenging Issues in Workflow Applications workshop. Lots of people are talking about provenance and starting to do things as well in the context of workflows for e-Science.
The back of me at e-Science.
I’m watching a talk about swiss-experiment.ch, which is a great project using a semantic wiki as a data infrastructure for a sensor network. The presenter started out by showing a graph of sensor data measuring temperature from the swiss alps in winter. The graph looked reasonable (i.e. it was cold). However, he pointed out that in fact the sensor was broken and the data was invalid. His quote (roughly):
Never trust someone else’s sensor without the metadata.
Fundamentally, you need provenance to enable the confident use of other people’s data.
Hi, I’m giving a talk at e-Science 2008 this morning on a distributed provenance query algorithm I developed. You can actually tune in by going to the multimedia portion of the e-Science site. For those of you watching the talk, I’d be interested in your comments, which you can leave here. Here’s the abstract of the talk:
As computational techniques for tracking provenance have become more widely used, applications are beginning to produce large quantities of provenance information. Furthermore, many of these applications are composed from distributed components (e.g., scientific workflows) that may, for reasons of scalability, security, or policy, need to store this information across multiple sites. In this paper, we describe an algorithm, D-PQuery, for determining the provenance of data from distributed sources of provenance information in a parallel fashion. To enable scientist to use D-PQuery on already existing Grid infrastructure, we present an implementation of the algorithm as a Condor DAGMan workflow that works across Kickstart records, which are produced in several production e-Science applications including the example application used in this paper, the astronomy application, Montage. Initial performance benchmarks are also presented.
Unfortunately, I’m on at the same time as End-to-End e-Science: Integrating Workﬂow, Query, Visualization, and Provenance at an Ocean Observatory, which I’d like to see. Maybe, I can use the video archives…we’ll see…
Just saw a great talk by Alexander Szalay about the work at John Hopkins university to develop a cluster to perform data intensive computing. The resulting cluster just won the HPC Storage Challenge at Supercomputing 2008.
There approach is to reach back to Gene Amdahl’s rules of thumb for computer architecture and apply these to large scale parallel machines. See his article with Gordon Bell and Jim Gray in IEEE Computer 2006 (10.1109/MC.2006.29). It’s a nice way of looking at how to design such machines and it works.
This week I’m attending the Microsoft e-Science Workshop and the IEEE e-Science Conference so I’ll be posting some quick links of things I find interesting.
The first quick link is to e-Science Central a project by Paul Watson for providing a scientific platform to scientist leveraging existing cloud computing infrastructure. In particular, I like the ability to dynamically deploy services from a service repository, this is critical for provenance, we need a mechanism to be able to get a hold of old analysis rountines.