4store Amazon Machine Image and Billion Triple Challenge Data Set
As part of our entry to the 2009 Billion Triple Challenge (BTC), we have been using two pieces of great infrastructure: Amazon Web Services and the quad store – 4store. Today, we are making publicly available an Amazon Machine Image for 4store. Additionally, we are making an Elastic Block Storage snapshot of the BTC dataset for 4store. Thus, developers can easily get started using 4store with a billion triples on Amazon’s cloud.
Technical Notes:
- We assume you have used Amazon EC2 before.
- The 4store AMI and the associated EBS snapshot are currently only available in the EU-West Amazon region.
- The id of the AMI is : ami-62547f16
- The id of the BTC snapshot is : snap-1a8c6073
- The 4store AMI is based on Debian Squeeze 64-bit. We use the AMI (ami-745b7000) provided by alestic.com as the starting point.
Using the 4store AMI:
- The AMI is 64-bit so you need to start it on a 64-bit EC2 instance
- Checkout 4store.org for documentation about using 4store.
- If you’re going to use 4store without the BTC dataset, you need to create the directory /mnt/4store once the instance has started.
Using the 4store AMI with the BTC dataset:
- Start the AMI as above.
- Make sure that the Security Group you use allows for HTTP traffic on the port range 4000-4060 as we start a 4store instance for roughly every 20 million triples.
- Create an EBS volume from the BTC snapshot and attach it to your EC2 instance.
- Mount the volume at /mnt/4store
- In the root home directory (~/), you’ll find a shell script called btc.sh. This will allow you to start 4store for btc. Run “btc.sh start”. This will launch all the 4store backends and HTTP servers. This will take a bit of time to start around 30 minutes to an hour.
- Once this is complete, you’ll be able to access the billion triples over the 50 some sparql endpoints that have been started on ports 4000 – 4057.
Contact: pgroth@gmail.com
Have fun!
Paul Groth, Christophe Guéret, Stefan Schlobach
Knowledge Representation and Reasoning Group
Department of Artificial Intelligence
Vrije Universiteit Amsterdam
Where did that tweet come from?
Check out Dan Brickley’s post on the chaos around tweets about Iran. The key quote from the post in my opinion is:
Without tools to trace reports to their source, to claims about their source from credible intermediaries, or evidence, this isn’t directly useful. Even grassroots journalists needs evidence.
Even with retweets it’s hard to figure out where information is coming from and from whom especially as it flows in real time.
Provenance Challenge 3 Workshop Today and Tomorrow
Today, was the start of the Provenance Challenge 3. A challenge focused on interoperability between computational provenance systems using the Open Provenance Model. 14 teams participated and are now presenting their submissions and discussing what’s next for the model and the community at the workshop. The workshop is being held at the University of Amsterdam and is sponsored by VL-e (a dutch e-Science project) and Microsoft. The event is already starting off great, plenty of interesting observations and some cool extra tools (provenance -> workflows).
Provenance = Food Safety
Should you be responsible for the safety of your food? The article, Food Companies Are Placing the Onus for Safety on Consumers, in the New York Times is scary. The fundamental point is that it’s extremely difficult for companies that make ready-made frozen meals to verify the safety of their food because the supply chains have gotten so complex and they cannot track the provenance of the ingredients. Furthermore, the manufactures have resisted putting in place tracking systems. From the article:
But government efforts to impose tougher trace-back requirements for ingredients have met with resistance from food industry groups including the Grocery Manufacturers Association, which complained to the Food and Drug Administration: “This information is not reasonably needed and it is often not practical or possible to provide it.
Instead of instituting a track back mechanism, the manufactures are trying to get consumers to ensure they cook their meals safe, reaching a “kill-step” where bacteria is destroyed. However, as discussed in the article this is actually very hard to do for some meals.
Personally, I’m going to lay off ready made meals, which is unfortunate because they do come in handy. Generally, I want to know about provenance even if I can destroy all the bacteria with a smoking microwave. Additionally, I wonder how we can get the computer science research products we have been doing in the CS Provenance Community into the hands of these manufacturers. I really believe the collecting and managing the kind of the documentation they need can be significantly cheaper and more effective than they expect using our technology.
Check out further discussion at the New York Times’ Room For Debate blog.
Rand and Data Now
I just came back from an Issues in Focus talk at RAND in Santa Monica about whether the United States is losing its edge in science and technology. You can read the full report by Titus Galama and James Hosek, here. But to sum it up very succinctly, the answer is no. The US is still extremely competitive and looks to remain that way according to their research. Obviously, there’s much to be debated about this topic and they weren’t as blunt in their assessment as my one word summary. However, instead of focusing on their research (which their report summarizes well), I want to focus on a question that came up several times from the audience: is there more current data?
Many of the graphs that Dr. Galama showed during his talk were compelling but they were plotted over time and roughly ended between 2001 and 2005. This is not because of some omission on Dr. Galama and Hosek’s part, it is because the data was just not available. They mentioned this several times in response to the audience questions. Talking to Dr. Galama after the Q&A, it was clear that he wants the most current data possible. Indeed, a recommendation from their report is all about obtaining data now:
Establish a permanent commitment to a funded, chartered entity responsible for periodically monitoring, critically reviewing, and analyzing U.S. S&T performance and the condition of the S&E workforce.
They essentially recommend an organization whose whole responsibility is to get good current data and synthesize it. However, the establishment of such an organization takes time and indeed may never happen. What should researchers do in the meantime? I believe the solution lies in taking advantage of the web. In particular, as the web becomes increasingly current (i.e. this blog post, tweets, etc.) and increasingly structured (RDFa, YQL, Linked Data) the kind of data that Dr. Galama needs will be available. The key then is making it accessible for synthesis. Once that (non-trivial) problem is solved and Dr. Galama can use an up to the moment graph in his talk, then the audience question will change from “Is there more current data?” to “Where did that data come from?”.
A Better Place
A Better Place is a new electric car infrastructure copy that’s rolling out a whole new concept of how electric cars should work. The battery is the fuel. The founder of the company Shai Agassi explains the idea much better than I can in his TED talk embedded below. It’s an audacious idea but somehow very convincing. One key point, pertinent to this blog, is that all the electricity is sourced from renewable energy.
For even more details, you can watch Agassi’s Wired Science interview as well.
Hmmm…. maybe it’s time to invest in lithium-ion batteries.
Review: The Myth of the Paperless Office
A couple of weeks ago, I finished reading The Myth of the Paperless Office by Abigail Sellen and Richard Harper. In the midst of the publicity surrounding the Kindle and other e-Ink based e-book readers, this is a perfect book to understand why paper is such a pervasive and useful tool even in environments that are predominately computerized (for example, a computer science research lab) .
While the title of the book (I suspect) is meant to be controversial, the content is not. The authors are not Luddites, what they are is researchers trying to understand how paper can inform the development of electronic tools. They do this through several case studies that involved the transition from paper to a technological solution. The case studies ranged from police work and air traffic control to an office at the IMF. To me, the case studies were informative because they made me realize how paper enables people to do their job in very tiny but important ways.
For example, in the police case study, the police department studied wanted to have statements and notes at the crime scene electronically entered so the department could have immediate updates to their reporting system. However, the introduction of the electronic system was less than successful not because the system didn’t work technically but because of the nature of police work. Police officers are not just investigators at a crime scene they are also social workers. They have to comfort and attend to the witnesses or victims at the scene. This is a difficult task as it involves the officer being aware of the interviewee’s psychological state. The electronic system given to the officers were not adapted to this kind of sensitive environment and got in the way of the social work aspect of the task. Paper, unlike the electronic system, did not get in the way of the officers job while still allowing them to gather data.
Throughout these case studies the authors highlight how understanding the affordances of paper are critical when designing new technology especially technology that is supposed to integrate with existing work practices. They identified the following affordances:
- A single sheet is light and physically flexible.
- It is porous, which means that is markable and that marks are fixed and spatially invariant with respect the the underlying medium.
- It is a tangible, physical object.
- Engagement with paper for the purpose of marking or reading is direct and local. In other words , the medium is immediately responsive to executed actions, and interaction depends on physical copresence.
These affordances lead to certain consequences. For example, the fact that paper is tangible and has locality means that when a paper is on my desk at work, it acts as a reminder to do something about it. Or, the fact that paper can be easily bent, means that I can easily tell what pages I should go back to when writing the blog post about this book. The book has many more examples of these sorts of consequences.
The final thing I learned from the book (or had reinforced) was that paper is not a complete or even cursory repository of people’s knowledge. More often than not, it is used as a trigger for people’s recall. Indeed, it turns out that making all an organization’s information electronic does not provide instant access to an institution’s knowledge. From the book…
In other words, despite the mangers’ best efforts to leverage the knowledge in their documentation, ultimately the knowledge resided in the minds of the engineers.
Hence, with respect to tracking provenance, it’s important to keep this in mind as it reminds us of the difficulty in the endeavor of documenting the true origins of things especially when it involves human thought and analysis.
Overall, The Myth of the Paperless Office was a worthwhile read. Even though it was published 2001 it still provides lessons for technology designers now. I hope that the authors publish an updated version soon.
That being said, even after reading the book, I still want a Kindle.
PC3-Start!
Below is the call for participation for the Third Provenance Challenge, which I’m helping to organize. If you have any questions about it, contact me. We are obviously looking for participation but it’s also interesting to just hear comments on the approach of having a common format for provenance.
The Third Provenance Challenge – Call for Participation
Data products are increasingly being produced by the composition of services and data supplied by multiple parties using a variety of data analysis, management, and collection technologies. This approach is particular evident in e-Science where scientists combine sensor data and shared Web-accessible databases using a variety of local and remote data analysis routines to produce experimental results. In such environments, provenance (also referred to as audit trail, lineage, and pedigree) plays a critical role as it enables users to understand, verify, reproduce, and ascertain the quality of data products.
Because of the importance of provenance, many areas have developed techniques and tools for determining provenance including scientific and business process workflow, visualization, digital libraries and semantic web technologies. An important challenge in the context of heterogenous compositional applications, is how to integrate the provenance produced by these techniques to be able to construct the full provenance of complex data products. To that end, the community has endeavored to develop a common understanding and model of provenance to aid interoperability through the Open Provenance Model (OPM).
Help chart the future of provenance interoperability by participating in the Third Provenance Challenge.
Details:
You can find information on the challenge definition at how to participate at the Third Provenance Challenge Wiki.
To keep up-to-date, subscribe to the Provenance Challenge mailing list .
Key Dates:
- March 2 – The Third Provenance Challenge Starts
- Make the workflow work with individual team’s systems [Mar. 2 - Mar. 30]
- Generate provenance for the challenge workflow & run queries on it [Mar. 30 - Apr. 13]
- Export OPM Graphs and import from others [Apr. 13 - May. 4]
- Run queries on imported OPM graph [Apr 27. - Jun. 1]
- Prepare slides for challenge [Jun. 1 - Jun. 8]
- PC3 Workshop June 10 – 11 held in Amsterdam.
Contact:
For details or questions, contact Paul Groth (pgroth -at- isi.edu).
Organizers:
- Paul Groth, ISI / University of Southern California
- Yogesh Simmhan, Microsoft Research
- Luc Moreau, University of Southampton
Local Organizers:
- Adam Belloum, University of Amsterdam
- Zhiming Zhao, University of Amsterdam
History
Starting with the 2006 International Provenance and Annotation Workshop (IPAW), the community agreed to hold the First Provenance Challenge that emphasized understanding the commonalities and differences between existing approaches. Held in Washington DC on September 2006, the 17 team workshop identified several commonalities and resulted in agreement that a Second Provenance Challenge focusing on interoperability would be beneficial. At the Second Provenance Challenge workshop held at the High Performance Distributed Computing conference on June 26, 2007, teams presented their results demonstrating the ability to interoperate between several systems. Discussions at this challenge led to the specification of a common data model, The Open Provenance Model (OPM). This model was further discussed and developed at a subsequent workshop held at IPAW’08. Discussions at this workshop led to this Third Provenance Challenge focusing on interoperability using OPM. More information can be found at http://twiki.ipaw.info.
Status
It’s been awhile since I’ve posted…. I still owe a post on the rest of Borgman’s book. I’ve been a bit side tracked. I got engaged a couple of weeks ago and I started reading another fascinating book on the role of paper in knowledge work.
I’ll be in Germany and the Netherlands next week (Mar. 8 – 17). If you’re interested in meeting up send me an email.
Scholarship Now
Just over a week ago, we had a great AI seminar here at ISI with Christine Borgman. A professor at UCLA, she is at the leading edge of understanding how the academic process in particular academic communication is done now (i.e. with the advent of the interweb). I wanted to wait and post about the talk until after I had finished her book Scholarship in the Digital Age but, as I’m only half way through, in the interest of freshness I thought I’d put up the link now.
I’ll save my own thoughts on her ideas with respect to provenance until I’ve completed the book. But you should definitely check out her talk. She presents some really compelling ideas about the information value chain.
