For International Open Access Week the VU University Amsterdam is working with other Dutch universities to provide information to academics on open access. Here’s a video they took of me talking about the relationship between open access and social media. The two go hand-in-hand!
For the last 1.5 years, I’ve been working on the Open PHACTS project – a public private partnership for pharmacology data integration. Below is a good video from Lee Harland the CTO giving an overview of the project from the tech/pharma perspective.
I just received 50 copies of this in the mail today:
Literature is so much better in its electronic form but its still fun to get a physical copy. Most importantly, this proceedings represents scientific content and a scientific community that I’m proud to be be part of. You can obviously access the full proceedings online. Preprints are also available from most of the author’s sites. You can also read my summary of the 4th International Provenance and Annotation Workshop (IPAW 2012) .
If you read this blog a bit, you’ll know I’m a fairly big fan of RDF as a data format. It’s really great for easily mashing different data sources together. The common syntax gets rid of a lot of headaches before you can start querying the data and you get nice things like simple reasoning to boot.
One thing that I’ve been looking for is a nice way to store a ton of RDF data, which is
- easy to deploy;
- easy to query;
- works well with scalable analysis & reasoning techniques (e.g. stuff built using MapReduce/Hadoop);
- oh and obviously scalable.
This past spring I was at a dagstuhl workshop where I had the chance to briefly talk to Chris Ré about the data storage environment used by the Hazy – one of the leading projects in the world on large scale statistical inference. At the time, he was fairly enthusiastic about using HBase as a storage layer.
Based on that suggestion, I played around with deploying HBase myself on Amazon. Using whirr it was pretty straightforward to deploy a pretty nice environment in a matter of hours. In addition, HBase has the nice side effect that it uses the same file system as Hadoop (HDFS) so you can run Hadoop jobs over the data that’s stored in the database.
With that I wanted to see a) what was a good way to store a bunch of RDF in HBase and b) if the retrieval of RDF was performant. Sever Fundatureanu worked on this as his master’s thesis.
One of the novel things he looked at was using coprocessors (built in user defined functions in hbase) to try and improve the building of indexes for RDF within the database. That is instead of running multiple hadoop load jobs you run ~one and then let the coprocessors in each worker node build the rest of the x indexes you want to improve retrieval. While it didn’t improve performance, I thought the idea was cool. I’m still interested in how much user side processing you can shove into the worker nodes within HBase. Below you’ll find an abstract and link to his full thesis.
I’m still keen on using HBase as the basis for the analysis and reasoning over RDF data. We’re continuing to look into this area. If you have some cool ideas, let us know.
The exponential growth of the Semantic Web leads to a need for a scalable storage solution for RDF data. In this project, we design a quad store based on HBase, a NoSQL database which has proven to scale out to thousands of nodes. We adopt an Id-based schema and argue why it enables a good trade-off between loading and retrieval performance. We devise a novel bulk loading technique based on HBase coprocessors and we compare it to a traditional Map-Reduce technique. The evaluation shows that our technique does not scale as well as the traditional approach. Instead, with Map-Reduce, we achieve a loading throughput of 32152 quads/second on a cluster of 13 nodes. For retrieval, we obtain a peak throughput of 56447 quads/second.