Another 5 Linked Data Business Models
There has been a growing movement to make data available on-line in a manner that’s easy to access and query for developers. These data sets range Yelp Reviews and government statistics to beer quality. In particular, there has been a rapid increase in Linked Data (i.e. interconnected data sets structured using web standards). I posted to twitter a back of the envelope calculation that Linked Data tripled in size in 9 months of 2009 to almost 13.1 billion triples. Interestingly, the first message to me after posting was: What is the business case for Linked Data?
It is critical to answer this question in order to maintain the viability of Linked Data over time. Obviously, some data (i.e. from governments) will be made available as a public service. But, especially as this data becomes more popular and thus more expensive to host, there needs to be a way to support these data sets and encourage more and better data sets to come online. Additionally, if there are ways to make money from or around Linked Data, we will see a stronger Linked Data developer and support ecosystem. Such an ecosystem will make Linked Data even more valuable both as a public and private resource.
With that in mind, I’ve thought of the following business models for Linked Data. This is in addition to Scott Brinker’s 7 business models. You can find more discussion here and here.
1. Tools for Analytics and Business Intelligence
I like to say that Linked Data is analytics enabled. It provides for rich machine understandable data sets with common formats and query languages. When done correctly, it connects out to a wider set of data enriching local data with very little developer effort. I believe there is space to develop advanced business intelligence tools that can leverage the linked nature of these data sets. Furthermore, it should be easy to adapt such tools from domain to domain by making use of common ontologies. Thus, these tools can be easily customized for particular business needs. By the way, the business intelligence software market is $8.8 billion.
2. High Resolution and Realtime Data Sets
Open Linked Data sets can be loss leaders for information providers to sell high quality or up to the minute data sets. This is often done by data providers, the classic example is delayed stock ticker information. But it also common for web services to charge for high numbers of queries and better access. In the realm of Linked Data, the Ordnance Survey is exposing mapping Linked Data for free as a public service while charging for higher resolution datasets. We already see the emergence of data marketplaces for purchasing data sets. I think this trend will continue allowing data providers to provide quality data and charge those who need the very best, now.
3. Exposing Data Sets
If companies see the benefits of exposing their data as Linked Data either to sell it or to use new analytic tools, there will obviously be a role for firms that specialize in this practice. I doubt this will be a high revenue business model. However, given the availability of tools, it might be highly profitable.
4. Aggregation
Like the Web, the Web of Data is messy and large. We’ve seen on the Web how aggregators can create tremendous value by collecting, cleaning and ranking data. Building the infrastructure to do this task is expensive and for most companies it’s better to let someone else do it and pay for access. My favorite company that has this model is spinn3r, which provides a indexed, cleaned, structured view of the blogosphere.
5. Tailored Push or Data Set Recommendation
With an ever increasing amount of data, there is the need to find and curate data tailored to an individual corporation’s needs. I envision a subscription service where specifically designed data sets are obtained either through advanced algorithms searching the Linked Data Cloud or by manually creation. These specialized data sets could be built by observing the queries that were made when using the analytics tools discussed previously. Imagine, if I’m examining data sets about beer imports from Holland to America and the next day an accurate break down by type of beer for the last 3 weeks appeared in my data inbox.
I hope this post has helped contribute to the conversation on Linked Data and business. Now it’s time to implement some of these.
Data DJs
Last week on Wednesday, I gave a seminar at LARGE (Learning Agents Research Group at Erasmus). Thanks to Wolfgang Ketter for both inviting me and more importantly the excellent discussion. The nice thing about seminars is that they often give you an opportunity to do something different. In this case, I did an expanded version of the October WAI talk I gave introducing myself to my new colleagues in the VU’s AI department. When I moved to the VU, I started trying to think about a better organization or perspective on my research. Something that encompassed what I’ve already done but also something that pointed towards the future. The thing that kept coming to my mind is that what we really need is tools that make remixing data as easy as it is to remix music or video. Essentially, why don’t we have Data DJs? (or should it just be DJ – Data Jockey??)
Indeed, we can see that a lot of what scientists (i.e. data analysis pros) do is remix data. In a happy coincidence, this idea was reinforced to me last night when I watched the documentary RiP: A Remix Manifesto. It’s both an entertaining and important documentary about the impact of strong intellectual property laws on creative freedom. Most importantly for this post, it describes the analogy between a music DJ and the practice of science in vivid terms. See the clip below starting at about 4 minutes in.
The other thing about the documentary to me was it showed how accessible the tools for working with music and video were to people. I think we can make tools that are just as good or better for data sitting in files, spreadsheets and databases. I think this DJ to data-analysis analogy is a powerful framework to think about how we can make such tools. I summarize the analogy as follows:
- records = data in one format (linked data?)
- turntable and mixers = end-user programming (workflows)
- recording equipment = capturing what goes on during data analysis (provenance)
The slides at the end of this post are from the talk I gave at LARGE, explaining how my research fits into this framework.
I want to be a Data DJ, do you?
Content-based Trust for Electronic Contracts
I’m at the 10th Annual International Workshop “Engineering Societies in the Agents’ World” (ESAW 2009) and gave a talk this morning about how an electronic agent can use past experience (e.g. process documentation) with contracts to predict whether it should trust a new contract proposal. You can check out the slides below. It’s interesting to be at an agents workshop… they use a vocabulary I haven’t heard in a couple of years, but it’s fun.
I was also on a panel where we had a nice discussion on the overlap between reputation and content based trust and the role of context in trust. Roles seem to be a super important topic here.
Big idea from Frank Dignum – trust reduces to machine learning
Something to think about.
4store Amazon Machine Image and Billion Triple Challenge Data Set
As part of our entry to the 2009 Billion Triple Challenge (BTC), we have been using two pieces of great infrastructure: Amazon Web Services and the quad store – 4store. Today, we are making publicly available an Amazon Machine Image for 4store. Additionally, we are making an Elastic Block Storage snapshot of the BTC dataset for 4store. Thus, developers can easily get started using 4store with a billion triples on Amazon’s cloud.
Technical Notes:
- We assume you have used Amazon EC2 before.
- The 4store AMI and the associated EBS snapshot are currently only available in the EU-West Amazon region.
- The id of the AMI is : ami-62547f16
- The id of the BTC snapshot is : snap-1a8c6073
- The 4store AMI is based on Debian Squeeze 64-bit. We use the AMI (ami-745b7000) provided by alestic.com as the starting point.
Using the 4store AMI:
- The AMI is 64-bit so you need to start it on a 64-bit EC2 instance
- Checkout 4store.org for documentation about using 4store.
- If you’re going to use 4store without the BTC dataset, you need to create the directory /mnt/4store once the instance has started.
Using the 4store AMI with the BTC dataset:
- Start the AMI as above.
- Make sure that the Security Group you use allows for HTTP traffic on the port range 4000-4060 as we start a 4store instance for roughly every 20 million triples.
- Create an EBS volume from the BTC snapshot and attach it to your EC2 instance.
- Mount the volume at /mnt/4store
- In the root home directory (~/), you’ll find a shell script called btc.sh. This will allow you to start 4store for btc. Run “btc.sh start”. This will launch all the 4store backends and HTTP servers. This will take a bit of time to start around 30 minutes to an hour.
- Once this is complete, you’ll be able to access the billion triples over the 50 some sparql endpoints that have been started on ports 4000 – 4057.
Contact: pgroth@gmail.com
Have fun!
Paul Groth, Christophe Guéret, Stefan Schlobach
Knowledge Representation and Reasoning Group
Department of Artificial Intelligence
Vrije Universiteit Amsterdam
Where did that tweet come from?
Check out Dan Brickley’s post on the chaos around tweets about Iran. The key quote from the post in my opinion is:
Without tools to trace reports to their source, to claims about their source from credible intermediaries, or evidence, this isn’t directly useful. Even grassroots journalists needs evidence.
Even with retweets it’s hard to figure out where information is coming from and from whom especially as it flows in real time.
Provenance Challenge 3 Workshop Today and Tomorrow
Today, was the start of the Provenance Challenge 3. A challenge focused on interoperability between computational provenance systems using the Open Provenance Model. 14 teams participated and are now presenting their submissions and discussing what’s next for the model and the community at the workshop. The workshop is being held at the University of Amsterdam and is sponsored by VL-e (a dutch e-Science project) and Microsoft. The event is already starting off great, plenty of interesting observations and some cool extra tools (provenance -> workflows).
Provenance = Food Safety
Should you be responsible for the safety of your food? The article, Food Companies Are Placing the Onus for Safety on Consumers, in the New York Times is scary. The fundamental point is that it’s extremely difficult for companies that make ready-made frozen meals to verify the safety of their food because the supply chains have gotten so complex and they cannot track the provenance of the ingredients. Furthermore, the manufactures have resisted putting in place tracking systems. From the article:
But government efforts to impose tougher trace-back requirements for ingredients have met with resistance from food industry groups including the Grocery Manufacturers Association, which complained to the Food and Drug Administration: “This information is not reasonably needed and it is often not practical or possible to provide it.
Instead of instituting a track back mechanism, the manufactures are trying to get consumers to ensure they cook their meals safe, reaching a “kill-step” where bacteria is destroyed. However, as discussed in the article this is actually very hard to do for some meals.
Personally, I’m going to lay off ready made meals, which is unfortunate because they do come in handy. Generally, I want to know about provenance even if I can destroy all the bacteria with a smoking microwave. Additionally, I wonder how we can get the computer science research products we have been doing in the CS Provenance Community into the hands of these manufacturers. I really believe the collecting and managing the kind of the documentation they need can be significantly cheaper and more effective than they expect using our technology.
Check out further discussion at the New York Times’ Room For Debate blog.
Rand and Data Now
I just came back from an Issues in Focus talk at RAND in Santa Monica about whether the United States is losing its edge in science and technology. You can read the full report by Titus Galama and James Hosek, here. But to sum it up very succinctly, the answer is no. The US is still extremely competitive and looks to remain that way according to their research. Obviously, there’s much to be debated about this topic and they weren’t as blunt in their assessment as my one word summary. However, instead of focusing on their research (which their report summarizes well), I want to focus on a question that came up several times from the audience: is there more current data?
Many of the graphs that Dr. Galama showed during his talk were compelling but they were plotted over time and roughly ended between 2001 and 2005. This is not because of some omission on Dr. Galama and Hosek’s part, it is because the data was just not available. They mentioned this several times in response to the audience questions. Talking to Dr. Galama after the Q&A, it was clear that he wants the most current data possible. Indeed, a recommendation from their report is all about obtaining data now:
Establish a permanent commitment to a funded, chartered entity responsible for periodically monitoring, critically reviewing, and analyzing U.S. S&T performance and the condition of the S&E workforce.
They essentially recommend an organization whose whole responsibility is to get good current data and synthesize it. However, the establishment of such an organization takes time and indeed may never happen. What should researchers do in the meantime? I believe the solution lies in taking advantage of the web. In particular, as the web becomes increasingly current (i.e. this blog post, tweets, etc.) and increasingly structured (RDFa, YQL, Linked Data) the kind of data that Dr. Galama needs will be available. The key then is making it accessible for synthesis. Once that (non-trivial) problem is solved and Dr. Galama can use an up to the moment graph in his talk, then the audience question will change from “Is there more current data?” to “Where did that data come from?”.
A Better Place
A Better Place is a new electric car infrastructure copy that’s rolling out a whole new concept of how electric cars should work. The battery is the fuel. The founder of the company Shai Agassi explains the idea much better than I can in his TED talk embedded below. It’s an audacious idea but somehow very convincing. One key point, pertinent to this blog, is that all the electricity is sourced from renewable energy.
For even more details, you can watch Agassi’s Wired Science interview as well.
Hmmm…. maybe it’s time to invest in lithium-ion batteries.
Review: The Myth of the Paperless Office
A couple of weeks ago, I finished reading The Myth of the Paperless Office by Abigail Sellen and Richard Harper. In the midst of the publicity surrounding the Kindle and other e-Ink based e-book readers, this is a perfect book to understand why paper is such a pervasive and useful tool even in environments that are predominately computerized (for example, a computer science research lab) .
While the title of the book (I suspect) is meant to be controversial, the content is not. The authors are not Luddites, what they are is researchers trying to understand how paper can inform the development of electronic tools. They do this through several case studies that involved the transition from paper to a technological solution. The case studies ranged from police work and air traffic control to an office at the IMF. To me, the case studies were informative because they made me realize how paper enables people to do their job in very tiny but important ways.
For example, in the police case study, the police department studied wanted to have statements and notes at the crime scene electronically entered so the department could have immediate updates to their reporting system. However, the introduction of the electronic system was less than successful not because the system didn’t work technically but because of the nature of police work. Police officers are not just investigators at a crime scene they are also social workers. They have to comfort and attend to the witnesses or victims at the scene. This is a difficult task as it involves the officer being aware of the interviewee’s psychological state. The electronic system given to the officers were not adapted to this kind of sensitive environment and got in the way of the social work aspect of the task. Paper, unlike the electronic system, did not get in the way of the officers job while still allowing them to gather data.
Throughout these case studies the authors highlight how understanding the affordances of paper are critical when designing new technology especially technology that is supposed to integrate with existing work practices. They identified the following affordances:
- A single sheet is light and physically flexible.
- It is porous, which means that is markable and that marks are fixed and spatially invariant with respect the the underlying medium.
- It is a tangible, physical object.
- Engagement with paper for the purpose of marking or reading is direct and local. In other words , the medium is immediately responsive to executed actions, and interaction depends on physical copresence.
These affordances lead to certain consequences. For example, the fact that paper is tangible and has locality means that when a paper is on my desk at work, it acts as a reminder to do something about it. Or, the fact that paper can be easily bent, means that I can easily tell what pages I should go back to when writing the blog post about this book. The book has many more examples of these sorts of consequences.
The final thing I learned from the book (or had reinforced) was that paper is not a complete or even cursory repository of people’s knowledge. More often than not, it is used as a trigger for people’s recall. Indeed, it turns out that making all an organization’s information electronic does not provide instant access to an institution’s knowledge. From the book…
In other words, despite the mangers’ best efforts to leverage the knowledge in their documentation, ultimately the knowledge resided in the minds of the engineers.
Hence, with respect to tracking provenance, it’s important to keep this in mind as it reminds us of the difficulty in the endeavor of documenting the true origins of things especially when it involves human thought and analysis.
Overall, The Myth of the Paperless Office was a worthwhile read. Even though it was published 2001 it still provides lessons for technology designers now. I hope that the authors publish an updated version soon.
That being said, even after reading the book, I still want a Kindle.