linked data

This past week I attended a workshop the Evolution and variation of classification systems organized by the Knowescape EU project. The project studies how knowledge evolves and makes cool maps like this one:

The aim of the workshop was to discuss how knowledge organization systems and classification systems change.  By knowledge organization systems, we mean things like the Universal Decimal Classification system or the Wikipedia Category Structure. My interest here is the interplay between the change in data and the change in the organization system used for that data. For example, I may use a certain vocabulary or ontology to describe a dataset (i.e. the columns), how does that impact data analysis procedures when that organization’s meaning changes.  Many of our visualizations decisions and analysis are based on how we categorize (whether mechanical or automatically) data according to such organizational structures. Albert Meroño-Peñuela gave an excellent example of that with his work on dutch historical census data. Furthermore, the organization system used may impact the ability to repurpose and combine data.

Interestingly, even though we’ve seen highly automated approaches emerge for search and other information analysis tasks Knowledge Organization Systems (KOSs) still often provide extremely useful information. For example, we’ve see how and wikipedia structure have been central to the emergence of knowledge graphs. Likewise, extremely adaptable organization systems such as hashtags have been foundational for other services.

At the workshop, I particularly enjoyed Joesph Tennis keynote on the diversity and stability of KOSs. He’s work on ontogeny is starting to measure that change. He demonstrated this by looking at the Dewey Decimal System but others have shown that the change is apparent in other KOSs (1, 2, 3, 4). Understanding this change could help in constructing better and more applicable organization systems.

From both Joseph’s talk as well as the talk Richard Smiraglia (one of the leaders in the Knowledge Organization), it’s clear that as with many other sciences our ability to understand information systems can now become much more deeply empirical. Because the objects of study (e.g. vocabularies, ontologies, taxonomies, dictionaries) are available on the Web in digital form we can now analyze them. This is the promise of Web Observatories. Indeed, that was an interesting outcome of the workshop was that the construction of KOSs observatory was not that far fetched and could be done using aggregators such as Linked Open Vocabularies and Taxonomy Warehouse. I’ll be interested to see if this gets built.

Finally, it occurred to me that there is a major lack of studies on the evolution of the urban dictionary as a KOS. Somewhat ought to do something about it 🙂

Random Notes

NewsReader Amsterdam Hackathon

This past Wednesday (Jan. 21, 2015) I was at the NewsReader Hackathon. NewsReader is a EU project to extract events and build stories from the news. They use a sophisticated NLP pipeline combined with semantic background knowledge to perform this task. The hackathon was an opportunity to talk to members of one of the leading NLP groups in the Netherlands (CLTL) and find out more about their current pipeline. Additionally, one of the project partners is Lexis Nexis, a sister company of Elsevier, so it was nice to see how their content was being used as basis for event extraction and also meet some of my colleagues.  The combination of news and research  is particularly of interest in light of the recent Elsevier acquisition of NewsFlo.

Besides the chance to meet people, I also got to do some hacking myself to see how the NewsReader API worked. I used the api to plot the number and type of events featuring universities. (The resulting iPython Notebook)

A couple of pointers for future reference:

A couple of weeks ago, I was at the European Data Forum in Athens talking about the Open PHACTS project. You can find a video of my talk with slides here. Slides are embedded below.

It’s been about a week since I got from Australia attending the International Semantic Web Conference  (ISWC 2013).  This is the premier forum for the latest in research on using semantics on the Web. Overall, it was a great conference – both well run and there was a good buzz. (Note, I’m probably a bit biased – I was  chair of this year’s In-Use track) .

ISWC is a fairly hard conference to get into and the quality is strong.

More importantly, almost all the talks I went to were worth thinking about. You can find the proceedings of the conference online either as a complete zip here or published by Springer. You can find more stats on the conference here.

As an aside, before digging into the meat of the conference – Sydney was great. Really a fantastic city – very cosmopolitan and with great coffee. I suggest Single Origin Roasters.  Also, Australia has wombats – wombats are like the chillest animal ever.


From my perspective, there were three main themes to take away from the conference:

  1. Impressive applications of semantic web technologies
  2. Core ontologies as the framework for connecting complex integration and retrieval tasks
  3. Starting to come to grips with messiness


We are really seeing how semantic technologies can power great applications. All three keynotes highlighted the use of Semantic Tech. I think Ramanathan Guha’s keynote probably highlighted this the best in his discussion of the growth of

Beyond the slide above, he brought up representatives from Yandex, Yahoo, and Microsoft on stage to join Google to tell how they are using Drupal and WordPress will have in their cores in 2014. is being used to drive everything from veteran friendly job search, to rich pins on Pinterest and enabling Open Table reservations to be easily put into your calendar. So is clearly a success.

Peter Mika presented a paper on how Yahoo is using ontologies to drive entity recommendations in searches. For example, you search for Brad Pitt and they show you related entities like Angelina Jolie or  Fight Club. The nice thing about the paper is that it showed how the deployment in production (in Yahoo! Web Search in the US) increases click through rates.

Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. International Semantic Web Conference (2) 2013: 33-48

I think it was probably Yves Raimond’s conference – he showed some amazing things being done at the BBC using semantic web technology. He had an excellent keynote at the COLD workshop – also highlighting some challenges on where we need to improve to ease the use of these technologies in production. I recommend you check out the slides above. Of all the applications, their work on mining the world service archive  of the BBC to enrich content being created. This work won the Semantic Web Challenge.

In the biomedical domain, there were two  papers showing how semantics can be embedded in tools that regular users use.  One showed how the development of ICD-11 (ICD is the most widely used clinical classification developed by the WHO) is  supported using semtech. The other I liked was the use of excel templates (developed using RightField) that transparently captured data according to a domain model for Systems biology.

Also in the biomedical domain, IBM presented an approach for using semantic web technologies to help coordinate health and social care at the semantic web challenge.

Finally, there was a neat application presented by Jane Hunter applying these technologies to art preservation: The Twentieth Century in Paint.

I did a review of all the in-use papers leading up to the conference but it’s good enough to say that there were numerous impressive applications. Also, I think it says something about the health of the community when you see slides like this:

Core Ontologies + Other Methods

There were a number of interesting papers that were around the idea of using a combination of well-known ontologies and then either record linkage or other machine learning methods to populate knowledge bases.

A paper that I like a lot (and also won the best student paper) was titled Knowledge Graph Identification (by Jay Pujara, Hui Mia, Lise Getoor and William Cohen) sums it up nicely:

Our approach, knowledge graph identification (KGI) combines the tasks of entity resolution, collective classification and link prediction mediated by rules based on ontological information.

Interesting papers under this theme were:

From my perspective, it was also nice to see the use of the W3C Provenance Model (PROV) as one of these core ontologies in many different papers and two of the keynotes. People are using it as a substructure to do a number of different applications – I intend to write a whole post on this – but until then here’s proof by twitter:

Coming to grips with messiness

It’s pretty evident that when dealing with the web things are messy. There were a couple of papers that documented this empirically either in terms of the availability of endpoints or just looking at the heterogeneity of the markup available from web pages.

In some sense, the papers mentioned in the prior theme also try to deal with this messiness. Here are another couple of papers looking at essentially how do deal with or even use this messiness.

One thing that seemed a lot more present in this year’s conference than last year  was the term entity. This is obviously popular because of things like google knowledge graph – but in some sense maybe it gives a better description of what we are aiming to get out of the data we have – machine readable descriptions or real world concepts/things.


There are some things that are of interest that don’t fit neatly into the themes above. So I’ll just try a bulleted list.

  • We won the Best Demo Paper Award for
  • Our paper on using NoSQL stores for RDF went over very well. Congrats to Marcin for giving a good presentation.
  • The format of mixing talks from different tracks by topic and having only 20 minutes per talk was great.
  • VUA had a great showing – 3 main track papers, a bunch of workshop papers, a couple of different posters, 4 workshop organizers giving talks at the workshop summary session, 2 organizing committee members, alumni all over the place, plus a bunch of stuff I probably forgot to mention.
  • The colocation with Web Directions South was great – it added a nice extra energy to the conference.
  • There were best reviewer awards won by Oscar Corcho, Tania Tudorache, and Aidan Hogan
  • Peter Fox seemed to give a keynote just for me – concept maps, PROV followed with abductive reasoning.
  • Did I mention that the coffee in Sydney (and Newcastle) is really good and lots of places serve proper breakfast!

Yesterday, Luc (my coauthor) and I received our physical copies of Provenance: An Introduction to PROV in the mail. Even though the book is primarily designed to be distributed digitally – it’s always great actually holding a copy in your hands. You can now order your own physical copy on Amazon. The Amazon page for the book there also includes the ability to look inside the book.

booksonshelf Prov Book Cover

Cross-posted from

In April, we launched the Open PHACTS Discovery Platform with a corresponding API allowing developers to create drug discovery applications without having to worry about the complexities and pain of integrating multiple databases. We’ve had some great applications being developed on top of this API. If you’re a developer in this space, I encourage you to take a look and see what you can create.  Below is a slide set and a webinar about getting started with the API. You can also check out for developer documentation and getting an account.


This past week we (Achille Fokoue & myself) sent the paper notifications for the 2013 International Semantic Web Conference’s In-Use Track. The track seeks to highlight innovative semantic technologies being applied and deployed in practice. With the selection made by the program committee (Thanks!), I think we have definitely achieved that goal.

So if you’re coming to Sydney (& you should definitely be coming to Sydney) here’s what’s in store. (Papers are listed below.) You’ll see  a number of papers where semantic technologies are being deployed in companies to help end users including:

  • how semantic technologies are helping the BBC expose its archive to its journalists [1];
  • how OWL and RDF and being combined to give energy saving tips to 300,000 customers at EDF [2];
  • and how the search result pages in Yahoo! Search are being improved through the use ofknowledge bases [3].


Dealing with streaming data has been a growing research theme in recent years. In the in-use track, we are seeing some of the fruits of that research in particular with respect to monitoring city events.  Balduini et al. report on the use of Streaming Linked Data Framework for monitoring the London Olympic Games 2012 and Milano Design Week 2013. (Yes, the semantic web is fashionable) [4]. IBM will present its work on the real-time urban monitoring of Dublin – requiring both scale but also low-latency solutions [5].

Life sciences

Semantic technologies have a long history of being deployed in healthcare and life sciences. We’ll see that again at this year’s conference. We get a progress report on the usage of these technologies in the development of the 11th revision of the International Classification of Diseases (ICD-11) [6]. ICD-11 involves 270 domain experts using the iCAT tool. We see how the intermixing (plain-old) spreadsheets and semantic technologies is enabling systems biology to better share its data [7]. In the life sciences, and in particular in drug discovery, both public and private data are critical, we see how the Open PHACTS project is tackling the problem of intermixing such data [8].

Semantics for Science & Research

Continuing on the science theme, the track will have reports on improving the reliability of scientific workflows [9], how linked data is being leverage to understand the economic impact of R&D in Europe [10]; and how our community is “eating its own dogfood” to enable better scientometric analysis of journals [11].  Lastly, you’ll get a talk on  the use of semantic annotations to help crowd source 3D representations of Greek Pottery for cultural heritage (a paper that I just think is so cool – I hope for videos) [12].

Semantic Data Availability

Reasoning relies on the availability of data exposed with its associated semantics. We’ve seen how the Linking Open Data movement helped bootstrap the uptake of Semantic Web technologies. Likewise, the widespread deployment of RDFa and microformats have dramatically increased the amount of data availability. But what’s out there? Bizer et al. give us a report based on analyzing  3 billion web pages. (I expect some awesome charts in this presentation) [13].

Enriching data with semantics has benefits but also comes at a cost. Based on a case study of converting Norwegian Petroleum Directorate’s FactPages, we’ll get insight into those trade-offs [14].  Reducing the effort for such conversations and particularly interlinking is a key challenge. The Cross-language Service Retrieve system is tackling this for open government data across multiple languages [15].

Finally, in practice, a key way to “semantize” data is through the use of natural language processing tools. You’ll see how semantic tech is facilitating the reusability and interoperability of NLP tools using NIF 2.0 framework [16].


I hope you’ll agree that this really represents the best from the semantic web community. These 16 papers were selected from 79 submissions. The program committee (for the most part)  did a great job both with their reviewers and importantly the discussion. Any many cases it was a hard decision and the PCs ability to discuss and revise their views was crucial in making the final selection. Thanks to the PC, it is a lot of work to do and we definitely asked them to do it in a fairly compact way. Thank you!

A couple of other thoughts, I think decision to institute an abstract submission for the in-use track was a good one and that author rebuttals are more helpful than I thought they would be.

ISWC 2013 is going to be a fantastic conference. I’m looking forward to the location, the sessions and the community. I look forward to seeing you there. There are many ways to participate so check out 


  1. Yves Raimond, Michael Smethurst, Andrew McParland and Christopher Lowis. Using the past to explain the present: interlinking current affairs with archives via the Semantic Web
  2. Pierre Chaussecourte, Birte Glimm, Ian Horrocks, Boris Motik and Laurent Pierre. The Energy Management Adviser at EDF
  3. Roi Blanco, Berkant Barla Cambazoglu, Peter Mika and Nicolas Torzec. Entity recommendations in Web Search
  4. Marco Balduini, Emanuele Della Valle, Daniele Dell’Aglio, Themis Palpanas, Mikalai Tsytsarau and Cristian Confalonieri. Social listening of City Scale Events using the Streaming Linked Data Framework
  5. Simone Tallevi-Diotallevi, Spyros Kotoulas, Luca Foschini, Freddy Lecue and Antonio Corradi. Real-time Urban Monitoring in Dublin using Semantic and Stream Technologies
  6. Tania Tudorache, Csongor I Nyulas, Natasha F. Noy and Mark Musen. Using Semantic Web in ICD-11: Three Years Down the Road
  7. Katherine Wolstencroft, Stuart Owen, Olga Krebs, Quyen Ngyuen, Jacky. L. Snoep, Wolfgang Mueller and Carole Goble. Semantic Data and Models Sharing in systems Biology: The Just Enough Results Model and the SEEK Platform
  8. Carole Goble, Alasdair J. G. Gray, Lee Harland, Karen Karapetyan, Antonis Loizou, Ivan Mikhailov, Yrjana Rankka, Stefan Senger, Valery Tkachenko, Antony Williams and Egon Willighagen. Incorporating Private and Commercial Data into an Open Linked Data Platform for Drug Discovery
  9. José Manuel Gómez-Pérez, Esteban García-Cuesta, Aleix Garrido and José Enrique Ruiz. When History Matters – Assessing Reliability for the Reuse of Scientific Workflows
  10. Amrapali Zaveri, Joao Ricardo Nickenig Vissoci, Cinzia Daraio and Ricardo Pietrobon. Using Linked Data to evaluate the impact of Research and Development in Europe: a Structural Equation Model
  11. Yingjie Hu, Krzysztof Janowicz, Grant Mckenzie, Kunal Sengupta and Pascal Hitzler. A Linked Data-driven Semantically-enabled Journal Portal for Scientometrics
  12. Chih-Hao Yu, Tudor Groza and Jane Hunter. Reasoning on crowd-sourced semantic annotations to facilitate cataloguing of 3D artefacts in the cultural heritage domain
  13. Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher and Johanna Völker. Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis
  14. Martin G. Skjæveland, Espen H. Lian and Ian Horrocks. Publishing the Norwegian Petroleum Directorate’s FactPages as Semantic Web Data
  15. Fedelucio Narducci, Matteo Palmonari and Giovanni Semeraro. Cross-language Semantic Retrieval and Linking of E-gov Services
  16. Sebastian Hellmann, Jens Lehmann, Sören Auer and Martin Brümmer. Integrating NLP using Linked Data
%d bloggers like this: