Archive

linked data

A couple of weeks ago, I was at the European Data Forum in Athens talking about the Open PHACTS project. You can find a video of my talk with slides here. Slides are embedded below.

Advertisements

It’s been about a week since I got from Australia attending the International Semantic Web Conference  (ISWC 2013).  This is the premier forum for the latest in research on using semantics on the Web. Overall, it was a great conference – both well run and there was a good buzz. (Note, I’m probably a bit biased – I was  chair of this year’s In-Use track) .

ISWC is a fairly hard conference to get into and the quality is strong.

More importantly, almost all the talks I went to were worth thinking about. You can find the proceedings of the conference online either as a complete zip here or published by Springer. You can find more stats on the conference here.

As an aside, before digging into the meat of the conference – Sydney was great. Really a fantastic city – very cosmopolitan and with great coffee. I suggest Single Origin Roasters.  Also, Australia has wombats – wombats are like the chillest animal ever.

Wombat

From my perspective, there were three main themes to take away from the conference:

  1. Impressive applications of semantic web technologies
  2. Core ontologies as the framework for connecting complex integration and retrieval tasks
  3. Starting to come to grips with messiness

Applications

We are really seeing how semantic technologies can power great applications. All three keynotes highlighted the use of Semantic Tech. I think Ramanathan Guha’s keynote probably highlighted this the best in his discussion of the growth of schema.org.

Beyond the slide above, he brought up representatives from Yandex, Yahoo, and Microsoft on stage to join Google to tell how they are using schema.org. Drupal and WordPress will have schema.org in their cores in 2014. Schema.org is being used to drive everything from veteran friendly job search, to rich pins on Pinterest and enabling Open Table reservations to be easily put into your calendar. So schema.org is clearly a success.

Peter Mika presented a paper on how Yahoo is using ontologies to drive entity recommendations in searches. For example, you search for Brad Pitt and they show you related entities like Angelina Jolie or  Fight Club. The nice thing about the paper is that it showed how the deployment in production (in Yahoo! Web Search in the US) increases click through rates.

Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. International Semantic Web Conference (2) 2013: 33-48

I think it was probably Yves Raimond’s conference – he showed some amazing things being done at the BBC using semantic web technology. He had an excellent keynote at the COLD workshop – also highlighting some challenges on where we need to improve to ease the use of these technologies in production. I recommend you check out the slides above. Of all the applications, their work on mining the world service archive  of the BBC to enrich content being created. This work won the Semantic Web Challenge.

In the biomedical domain, there were two  papers showing how semantics can be embedded in tools that regular users use.  One showed how the development of ICD-11 (ICD is the most widely used clinical classification developed by the WHO) is  supported using semtech. The other I liked was the use of excel templates (developed using RightField) that transparently captured data according to a domain model for Systems biology.

Also in the biomedical domain, IBM presented an approach for using semantic web technologies to help coordinate health and social care at the semantic web challenge.

Finally, there was a neat application presented by Jane Hunter applying these technologies to art preservation: The Twentieth Century in Paint.

I did a review of all the in-use papers leading up to the conference but it’s good enough to say that there were numerous impressive applications. Also, I think it says something about the health of the community when you see slides like this:

Core Ontologies + Other Methods

There were a number of interesting papers that were around the idea of using a combination of well-known ontologies and then either record linkage or other machine learning methods to populate knowledge bases.

A paper that I like a lot (and also won the best student paper) was titled Knowledge Graph Identification (by Jay Pujara, Hui Mia, Lise Getoor and William Cohen) sums it up nicely:

Our approach, knowledge graph identification (KGI) combines the tasks of entity resolution, collective classification and link prediction mediated by rules based on ontological information.

Interesting papers under this theme were:

From my perspective, it was also nice to see the use of the W3C Provenance Model (PROV) as one of these core ontologies in many different papers and two of the keynotes. People are using it as a substructure to do a number of different applications – I intend to write a whole post on this – but until then here’s proof by twitter:

Coming to grips with messiness

It’s pretty evident that when dealing with the web things are messy. There were a couple of papers that documented this empirically either in terms of the availability of endpoints or just looking at the heterogeneity of the markup available from web pages.

In some sense, the papers mentioned in the prior theme also try to deal with this messiness. Here are another couple of papers looking at essentially how do deal with or even use this messiness.

One thing that seemed a lot more present in this year’s conference than last year  was the term entity. This is obviously popular because of things like google knowledge graph – but in some sense maybe it gives a better description of what we are aiming to get out of the data we have – machine readable descriptions or real world concepts/things.

Misc.

There are some things that are of interest that don’t fit neatly into the themes above. So I’ll just try a bulleted list.

  • We won the Best Demo Paper Award for git2prov.org
  • Our paper on using NoSQL stores for RDF went over very well. Congrats to Marcin for giving a good presentation.
  • The format of mixing talks from different tracks by topic and having only 20 minutes per talk was great.
  • VUA had a great showing – 3 main track papers, a bunch of workshop papers, a couple of different posters, 4 workshop organizers giving talks at the workshop summary session, 2 organizing committee members, alumni all over the place, plus a bunch of stuff I probably forgot to mention.
  • The colocation with Web Directions South was great – it added a nice extra energy to the conference.
  • There were best reviewer awards won by Oscar Corcho, Tania Tudorache, and Aidan Hogan
  • Peter Fox seemed to give a keynote just for me – concept maps, PROV followed with abductive reasoning.
  • Did I mention that the coffee in Sydney (and Newcastle) is really good and lots of places serve proper breakfast!

Yesterday, Luc (my coauthor) and I received our physical copies of Provenance: An Introduction to PROV in the mail. Even though the book is primarily designed to be distributed digitally – it’s always great actually holding a copy in your hands. You can now order your own physical copy on Amazon. The Amazon page for the book there also includes the ability to look inside the book.

booksonshelf Prov Book Cover


Cross-posted from blog.provbook.org

In April, we launched the Open PHACTS Discovery Platform with a corresponding API allowing developers to create drug discovery applications without having to worry about the complexities and pain of integrating multiple databases. We’ve had some great applications being developed on top of this API. If you’re a developer in this space, I encourage you to take a look and see what you can create.  Below is a slide set and a webinar about getting started with the API. You can also check out https://dev.openphacts.org for developer documentation and getting an account.

iswc2013

This past week we (Achille Fokoue & myself) sent the paper notifications for the 2013 International Semantic Web Conference’s In-Use Track. The track seeks to highlight innovative semantic technologies being applied and deployed in practice. With the selection made by the program committee (Thanks!), I think we have definitely achieved that goal.

So if you’re coming to Sydney (& you should definitely be coming to Sydney) here’s what’s in store. (Papers are listed below.) You’ll see  a number of papers where semantic technologies are being deployed in companies to help end users including:

  • how semantic technologies are helping the BBC expose its archive to its journalists [1];
  • how OWL and RDF and being combined to give energy saving tips to 300,000 customers at EDF [2];
  • and how the search result pages in Yahoo! Search are being improved through the use ofknowledge bases [3].

Streaming

Dealing with streaming data has been a growing research theme in recent years. In the in-use track, we are seeing some of the fruits of that research in particular with respect to monitoring city events.  Balduini et al. report on the use of Streaming Linked Data Framework for monitoring the London Olympic Games 2012 and Milano Design Week 2013. (Yes, the semantic web is fashionable) [4]. IBM will present its work on the real-time urban monitoring of Dublin – requiring both scale but also low-latency solutions [5].

Life sciences

Semantic technologies have a long history of being deployed in healthcare and life sciences. We’ll see that again at this year’s conference. We get a progress report on the usage of these technologies in the development of the 11th revision of the International Classification of Diseases (ICD-11) [6]. ICD-11 involves 270 domain experts using the iCAT tool. We see how the intermixing (plain-old) spreadsheets and semantic technologies is enabling systems biology to better share its data [7]. In the life sciences, and in particular in drug discovery, both public and private data are critical, we see how the Open PHACTS project is tackling the problem of intermixing such data [8].

Semantics for Science & Research

Continuing on the science theme, the track will have reports on improving the reliability of scientific workflows [9], how linked data is being leverage to understand the economic impact of R&D in Europe [10]; and how our community is “eating its own dogfood” to enable better scientometric analysis of journals [11].  Lastly, you’ll get a talk on  the use of semantic annotations to help crowd source 3D representations of Greek Pottery for cultural heritage (a paper that I just think is so cool – I hope for videos) [12].

Semantic Data Availability

Reasoning relies on the availability of data exposed with its associated semantics. We’ve seen how the Linking Open Data movement helped bootstrap the uptake of Semantic Web technologies. Likewise, the widespread deployment of RDFa and microformats have dramatically increased the amount of data availability. But what’s out there? Bizer et al. give us a report based on analyzing  3 billion web pages. (I expect some awesome charts in this presentation) [13].

Enriching data with semantics has benefits but also comes at a cost. Based on a case study of converting Norwegian Petroleum Directorate’s FactPages, we’ll get insight into those trade-offs [14].  Reducing the effort for such conversations and particularly interlinking is a key challenge. The Cross-language Service Retrieve system is tackling this for open government data across multiple languages [15].

Finally, in practice, a key way to “semantize” data is through the use of natural language processing tools. You’ll see how semantic tech is facilitating the reusability and interoperability of NLP tools using NIF 2.0 framework [16].

Conclusion

I hope you’ll agree that this really represents the best from the semantic web community. These 16 papers were selected from 79 submissions. The program committee (for the most part)  did a great job both with their reviewers and importantly the discussion. Any many cases it was a hard decision and the PCs ability to discuss and revise their views was crucial in making the final selection. Thanks to the PC, it is a lot of work to do and we definitely asked them to do it in a fairly compact way. Thank you!

A couple of other thoughts, I think decision to institute an abstract submission for the in-use track was a good one and that author rebuttals are more helpful than I thought they would be.

ISWC 2013 is going to be a fantastic conference. I’m looking forward to the location, the sessions and the community. I look forward to seeing you there. There are many ways to participate so check out http://iswc2013.semanticweb.org. 

Papers

  1. Yves Raimond, Michael Smethurst, Andrew McParland and Christopher Lowis. Using the past to explain the present: interlinking current affairs with archives via the Semantic Web
  2. Pierre Chaussecourte, Birte Glimm, Ian Horrocks, Boris Motik and Laurent Pierre. The Energy Management Adviser at EDF
  3. Roi Blanco, Berkant Barla Cambazoglu, Peter Mika and Nicolas Torzec. Entity recommendations in Web Search
  4. Marco Balduini, Emanuele Della Valle, Daniele Dell’Aglio, Themis Palpanas, Mikalai Tsytsarau and Cristian Confalonieri. Social listening of City Scale Events using the Streaming Linked Data Framework
  5. Simone Tallevi-Diotallevi, Spyros Kotoulas, Luca Foschini, Freddy Lecue and Antonio Corradi. Real-time Urban Monitoring in Dublin using Semantic and Stream Technologies
  6. Tania Tudorache, Csongor I Nyulas, Natasha F. Noy and Mark Musen. Using Semantic Web in ICD-11: Three Years Down the Road
  7. Katherine Wolstencroft, Stuart Owen, Olga Krebs, Quyen Ngyuen, Jacky. L. Snoep, Wolfgang Mueller and Carole Goble. Semantic Data and Models Sharing in systems Biology: The Just Enough Results Model and the SEEK Platform
  8. Carole Goble, Alasdair J. G. Gray, Lee Harland, Karen Karapetyan, Antonis Loizou, Ivan Mikhailov, Yrjana Rankka, Stefan Senger, Valery Tkachenko, Antony Williams and Egon Willighagen. Incorporating Private and Commercial Data into an Open Linked Data Platform for Drug Discovery
  9. José Manuel Gómez-Pérez, Esteban García-Cuesta, Aleix Garrido and José Enrique Ruiz. When History Matters – Assessing Reliability for the Reuse of Scientific Workflows
  10. Amrapali Zaveri, Joao Ricardo Nickenig Vissoci, Cinzia Daraio and Ricardo Pietrobon. Using Linked Data to evaluate the impact of Research and Development in Europe: a Structural Equation Model
  11. Yingjie Hu, Krzysztof Janowicz, Grant Mckenzie, Kunal Sengupta and Pascal Hitzler. A Linked Data-driven Semantically-enabled Journal Portal for Scientometrics
  12. Chih-Hao Yu, Tudor Groza and Jane Hunter. Reasoning on crowd-sourced semantic annotations to facilitate cataloguing of 3D artefacts in the cultural heritage domain
  13. Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher and Johanna Völker. Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis
  14. Martin G. Skjæveland, Espen H. Lian and Ian Horrocks. Publishing the Norwegian Petroleum Directorate’s FactPages as Semantic Web Data
  15. Fedelucio Narducci, Matteo Palmonari and Giovanni Semeraro. Cross-language Semantic Retrieval and Linking of E-gov Services
  16. Sebastian Hellmann, Jens Lehmann, Sören Auer and Martin Brümmer. Integrating NLP using Linked Data

I think since I’ve moved to Europe I’ve been attending ESWC (Extended/European Semantic Web Conference) and I always get something out of the event. There are plenty of familiar faces but also quite a few new people and it’s a great environment for having chats. In addition, the quality of the content is always quite good. This year the event was held in Montpellier and was for the most part well organized: the main conference wifi worked!

The stats:

  • 300 participants
  • 42 accepted papers from 162 submissions
  • 26% acceptance rate
  • 11 workshops + 7 tutorials

So what was I doing there:

The VU Semantic Web group also had a strong showing:

  • Albert Meroño-Peñuela won the best PhD symposium paper for his work on digital humanities and the semantic web.
  • The USEWOD workshop’s (led by Laura Hollink) datasets were used by a number of main track papers for evaluation.
  • Stefan Schlobach and Laura Hollink were on the organizing committee. And we organized a couple of workshops & tutorials.
  • Posters/Demos:
    • Albert Meroño-Peñuela, Rinke Hoekstra, Andrea Scharnhorst, Christophe Guéret and Ashkan Ashkpour. Longitudinal Queries over Linked Census Data.
    • Niels Ockeloen, Victor de Boer and Lora Aroyo. LDtogo: A Data Querying and Mapping Framework for Linked Data Applications.
  • Several workshop papers.

I’ll try to pull out what I thought were the highlights of the event.

What is a semantic web application?

Can you escape Frank?

Can you escape Frank?

The keynotes from Enrico Motta and David Karger focused on trying to define what a semantic web application was. This starts out in the form of does a Semantic Web application need to use the Semantic Web set of standards (e.g. RDF, OWL, etc). So from my perspective, the answer is no. These standards are great infrastructure for building these applications but are they necessary, no (see google knowledge graph).  Then what is a semantic web application?

From what I could gather, Motta would define it as an application that is scalable, uses the web and embraces Model Theoretic semantics. For me that’s rather limiting, there are many other semantics that may be appropriate… we can ground meaning in something else other than model theory. I think a good example of this is the work on Pragmatic Semantics that my colleague Stefan Schlobach presented at the Artificial Intelligence meets the Semantic Web workshop. Or we can reach back into AI and see discussion’s from Brooks’ classic paper Elephant’s Don’t Play Chess.  I felt that Karger’s definition (in what was a great keynote) was getting somewhere. He defined a semantic web application essentially as:

An application whose schema is expected to change.

This seems to me to capture the semantic portion of the definition, in a sense that the semantics need to be understood on the fly. However, I think we need to role the web back into this definition… Overall, I thought this discussion was worth having and helps the field define what it is that we are aiming at. To be continued…

Homebrew databases

2013-05-29 09.18.05

Homebrew databases

As I said, I thought Karger’s keynote was great. He gave a talk within a talk, on the subject of homebrew databases from this paper in CHI 2011:

Amy Voida, Ellie Harmon, and Ban Al-Ani. 2011. Homebrew databases: complexities of everyday information management in nonprofit organizations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). ACM, New York, NY, USA, 915-924. DOI=10.1145/1978942.1979078 http://doi.acm.org/10.1145/1978942.1979078

They define a homebrew database as “an assemblage of information management resources that people have pieced together to satisfice their information management needs.” This is just what we see all the time, the combination of excel, word, email, databases and don’t forget normal paper brought together to try to attack information management problems. A number of our use cases from the pharma industry as well as science reflect essentially this practice. It’s great to see a good definition of this problem grounded in ethnographic studies.

The Concerns of Linking

There were a couple of good papers on generating linkage across datasets (the central point of linked data). In Open PHACTS, we’ve been dealing with the notion of essentially context dependent linkages. I think this notion is becoming more prevalent in the community. We had a lot of positive response on this in the poster session when presenting Open PHACTS. Probably, my favorite paper was on linking the Smithsonian American Art museum to the Linked Data cloud. They use PROV to drive their link generation. Essentially, proposing links to human’s who then verify the connections. See:

I also liked the following paper on which hardware environment you should use when doing link discovery. Result: use GPU’s there fast!

Additionally, I think the following paper is cool because they use network statistics not just to measure but to do something, namely create links:

APIs

APIs were a growing theme of the event with things like the Linked Data Platform working group and  the successful SALAD workshop. (Fantastic acronym). Although I was surprised people in the workshop hadn’t heard of the Linked Data API. We had a lot of good feedback on the Open PHACTS API. It’s just the case that there is more developer expertise for using web service apis rather than semweb tech. I’ve actually seen a lot of demand for Semweb skills and while we our doing our best to train people there is still this gap. It’s good then that we are thinking about how these two technologies play together nicely.

Random Notes

In the context of the Open PHACTS and the Linked Data Benchmark Council projects, Antonis Loizou and I have been looking at how to write better SPARQL queries. In the Open PHACTS project, we’ve been writing super complicated queries to integrate multiple data sources and from experience we realized that different combinations and factors can dramatically impact performance. With this experience, we decided to do something more systematic and test how different techniques we came up with mapped to database theory and worked in practice. We just submitted a paper for review on the outcome. You can find a preprint (On the Formulation of Performant SPARQL Queries) on arxiv.org at http://arxiv.org/abs/1304.0567. The abstract is below. The fancy graphs are in the paper.

But if you’re just looking for ways to write better queries, here are the main rules-of-thumb that we found.

  1. Minimise optional triple patterns : Reduce the number of optional triple patterns by identify-ing those triple patterns for a given query that will always be bound using dataset statistics.
  2. Localise SPARQL subpatterns: Use named graphs to specify the subset of triples in a dataset that portions of a query should be evaluated against.
  3. Replace connected triple patterns: Use property paths to replace connected triple patterns where the object of one triple pattern is the subject of another.
  4. Reduce the effects of cartesian products: Use aggregates to reduce the size of solution sequences.
  5. Specifying alternative URIs: Consider different ways of specifying alternative URIs beyond UNION.

Finally, one thing we did learn was test, test, test. The performance of the same query can vary dramatically across stores.

Title: On the Formulation of Performant SPARQL Queries
Authors: Antonis Loizou and Paul Groth

The combination of the flexibility of RDF and the expressiveness of SPARQL
provides a powerful mechanism to model, integrate and query data. However,
these properties also mean that it is nontrivial to write performant SPARQL
queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores.

%d bloggers like this: