Archive

Tag Archives: #iswc2015

Last week, I hung out in Bethlehem, Pennsylvania for the the 14th International Semantic Web Conference. Bethlehem is famous for the Lehigh University Benchmark  (LUBM) and Bethlehem Steel. This is the major conference focused on the intersection of semantics and web technologies. In addition to being technically super cool, it was a great chance for me to meet many friends and make some new ones.

Let’s begin with some stats:

  • ~450 attendees
  • The conference continues to be selective:
    • Research track: 22% acceptance rate
    • Empirical studies track: 29% acceptance rate
    • In-use track: 40% acceptance rate
    • Datasets and Ontologies: 22% acceptance rate
  • There were 265 submissions across all tracks which is surprisingly the same number as last year.
  • More stats and info in Stefan’s slides (e.g. move to Portugal if you want to get your papers in the conference.)
  • Fancy visualizations courtesy of the STKO group

Before getting into what I thought were the major themes of the conference, a brief note. Reviewing is at the heart of any academic conference. While we can always try and improve review quality, it’s worth calling out good reviewing. The best reviewers were Maribel Acosta (research) and Markus Krötzsch (applied). As data sets and ontologies track co-chair, I can attest to how important good reviewers are.  For this new track we relied heavily on reviewers being flexible and looking at these sorts of contributions differently. So thanks to them!

For me there were three themes of ISWC:

  1. The Spectrum of Entity Resolution
  2. The Spectrum of Linked Data Querying
  3. Buy more RAM

The Spectrum of Entity Resolution

Maybe its because I attended the NLP & DBpedia workshop or the conversation I had about string similarity with Michelle Cheatham, but one theme that I saw was the continued amalgamation of natural language processing (NLP) style entity resolution with database entity resolution (i.e. record linkage). This movement stems from the fact that an increasing amount of linked data is a combination of data extracted from semi-structured sources as well as from NLP. But in addition to that, NLP sources rely on some of these semi-structured datasources to do NLP.

Probably, the best example of that idea is the work that Andrew McCallum presented in his keynote on “epistemlogical knowledge bases”.

Briefly, the idea is to reason with all the information coming from both basic low level NLP (e.g. basic NER, or even surface forms) as well as the knowledge base jointly (plus, anything else) to generate a knowledge base.  One method to do this is universal schemas. For a good intro, check out Sebastien Riedel’s slides.

From McCallum, I like the following papers which gives a good justification and results of doing collective/joint inference.

(Self promotion aside: check out Sara Magliacane’s work on Probabilistic Soft Logics for another way of doing joint inference.)

Following on from this notion of reasoning jointly, Hulpus, Prangnawarat and Hayes showed how to use the graph-based structure of linked data to to perform joint entity and word sense disambiguation from text. Likewise, Prokofyev et al. use the properties of a knowledge graph to perform better co-reference resolution. Essentially, they use this background knowledge to split the clusters of co-referrent entities produced by Stanford CoreNLP. On the same idea, but for more structured data, the TableEL system uses a joint model with soft constraints to perform entity linking for web tables, improving performance by up-to 75% on web tables. (code & data)

One approach to entity linking that I liked was from the Raphael Troncy’s crew titled “Reveal Entities From Texts With a Hybrid Approach” (paper, slides). (Shouldn’t it be “Revealing..”?). They showed that by using essentially the provenance of the data sources they are able to build an adaptive entity linking pipeline. Thus, one doesn’t necessarily have to do as much domain tuning to use these pipelines.

While not specifically about entity resolution, a paper worth pointing out is Type-Constrained Representation Learning in Knowledge Graphs from Denis Krompaß, Stephan Baier and Volker Tresp. They show how background knowledge about entity types can help improve link prediction tasks for generating knowledge graphs. Again, use the kitchen sink and you’ll perform better.

There were a couple of good resources presented for entity resolution tasks.  Bryl, Bizer and Paulheim produced a dataset of surface forms for dbpedia entities. They were able to boost performance up to 20% for extracting accurate surface forms for entities through filtering. Another tool, LANCE looks great for systematically generating benchmark and test sets for instance matching (i.e. entity linking). Also, Michel Dumontier presented work that had a benchmark for entity linking from the life sciences domain.

Finally, as we get better at entity resolution, I think people will turn towards fusion (getting the best possible representation for a real world entity). Examples include:

The Spectrum of Linked Data Querying

So Linked Data Fragments from Ruben Verborgh was the huge breakout of the conference. Oscar Corcho’s excellent COLD keynote was a riff off thinking about the spectrum (from data dumps through to full sparql queries) that was introduced by Reuben. Another example was the work of Maribel Acosta and Maria-Esther Vidal on “Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data”. They developed an adaptive client side spraql query engine for linked data fragments. This allows the server side to support a much simpler API by having a more intelligent client side. (An aside, kids this is how a technical talk should be done. Precise, clean, technical, understandable. Can’t wait to have the the video lecture for reference.)

Even the most centralized solution, the LODLaundromat which is a clean crawl of the entire web of data supports Linked Data Fragments. In some sense, by asking the server to do less you can handle more linked data, and thus do more powerful analysis. This is exemplified by the best paper LODLab byLaurens Rietveld, Wouter Beek, and Stefan Schlobach, which allowed for the reproduction of 3 existing analysis of the web of data at scale.

I think Olaf Hartig, in his paper on LDQL, framed the problem best as (N, Q) (slides). First define the “crawl” of the web you want to query (N)  and then define the query (Q). When we think about what and where are crawls are, we can think about what execution strategies and types of queries we can best support. Or put another way:

More Main Memory = better Triple Stores

Designing scalable graph / triple stores has always been a challenge. We’ve been trapped by the limits of RAM. But computer architecture is changing, and we now have systems that have a lot of main memory either in one machine or across multiple machines. This is a boon to triple stores and graph processing in general. See for example Leskovec team’s work from SIGMOD:

We saw that theme at ISWC as well:

Moral of the story: Buy RAM

Conclusion

This years conference explored the many spectra of the combination of the web and semantics. I liked the mix of methods used by papers and the range of practical (the industry session was packed) to theoretical results. I also think the community is no longer hemmed in by the standards but are using them as solid starting point. This was pointed out by Ian Horrocks in his keynote:
Additionally, this flexibility was exemplified by the best applied paper, “Building and Using a Knowledge Graph to Combat Human Trafficking” by  Pedro Szekely et al.. They used the parts of the semantic web stack that helped (like ontologies and JSON-LD) but used elastic search for storage to create a vital and important solution to a real challenging problem.
Overall, this was an excellent conference.  Next year’s conference is in Kobe, I hope you submit some great papers and I’ll seen you there!

Random Thoughts

Advertisements

Next week is the 2015 International Semantic Web Conference. I had the opportunity with the Michel Dumontier to chair a new track on Datasets and Ontologies. A key part of of the Semantic Web has always been shared resources, whether it’s common standards through the W3C or open datasets like those found in the LOD cloud. Indeed, one of the major successes of our community is the availability of these resources.

ISWC over the years has experimented with different ways of highlighting these contributions and bringing them into the scientific literature. For the past couple of years, we have had an evaluation track specifically devoted to reproducibility and evaluation studies. Last year datasets were included to form a larger RDBS track. This year we again have a specific Empirical Studies and Evaluation track along side the Data & Ontologies track.

The reviewers had a tough job for this track. First, it was new so it’s hard to make a standard judgment. Secondly, we asked reviewers not only to review the paper but the resource itself along a number of dimensions. Overall, I think they did a good job. Below you’ll find the resources chosen for presentation at the conference and a brief headline of what to me is interesting about the paper. In the spirt of the track, I link to the resource as well as the paper.

Datasets

  •  Automatic Curation of Clinical Trials Data in LinkedCT by Oktie Hassanzadeh and Renée J Miller (paper) – clinicaltrials.gov published as linked data in an open and queryable. This resource has been around since 2008. I love the fact that they post downtime and other status info on twitter https://twitter.com/linkedct
  • LSQ: Linked SPARQL Queries Dataset by Muhammad Saleem, Muhammad Intizar Ali, Qaiser Mehmood, Aidan Hogan and Axel-Cyrille Ngonga Ngomo (paper). – Query logs are becoming an ever more important resource from everything from search engines to database query optimization. See for example USEWOD. This resource provides queryable versions in SPARQL of the query logs from several major datasets including dbpedia and linked geo data.
  • Provenance-Centered Dataset of Drug-Drug Interactions by Juan Banda, Tobias Kuhn, Nigam Shah and Michel Dumontier (paper) – this resources provides aggregated set of drug-drug interactions coming from 8 different sources. I like how they provided a doi for the bulk download of their datasource as well as spraql endpoint. It also uses nanopublications as the representation format.
  • Semantic Bridges for Biodiversity Science by Natalia Villanueva-Rosales, Nicholas Del Rio, Deana Pennington and Luis Garnica Chavira (paper) – this resource allows biodiversity scientist to work with species distribution models. The interesting thing about this resource is that it not only provides linked data, a spraql endpoint and ontologies but also semantic web services (i.e. SADI) for orchestrating these models.
  • DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons by Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann and Sebastian Hellmann  (paper) – this is another chapter in exposing wikimedia content as structured data. This resource provides structured information for the media content in Wikimedia commons. Now you can spraql for all images with a CC-by-sa v2.0 license.

Ontologies

Overall, I think this is a good representation of the plethora of deep datasets and ontologies that the community is creating.  Take a minute and check out these new resources.

%d bloggers like this: