Last week, I hung out in Bethlehem, Pennsylvania for the the 14th International Semantic Web Conference. Bethlehem is famous for the Lehigh University Benchmark (LUBM) and Bethlehem Steel. This is the major conference focused on the intersection of semantics and web technologies. In addition to being technically super cool, it was a great chance for me to meet many friends and make some new ones.
Let’s begin with some stats:
- ~450 attendees
- The conference continues to be selective:
- Research track: 22% acceptance rate
- Empirical studies track: 29% acceptance rate
- In-use track: 40% acceptance rate
- Datasets and Ontologies: 22% acceptance rate
- There were 265 submissions across all tracks which is surprisingly the same number as last year.
- More stats and info in Stefan’s slides (e.g. move to Portugal if you want to get your papers in the conference.)
- Fancy visualizations courtesy of the STKO group
Before getting into what I thought were the major themes of the conference, a brief note. Reviewing is at the heart of any academic conference. While we can always try and improve review quality, it’s worth calling out good reviewing. The best reviewers were Maribel Acosta (research) and Markus Krötzsch (applied). As data sets and ontologies track co-chair, I can attest to how important good reviewers are. For this new track we relied heavily on reviewers being flexible and looking at these sorts of contributions differently. So thanks to them!
For me there were three themes of ISWC:
- The Spectrum of Entity Resolution
- The Spectrum of Linked Data Querying
- Buy more RAM
The Spectrum of Entity Resolution
Maybe its because I attended the NLP & DBpedia workshop or the conversation I had about string similarity with Michelle Cheatham, but one theme that I saw was the continued amalgamation of natural language processing (NLP) style entity resolution with database entity resolution (i.e. record linkage). This movement stems from the fact that an increasing amount of linked data is a combination of data extracted from semi-structured sources as well as from NLP. But in addition to that, NLP sources rely on some of these semi-structured datasources to do NLP.
Probably, the best example of that idea is the work that Andrew McCallum presented in his keynote on “epistemlogical knowledge bases”.
Briefly, the idea is to reason with all the information coming from both basic low level NLP (e.g. basic NER, or even surface forms) as well as the knowledge base jointly (plus, anything else) to generate a knowledge base. One method to do this is universal schemas. For a good intro, check out Sebastien Riedel’s slides.
From McCallum, I like the following papers which gives a good justification and results of doing collective/joint inference.
(Self promotion aside: check out Sara Magliacane’s work on Probabilistic Soft Logics for another way of doing joint inference.)
Following on from this notion of reasoning jointly, Hulpus, Prangnawarat and Hayes showed how to use the graph-based structure of linked data to to perform joint entity and word sense disambiguation from text. Likewise, Prokofyev et al. use the properties of a knowledge graph to perform better co-reference resolution. Essentially, they use this background knowledge to split the clusters of co-referrent entities produced by Stanford CoreNLP. On the same idea, but for more structured data, the TableEL system uses a joint model with soft constraints to perform entity linking for web tables, improving performance by up-to 75% on web tables. (code & data)
One approach to entity linking that I liked was from the Raphael Troncy’s crew titled “Reveal Entities From Texts With a Hybrid Approach” (paper, slides). (Shouldn’t it be “Revealing..”?). They showed that by using essentially the provenance of the data sources they are able to build an adaptive entity linking pipeline. Thus, one doesn’t necessarily have to do as much domain tuning to use these pipelines.
While not specifically about entity resolution, a paper worth pointing out is Type-Constrained Representation Learning in Knowledge Graphs from Denis Krompaß, Stephan Baier and Volker Tresp. They show how background knowledge about entity types can help improve link prediction tasks for generating knowledge graphs. Again, use the kitchen sink and you’ll perform better.
There were a couple of good resources presented for entity resolution tasks. Bryl, Bizer and Paulheim produced a dataset of surface forms for dbpedia entities. They were able to boost performance up to 20% for extracting accurate surface forms for entities through filtering. Another tool, LANCE looks great for systematically generating benchmark and test sets for instance matching (i.e. entity linking). Also, Michel Dumontier presented work that had a benchmark for entity linking from the life sciences domain.
Finally, as we get better at entity resolution, I think people will turn towards fusion (getting the best possible representation for a real world entity). Examples include:
The Spectrum of Linked Data Querying
So Linked Data Fragments from Ruben Verborgh was the huge breakout of the conference. Oscar Corcho’s excellent COLD keynote was a riff off thinking about the spectrum (from data dumps through to full sparql queries) that was introduced by Reuben. Another example was the work of Maribel Acosta and Maria-Esther Vidal on “Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data”. They developed an adaptive client side spraql query engine for linked data fragments. This allows the server side to support a much simpler API by having a more intelligent client side. (An aside, kids this is how a technical talk should be done. Precise, clean, technical, understandable. Can’t wait to have the the video lecture for reference.)
Even the most centralized solution, the LODLaundromat which is a clean crawl of the entire web of data supports Linked Data Fragments. In some sense, by asking the server to do less you can handle more linked data, and thus do more powerful analysis. This is exemplified by the best paper LODLab byLaurens Rietveld, Wouter Beek, and Stefan Schlobach, which allowed for the reproduction of 3 existing analysis of the web of data at scale.
I think Olaf Hartig, in his paper on LDQL, framed the problem best as (N, Q) (slides). First define the “crawl” of the web you want to query (N) and then define the query (Q). When we think about what and where are crawls are, we can think about what execution strategies and types of queries we can best support. Or put another way:
More Main Memory = better Triple Stores
Designing scalable graph / triple stores has always been a challenge. We’ve been trapped by the limits of RAM. But computer architecture is changing, and we now have systems that have a lot of main memory either in one machine or across multiple machines. This is a boon to triple stores and graph processing in general. See for example Leskovec team’s work from SIGMOD:
We saw that theme at ISWC as well:
Moral of the story: Buy RAM
This years conference explored the many spectra of the combination of the web and semantics. I liked the mix of methods used by papers and the range of practical (the industry session was packed) to theoretical results. I also think the community is no longer hemmed in by the standards but are using them as solid starting point. This was pointed out by Ian Horrocks in his keynote:
Additionally, this flexibility was exemplified by the best applied paper, “Building and Using a Knowledge Graph to Combat Human Trafficking”
by Pedro Szekely et al.. They used the parts of the semantic web stack that helped (like ontologies and JSON-LD) but used elastic search for storage to create a vital and important solution to a real challenging problem.
Overall, this was an excellent conference. Next year’s conference is in Kobe,
I hope you submit some great papers and I’ll seen you there!