Archive

Monthly Archives: October 2015

Last week, I hung out in Bethlehem, Pennsylvania for the the 14th International Semantic Web Conference. Bethlehem is famous for the Lehigh University Benchmark  (LUBM) and Bethlehem Steel. This is the major conference focused on the intersection of semantics and web technologies. In addition to being technically super cool, it was a great chance for me to meet many friends and make some new ones.

Let’s begin with some stats:

  • ~450 attendees
  • The conference continues to be selective:
    • Research track: 22% acceptance rate
    • Empirical studies track: 29% acceptance rate
    • In-use track: 40% acceptance rate
    • Datasets and Ontologies: 22% acceptance rate
  • There were 265 submissions across all tracks which is surprisingly the same number as last year.
  • More stats and info in Stefan’s slides (e.g. move to Portugal if you want to get your papers in the conference.)
  • Fancy visualizations courtesy of the STKO group

Before getting into what I thought were the major themes of the conference, a brief note. Reviewing is at the heart of any academic conference. While we can always try and improve review quality, it’s worth calling out good reviewing. The best reviewers were Maribel Acosta (research) and Markus Krötzsch (applied). As data sets and ontologies track co-chair, I can attest to how important good reviewers are.  For this new track we relied heavily on reviewers being flexible and looking at these sorts of contributions differently. So thanks to them!

For me there were three themes of ISWC:

  1. The Spectrum of Entity Resolution
  2. The Spectrum of Linked Data Querying
  3. Buy more RAM

The Spectrum of Entity Resolution

Maybe its because I attended the NLP & DBpedia workshop or the conversation I had about string similarity with Michelle Cheatham, but one theme that I saw was the continued amalgamation of natural language processing (NLP) style entity resolution with database entity resolution (i.e. record linkage). This movement stems from the fact that an increasing amount of linked data is a combination of data extracted from semi-structured sources as well as from NLP. But in addition to that, NLP sources rely on some of these semi-structured datasources to do NLP.

Probably, the best example of that idea is the work that Andrew McCallum presented in his keynote on “epistemlogical knowledge bases”.

Briefly, the idea is to reason with all the information coming from both basic low level NLP (e.g. basic NER, or even surface forms) as well as the knowledge base jointly (plus, anything else) to generate a knowledge base.  One method to do this is universal schemas. For a good intro, check out Sebastien Riedel’s slides.

From McCallum, I like the following papers which gives a good justification and results of doing collective/joint inference.

(Self promotion aside: check out Sara Magliacane’s work on Probabilistic Soft Logics for another way of doing joint inference.)

Following on from this notion of reasoning jointly, Hulpus, Prangnawarat and Hayes showed how to use the graph-based structure of linked data to to perform joint entity and word sense disambiguation from text. Likewise, Prokofyev et al. use the properties of a knowledge graph to perform better co-reference resolution. Essentially, they use this background knowledge to split the clusters of co-referrent entities produced by Stanford CoreNLP. On the same idea, but for more structured data, the TableEL system uses a joint model with soft constraints to perform entity linking for web tables, improving performance by up-to 75% on web tables. (code & data)

One approach to entity linking that I liked was from the Raphael Troncy’s crew titled “Reveal Entities From Texts With a Hybrid Approach” (paper, slides). (Shouldn’t it be “Revealing..”?). They showed that by using essentially the provenance of the data sources they are able to build an adaptive entity linking pipeline. Thus, one doesn’t necessarily have to do as much domain tuning to use these pipelines.

While not specifically about entity resolution, a paper worth pointing out is Type-Constrained Representation Learning in Knowledge Graphs from Denis Krompaß, Stephan Baier and Volker Tresp. They show how background knowledge about entity types can help improve link prediction tasks for generating knowledge graphs. Again, use the kitchen sink and you’ll perform better.

There were a couple of good resources presented for entity resolution tasks.  Bryl, Bizer and Paulheim produced a dataset of surface forms for dbpedia entities. They were able to boost performance up to 20% for extracting accurate surface forms for entities through filtering. Another tool, LANCE looks great for systematically generating benchmark and test sets for instance matching (i.e. entity linking). Also, Michel Dumontier presented work that had a benchmark for entity linking from the life sciences domain.

Finally, as we get better at entity resolution, I think people will turn towards fusion (getting the best possible representation for a real world entity). Examples include:

The Spectrum of Linked Data Querying

So Linked Data Fragments from Ruben Verborgh was the huge breakout of the conference. Oscar Corcho’s excellent COLD keynote was a riff off thinking about the spectrum (from data dumps through to full sparql queries) that was introduced by Reuben. Another example was the work of Maribel Acosta and Maria-Esther Vidal on “Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data”. They developed an adaptive client side spraql query engine for linked data fragments. This allows the server side to support a much simpler API by having a more intelligent client side. (An aside, kids this is how a technical talk should be done. Precise, clean, technical, understandable. Can’t wait to have the the video lecture for reference.)

Even the most centralized solution, the LODLaundromat which is a clean crawl of the entire web of data supports Linked Data Fragments. In some sense, by asking the server to do less you can handle more linked data, and thus do more powerful analysis. This is exemplified by the best paper LODLab byLaurens Rietveld, Wouter Beek, and Stefan Schlobach, which allowed for the reproduction of 3 existing analysis of the web of data at scale.

I think Olaf Hartig, in his paper on LDQL, framed the problem best as (N, Q) (slides). First define the “crawl” of the web you want to query (N)  and then define the query (Q). When we think about what and where are crawls are, we can think about what execution strategies and types of queries we can best support. Or put another way:

More Main Memory = better Triple Stores

Designing scalable graph / triple stores has always been a challenge. We’ve been trapped by the limits of RAM. But computer architecture is changing, and we now have systems that have a lot of main memory either in one machine or across multiple machines. This is a boon to triple stores and graph processing in general. See for example Leskovec team’s work from SIGMOD:

We saw that theme at ISWC as well:

Moral of the story: Buy RAM

Conclusion

This years conference explored the many spectra of the combination of the web and semantics. I liked the mix of methods used by papers and the range of practical (the industry session was packed) to theoretical results. I also think the community is no longer hemmed in by the standards but are using them as solid starting point. This was pointed out by Ian Horrocks in his keynote:
Additionally, this flexibility was exemplified by the best applied paper, “Building and Using a Knowledge Graph to Combat Human Trafficking” by  Pedro Szekely et al.. They used the parts of the semantic web stack that helped (like ontologies and JSON-LD) but used elastic search for storage to create a vital and important solution to a real challenging problem.
Overall, this was an excellent conference.  Next year’s conference is in Kobe, I hope you submit some great papers and I’ll seen you there!

Random Thoughts

Last week (Oct 7 – 9) the altmetrics community made its way to Amsterdam for 2:AM (the second altmetrics conference) and altmetrics15 (the 4th altmetrics workshop). The conference is aimed more at practitioners while the workshop has a bit more research focus. I enjoyed the events from both a content (I’m biased as a co-organizer) as well as logistics perspective (I could bike from home). This was the five year anniversary of the altmetrics manifesto so it was a great opportunity to reflect on the status of the community. Plus the conference organizers brought cake!

This was the first time that all of the authors were in the same room together and we got a chance to share some of our thoughts. The video is here if you want to hear us pontificate:

From my perspective, I think you can summarize the past years in two bullet points:

  • Amazing what the community has done: multiple startups on altmetrics, big companies having altmetric products, many articles and other research objects having altmetric scores, a small but vibrant research community is alive
  • It would be great to focus more on altmetrics to improve the research process rather than just their potential use in research evaluation.

Beyond the reflection on the community itself, I took three themes from the conference:

More & different data please

An interesting aspect is that most studies and implementations rely on social media data (twitter, mendeley, Facebook, blogs, etc). As an aside, it’s worth noting you can do amazing things with this data in a very short amount of time…

However, there is increasing interest in having data from other sources or having more contextualized data.

There were several good examples.  gave a good talk about trying to get data behind who tweets about scientific articles. I’m excited to see how better population data can help us have. The folks at altimetric.com are starting to provide data that looks at how articles are being used in public policy documents. Finally, moving beyond articles, Peter van Besselaar looking at data derived from grant review processes to study, for example, gender bias.

It’s also good to see developments such as the DOI Event Tracker that makes the aggregation of altmetrics data easier. This is hopefully just the start and we will see a continued expansion of the variety of data available for studies.

The role of theory

There was quite a bit of discussions about the appropriateness of the use of altmetrics for different tasks ranging from the development of global evaluation measures to their role in understanding the science system. There was a long discussion of the quality of altmetrics data in particular the transparency of how aggregator’s integrate and provide data.

A number of presenters discussed the need for theory in trying to interpret altmetrics signal. Cameron Neylon gave an excellent talk about his view of the need for a different theoretical view. There was also a break out session at the workshop discussing the role of theory and I look forward to the ether pad becoming something more well defined.  Peter van Bessellaar and I also tried to argue for a question driven approach when using altmetrics.

Finally, I enjoyed the work of Stefanie Haustein, Timothy Bowman, and Rodrigo Costas on interpreting the meaning of altmetrics. This is definitely a must read.

Going beyond research evaluation

I had a number of good conversations with people about the desire to do something that moves beyond the focus of research evaluation. In all honesty, being able or tell stories with a variety of metrics is probably why altmetrics has gained traction.

However, I think a world in which understanding the signals produced by the research system can be used to improve research is the exciting bit. There were some hints of this. In particular, I was compelled by the work of Kristi Holmes on using measures to improve translational medicine at northwestern.

Wrap-up

Overall, It’s great to see all the great activity around altmetrics. There are a bunch of great summaries of the event. Check out the altmetrics conference blog and Julie Brikholz’s summary.

Next week is the 2015 International Semantic Web Conference. I had the opportunity with the Michel Dumontier to chair a new track on Datasets and Ontologies. A key part of of the Semantic Web has always been shared resources, whether it’s common standards through the W3C or open datasets like those found in the LOD cloud. Indeed, one of the major successes of our community is the availability of these resources.

ISWC over the years has experimented with different ways of highlighting these contributions and bringing them into the scientific literature. For the past couple of years, we have had an evaluation track specifically devoted to reproducibility and evaluation studies. Last year datasets were included to form a larger RDBS track. This year we again have a specific Empirical Studies and Evaluation track along side the Data & Ontologies track.

The reviewers had a tough job for this track. First, it was new so it’s hard to make a standard judgment. Secondly, we asked reviewers not only to review the paper but the resource itself along a number of dimensions. Overall, I think they did a good job. Below you’ll find the resources chosen for presentation at the conference and a brief headline of what to me is interesting about the paper. In the spirt of the track, I link to the resource as well as the paper.

Datasets

  •  Automatic Curation of Clinical Trials Data in LinkedCT by Oktie Hassanzadeh and Renée J Miller (paper) – clinicaltrials.gov published as linked data in an open and queryable. This resource has been around since 2008. I love the fact that they post downtime and other status info on twitter https://twitter.com/linkedct
  • LSQ: Linked SPARQL Queries Dataset by Muhammad Saleem, Muhammad Intizar Ali, Qaiser Mehmood, Aidan Hogan and Axel-Cyrille Ngonga Ngomo (paper). – Query logs are becoming an ever more important resource from everything from search engines to database query optimization. See for example USEWOD. This resource provides queryable versions in SPARQL of the query logs from several major datasets including dbpedia and linked geo data.
  • Provenance-Centered Dataset of Drug-Drug Interactions by Juan Banda, Tobias Kuhn, Nigam Shah and Michel Dumontier (paper) – this resources provides aggregated set of drug-drug interactions coming from 8 different sources. I like how they provided a doi for the bulk download of their datasource as well as spraql endpoint. It also uses nanopublications as the representation format.
  • Semantic Bridges for Biodiversity Science by Natalia Villanueva-Rosales, Nicholas Del Rio, Deana Pennington and Luis Garnica Chavira (paper) – this resource allows biodiversity scientist to work with species distribution models. The interesting thing about this resource is that it not only provides linked data, a spraql endpoint and ontologies but also semantic web services (i.e. SADI) for orchestrating these models.
  • DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons by Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann and Sebastian Hellmann  (paper) – this is another chapter in exposing wikimedia content as structured data. This resource provides structured information for the media content in Wikimedia commons. Now you can spraql for all images with a CC-by-sa v2.0 license.

Ontologies

Overall, I think this is a good representation of the plethora of deep datasets and ontologies that the community is creating.  Take a minute and check out these new resources.

%d bloggers like this: