linked data

At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.

Begin aside: Yes, I did run a conference called Beyond the PDF 2 and  have been known to tweet things like:

But, there’s a lot of great information in papers so we need to get our machines to read. end aside.

You can roughly catergorize the steps of reference extraction as follows:

  1. Extract the structure of the article.  (e.g. find the reference section)
  2. Extract the reference string itself
  3. Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)

Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.

There were three themes that popped out for me:

  1. The reading experience
  2. Resources
  3. Reading from the image

The Reading Experience

Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work  and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.

Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.

Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).


The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases).  Here’s a listing of the open resources that I heard called out:

This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:

An interesting issue was the transparency of algorithms and quality of the resulting citation databases.  Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:

Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.

Reading from the image

In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.

I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.

We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.

Random Notes

  • The organizers set-up a slack backchannel which was useful.
  • I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
  • EU projects can have a legacy – Roman Kern is still using code from where Mendeley was a consortium member.
  • Kölsch is dangerous but tasty
  • More workshops should try the noon to noon format.



Last week, I was in Japan for the 15th International Semantic Web Conference. 

For me, this was a big event as I was research track program co-chair together with the amazing Elena Simperl. Being a program chair is a funny thing, you’re not directly responsible for any individual paper, presentation or review but you feel responsible for the entirety. And obviously, organizing 664 reviews for 212 submissions isn’t something to be taken lightly. Beyond my service as research track chair, I think my main contribution was finding good coffee near the event:

With all that said, I think the entire program was really solid. All the preprints are on the website and the proceedings are available from Springer. I’ll try to summarize my main takeaways below. But first some quick stats:

  • 430 participants
  • 212 (research track) + 43 (application track) + 71 (resources track) = 326 submissions
    • that’s up by 61 submission from last year!
  • Acceptance rates:
    • 39/212  =  18% (research track)
    • 12/43 = 28% (application track)
    • 24/71 = 34%  (resources track)
    • I think these reflect the aims of the individual tracks
  • We also had 102 posters and demos and 12 journal track papers
  • 35 student travel winners

My three main takeaways:

  1. Frames are back!
  2. semantics on the web (notice the case)
  3. Science as the next challenge
  4. SPARQL as a driver for other CS communities

(Oh and apologies for the gratuitous use of images and twitter embeds)

Frames are back!

For the past couple of years, a chunk of the community has been focused on the problem of entity resolution/disambiguation whether that’s from text to a KB or across multiple KBs. Indeed, one of the best paper winners (yes, we gave out two – both nominees had great papers) by ISI’s Information Integration Group was an excellent approach to do multi-type entity resolution.  Likewise, Axel and crew gave a pretty heavy duty tutorial on link discovery. On the NLP front, Stefano Faralli presented a nice resource that disambiguates text to lexical resources with a focus on providing both symbolic and distributional representations .

2016-10-21 10.32.59.jpg2016-10-21 10.34.59.jpg

What struck me at the conference were the number of papers beginning to think not just about entities and their relations but the context they are in. This need for context was well motivated by the folks at IBM research working on medical question answering.

Essentially, thinking about classic AI frames but how do obtain these automatically. A clear example of this is the (ongoing) work on FRED:

Similarly, the News Reader system for extracting information into situated events is another example. Another example is extracting process graphs from medical texts. Finally, in the NLP community there’s an increasing focus on developing resources in order to build automated parsers for frame-style semantic representations (e.g. Abstract Meaning Representation). Such representations can be enhanced by connections to semantic web resources as discussed by Burns et al. (I knew this was a great idea in 2015!)

2016-10-21 10.58.15.jpg

In summary,  I think we’re beginning to see how the background knowledge available on the Semantic Web combined with better parsers can help us start to deal better with context in an automated fashion.

semantics on the web

Chris Bizer gave an insightful keynote reflecting on what the community’s expectations were for the semantic web and where we currently are at.

He presented stats on the growth of Linked Data (e.g. stuff in the LOD cloud) as well as web data (e.g. marked pages) but really the main take away is the boom in the later. About 30% of the Web has html embedded data something like 12 million websites.  There’s an 86% adoption rate on top travel website.  I think the choice quote was:

“Probably, every hotel on earth is represented as web data.”

The problem is that this sort of data is not clean, it’s messy – it’s webby data, which brings to Chris’s important point for the community:


While standards have brought us a lot, I think we are starting as a research community to think increasingly about different kinds of semantics and different kinds of structured data.  Some examples from the conference:

An embrace of the whole spectrum of semantics on the web is really a valuable move for the research community. Interestingly enough, I think we can truly experiment with web data through things like Common Crawl and the Web Data Commons. As knowledge graphs, triple stores, and ontologies become increasingly common place especially in enterprise deployments, I’m heartened by these new areas of investigation.

The next challenge: Science

Personally, the third keynote of ISWC by Professor Hiroaki Kitano – the CEO of Sony CSL and creator among other things of the AIBO and founder of RoboCup gave a inspirational speech laying out what he sees as the next AI grand challenge:

2016-10-21 09.20.30.jpg

It will be hard for me to do justice to the keynote as the material per second ratio was pretty much off the chart but he has AI magazine article laying out the vision.

Broadly, he used RoboCup as a framework for discussing how to organize a challenge and pointed to its effectiveness. (e.g Kiva systems a RoboCup spinout was acquired by Amazon for $770 million). He then focused on the issue of the inefficiency in scientific discovery and in particular how assembling knowledge is just too difficult.

:2016-10-21 09.21.04.jpg

2016-10-21 09.40.53 copy.jpg

Assembling this by hand is way too hard!

He then went on to reframe the scientific question as one of a massive search and verification of hypothesis space. 2016-10-21 09.52.29.jpg

I walked out of that keynote pretty charged up.

I think the semantic web community can be a big part of tackling this grand challenge. Science and medicine have always been important domains for applying these technologies and that showed up at this conference as well:

SPARQL as a driver for other CS communities

The 10 year award was given to  Jorge Perez , Marcelo Arenas and Claudio Gutierrez for their paper Semantics and Complexity of SPARQL. Jorge gave just a beautiful 10 minute reflection on the paper and the relationship between theory and practice. I think his slide below really sums up the impact that SPARQL has had not just on the semantic web community but CS as a whole:

2016-10-19 09.35.39.jpg

As further evidence, I thought one of the best technical talks of the conference (even through an earthquake) was by Peter Bonz on emergent schemas for RDF querying.

It was a clear example of how the two DB and semweb communities are learning from one another and that by the semantic web having different requirements (e.g. around schemas), this drives new research.

As a whole, it’s hard to beat a conference where you learn a ton and has the following:

2016-10-19 10.52.15.jpg

Random Pointers

It’s kind of appropriate that my last post of 2015 was about the International Semantic Web Conference (ISWC) and my first post of 2016 will be about ISWC.

This years conference will be held in Kobe Japan. This year’s conference already has a number of great things in store. We already have a stellar list of keynote speakers:

  • Kathleen McKeown – Professor of Computer Science at Columbia University,
    Director of the Institute for Data Sciences and Engineering, and Director of the North East Big Data Hub. I was at the hub’s launch last year and it’s really amazing the researchers she brought together through that hub.
  • Hiroaki Kitano – CEO of Sony Computer Science Laboratory and President of the systems biology institute. A truly inspirational figure who done everything from RoboCup to systems biology. He was even an invited artist at MoMA.
  • Chris Bizer – Professor at the Univesity of Mannheim  and Director of the Institute of Computer Science and Business Informatics there. If you’re in the Semantic Web community – you know the amazing work Chris has done. He really kicked the entire move toward Linked Data into high gear.

We have three tracks for you to submit to:

  1. The classic Research Track. Elena and I hope to get your most innovative and groundbreaking work on the cross between semantics and the web writ large. We’ve put together a top notch PC to give you feedback.
  2. The Resources Tracks. Reusable resources like datasets, ontologies, benchmarks and tools are crucial for many research disciplines and especially ours. This track focuses on highlighting them. Alasdair and Marta have put together a rich set of guidelines for a great reusable resources. Check them out.
  3. The Applications Track provides an area to discuss the benefits and challenges of applying semantic technologies. This track, organized by Markus and Freddy, is accepting three different types of submissions on in-use applications, industry applications and industry applications.

In addition to these tracks, ISWC 2016 will have a full program of workshops, posters, demos and student opportunities.

This year we’ll also be allowing submissions to be HTML, letting you experiment with new ways of conveying your contributions. I’m excited to see the creativity in the community using web technologies.

So get those submissions in. Abstracts are due April 20, Full submissions April 30th!





Next week is the 2015 International Semantic Web Conference. I had the opportunity with the Michel Dumontier to chair a new track on Datasets and Ontologies. A key part of of the Semantic Web has always been shared resources, whether it’s common standards through the W3C or open datasets like those found in the LOD cloud. Indeed, one of the major successes of our community is the availability of these resources.

ISWC over the years has experimented with different ways of highlighting these contributions and bringing them into the scientific literature. For the past couple of years, we have had an evaluation track specifically devoted to reproducibility and evaluation studies. Last year datasets were included to form a larger RDBS track. This year we again have a specific Empirical Studies and Evaluation track along side the Data & Ontologies track.

The reviewers had a tough job for this track. First, it was new so it’s hard to make a standard judgment. Secondly, we asked reviewers not only to review the paper but the resource itself along a number of dimensions. Overall, I think they did a good job. Below you’ll find the resources chosen for presentation at the conference and a brief headline of what to me is interesting about the paper. In the spirt of the track, I link to the resource as well as the paper.


  •  Automatic Curation of Clinical Trials Data in LinkedCT by Oktie Hassanzadeh and Renée J Miller (paper) – published as linked data in an open and queryable. This resource has been around since 2008. I love the fact that they post downtime and other status info on twitter
  • LSQ: Linked SPARQL Queries Dataset by Muhammad Saleem, Muhammad Intizar Ali, Qaiser Mehmood, Aidan Hogan and Axel-Cyrille Ngonga Ngomo (paper). – Query logs are becoming an ever more important resource from everything from search engines to database query optimization. See for example USEWOD. This resource provides queryable versions in SPARQL of the query logs from several major datasets including dbpedia and linked geo data.
  • Provenance-Centered Dataset of Drug-Drug Interactions by Juan Banda, Tobias Kuhn, Nigam Shah and Michel Dumontier (paper) – this resources provides aggregated set of drug-drug interactions coming from 8 different sources. I like how they provided a doi for the bulk download of their datasource as well as spraql endpoint. It also uses nanopublications as the representation format.
  • Semantic Bridges for Biodiversity Science by Natalia Villanueva-Rosales, Nicholas Del Rio, Deana Pennington and Luis Garnica Chavira (paper) – this resource allows biodiversity scientist to work with species distribution models. The interesting thing about this resource is that it not only provides linked data, a spraql endpoint and ontologies but also semantic web services (i.e. SADI) for orchestrating these models.
  • DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons by Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann and Sebastian Hellmann  (paper) – this is another chapter in exposing wikimedia content as structured data. This resource provides structured information for the media content in Wikimedia commons. Now you can spraql for all images with a CC-by-sa v2.0 license.


Overall, I think this is a good representation of the plethora of deep datasets and ontologies that the community is creating.  Take a minute and check out these new resources.

This past week I attended a workshop the Evolution and variation of classification systems organized by the Knowescape EU project. The project studies how knowledge evolves and makes cool maps like this one:

The aim of the workshop was to discuss how knowledge organization systems and classification systems change.  By knowledge organization systems, we mean things like the Universal Decimal Classification system or the Wikipedia Category Structure. My interest here is the interplay between the change in data and the change in the organization system used for that data. For example, I may use a certain vocabulary or ontology to describe a dataset (i.e. the columns), how does that impact data analysis procedures when that organization’s meaning changes.  Many of our visualizations decisions and analysis are based on how we categorize (whether mechanical or automatically) data according to such organizational structures. Albert Meroño-Peñuela gave an excellent example of that with his work on dutch historical census data. Furthermore, the organization system used may impact the ability to repurpose and combine data.

Interestingly, even though we’ve seen highly automated approaches emerge for search and other information analysis tasks Knowledge Organization Systems (KOSs) still often provide extremely useful information. For example, we’ve see how and wikipedia structure have been central to the emergence of knowledge graphs. Likewise, extremely adaptable organization systems such as hashtags have been foundational for other services.

At the workshop, I particularly enjoyed Joesph Tennis keynote on the diversity and stability of KOSs. He’s work on ontogeny is starting to measure that change. He demonstrated this by looking at the Dewey Decimal System but others have shown that the change is apparent in other KOSs (1, 2, 3, 4). Understanding this change could help in constructing better and more applicable organization systems.

From both Joseph’s talk as well as the talk Richard Smiraglia (one of the leaders in the Knowledge Organization), it’s clear that as with many other sciences our ability to understand information systems can now become much more deeply empirical. Because the objects of study (e.g. vocabularies, ontologies, taxonomies, dictionaries) are available on the Web in digital form we can now analyze them. This is the promise of Web Observatories. Indeed, that was an interesting outcome of the workshop was that the construction of KOSs observatory was not that far fetched and could be done using aggregators such as Linked Open Vocabularies and Taxonomy Warehouse. I’ll be interested to see if this gets built.

Finally, it occurred to me that there is a major lack of studies on the evolution of the urban dictionary as a KOS. Somewhat ought to do something about it 🙂

Random Notes

NewsReader Amsterdam Hackathon

This past Wednesday (Jan. 21, 2015) I was at the NewsReader Hackathon. NewsReader is a EU project to extract events and build stories from the news. They use a sophisticated NLP pipeline combined with semantic background knowledge to perform this task. The hackathon was an opportunity to talk to members of one of the leading NLP groups in the Netherlands (CLTL) and find out more about their current pipeline. Additionally, one of the project partners is Lexis Nexis, a sister company of Elsevier, so it was nice to see how their content was being used as basis for event extraction and also meet some of my colleagues.  The combination of news and research  is particularly of interest in light of the recent Elsevier acquisition of NewsFlo.

Besides the chance to meet people, I also got to do some hacking myself to see how the NewsReader API worked. I used the api to plot the number and type of events featuring universities. (The resulting iPython Notebook)

A couple of pointers for future reference:

A couple of weeks ago, I was at the European Data Forum in Athens talking about the Open PHACTS project. You can find a video of my talk with slides here. Slides are embedded below.

%d bloggers like this: