Archive

linked data

Last week, I was in Japan for the 15th International Semantic Web Conference. 

For me, this was a big event as I was research track program co-chair together with the amazing Elena Simperl. Being a program chair is a funny thing, you’re not directly responsible for any individual paper, presentation or review but you feel responsible for the entirety. And obviously, organizing 664 reviews for 212 submissions isn’t something to be taken lightly. Beyond my service as research track chair, I think my main contribution was finding good coffee near the event:

With all that said, I think the entire program was really solid. All the preprints are on the website and the proceedings are available from Springer. I’ll try to summarize my main takeaways below. But first some quick stats:

  • 430 participants
  • 212 (research track) + 43 (application track) + 71 (resources track) = 326 submissions
    • that’s up by 61 submission from last year!
  • Acceptance rates:
    • 39/212  =  18% (research track)
    • 12/43 = 28% (application track)
    • 24/71 = 34%  (resources track)
    • I think these reflect the aims of the individual tracks
  • We also had 102 posters and demos and 12 journal track papers
  • 35 student travel winners

My three main takeaways:

  1. Frames are back!
  2. semantics on the web (notice the case)
  3. Science as the next challenge
  4. SPARQL as a driver for other CS communities

(Oh and apologies for the gratuitous use of images and twitter embeds)

Frames are back!

For the past couple of years, a chunk of the community has been focused on the problem of entity resolution/disambiguation whether that’s from text to a KB or across multiple KBs. Indeed, one of the best paper winners (yes, we gave out two – both nominees had great papers) by ISI’s Information Integration Group was an excellent approach to do multi-type entity resolution.  Likewise, Axel and crew gave a pretty heavy duty tutorial on link discovery. On the NLP front, Stefano Faralli presented a nice resource that disambiguates text to lexical resources with a focus on providing both symbolic and distributional representations .

2016-10-21 10.32.59.jpg2016-10-21 10.34.59.jpg

What struck me at the conference were the number of papers beginning to think not just about entities and their relations but the context they are in. This need for context was well motivated by the folks at IBM research working on medical question answering.

Essentially, thinking about classic AI frames but how do obtain these automatically. A clear example of this is the (ongoing) work on FRED:

Similarly, the News Reader system for extracting information into situated events is another example. Another example is extracting process graphs from medical texts. Finally, in the NLP community there’s an increasing focus on developing resources in order to build automated parsers for frame-style semantic representations (e.g. Abstract Meaning Representation). Such representations can be enhanced by connections to semantic web resources as discussed by Burns et al. (I knew this was a great idea in 2015!)

2016-10-21 10.58.15.jpg

In summary,  I think we’re beginning to see how the background knowledge available on the Semantic Web combined with better parsers can help us start to deal better with context in an automated fashion.

semantics on the web

Chris Bizer gave an insightful keynote reflecting on what the community’s expectations were for the semantic web and where we currently are at.

He presented stats on the growth of Linked Data (e.g. stuff in the LOD cloud) as well as web data (e.g. schema.org marked pages) but really the main take away is the boom in the later. About 30% of the Web has html embedded data something like 12 million websites.  There’s an 86% adoption rate on top travel website.  I think the choice quote was:

“Probably, every hotel on earth is represented as web data.”

The problem is that this sort of data is not clean, it’s messy – it’s webby data, which brings to Chris’s important point for the community:

2016-10-20-09-46-12

While standards have brought us a lot, I think we are starting as a research community to think increasingly about different kinds of semantics and different kinds of structured data.  Some examples from the conference:

An embrace of the whole spectrum of semantics on the web is really a valuable move for the research community. Interestingly enough, I think we can truly experiment with web data through things like Common Crawl and the Web Data Commons. As knowledge graphs, triple stores, and ontologies become increasingly common place especially in enterprise deployments, I’m heartened by these new areas of investigation.

The next challenge: Science

Personally, the third keynote of ISWC by Professor Hiroaki Kitano – the CEO of Sony CSL and creator among other things of the AIBO and founder of RoboCup gave a inspirational speech laying out what he sees as the next AI grand challenge:

2016-10-21 09.20.30.jpg

It will be hard for me to do justice to the keynote as the material per second ratio was pretty much off the chart but he has AI magazine article laying out the vision.

Broadly, he used RoboCup as a framework for discussing how to organize a challenge and pointed to its effectiveness. (e.g Kiva systems a RoboCup spinout was acquired by Amazon for $770 million). He then focused on the issue of the inefficiency in scientific discovery and in particular how assembling knowledge is just too difficult.

:2016-10-21 09.21.04.jpg

2016-10-21 09.40.53 copy.jpg

Assembling this by hand is way too hard!

He then went on to reframe the scientific question as one of a massive search and verification of hypothesis space. 2016-10-21 09.52.29.jpg

I walked out of that keynote pretty charged up.

I think the semantic web community can be a big part of tackling this grand challenge. Science and medicine have always been important domains for applying these technologies and that showed up at this conference as well:

SPARQL as a driver for other CS communities

The 10 year award was given to  Jorge Perez , Marcelo Arenas and Claudio Gutierrez for their paper Semantics and Complexity of SPARQL. Jorge gave just a beautiful 10 minute reflection on the paper and the relationship between theory and practice. I think his slide below really sums up the impact that SPARQL has had not just on the semantic web community but CS as a whole:

2016-10-19 09.35.39.jpg

As further evidence, I thought one of the best technical talks of the conference (even through an earthquake) was by Peter Bonz on emergent schemas for RDF querying.

It was a clear example of how the two DB and semweb communities are learning from one another and that by the semantic web having different requirements (e.g. around schemas), this drives new research.

As a whole, it’s hard to beat a conference where you learn a ton and has the following:

2016-10-19 10.52.15.jpg

Random Pointers

It’s kind of appropriate that my last post of 2015 was about the International Semantic Web Conference (ISWC) and my first post of 2016 will be about ISWC.

This years conference will be held in Kobe Japan. This year’s conference already has a number of great things in store. We already have a stellar list of keynote speakers:

  • Kathleen McKeown – Professor of Computer Science at Columbia University,
    Director of the Institute for Data Sciences and Engineering, and Director of the North East Big Data Hub. I was at the hub’s launch last year and it’s really amazing the researchers she brought together through that hub.
  • Hiroaki Kitano – CEO of Sony Computer Science Laboratory and President of the systems biology institute. A truly inspirational figure who done everything from RoboCup to systems biology. He was even an invited artist at MoMA.
  • Chris Bizer – Professor at the Univesity of Mannheim  and Director of the Institute of Computer Science and Business Informatics there. If you’re in the Semantic Web community – you know the amazing work Chris has done. He really kicked the entire move toward Linked Data into high gear.

We have three tracks for you to submit to:

  1. The classic Research Track. Elena and I hope to get your most innovative and groundbreaking work on the cross between semantics and the web writ large. We’ve put together a top notch PC to give you feedback.
  2. The Resources Tracks. Reusable resources like datasets, ontologies, benchmarks and tools are crucial for many research disciplines and especially ours. This track focuses on highlighting them. Alasdair and Marta have put together a rich set of guidelines for a great reusable resources. Check them out.
  3. The Applications Track provides an area to discuss the benefits and challenges of applying semantic technologies. This track, organized by Markus and Freddy, is accepting three different types of submissions on in-use applications, industry applications and industry applications.

In addition to these tracks, ISWC 2016 will have a full program of workshops, posters, demos and student opportunities.

This year we’ll also be allowing submissions to be HTML, letting you experiment with new ways of conveying your contributions. I’m excited to see the creativity in the community using web technologies.

So get those submissions in. Abstracts are due April 20, Full submissions April 30th!

 

 

 

 

Next week is the 2015 International Semantic Web Conference. I had the opportunity with the Michel Dumontier to chair a new track on Datasets and Ontologies. A key part of of the Semantic Web has always been shared resources, whether it’s common standards through the W3C or open datasets like those found in the LOD cloud. Indeed, one of the major successes of our community is the availability of these resources.

ISWC over the years has experimented with different ways of highlighting these contributions and bringing them into the scientific literature. For the past couple of years, we have had an evaluation track specifically devoted to reproducibility and evaluation studies. Last year datasets were included to form a larger RDBS track. This year we again have a specific Empirical Studies and Evaluation track along side the Data & Ontologies track.

The reviewers had a tough job for this track. First, it was new so it’s hard to make a standard judgment. Secondly, we asked reviewers not only to review the paper but the resource itself along a number of dimensions. Overall, I think they did a good job. Below you’ll find the resources chosen for presentation at the conference and a brief headline of what to me is interesting about the paper. In the spirt of the track, I link to the resource as well as the paper.

Datasets

  •  Automatic Curation of Clinical Trials Data in LinkedCT by Oktie Hassanzadeh and Renée J Miller (paper) – clinicaltrials.gov published as linked data in an open and queryable. This resource has been around since 2008. I love the fact that they post downtime and other status info on twitter https://twitter.com/linkedct
  • LSQ: Linked SPARQL Queries Dataset by Muhammad Saleem, Muhammad Intizar Ali, Qaiser Mehmood, Aidan Hogan and Axel-Cyrille Ngonga Ngomo (paper). – Query logs are becoming an ever more important resource from everything from search engines to database query optimization. See for example USEWOD. This resource provides queryable versions in SPARQL of the query logs from several major datasets including dbpedia and linked geo data.
  • Provenance-Centered Dataset of Drug-Drug Interactions by Juan Banda, Tobias Kuhn, Nigam Shah and Michel Dumontier (paper) – this resources provides aggregated set of drug-drug interactions coming from 8 different sources. I like how they provided a doi for the bulk download of their datasource as well as spraql endpoint. It also uses nanopublications as the representation format.
  • Semantic Bridges for Biodiversity Science by Natalia Villanueva-Rosales, Nicholas Del Rio, Deana Pennington and Luis Garnica Chavira (paper) – this resource allows biodiversity scientist to work with species distribution models. The interesting thing about this resource is that it not only provides linked data, a spraql endpoint and ontologies but also semantic web services (i.e. SADI) for orchestrating these models.
  • DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons by Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann and Sebastian Hellmann  (paper) – this is another chapter in exposing wikimedia content as structured data. This resource provides structured information for the media content in Wikimedia commons. Now you can spraql for all images with a CC-by-sa v2.0 license.

Ontologies

Overall, I think this is a good representation of the plethora of deep datasets and ontologies that the community is creating.  Take a minute and check out these new resources.

This past week I attended a workshop the Evolution and variation of classification systems organized by the Knowescape EU project. The project studies how knowledge evolves and makes cool maps like this one:

The aim of the workshop was to discuss how knowledge organization systems and classification systems change.  By knowledge organization systems, we mean things like the Universal Decimal Classification system or the Wikipedia Category Structure. My interest here is the interplay between the change in data and the change in the organization system used for that data. For example, I may use a certain vocabulary or ontology to describe a dataset (i.e. the columns), how does that impact data analysis procedures when that organization’s meaning changes.  Many of our visualizations decisions and analysis are based on how we categorize (whether mechanical or automatically) data according to such organizational structures. Albert Meroño-Peñuela gave an excellent example of that with his work on dutch historical census data. Furthermore, the organization system used may impact the ability to repurpose and combine data.

Interestingly, even though we’ve seen highly automated approaches emerge for search and other information analysis tasks Knowledge Organization Systems (KOSs) still often provide extremely useful information. For example, we’ve see how scheme.org and wikipedia structure have been central to the emergence of knowledge graphs. Likewise, extremely adaptable organization systems such as hashtags have been foundational for other services.

At the workshop, I particularly enjoyed Joesph Tennis keynote on the diversity and stability of KOSs. He’s work on ontogeny is starting to measure that change. He demonstrated this by looking at the Dewey Decimal System but others have shown that the change is apparent in other KOSs (1, 2, 3, 4). Understanding this change could help in constructing better and more applicable organization systems.

From both Joseph’s talk as well as the talk Richard Smiraglia (one of the leaders in the Knowledge Organization), it’s clear that as with many other sciences our ability to understand information systems can now become much more deeply empirical. Because the objects of study (e.g. vocabularies, ontologies, taxonomies, dictionaries) are available on the Web in digital form we can now analyze them. This is the promise of Web Observatories. Indeed, that was an interesting outcome of the workshop was that the construction of KOSs observatory was not that far fetched and could be done using aggregators such as Linked Open Vocabularies and Taxonomy Warehouse. I’ll be interested to see if this gets built.

Finally, it occurred to me that there is a major lack of studies on the evolution of the urban dictionary as a KOS. Somewhat ought to do something about it 🙂

Random Notes

NewsReader Amsterdam Hackathon

This past Wednesday (Jan. 21, 2015) I was at the NewsReader Hackathon. NewsReader is a EU project to extract events and build stories from the news. They use a sophisticated NLP pipeline combined with semantic background knowledge to perform this task. The hackathon was an opportunity to talk to members of one of the leading NLP groups in the Netherlands (CLTL) and find out more about their current pipeline. Additionally, one of the project partners is Lexis Nexis, a sister company of Elsevier, so it was nice to see how their content was being used as basis for event extraction and also meet some of my colleagues.  The combination of news and research  is particularly of interest in light of the recent Elsevier acquisition of NewsFlo.

Besides the chance to meet people, I also got to do some hacking myself to see how the NewsReader API worked. I used the api to plot the number and type of events featuring universities. (The resulting iPython Notebook)

A couple of pointers for future reference:

A couple of weeks ago, I was at the European Data Forum in Athens talking about the Open PHACTS project. You can find a video of my talk with slides here. Slides are embedded below.

It’s been about a week since I got from Australia attending the International Semantic Web Conference  (ISWC 2013).  This is the premier forum for the latest in research on using semantics on the Web. Overall, it was a great conference – both well run and there was a good buzz. (Note, I’m probably a bit biased – I was  chair of this year’s In-Use track) .

ISWC is a fairly hard conference to get into and the quality is strong.

More importantly, almost all the talks I went to were worth thinking about. You can find the proceedings of the conference online either as a complete zip here or published by Springer. You can find more stats on the conference here.

As an aside, before digging into the meat of the conference – Sydney was great. Really a fantastic city – very cosmopolitan and with great coffee. I suggest Single Origin Roasters.  Also, Australia has wombats – wombats are like the chillest animal ever.

Wombat

From my perspective, there were three main themes to take away from the conference:

  1. Impressive applications of semantic web technologies
  2. Core ontologies as the framework for connecting complex integration and retrieval tasks
  3. Starting to come to grips with messiness

Applications

We are really seeing how semantic technologies can power great applications. All three keynotes highlighted the use of Semantic Tech. I think Ramanathan Guha’s keynote probably highlighted this the best in his discussion of the growth of schema.org.

Beyond the slide above, he brought up representatives from Yandex, Yahoo, and Microsoft on stage to join Google to tell how they are using schema.org. Drupal and WordPress will have schema.org in their cores in 2014. Schema.org is being used to drive everything from veteran friendly job search, to rich pins on Pinterest and enabling Open Table reservations to be easily put into your calendar. So schema.org is clearly a success.

Peter Mika presented a paper on how Yahoo is using ontologies to drive entity recommendations in searches. For example, you search for Brad Pitt and they show you related entities like Angelina Jolie or  Fight Club. The nice thing about the paper is that it showed how the deployment in production (in Yahoo! Web Search in the US) increases click through rates.

Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. International Semantic Web Conference (2) 2013: 33-48

I think it was probably Yves Raimond’s conference – he showed some amazing things being done at the BBC using semantic web technology. He had an excellent keynote at the COLD workshop – also highlighting some challenges on where we need to improve to ease the use of these technologies in production. I recommend you check out the slides above. Of all the applications, their work on mining the world service archive  of the BBC to enrich content being created. This work won the Semantic Web Challenge.

In the biomedical domain, there were two  papers showing how semantics can be embedded in tools that regular users use.  One showed how the development of ICD-11 (ICD is the most widely used clinical classification developed by the WHO) is  supported using semtech. The other I liked was the use of excel templates (developed using RightField) that transparently captured data according to a domain model for Systems biology.

Also in the biomedical domain, IBM presented an approach for using semantic web technologies to help coordinate health and social care at the semantic web challenge.

Finally, there was a neat application presented by Jane Hunter applying these technologies to art preservation: The Twentieth Century in Paint.

I did a review of all the in-use papers leading up to the conference but it’s good enough to say that there were numerous impressive applications. Also, I think it says something about the health of the community when you see slides like this:

Core Ontologies + Other Methods

There were a number of interesting papers that were around the idea of using a combination of well-known ontologies and then either record linkage or other machine learning methods to populate knowledge bases.

A paper that I like a lot (and also won the best student paper) was titled Knowledge Graph Identification (by Jay Pujara, Hui Mia, Lise Getoor and William Cohen) sums it up nicely:

Our approach, knowledge graph identification (KGI) combines the tasks of entity resolution, collective classification and link prediction mediated by rules based on ontological information.

Interesting papers under this theme were:

From my perspective, it was also nice to see the use of the W3C Provenance Model (PROV) as one of these core ontologies in many different papers and two of the keynotes. People are using it as a substructure to do a number of different applications – I intend to write a whole post on this – but until then here’s proof by twitter:

Coming to grips with messiness

It’s pretty evident that when dealing with the web things are messy. There were a couple of papers that documented this empirically either in terms of the availability of endpoints or just looking at the heterogeneity of the markup available from web pages.

In some sense, the papers mentioned in the prior theme also try to deal with this messiness. Here are another couple of papers looking at essentially how do deal with or even use this messiness.

One thing that seemed a lot more present in this year’s conference than last year  was the term entity. This is obviously popular because of things like google knowledge graph – but in some sense maybe it gives a better description of what we are aiming to get out of the data we have – machine readable descriptions or real world concepts/things.

Misc.

There are some things that are of interest that don’t fit neatly into the themes above. So I’ll just try a bulleted list.

  • We won the Best Demo Paper Award for git2prov.org
  • Our paper on using NoSQL stores for RDF went over very well. Congrats to Marcin for giving a good presentation.
  • The format of mixing talks from different tracks by topic and having only 20 minutes per talk was great.
  • VUA had a great showing – 3 main track papers, a bunch of workshop papers, a couple of different posters, 4 workshop organizers giving talks at the workshop summary session, 2 organizing committee members, alumni all over the place, plus a bunch of stuff I probably forgot to mention.
  • The colocation with Web Directions South was great – it added a nice extra energy to the conference.
  • There were best reviewer awards won by Oscar Corcho, Tania Tudorache, and Aidan Hogan
  • Peter Fox seemed to give a keynote just for me – concept maps, PROV followed with abductive reasoning.
  • Did I mention that the coffee in Sydney (and Newcastle) is really good and lots of places serve proper breakfast!
%d bloggers like this: