amazing the technology and the people involved.
amazing the technology and the people involved.
It’s been about a week since I got from Australia attending the International Semantic Web Conference (ISWC 2013). This is the premier forum for the latest in research on using semantics on the Web. Overall, it was a great conference – both well run and there was a good buzz. (Note, I’m probably a bit biased – I was chair of this year’s In-Use track) .
ISWC is a fairly hard conference to get into and the quality is strong.
More importantly, almost all the talks I went to were worth thinking about. You can find the proceedings of the conference online either as a complete zip here or published by Springer. You can find more stats on the conference here.
As an aside, before digging into the meat of the conference – Sydney was great. Really a fantastic city – very cosmopolitan and with great coffee. I suggest Single Origin Roasters. Also, Australia has wombats – wombats are like the chillest animal ever.
From my perspective, there were three main themes to take away from the conference:
We are really seeing how semantic technologies can power great applications. All three keynotes highlighted the use of Semantic Tech. I think Ramanathan Guha’s keynote probably highlighted this the best in his discussion of the growth of schema.org.
Beyond the slide above, he brought up representatives from Yandex, Yahoo, and Microsoft on stage to join Google to tell how they are using schema.org. Drupal and WordPress will have schema.org in their cores in 2014. Schema.org is being used to drive everything from veteran friendly job search, to rich pins on Pinterest and enabling Open Table reservations to be easily put into your calendar. So schema.org is clearly a success.
Peter Mika presented a paper on how Yahoo is using ontologies to drive entity recommendations in searches. For example, you search for Brad Pitt and they show you related entities like Angelina Jolie or Fight Club. The nice thing about the paper is that it showed how the deployment in production (in Yahoo! Web Search in the US) increases click through rates.
I think it was probably Yves Raimond’s conference – he showed some amazing things being done at the BBC using semantic web technology. He had an excellent keynote at the COLD workshop – also highlighting some challenges on where we need to improve to ease the use of these technologies in production. I recommend you check out the slides above. Of all the applications, their work on mining the world service archive of the BBC to enrich content being created. This work won the Semantic Web Challenge.
In the biomedical domain, there were two papers showing how semantics can be embedded in tools that regular users use. One showed how the development of ICD-11 (ICD is the most widely used clinical classification developed by the WHO) is supported using semtech. The other I liked was the use of excel templates (developed using RightField) that transparently captured data according to a domain model for Systems biology.
Finally, there was a neat application presented by Jane Hunter applying these technologies to art preservation: The Twentieth Century in Paint.
I did a review of all the in-use papers leading up to the conference but it’s good enough to say that there were numerous impressive applications. Also, I think it says something about the health of the community when you see slides like this:
There were a number of interesting papers that were around the idea of using a combination of well-known ontologies and then either record linkage or other machine learning methods to populate knowledge bases.
A paper that I like a lot (and also won the best student paper) was titled Knowledge Graph Identification (by Jay Pujara, Hui Mia, Lise Getoor and William Cohen) sums it up nicely:
Our approach, knowledge graph identification (KGI) combines the tasks of entity resolution, collective classification and link prediction mediated by rules based on ontological information.
Interesting papers under this theme were:
From my perspective, it was also nice to see the use of the W3C Provenance Model (PROV) as one of these core ontologies in many different papers and two of the keynotes. People are using it as a substructure to do a number of different applications – I intend to write a whole post on this – but until then here’s proof by twitter:
It’s pretty evident that when dealing with the web things are messy. There were a couple of papers that documented this empirically either in terms of the availability of endpoints or just looking at the heterogeneity of the markup available from web pages.
In some sense, the papers mentioned in the prior theme also try to deal with this messiness. Here are another couple of papers looking at essentially how do deal with or even use this messiness.
One thing that seemed a lot more present in this year’s conference than last year was the term entity. This is obviously popular because of things like google knowledge graph – but in some sense maybe it gives a better description of what we are aiming to get out of the data we have – machine readable descriptions or real world concepts/things.
There are some things that are of interest that don’t fit neatly into the themes above. So I’ll just try a bulleted list.
Altmetrics has seen an increasing interest as an alternative to traditional measures of academic performance. This past week I gave a talk in Amsterdam for Open Access week about how altmetrics can be used by academics and their organizations to highlight their broader set of contributions. These can be used to tell a richer and fuller story about how what we do has impact. The talk had a nice turn out of librarians, faculty and administrators (friendly faces below).
In relation to the talk, I was interviewed by the Dutch national newspaper, de Volkskrant, about the same theme (Twitter neemt wetenschap steeds meer de maat).
You can find the slides of the talk below. I’m told there will be video as well. A big thanks to the altmetrics community. The recent PLOS ALM workshop was a great resource for material. A big thanks goes to Cameron Neylon for allowing me to reuse some of his slides. Overall, I hope that I helped some more people understand how these new forms of metrics can help in showing the impact of what they do.
Yesterday, Luc (my coauthor) and I received our physical copies of Provenance: An Introduction to PROV in the mail. Even though the book is primarily designed to be distributed digitally – it’s always great actually holding a copy in your hands. You can now order your own physical copy on Amazon. The Amazon page for the book there also includes the ability to look inside the book.
Cross-posted from blog.provbook.org
If you follow this blog, you’ll know that one of the main themes of my research is data provenance - one of the main use cases for it is reproducibility and transparency in science. I’ve been attending and speaking at quite a few events talking about data sharing, reproducibility and making science more transparent. I’ve even published [1, 2] on these topics.
In this context, I’ve been thinking about my own process as a scientist and whether I’m ”eating my own dogfood“. Indeed at the Beyond the PDF 2 conference in March, I stood up at the end and in front of ~200 people said that I would change my work practice – we have enough tools to really change how we do science. I knew I could do better.
So this post is about doing just that. In general, my research work consists of larger infrastructure projects in collaborations and then smaller work developing experimental prototypes and mucking with new algorithms. For the former, the projects use all the standard software development stuff (github, jira, wikis) so this gets documented fairly well.
The bit that’s not as good as it should be is for the smaller scale things. I think with my co-authors and I do an ok job at publishing the code and the data associated with our publications — although this could be improved. (It’s too often on our own websites). The major issue I have is that the methods are probably not as reproducible or transparent as they should be – essentially it’s a bit messy for other people to figure out exactly what I was up to when doing something new. It’s not in one place nor is it clearly documented. It also hurts my process in that a lot of the mucking about I do gets lost or it takes time to find. I see this is as a particular problem as I do more web science research where the gathering cleaning and reanalyzing data is a critical part of the endeavor.
To do this, I’ve decided to adopt IPython Notebooks as my new note taking environment. This solves the problem of allowing me to try different things out and keep track of all the parts of a project together. Additionally, it lets me “narrate my work” – that is mix commentary with my code, which is pretty cool. My notebook is on github and also contains information about how my system is setup including versions of libraries I’m relying on.
There’s still a long way to go to pass Phil’s test for research programming effectiveness (see also Why use make?), but I think this is a right step in my direction.
To honor this step, I’m giving $100 to FORCE11 to spread the word about how we can make scholarship better.
In April, we launched the Open PHACTS Discovery Platform with a corresponding API allowing developers to create drug discovery applications without having to worry about the complexities and pain of integrating multiple databases. We’ve had some great applications being developed on top of this API. If you’re a developer in this space, I encourage you to take a look and see what you can create. Below is a slide set and a webinar about getting started with the API. You can also check out https://dev.openphacts.org for developer documentation and getting an account.
This past week we (Achille Fokoue & myself) sent the paper notifications for the 2013 International Semantic Web Conference’s In-Use Track. The track seeks to highlight innovative semantic technologies being applied and deployed in practice. With the selection made by the program committee (Thanks!), I think we have definitely achieved that goal.
So if you’re coming to Sydney (& you should definitely be coming to Sydney) here’s what’s in store. (Papers are listed below.) You’ll see a number of papers where semantic technologies are being deployed in companies to help end users including:
Dealing with streaming data has been a growing research theme in recent years. In the in-use track, we are seeing some of the fruits of that research in particular with respect to monitoring city events. Balduini et al. report on the use of Streaming Linked Data Framework for monitoring the London Olympic Games 2012 and Milano Design Week 2013. (Yes, the semantic web is fashionable) . IBM will present its work on the real-time urban monitoring of Dublin – requiring both scale but also low-latency solutions .
Semantic technologies have a long history of being deployed in healthcare and life sciences. We’ll see that again at this year’s conference. We get a progress report on the usage of these technologies in the development of the 11th revision of the International Classification of Diseases (ICD-11) . ICD-11 involves 270 domain experts using the iCAT tool. We see how the intermixing (plain-old) spreadsheets and semantic technologies is enabling systems biology to better share its data . In the life sciences, and in particular in drug discovery, both public and private data are critical, we see how the Open PHACTS project is tackling the problem of intermixing such data .
Continuing on the science theme, the track will have reports on improving the reliability of scientific workflows , how linked data is being leverage to understand the economic impact of R&D in Europe ; and how our community is “eating its own dogfood” to enable better scientometric analysis of journals . Lastly, you’ll get a talk on the use of semantic annotations to help crowd source 3D representations of Greek Pottery for cultural heritage (a paper that I just think is so cool – I hope for videos) .
Reasoning relies on the availability of data exposed with its associated semantics. We’ve seen how the Linking Open Data movement helped bootstrap the uptake of Semantic Web technologies. Likewise, the widespread deployment of RDFa and microformats have dramatically increased the amount of data availability. But what’s out there? Bizer et al. give us a report based on analyzing 3 billion web pages. (I expect some awesome charts in this presentation) .
Enriching data with semantics has benefits but also comes at a cost. Based on a case study of converting Norwegian Petroleum Directorate’s FactPages, we’ll get insight into those trade-offs . Reducing the effort for such conversations and particularly interlinking is a key challenge. The Cross-language Service Retrieve system is tackling this for open government data across multiple languages .
Finally, in practice, a key way to “semantize” data is through the use of natural language processing tools. You’ll see how semantic tech is facilitating the reusability and interoperability of NLP tools using NIF 2.0 framework .
I hope you’ll agree that this really represents the best from the semantic web community. These 16 papers were selected from 79 submissions. The program committee (for the most part) did a great job both with their reviewers and importantly the discussion. Any many cases it was a hard decision and the PCs ability to discuss and revise their views was crucial in making the final selection. Thanks to the PC, it is a lot of work to do and we definitely asked them to do it in a fairly compact way. Thank you!
A couple of other thoughts, I think decision to institute an abstract submission for the in-use track was a good one and that author rebuttals are more helpful than I thought they would be.
ISWC 2013 is going to be a fantastic conference. I’m looking forward to the location, the sessions and the community. I look forward to seeing you there. There are many ways to participate so check out http://iswc2013.semanticweb.org.