It’s kind of appropriate that my last post of 2015 was about the International Semantic Web Conference (ISWC) and my first post of 2016 will be about ISWC.

This years conference will be held in Kobe Japan. This year’s conference already has a number of great things in store. We already have a stellar list of keynote speakers:

  • Kathleen McKeown – Professor of Computer Science at Columbia University,
    Director of the Institute for Data Sciences and Engineering, and Director of the North East Big Data Hub. I was at the hub’s launch last year and it’s really amazing the researchers she brought together through that hub.
  • Hiroaki Kitano – CEO of Sony Computer Science Laboratory and President of the systems biology institute. A truly inspirational figure who done everything from RoboCup to systems biology. He was even an invited artist at MoMA.
  • Chris Bizer – Professor at the Univesity of Mannheim  and Director of the Institute of Computer Science and Business Informatics there. If you’re in the Semantic Web community – you know the amazing work Chris has done. He really kicked the entire move toward Linked Data into high gear.

We have three tracks for you to submit to:

  1. The classic Research Track. Elena and I hope to get your most innovative and groundbreaking work on the cross between semantics and the web writ large. We’ve put together a top notch PC to give you feedback.
  2. The Resources Tracks. Reusable resources like datasets, ontologies, benchmarks and tools are crucial for many research disciplines and especially ours. This track focuses on highlighting them. Alasdair and Marta have put together a rich set of guidelines for a great reusable resources. Check them out.
  3. The Applications Track provides an area to discuss the benefits and challenges of applying semantic technologies. This track, organized by Markus and Freddy, is accepting three different types of submissions on in-use applications, industry applications and industry applications.

In addition to these tracks, ISWC 2016 will have a full program of workshops, posters, demos and student opportunities.

This year we’ll also be allowing submissions to be HTML, letting you experiment with new ways of conveying your contributions. I’m excited to see the creativity in the community using web technologies.

So get those submissions in. Abstracts are due April 20, Full submissions April 30th!

 

 

 

 

Last week, I hung out in Bethlehem, Pennsylvania for the the 14th International Semantic Web Conference. Bethlehem is famous for the Lehigh University Benchmark  (LUBM) and Bethlehem Steel. This is the major conference focused on the intersection of semantics and web technologies. In addition to being technically super cool, it was a great chance for me to meet many friends and make some new ones.

Let’s begin with some stats:

  • ~450 attendees
  • The conference continues to be selective:
    • Research track: 22% acceptance rate
    • Empirical studies track: 29% acceptance rate
    • In-use track: 40% acceptance rate
    • Datasets and Ontologies: 22% acceptance rate
  • There were 265 submissions across all tracks which is surprisingly the same number as last year.
  • More stats and info in Stefan’s slides (e.g. move to Portugal if you want to get your papers in the conference.)
  • Fancy visualizations courtesy of the STKO group

Before getting into what I thought were the major themes of the conference, a brief note. Reviewing is at the heart of any academic conference. While we can always try and improve review quality, it’s worth calling out good reviewing. The best reviewers were Maribel Acosta (research) and Markus Krötzsch (applied). As data sets and ontologies track co-chair, I can attest to how important good reviewers are.  For this new track we relied heavily on reviewers being flexible and looking at these sorts of contributions differently. So thanks to them!

For me there were three themes of ISWC:

  1. The Spectrum of Entity Resolution
  2. The Spectrum of Linked Data Querying
  3. Buy more RAM

The Spectrum of Entity Resolution

Maybe its because I attended the NLP & DBpedia workshop or the conversation I had about string similarity with Michelle Cheatham, but one theme that I saw was the continued amalgamation of natural language processing (NLP) style entity resolution with database entity resolution (i.e. record linkage). This movement stems from the fact that an increasing amount of linked data is a combination of data extracted from semi-structured sources as well as from NLP. But in addition to that, NLP sources rely on some of these semi-structured datasources to do NLP.

Probably, the best example of that idea is the work that Andrew McCallum presented in his keynote on “epistemlogical knowledge bases”.

Briefly, the idea is to reason with all the information coming from both basic low level NLP (e.g. basic NER, or even surface forms) as well as the knowledge base jointly (plus, anything else) to generate a knowledge base.  One method to do this is universal schemas. For a good intro, check out Sebastien Riedel’s slides.

From McCallum, I like the following papers which gives a good justification and results of doing collective/joint inference.

(Self promotion aside: check out Sara Magliacane’s work on Probabilistic Soft Logics for another way of doing joint inference.)

Following on from this notion of reasoning jointly, Hulpus, Prangnawarat and Hayes showed how to use the graph-based structure of linked data to to perform joint entity and word sense disambiguation from text. Likewise, Prokofyev et al. use the properties of a knowledge graph to perform better co-reference resolution. Essentially, they use this background knowledge to split the clusters of co-referrent entities produced by Stanford CoreNLP. On the same idea, but for more structured data, the TableEL system uses a joint model with soft constraints to perform entity linking for web tables, improving performance by up-to 75% on web tables. (code & data)

One approach to entity linking that I liked was from the Raphael Troncy’s crew titled “Reveal Entities From Texts With a Hybrid Approach” (paper, slides). (Shouldn’t it be “Revealing..”?). They showed that by using essentially the provenance of the data sources they are able to build an adaptive entity linking pipeline. Thus, one doesn’t necessarily have to do as much domain tuning to use these pipelines.

While not specifically about entity resolution, a paper worth pointing out is Type-Constrained Representation Learning in Knowledge Graphs from Denis Krompaß, Stephan Baier and Volker Tresp. They show how background knowledge about entity types can help improve link prediction tasks for generating knowledge graphs. Again, use the kitchen sink and you’ll perform better.

There were a couple of good resources presented for entity resolution tasks.  Bryl, Bizer and Paulheim produced a dataset of surface forms for dbpedia entities. They were able to boost performance up to 20% for extracting accurate surface forms for entities through filtering. Another tool, LANCE looks great for systematically generating benchmark and test sets for instance matching (i.e. entity linking). Also, Michel Dumontier presented work that had a benchmark for entity linking from the life sciences domain.

Finally, as we get better at entity resolution, I think people will turn towards fusion (getting the best possible representation for a real world entity). Examples include:

The Spectrum of Linked Data Querying

So Linked Data Fragments from Ruben Verborgh was the huge breakout of the conference. Oscar Corcho’s excellent COLD keynote was a riff off thinking about the spectrum (from data dumps through to full sparql queries) that was introduced by Reuben. Another example was the work of Maribel Acosta and Maria-Esther Vidal on “Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data”. They developed an adaptive client side spraql query engine for linked data fragments. This allows the server side to support a much simpler API by having a more intelligent client side. (An aside, kids this is how a technical talk should be done. Precise, clean, technical, understandable. Can’t wait to have the the video lecture for reference.)

Even the most centralized solution, the LODLaundromat which is a clean crawl of the entire web of data supports Linked Data Fragments. In some sense, by asking the server to do less you can handle more linked data, and thus do more powerful analysis. This is exemplified by the best paper LODLab byLaurens Rietveld, Wouter Beek, and Stefan Schlobach, which allowed for the reproduction of 3 existing analysis of the web of data at scale.

I think Olaf Hartig, in his paper on LDQL, framed the problem best as (N, Q) (slides). First define the “crawl” of the web you want to query (N)  and then define the query (Q). When we think about what and where are crawls are, we can think about what execution strategies and types of queries we can best support. Or put another way:

More Main Memory = better Triple Stores

Designing scalable graph / triple stores has always been a challenge. We’ve been trapped by the limits of RAM. But computer architecture is changing, and we now have systems that have a lot of main memory either in one machine or across multiple machines. This is a boon to triple stores and graph processing in general. See for example Leskovec team’s work from SIGMOD:

We saw that theme at ISWC as well:

Moral of the story: Buy RAM

Conclusion

This years conference explored the many spectra of the combination of the web and semantics. I liked the mix of methods used by papers and the range of practical (the industry session was packed) to theoretical results. I also think the community is no longer hemmed in by the standards but are using them as solid starting point. This was pointed out by Ian Horrocks in his keynote:
Additionally, this flexibility was exemplified by the best applied paper, “Building and Using a Knowledge Graph to Combat Human Trafficking” by  Pedro Szekely et al.. They used the parts of the semantic web stack that helped (like ontologies and JSON-LD) but used elastic search for storage to create a vital and important solution to a real challenging problem.
Overall, this was an excellent conference.  Next year’s conference is in Kobe, I hope you submit some great papers and I’ll seen you there!

Random Thoughts

Last week (Oct 7 – 9) the altmetrics community made its way to Amsterdam for 2:AM (the second altmetrics conference) and altmetrics15 (the 4th altmetrics workshop). The conference is aimed more at practitioners while the workshop has a bit more research focus. I enjoyed the events from both a content (I’m biased as a co-organizer) as well as logistics perspective (I could bike from home). This was the five year anniversary of the altmetrics manifesto so it was a great opportunity to reflect on the status of the community. Plus the conference organizers brought cake!

This was the first time that all of the authors were in the same room together and we got a chance to share some of our thoughts. The video is here if you want to hear us pontificate:

From my perspective, I think you can summarize the past years in two bullet points:

  • Amazing what the community has done: multiple startups on altmetrics, big companies having altmetric products, many articles and other research objects having altmetric scores, a small but vibrant research community is alive
  • It would be great to focus more on altmetrics to improve the research process rather than just their potential use in research evaluation.

Beyond the reflection on the community itself, I took three themes from the conference:

More & different data please

An interesting aspect is that most studies and implementations rely on social media data (twitter, mendeley, Facebook, blogs, etc). As an aside, it’s worth noting you can do amazing things with this data in a very short amount of time…

However, there is increasing interest in having data from other sources or having more contextualized data.

There were several good examples.  gave a good talk about trying to get data behind who tweets about scientific articles. I’m excited to see how better population data can help us have. The folks at altimetric.com are starting to provide data that looks at how articles are being used in public policy documents. Finally, moving beyond articles, Peter van Besselaar looking at data derived from grant review processes to study, for example, gender bias.

It’s also good to see developments such as the DOI Event Tracker that makes the aggregation of altmetrics data easier. This is hopefully just the start and we will see a continued expansion of the variety of data available for studies.

The role of theory

There was quite a bit of discussions about the appropriateness of the use of altmetrics for different tasks ranging from the development of global evaluation measures to their role in understanding the science system. There was a long discussion of the quality of altmetrics data in particular the transparency of how aggregator’s integrate and provide data.

A number of presenters discussed the need for theory in trying to interpret altmetrics signal. Cameron Neylon gave an excellent talk about his view of the need for a different theoretical view. There was also a break out session at the workshop discussing the role of theory and I look forward to the ether pad becoming something more well defined.  Peter van Bessellaar and I also tried to argue for a question driven approach when using altmetrics.

Finally, I enjoyed the work of Stefanie Haustein, Timothy Bowman, and Rodrigo Costas on interpreting the meaning of altmetrics. This is definitely a must read.

Going beyond research evaluation

I had a number of good conversations with people about the desire to do something that moves beyond the focus of research evaluation. In all honesty, being able or tell stories with a variety of metrics is probably why altmetrics has gained traction.

However, I think a world in which understanding the signals produced by the research system can be used to improve research is the exciting bit. There were some hints of this. In particular, I was compelled by the work of Kristi Holmes on using measures to improve translational medicine at northwestern.

Wrap-up

Overall, It’s great to see all the great activity around altmetrics. There are a bunch of great summaries of the event. Check out the altmetrics conference blog and Julie Brikholz’s summary.

Next week is the 2015 International Semantic Web Conference. I had the opportunity with the Michel Dumontier to chair a new track on Datasets and Ontologies. A key part of of the Semantic Web has always been shared resources, whether it’s common standards through the W3C or open datasets like those found in the LOD cloud. Indeed, one of the major successes of our community is the availability of these resources.

ISWC over the years has experimented with different ways of highlighting these contributions and bringing them into the scientific literature. For the past couple of years, we have had an evaluation track specifically devoted to reproducibility and evaluation studies. Last year datasets were included to form a larger RDBS track. This year we again have a specific Empirical Studies and Evaluation track along side the Data & Ontologies track.

The reviewers had a tough job for this track. First, it was new so it’s hard to make a standard judgment. Secondly, we asked reviewers not only to review the paper but the resource itself along a number of dimensions. Overall, I think they did a good job. Below you’ll find the resources chosen for presentation at the conference and a brief headline of what to me is interesting about the paper. In the spirt of the track, I link to the resource as well as the paper.

Datasets

  •  Automatic Curation of Clinical Trials Data in LinkedCT by Oktie Hassanzadeh and Renée J Miller (paper) – clinicaltrials.gov published as linked data in an open and queryable. This resource has been around since 2008. I love the fact that they post downtime and other status info on twitter https://twitter.com/linkedct
  • LSQ: Linked SPARQL Queries Dataset by Muhammad Saleem, Muhammad Intizar Ali, Qaiser Mehmood, Aidan Hogan and Axel-Cyrille Ngonga Ngomo (paper). – Query logs are becoming an ever more important resource from everything from search engines to database query optimization. See for example USEWOD. This resource provides queryable versions in SPARQL of the query logs from several major datasets including dbpedia and linked geo data.
  • Provenance-Centered Dataset of Drug-Drug Interactions by Juan Banda, Tobias Kuhn, Nigam Shah and Michel Dumontier (paper) – this resources provides aggregated set of drug-drug interactions coming from 8 different sources. I like how they provided a doi for the bulk download of their datasource as well as spraql endpoint. It also uses nanopublications as the representation format.
  • Semantic Bridges for Biodiversity Science by Natalia Villanueva-Rosales, Nicholas Del Rio, Deana Pennington and Luis Garnica Chavira (paper) – this resource allows biodiversity scientist to work with species distribution models. The interesting thing about this resource is that it not only provides linked data, a spraql endpoint and ontologies but also semantic web services (i.e. SADI) for orchestrating these models.
  • DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons by Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann and Sebastian Hellmann  (paper) – this is another chapter in exposing wikimedia content as structured data. This resource provides structured information for the media content in Wikimedia commons. Now you can spraql for all images with a CC-by-sa v2.0 license.

Ontologies

Overall, I think this is a good representation of the plethora of deep datasets and ontologies that the community is creating.  Take a minute and check out these new resources.

I was in southern California for essentially a big chunk of August. I had a day visit to the Information Sciences Institute (slides here),  a some nice discussions with friends and also a chance to hang out at the ocean. So here are 10 observations:

  1. I still think hooking up  Abstract Meaning Representation to linked data semantics is something worth trying out.
  2. What is data? I Christine Borgman’s definition “Data refers to entities used as evidence of phenomena for the purposes of research or scholarship”. p.29
  3. Silicon Beach is like a thing. Overhead in Venice, literally, “Tech dude: We need to iterate and test our mvp. Product dude: Steve Jobs didn’t ask what the marked wanted. We need vision!”.
  4. “a future incarnation of Siri, Cortana or other digital companions will be more like a knowledgeable colleague than a personal assistant.” 
  5. JSON-LD + PROV + Elastic Search + lots of other stuff is awesome. I DIG it. Looking forward to hearing more at ISWC.
  6. Something to check out for altmetrics fans: Media Impact Project
  7. UCSB has a sweet campus….
  8. A nice ontology for software metadata: OntoSoft.
  9. AirBnB is great but this is the first trip where I encountered negative responses from neighbors / neighborhood.
  10. You can predict transformative scientific research

 

Last week, I was a the Theory and Practice of Provenance 2015 (TaPP’15) held in Edinburgh. This is the seventh edition of the workshop. You can check out my trip report from last year’s event which was held during Provenance week here. TaPP’s aim is to be a venue for a place where people can present their early and innovative research ideas.

The event is useful because it brings a cross section of researchers from different CS communities ranging from databases, programming language theory, distributed systems, to e-science and the semantic web. While it’s nice to see old friends at this event, one discussion that was had during the two days was how we can connect back in a stronger fashion to these larger communities especially as the interest in provenance increases within them.

I discuss the three themes I pulled from the event but you can take a look at all of the papers online at the event’s site and see what you think.

1. Execution traces as a core primitive

I was happy to be presenting on behalf of one of my students Manolis Stamatogiannakis whose been studying how to capture provenance of desktop systems using virtual machines and other technologies from the systems community. (He’s hanging out at SRI with Ashish Gehani for the summer so couldn’t make it.) A key idea in the paper we presented was to separate the capture of an execution trace from the instrumentation needed to analyze provence (paper). The slides for the talk are embedded below:

The mechanism used to do this is a technology called record & replay (we use PANDA) but this notion of capturing a light weight execution trace and then replaying it deterministically is also popping up in other communities. For example, Boris Glavic has been using it successfully for database provenance in his work on GProM and reenactment queries. There he uses the audit logging and time travel features of modern databases (i.e. execution trace) to support rich provenance queries. 

This need to separate capture from queries was emphasized by David Gammack and Adriane Chapman work on trying to develop agent based models to figure out what instrumentation needs to be be applied in order to capture provenance. Until we can efficiently capture everything this is still going to be a stumbling block for completely provenance aware systems. I think that thinking about execution traces as a core primitive for provenance systems may be a way forward.

2. Workflow lessons in non-workflow environments

There are numerous benefits to using (scientific) workflow systems for computational experiments one of which is that it provides a good mechanism for capturing provenance in a declarative form. However, not all users can or will adopt workflow environments. Many use computational notebooks (e.g. Jupyter)  or just shell scripts. The YesWorkflow system (very inside community joke here) uses convention and a series of comments to help users produce a workflow and provenance structure from their scrips and file system (paper). Likewise, work on combining noWorkflow, a provenance tracking system for python, and iPython notebooks shows real promise (paper). This reminded me of the PROV-O-Matic work by Rinke Hoekstra.

Of course you can combine yesWorkflow and noWorkflow together into one big system.

Overall, I like the trend towards applying workflow concepts in-situ. It got me thinking about applying the scientific workflow results to the abstractions provided by Apache Spark. Just a hunch that this might be an interesting direction.

3. Completing the pipeline

The last theme I wanted to pull out is that I think we are inching towards being able to truly connect provenance generated by applications. My first example, is the work by Glavic and his students on importing and ingesting PROV-JSON into a database. This lets you query the provenance of query results but include information on the pipelines that got it there.

This is something I’ve wanted to do for ages with Marcin Wylot’s work on TripleProv, I was a bit bummed that Boris got their first but I’m glad somebody did it:-)

The second example was the continued push forward for provenance in the systems community. In particular, the OPUS and SPADE systems, which I was aware off but now also the work on Linux Provenance Modules by Adam Bates that was introduced to me at TaPP. These all point to the ability to leverage key operating systems constructs to capture and manage provenance. For example, Adam showed how to make use of  mandatory access control policies to provide focused and complete capture of provenance for particular applications. 

I have high hopes here.

Random thoughts

To conclude I’ll end with some thoughts from the notebook.

I hope to see many familiar and new faces at next year’s Provenance Week (which combines TaPP and IPAW).

From Florence, I  headed to Washington D.C. to attend the Society for Scholarly Publishing Annual Meeting (SSP) last week. This conference is a big event for academic publishers. It’s primarily attended by people who either work for publishers as well as companies that provide services for them. It also includes a smattering of librarians and technologist. The demographic was quite different from WWW – more in suits and more women (attention CS community). This reflects the make-up of the academic publishing industry as a whole as shown by the survey done by Amy Brand, which was presented at the beginning of the conference. Full data here. Here’s a small glimpse.

2015-05-27 16.12.52

Big Literature, Big Usage

I was at SSP primarily to give a talk and then be on a panel in the Big Literature, Big Usage session. The session with Jan Velterop and Paul Cohen went well. Jan presented the need for large scale analysis of the literature in order for science to progress (shout outs to the Steve Pettifier led Lazaurs project). He had a great set of slides of showing how fast one would have to read in order to read every paper being deposited in pubmed (the pages just flash buy). I followed up with a discussion of recent work on Machine Reading (see below) and how it impacts publishers. The aim was to show that automated analysis and reading of the literature is not somewhere off in the future but is viable now.

Paul Cohen from DARPA followed up by the discussion of their Big Mechanism program. This effort is absolutely fascinating. Paul’s claim was that we currently cannot understand large scale complex systems. He characterized this as pull vs. a push approach. Paraphrasing, we currently try to pull on the knowledge into individuals’ heads and then make connections (i.e. build a model).  vs. a push approach  where we push all the information from individuals out and have the computer build a model. The former makes understanding such large scale systems for all intensive purposes impossible. To attack this problem, the program’s aim is to automatically build large scale causal computational models directly from the literature.  Paul pointed out there are still difficulties with machine reading (e.g. coreference resolution is still a challenge), however, the progress is there. Amazing they are having success with building models in cancer biology. Both the vision and the passion in Paul’s talk was just compelling. (As a nice aside, folks at Elsevier (e.g. Anita De Waard) are a small part of the project.) are participating in this program.

We followed up with a Q/A panel session. We discussed the fact that all of us believe that sharing/publishing of computational models is really the future. It’s just unfortunate to lock this information up in text. We answered a number of questions around feasibility (yes, this is happening even if it’s hard). Also, we discussed the impossibility of doing some of this science without having computers deeply involved. Things are just too complicated and we are not getting the requisite productivity.

So that was my bit. I also got to attend a number of sessions and catch up with a number of people. What’s up @MarkHahnel  and @jenniferlin15  – thanks for letting me heckle from the back of the room:-)

Business Stuff

I attended a number of sessions that discussed the business of academic publishing. The big factor seems to be fairly flat growth in library budgets but with a growing amount of journals. This was mentioned in Jayne Marks from Wolters Kluwer’s talk as well as a whole session on where to get growth. I thought the mergers and acquisition talk from @lamb was of interest. It seems that there is even more room for consolidation in the industry.

2015-05-28 16.56.102015-05-28 16.59.09

I also feel that the availability of large of amounts of cheap capital has not been taken full advantage of in the industry. Beyond consolidation new product development seems to be the best way to get growth.  I think one notion that’s of interest is the transition towards Big Funder Deal where funders are essentially paying in bulk for their research to be published.

I enjoyed the cost of OA business models session. A very interesting set of slides from Robert Kiley about Wellcome Trust’s open access journal spend is embedded below. This is a must look at in terms of where costs are coming from. It is a clarion call to all publishers in terms of delivering what they say they are going to deliver.

Pete Binfield of Peer J gave an insightful about the disruptive nature of non-legacy digital only OA publishers. However, I think it may overestimate the costs of the current subscription infrastructure. Also, as Marks noted in her talk 50% of physicians still require print and a majority students want print textbooks. I wonder how much this is the predominate factor in legacy costs?

2015-05-29 11.00.06

Overall, throughout the sessions, it still felt a bit … umm…. slow. We are still talking papers maybe with a bit of metrics or data in for spice but I think there’s much more to be done in helping us scientist do better and that scholarly media/information providers.

Media & Technology

The conference lined up three excellent keynote sessions. The former CEO of MakerBot Jenny Lawton gave a “life advice” talk. I think the best line was “do a gap analysis on yourself”. Actually, the most interesting bit was her answer to a question about open source. Her answer was that we live in very IP and patent oriented world and it’s important to figure out how if you want to be an open company to work strategically in that world. The interview form with the New Yorker author Ken Auletta worked great. His book Googled
The End of the World As We Know It is now on my to read list. A couple of interesting points:

  • New York Times subscribers read 35 minutes a day (print) vs 35 minutes a month (online)
  • Human factors drive more decisions in the highest levels of business than let on.
  • Editorial and fact checking at the level of the New Yorker is a game changer for an author.
  • He’s really big on having a calling card.

Finally, Charles Watkinson gave a talk about how the monograph publishing is experimenting with digital and how it’s adopting many of the same features as journal articles. He called out Morgan and Claypool’s Synthesis Series as an example innovator in this space — I’m an editor😉

I always enjoy T Scott Plutchak’s talks. He talked about his new role in bringing together data wranglers across his university. He made a good case that this role is really necessary in today’s science. I agree. But it’s unclear how one can keep the talent needed for data wrangling within academia especially in the library.

Overall, SSP was useful in understanding the current state and thinking of this industry.

Random Thoughts

Follow

Get every new post delivered to your Inbox.

Join 41 other followers

%d bloggers like this: