Archive

Monthly Archives: October 2014

Last week I got from a great 8! days in Riva del Garda, Italy attending the 2014 International Semantic Web Conference and associated events. This is one of those events where your colleagues on Facebook get annoyed with the pretty pictures of a lakes and mountains that their other colleagues keep posting:

2014-10-23 06.57.15

ISWC is the key conference for semantic web research and the place to see what’s happening. This year’s conference had 630 attendees which is a strong showing for the event. The conference is as usual selective:
2014-10-21 09.11.38
Interestingly, the numbers were about on par with last year except for the in-use track where we had a much larger number of submissions. I suspect this is because all tracks had synchronized submission deadlines whereas the in-use track was after the research track last year. The replication, dataset, software, and benchmark track is a new addition to the conference and a good one I might add. Having a place to present for these sorts of scholarly output is important and from my perspective a good move by the conference. You can find the papers (published and in preprint form) on the website.. More importantly you can find a big chunk of the slides presented on Eventifier.

So why am I hanging out in Italy (other than the pasta).  I was co-organizer the Doctoral Consortium for the event.

Additionally, I was on a panel for the Context Interpretation and Meaning workshop. I also attended a pre-meeting on archiving linked data for the PRELIDA project. Lastly, we had an in-use paper in the conference on adaptive linking used within the Open PHACTS platform to support chemistry.. Alasdair Gray did a fantastic job of leading and presenting the paper.

So on to the show.Three themes, which I discuss in turn:

  1. It’s not Volume, it’s Variety
  2. Variety & the Semantic Spectrum
  3. Fuzziness & Metrics

It’s not Volume, it’s Variety

I’m becoming more convinced that the issue for most “big” data problems isn’t volume or velocity, it’s variety. In particular, I think the hardware/systems folks are addressing the first two problems at a rate that means that for many (most?) workloads the software abstractions provided are enough to deal with the data sizes and speed involved. This inkling was confirmed to me a couple of weeks ago when I saw a talk by Peter Hofstee, the designer of the Cell microprocessor, talking about his recent work on computer architectures for big data.

This notion was further confirmed at ISWC. Bryan Thompson of BigData triple store fame, presented his new work using GPUs (mapgraph.io) that can do graph processing on hundreds of millions of nodes using GPUs using similar abstractions to Signal/Collect or GraphLab. Additionally, as I was sitting in the session on Large Scale RDF processing – many of the systems were focused on a clustered environment but using ~100 million triple test sets even though you can process these with a single beefy server. It seems that for online analytics workloads you can do these with a simple server setup and for truly web scale workloads these will be at the level of clusters that can be provisioned fairly straightforwardly using THE cloud. I mean in our community the best examples are webdatacommons.org or the work of the VU team on LODLaundry  – both of these process graphs in the billions using the Hadoop ecosystem on either local or Amazon based clusters. Furthermore, the best paper in the in-use track (Semantic Traffic Diagnosis with STAR-CITY: Architecture and Lessons Learned from Deployment in Dublin, Bologna, Miami and Rio) from IBM actually scrapped using a specific streaming system because even data coming from traffic sensors wasn’t fast enough to make it worthwhile.

Indeed, in Prabhakar Raghavan‘s  (yes! the Intro. to Information Retrieval and Google guy) keynote, he noted that he would love to have problems that were just computational in nature. Likewise, Yolanda Gil discussed that the difficulties and that the challenges lay not in necessarily data analysis but in data preparation (i.e. it’s a data mess!) 2014-10-21 14.08.27

The hard part is data variety and heterogeneity, which transitions, nicely, into our next theme…

Variety & the Semantic Spectrum

Chris Bizer gave an update to the measurements of the Linked Data Cloud – this was a highlight talk.

The Linked Data Cloud has grown essentially doubling (towards generously ~1000 datasets) but the growth of schema.org based data (see the Microdata+RDFa series ISWC 2014 paper) has ~500,000 datasets. Chris gave an interesting analysis about what he thinks this means in a nice mailing list post. The comparison is summed up below:

So what we are dealing with is really a spectrum of semantics from extremely rich knowledge bases to more shallow mark-up (As a side note: Guha’s thought’s on Schema.org are always worth a revisit.) To address, this spectrum, I saw quite a few papers trying to deal with it using a variety of CS techniques from NLP to databases. Indeed, two of the best papers were related to this subject:

Also on this front were works on optimizing linked discovery (HELIOS), machine reading (SHELDON), entity recognition, and query probabilistic triple stores. All of these works hand in common trying to take approaches from other CS fields and adapt or improve them to deal with these problems of variety within a spectrum of semantics.

Fuzziness & Metrics

The final theme that I pulled out of the conference was the area of evaluation metrics but ones that either dealt with or catered for the fact that there are no hard truths, especially, when using corpora developed using human judgements. The quintessential example of this is my colleague Lora Aroyo’s work on Crowd Truth – trying to capture disagreement in the process of creating gold standard corpora in crowd sourcing environments. Other example is the very nice work from Michelle Cheatham and Pascal Hitzler on creating an uncertain OAEI conference benchmark.  Raghavan‘s keynote also homed in on the need for more metrics especially as we have a change in the type of search interfaces that we typically use (going from keyword searches to more predictive contextual search). This theme was also prevalent in the workshops in particular how to do we measure in the face of changing contexts. Examples include:

A Note on the Best Reviewers

Good citizens:

A nice note: some were nominated by authors of papers that the reviewer rejected because the review was so good. That’s what good peer review is about – improving our science.

Random Notes

  • Love the work Bizer and crew are doing on Web Tables. Check it out.
  • Conferences are so good for quick lit reviews. Thanks to Bijan Parsia who sent me the direction of Pavel Klinov‘s work on probabilistic reasoning over inconsistent ontologies.
  • grafter.org – nice site
  • Yes, you can reproduce results.
  • There’s more provenance on the Web of Data than ever. (Unfortunately, PROV is still small percentage wise.)
  • On the other hand, PROV was in many talks like last year. It’s become a touch point. Another post on this is on the way.
  • The work by Halpin and Cheney on using SPARQL update for provenance tracking is quite cool. 
  • A win from the VU: DIVE 3rd place in the semantic web challenge 
  • Amazing wifi at the conference! Unbelievable!
  • +1 to the Poster & Demo crew: keeping 160 lightening talks going on time and fun – that’s hard
  • 10 year award goes to software: Protege: well deserved
  • http://ws.nju.edu.cn/explass/
  • From Nigel’s keynote: it seems that the killer app of open data is …. insurance
  • Two years in a row that stuff I worked has gotten a shout out in a keynote (Social Task Networks). 😃
  • ….. I don’t think the streak will last
  • 99% of queries have nouns (i.e. entities)
  • I hope I did Sarven’s Call for Linked Research justice
  • We really ought to archive LOV – vocabularies are small but they take a lot of work. It’s worth it.
  • The Media Ecology project is pretty cool. Clearly, people who have lived in LA (e.g. Mark Williams) just know what it takes 😉
  • Like: Linked Data Fragments – that’s the way to question assumptions.
  • A low-carb diet in italy – lots of running

ISWC-DC 2014 t-shirtNext Monday (Oct. 20) we’ll be the International Semantic Web Conference’s Doctoral Consortium . This year, I helped organize the consortium with  Natasha Noy.

This is a chance for the PhD students in our community to get direct feedback on their proposals and to connect with each other. We had 41 submissions this year. It was tough to chose but with our excellent PC we were able to select 16 proposals for discussion at the conference. Even thought we had to make a selection we worked hard to give each of these students reviews that could really help them progress as a researcher.

6 of the selected papers were of high enough quality to be included in the official Springer conference proceedings. The others we’ve made available online as a CEUR supplementary proceedings.

Below you’ll find a list and links to the pdfs of  all the accepted papers. By going through these papers, I think you can get a great sense of where the field is headed ranging from combining information extraction and semantic web to unique applications of semantics.  If you’re attending the conference on Monday, I hope you stop by to talk with this next generation of researchers.

Also if you haven’t heard Natasha and Avi’s talk “The Semantic Webber’s Guide to Evaluating Research Contributions”  – it’s awesome. It kicks off the day. So be sure to show up for that.

 

 

The NIH has produced a report on the requirements for a Software Discovery Index. Below you’ll find my comments posted there.


I think this report reflects both demand for and consensus around the issues of software identification and use in biomedical science. As a creator and user of scientific software both installable and accessible through Web APIs, from my perspective anything that helps bring transparency and visibility to what scientific software devs do is fantastic.

I would like to reiterate the comments that we need to build on what’s already being used by software developers and is already in the wild and call out two main points.

Package management

There was some mention of package managers but I would suggest that this should be a stronger focal point as much of the metadata is already in this management systems. Examples include:

My suggestion would be that a simple metadata element could be suggested that would allow software developers to advertise that their work be indexed by the Software Discovery Index. Then the process would involve just indexing existing package manages used. (Note, this also works for VMs. See https://registry.hub.docker.com).

For those that chose not to use a package repository, I would suggest leveraging Schema.org that already has a metadata description for software (http://schema.org/SoftwareApplication). This metadata description can already be recognized by search engines.

Software vs Software Paper

One thing that I didn’t think came through strongly enough in the report was the distinction between the software and a potential scientific paper describing the software . Many developers of scientific code also publish papers that tease out important aspects of e.g. the design, applicability or vision of the software. In most cases, these authors would like citations to accrue to that one paper. See for example Scikit-learn [1], the many pages you find on software sites with the “please” cite section [2], or the idea of software articles [3, 4].

This differs from the idea that as an author of a publication one should reference specifically the software version that was used. My suggestion would be to decouple these tasks and actually suggest that both are done. This could be done by allowing for author side minting of DOIs for a particular software versions [5] and suggesting that authors also include a reference to the designated software paper. The Software Discovery Index could facilitate this process by suggesting the appropriate reference pair. This is only one idea, but I would suggest that the difference between Software and Software paper should be considered in any future developments.

This could also address some concerns with referencing APIs.

Summary

In summary, I would emphasize the need for the reuse of existing infrastructures for software development, publication of scientific results, and search. There is no need to “role your own”. I think some useful, simple and practical guidelines would actually go a long way in helping the scientific software ecosystem.

[1] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 (November 2011), 2825-2830.
[2] http://matplotlib.org
[3] http://www.elsevier.com/about/content-innovation/original-software-publications
[4] http://www.biomedcentral.com/bmcbioinformatics/authors/instructions/software
[5] e.g. http://www.webcitation.org

%d bloggers like this: