Last week I got from a great 8! days in Riva del Garda, Italy attending the 2014 International Semantic Web Conference and associated events. This is one of those events where your colleagues on Facebook get annoyed with the pretty pictures of a lakes and mountains that their other colleagues keep posting:
ISWC is the key conference for semantic web research and the place to see what’s happening. This year’s conference had 630 attendees which is a strong showing for the event. The conference is as usual selective:
Interestingly, the numbers were about on par with last year except for the in-use track where we had a much larger number of submissions. I suspect this is because all tracks had synchronized submission deadlines whereas the in-use track was after the research track last year. The replication, dataset, software, and benchmark track is a new addition to the conference and a good one I might add. Having a place to present for these sorts of scholarly output is important and from my perspective a good move by the conference. You can find the papers (published and in preprint form) on the website.. More importantly you can find a big chunk of the slides presented on Eventifier.
So why am I hanging out in Italy (other than the pasta). I was co-organizer the Doctoral Consortium for the event.
Additionally, I was on a panel for the Context Interpretation and Meaning workshop. I also attended a pre-meeting on archiving linked data for the PRELIDA project. Lastly, we had an in-use paper in the conference on adaptive linking used within the Open PHACTS platform to support chemistry.. Alasdair Gray did a fantastic job of leading and presenting the paper.
So on to the show.Three themes, which I discuss in turn:
- It’s not Volume, it’s Variety
- Variety & the Semantic Spectrum
- Fuzziness & Metrics
It’s not Volume, it’s Variety
I’m becoming more convinced that the issue for most “big” data problems isn’t volume or velocity, it’s variety. In particular, I think the hardware/systems folks are addressing the first two problems at a rate that means that for many (most?) workloads the software abstractions provided are enough to deal with the data sizes and speed involved. This inkling was confirmed to me a couple of weeks ago when I saw a talk by Peter Hofstee, the designer of the Cell microprocessor, talking about his recent work on computer architectures for big data.
This notion was further confirmed at ISWC. Bryan Thompson of BigData triple store fame, presented his new work using GPUs (mapgraph.io) that can do graph processing on hundreds of millions of nodes using GPUs using similar abstractions to Signal/Collect or GraphLab. Additionally, as I was sitting in the session on Large Scale RDF processing – many of the systems were focused on a clustered environment but using ~100 million triple test sets even though you can process these with a single beefy server. It seems that for online analytics workloads you can do these with a simple server setup and for truly web scale workloads these will be at the level of clusters that can be provisioned fairly straightforwardly using THE cloud. I mean in our community the best examples are webdatacommons.org or the work of the VU team on LODLaundry – both of these process graphs in the billions using the Hadoop ecosystem on either local or Amazon based clusters. Furthermore, the best paper in the in-use track (Semantic Traffic Diagnosis with STAR-CITY: Architecture and Lessons Learned from Deployment in Dublin, Bologna, Miami and Rio) from IBM actually scrapped using a specific streaming system because even data coming from traffic sensors wasn’t fast enough to make it worthwhile.
Indeed, in Prabhakar Raghavan‘s (yes! the Intro. to Information Retrieval and Google guy) keynote, he noted that he would love to have problems that were just computational in nature. Likewise, Yolanda Gil discussed that the difficulties and that the challenges lay not in necessarily data analysis but in data preparation (i.e. it’s a data mess!)
The hard part is data variety and heterogeneity, which transitions, nicely, into our next theme…
— Amy Guy (@rhiaro) October 19, 2014
Variety & the Semantic Spectrum
Chris Bizer gave an update to the measurements of the Linked Data Cloud – this was a highlight talk.
The Linked Data Cloud has grown essentially doubling (towards generously ~1000 datasets) but the growth of schema.org based data (see the Microdata+RDFa series ISWC 2014 paper) has ~500,000 datasets. Chris gave an interesting analysis about what he thinks this means in a nice mailing list post. The comparison is summed up below:
So what we are dealing with is really a spectrum of semantics from extremely rich knowledge bases to more shallow mark-up (As a side note: Guha’s thought’s on Schema.org are always worth a revisit.) To address, this spectrum, I saw quite a few papers trying to deal with it using a variety of CS techniques from NLP to databases. Indeed, two of the best papers were related to this subject:
- AGDISTIS – Graph-Based Disambiguation of Named Entities using Linked Data – was on how to use background knowledge to help improve the accuracy of named entity recognition and disambiguation. (software) – best research paper
- OBDA: Query Rewriting or Materialization? In Practice, Both! – the best student paper by Juan Sequeda about how to efficiently support mappings from a relational database to some ontology. By the way kids, if you want to see how to present related work check out Juan’s slides:
Also on this front were works on optimizing linked discovery (HELIOS), machine reading (SHELDON), entity recognition, and query probabilistic triple stores. All of these works hand in common trying to take approaches from other CS fields and adapt or improve them to deal with these problems of variety within a spectrum of semantics.
Fuzziness & Metrics
The final theme that I pulled out of the conference was the area of evaluation metrics but ones that either dealt with or catered for the fact that there are no hard truths, especially, when using corpora developed using human judgements. The quintessential example of this is my colleague Lora Aroyo’s work on Crowd Truth – trying to capture disagreement in the process of creating gold standard corpora in crowd sourcing environments. Other example is the very nice work from Michelle Cheatham and Pascal Hitzler on creating an uncertain OAEI conference benchmark. Raghavan‘s keynote also homed in on the need for more metrics especially as we have a change in the type of search interfaces that we typically use (going from keyword searches to more predictive contextual search). This theme was also prevalent in the workshops in particular how to do we measure in the face of changing contexts. Examples include:
- Linked Science: Capturing Provenance for a Linkset of Convenience
- LD4IE: Inductive Typing Entity Alignment
- COLD: Capturing the Currency of DBpedia Descriptions and Get Insight into their Validity
- URSW: Towards a Distributional Semantic Web Stack
- The whole CIM workshop
— Alasdair J G Gray (@gray_alasdair) October 19, 2014
A Note on the Best Reviewers
— Paul Groth (@pgroth) October 21, 2014
A nice note: some were nominated by authors of papers that the reviewer rejected because the review was so good. That’s what good peer review is about – improving our science.
- Love the work Bizer and crew are doing on Web Tables. Check it out.
- Conferences are so good for quick lit reviews. Thanks to Bijan Parsia who sent me the direction of Pavel Klinov‘s work on probabilistic reasoning over inconsistent ontologies.
- grafter.org – nice site
- Yes, you can reproduce results.
- There’s more provenance on the Web of Data than ever. (Unfortunately, PROV is still small percentage wise.)
- On the other hand, PROV was in many talks like last year. It’s become a touch point. Another post on this is on the way.
- The work by Halpin and Cheney on using SPARQL update for provenance tracking is quite cool.
- A win from the VU: DIVE 3rd place in the semantic web challenge
- Amazing wifi at the conference! Unbelievable!
- +1 to the Poster & Demo crew: keeping 160 lightening talks going on time and fun – that’s hard
- 10 year award goes to software: Protege: well deserved
- From Nigel’s keynote: it seems that the killer app of open data is …. insurance
- Two years in a row that stuff I worked has gotten a shout out in a keynote (Social Task Networks). 😃
- ….. I don’t think the streak will last
- 99% of queries have nouns (i.e. entities)
- I hope I did Sarven’s Call for Linked Research justice
- We really ought to archive LOV – vocabularies are small but they take a lot of work. It’s worth it.
- The Media Ecology project is pretty cool. Clearly, people who have lived in LA (e.g. Mark Williams) just know what it takes 😉
- Like: Linked Data Fragments – that’s the way to question assumptions.
A low-carb diet in italy– lots of running