Archive

Tag Archives: #semtech

Last week, I was in Japan for the 15th International Semantic Web Conference. 

For me, this was a big event as I was research track program co-chair together with the amazing Elena Simperl. Being a program chair is a funny thing, you’re not directly responsible for any individual paper, presentation or review but you feel responsible for the entirety. And obviously, organizing 664 reviews for 212 submissions isn’t something to be taken lightly. Beyond my service as research track chair, I think my main contribution was finding good coffee near the event:

With all that said, I think the entire program was really solid. All the preprints are on the website and the proceedings are available from Springer. I’ll try to summarize my main takeaways below. But first some quick stats:

  • 430 participants
  • 212 (research track) + 43 (application track) + 71 (resources track) = 326 submissions
    • that’s up by 61 submission from last year!
  • Acceptance rates:
    • 39/212  =  18% (research track)
    • 12/43 = 28% (application track)
    • 24/71 = 34%  (resources track)
    • I think these reflect the aims of the individual tracks
  • We also had 102 posters and demos and 12 journal track papers
  • 35 student travel winners

My three main takeaways:

  1. Frames are back!
  2. semantics on the web (notice the case)
  3. Science as the next challenge
  4. SPARQL as a driver for other CS communities

(Oh and apologies for the gratuitous use of images and twitter embeds)

Frames are back!

For the past couple of years, a chunk of the community has been focused on the problem of entity resolution/disambiguation whether that’s from text to a KB or across multiple KBs. Indeed, one of the best paper winners (yes, we gave out two – both nominees had great papers) by ISI’s Information Integration Group was an excellent approach to do multi-type entity resolution.  Likewise, Axel and crew gave a pretty heavy duty tutorial on link discovery. On the NLP front, Stefano Faralli presented a nice resource that disambiguates text to lexical resources with a focus on providing both symbolic and distributional representations .

2016-10-21 10.32.59.jpg2016-10-21 10.34.59.jpg

What struck me at the conference were the number of papers beginning to think not just about entities and their relations but the context they are in. This need for context was well motivated by the folks at IBM research working on medical question answering.

Essentially, thinking about classic AI frames but how do obtain these automatically. A clear example of this is the (ongoing) work on FRED:

Similarly, the News Reader system for extracting information into situated events is another example. Another example is extracting process graphs from medical texts. Finally, in the NLP community there’s an increasing focus on developing resources in order to build automated parsers for frame-style semantic representations (e.g. Abstract Meaning Representation). Such representations can be enhanced by connections to semantic web resources as discussed by Burns et al. (I knew this was a great idea in 2015!)

2016-10-21 10.58.15.jpg

In summary,  I think we’re beginning to see how the background knowledge available on the Semantic Web combined with better parsers can help us start to deal better with context in an automated fashion.

semantics on the web

Chris Bizer gave an insightful keynote reflecting on what the community’s expectations were for the semantic web and where we currently are at.

He presented stats on the growth of Linked Data (e.g. stuff in the LOD cloud) as well as web data (e.g. schema.org marked pages) but really the main take away is the boom in the later. About 30% of the Web has html embedded data something like 12 million websites.  There’s an 86% adoption rate on top travel website.  I think the choice quote was:

“Probably, every hotel on earth is represented as web data.”

The problem is that this sort of data is not clean, it’s messy – it’s webby data, which brings to Chris’s important point for the community:

2016-10-20-09-46-12

While standards have brought us a lot, I think we are starting as a research community to think increasingly about different kinds of semantics and different kinds of structured data.  Some examples from the conference:

An embrace of the whole spectrum of semantics on the web is really a valuable move for the research community. Interestingly enough, I think we can truly experiment with web data through things like Common Crawl and the Web Data Commons. As knowledge graphs, triple stores, and ontologies become increasingly common place especially in enterprise deployments, I’m heartened by these new areas of investigation.

The next challenge: Science

Personally, the third keynote of ISWC by Professor Hiroaki Kitano – the CEO of Sony CSL and creator among other things of the AIBO and founder of RoboCup gave a inspirational speech laying out what he sees as the next AI grand challenge:

2016-10-21 09.20.30.jpg

It will be hard for me to do justice to the keynote as the material per second ratio was pretty much off the chart but he has AI magazine article laying out the vision.

Broadly, he used RoboCup as a framework for discussing how to organize a challenge and pointed to its effectiveness. (e.g Kiva systems a RoboCup spinout was acquired by Amazon for $770 million). He then focused on the issue of the inefficiency in scientific discovery and in particular how assembling knowledge is just too difficult.

:2016-10-21 09.21.04.jpg

2016-10-21 09.40.53 copy.jpg

Assembling this by hand is way too hard!

He then went on to reframe the scientific question as one of a massive search and verification of hypothesis space. 2016-10-21 09.52.29.jpg

I walked out of that keynote pretty charged up.

I think the semantic web community can be a big part of tackling this grand challenge. Science and medicine have always been important domains for applying these technologies and that showed up at this conference as well:

SPARQL as a driver for other CS communities

The 10 year award was given to  Jorge Perez , Marcelo Arenas and Claudio Gutierrez for their paper Semantics and Complexity of SPARQL. Jorge gave just a beautiful 10 minute reflection on the paper and the relationship between theory and practice. I think his slide below really sums up the impact that SPARQL has had not just on the semantic web community but CS as a whole:

2016-10-19 09.35.39.jpg

As further evidence, I thought one of the best technical talks of the conference (even through an earthquake) was by Peter Bonz on emergent schemas for RDF querying.

It was a clear example of how the two DB and semweb communities are learning from one another and that by the semantic web having different requirements (e.g. around schemas), this drives new research.

As a whole, it’s hard to beat a conference where you learn a ton and has the following:

2016-10-19 10.52.15.jpg

Random Pointers

Last week I got from a great 8! days in Riva del Garda, Italy attending the 2014 International Semantic Web Conference and associated events. This is one of those events where your colleagues on Facebook get annoyed with the pretty pictures of a lakes and mountains that their other colleagues keep posting:

2014-10-23 06.57.15

ISWC is the key conference for semantic web research and the place to see what’s happening. This year’s conference had 630 attendees which is a strong showing for the event. The conference is as usual selective:
2014-10-21 09.11.38
Interestingly, the numbers were about on par with last year except for the in-use track where we had a much larger number of submissions. I suspect this is because all tracks had synchronized submission deadlines whereas the in-use track was after the research track last year. The replication, dataset, software, and benchmark track is a new addition to the conference and a good one I might add. Having a place to present for these sorts of scholarly output is important and from my perspective a good move by the conference. You can find the papers (published and in preprint form) on the website.. More importantly you can find a big chunk of the slides presented on Eventifier.

So why am I hanging out in Italy (other than the pasta).  I was co-organizer the Doctoral Consortium for the event.

Additionally, I was on a panel for the Context Interpretation and Meaning workshop. I also attended a pre-meeting on archiving linked data for the PRELIDA project. Lastly, we had an in-use paper in the conference on adaptive linking used within the Open PHACTS platform to support chemistry.. Alasdair Gray did a fantastic job of leading and presenting the paper.

So on to the show.Three themes, which I discuss in turn:

  1. It’s not Volume, it’s Variety
  2. Variety & the Semantic Spectrum
  3. Fuzziness & Metrics

It’s not Volume, it’s Variety

I’m becoming more convinced that the issue for most “big” data problems isn’t volume or velocity, it’s variety. In particular, I think the hardware/systems folks are addressing the first two problems at a rate that means that for many (most?) workloads the software abstractions provided are enough to deal with the data sizes and speed involved. This inkling was confirmed to me a couple of weeks ago when I saw a talk by Peter Hofstee, the designer of the Cell microprocessor, talking about his recent work on computer architectures for big data.

This notion was further confirmed at ISWC. Bryan Thompson of BigData triple store fame, presented his new work using GPUs (mapgraph.io) that can do graph processing on hundreds of millions of nodes using GPUs using similar abstractions to Signal/Collect or GraphLab. Additionally, as I was sitting in the session on Large Scale RDF processing – many of the systems were focused on a clustered environment but using ~100 million triple test sets even though you can process these with a single beefy server. It seems that for online analytics workloads you can do these with a simple server setup and for truly web scale workloads these will be at the level of clusters that can be provisioned fairly straightforwardly using THE cloud. I mean in our community the best examples are webdatacommons.org or the work of the VU team on LODLaundry  – both of these process graphs in the billions using the Hadoop ecosystem on either local or Amazon based clusters. Furthermore, the best paper in the in-use track (Semantic Traffic Diagnosis with STAR-CITY: Architecture and Lessons Learned from Deployment in Dublin, Bologna, Miami and Rio) from IBM actually scrapped using a specific streaming system because even data coming from traffic sensors wasn’t fast enough to make it worthwhile.

Indeed, in Prabhakar Raghavan‘s  (yes! the Intro. to Information Retrieval and Google guy) keynote, he noted that he would love to have problems that were just computational in nature. Likewise, Yolanda Gil discussed that the difficulties and that the challenges lay not in necessarily data analysis but in data preparation (i.e. it’s a data mess!) 2014-10-21 14.08.27

The hard part is data variety and heterogeneity, which transitions, nicely, into our next theme…

Variety & the Semantic Spectrum

Chris Bizer gave an update to the measurements of the Linked Data Cloud – this was a highlight talk.

The Linked Data Cloud has grown essentially doubling (towards generously ~1000 datasets) but the growth of schema.org based data (see the Microdata+RDFa series ISWC 2014 paper) has ~500,000 datasets. Chris gave an interesting analysis about what he thinks this means in a nice mailing list post. The comparison is summed up below:

So what we are dealing with is really a spectrum of semantics from extremely rich knowledge bases to more shallow mark-up (As a side note: Guha’s thought’s on Schema.org are always worth a revisit.) To address, this spectrum, I saw quite a few papers trying to deal with it using a variety of CS techniques from NLP to databases. Indeed, two of the best papers were related to this subject:

Also on this front were works on optimizing linked discovery (HELIOS), machine reading (SHELDON), entity recognition, and query probabilistic triple stores. All of these works hand in common trying to take approaches from other CS fields and adapt or improve them to deal with these problems of variety within a spectrum of semantics.

Fuzziness & Metrics

The final theme that I pulled out of the conference was the area of evaluation metrics but ones that either dealt with or catered for the fact that there are no hard truths, especially, when using corpora developed using human judgements. The quintessential example of this is my colleague Lora Aroyo’s work on Crowd Truth – trying to capture disagreement in the process of creating gold standard corpora in crowd sourcing environments. Other example is the very nice work from Michelle Cheatham and Pascal Hitzler on creating an uncertain OAEI conference benchmark.  Raghavan‘s keynote also homed in on the need for more metrics especially as we have a change in the type of search interfaces that we typically use (going from keyword searches to more predictive contextual search). This theme was also prevalent in the workshops in particular how to do we measure in the face of changing contexts. Examples include:

A Note on the Best Reviewers

Good citizens:

A nice note: some were nominated by authors of papers that the reviewer rejected because the review was so good. That’s what good peer review is about – improving our science.

Random Notes

  • Love the work Bizer and crew are doing on Web Tables. Check it out.
  • Conferences are so good for quick lit reviews. Thanks to Bijan Parsia who sent me the direction of Pavel Klinov‘s work on probabilistic reasoning over inconsistent ontologies.
  • grafter.org – nice site
  • Yes, you can reproduce results.
  • There’s more provenance on the Web of Data than ever. (Unfortunately, PROV is still small percentage wise.)
  • On the other hand, PROV was in many talks like last year. It’s become a touch point. Another post on this is on the way.
  • The work by Halpin and Cheney on using SPARQL update for provenance tracking is quite cool. 
  • A win from the VU: DIVE 3rd place in the semantic web challenge 
  • Amazing wifi at the conference! Unbelievable!
  • +1 to the Poster & Demo crew: keeping 160 lightening talks going on time and fun – that’s hard
  • 10 year award goes to software: Protege: well deserved
  • http://ws.nju.edu.cn/explass/
  • From Nigel’s keynote: it seems that the killer app of open data is …. insurance
  • Two years in a row that stuff I worked has gotten a shout out in a keynote (Social Task Networks). 😃
  • ….. I don’t think the streak will last
  • 99% of queries have nouns (i.e. entities)
  • I hope I did Sarven’s Call for Linked Research justice
  • We really ought to archive LOV – vocabularies are small but they take a lot of work. It’s worth it.
  • The Media Ecology project is pretty cool. Clearly, people who have lived in LA (e.g. Mark Williams) just know what it takes 😉
  • Like: Linked Data Fragments – that’s the way to question assumptions.
  • A low-carb diet in italy – lots of running
%d bloggers like this: