Archive

events

Two weeks ago, I had the pleasure of attending the 17th International Semantic Web Conference held at Asiolomar Conference Grounds in California. A tremendously beautiful setting in a state park along the ocean. This trip report is somewhat later than normal because I took the opportunity to hang out for another week along the coast of California.

Before getting into the content of the conference, I think it’s worth saying, if you don’t believe that there are capable, talented, smart and awesome women in computer science at every level of seniority, the ISWC 2018 organizing committee + keynote speakers is the mike drop of counter examples:

Now some stats:

  •  438 attendees
  •  Papers
    •  Research Track: 167 submissions – 39 accepted – 23% acceptance rate
    •  In Use: 55 submissions – 17 accepted – 31% acceptance rate
    •  Resources: 31 submissions – 6 accepted – 19% acceptance rate
  •  38 Posters & 39 Demos
  • 14 industry presentations
  • Over 1000 reviews

These are roughly the same as the last time ISWC was held in the United States. So on to the major themes I took away from the conference plus some asides.

Knowledge Graphs as enterprise assets

It was hard to walk away from the conference without being convinced that knowledge graphs are becoming fundamental to delivering modern information solutions in many domains. The enterprise knowledge graph panel was a demonstration of this idea. A big chunk of the majors were represented:

The stats are impressive. Google’s Knowledge Graph has 1 billion things and 70 billion assertions. Facebook’s knowledge graph which they distinguish from their social graph and has just ramped up this year has 50 Million Entities and 500 million assertions. More importantly, they are critical assets for applications, for example, at eBay their KG is central to creating product pages, at Google and Microsoft, KGs are key to entity search and assistants, and at IBM they use it as part of their corporate offerings. But you know it’s really in-use when knowledge graphs are used for emoji:

It wasn’t just the majors who have or are deploying knowledge graphs. The industry track in particular was full of good examples of knowledge graphs being used in practice. Some ones that stood out were: Bosch’s use of knowledge graphs for question answering in DIY, multiple use cases for digital twin management (Siemens, Aibel); use in a healthcare chatbot (Babylon Health); and for helping to regulate the US finance industry (FINRA). I was also very impressed with Diffbot’s platform for creating KGs from the Web. I contributed to the industry session presenting how Elsevier is using knowledge graphs to drive new products in institutional showcasing and healthcare.

Beyond the wide use of knowledge graphs, there was a number of things I took away from this thread of industrial adoption.

  1. Technology heterogeneity is really the norm. All sorts of storage, processing and representation approaches were being used. It’s good we have the W3C Semantic Web stack but it’s even better that the principles of knowledge representation for messy data are being applied. This is exemplified by Amazon Neptune’s support for TinkerPop & SPARQL.
  2. It’s still hard to build these things. Microsoft said it was hard at scale. IBM said it was hard for unique domains. I had several people come to me after my talk about Elsevier’s H-Graph discussing similar challenges faced in other organizations that are trying to bring their data together especially for machine learning based applications. Note, McCusker’s work is some of the better publicly available thinking on trying to address the entire KG construction lifecycle.
  3. Identity is a real challenge. I think one of the important moves in the success of knowledge graphs was not to over ontologize. However, record linkage and thinking when to unify an entity is still not a solved problem. One common approach was towards moving the creation of an identifiable entity closer to query time to deal with the query context but that removes the shared conceptualization that is one of the benefits of a Knowledge Graph. Indeed, the clarion call by Google’s Jamie Taylor to teach knowledge representation was an outcome of the need for people who can think about these kinds of problem.

In terms of research challenges, much of what was discussed reflects the same kinds of ideas that were discussed at the recent Dagstuhl Knowledge Graph Seminar so I’ll point you to my summary from that event.

Finally, for most enterprises, their knowledge graph(s) were considered a unique asset to the company. This led to an interesting discussion about how to share “common knowledge” and the need to be able to merge such knowledge with local knowledge. This leads to my next theme from the conference.

Wikidata as the default option

When discussing “common knowledge”, Wikidata has become a focal point. In the enterprise knowledge graph panel, it was mentioned as the natural place to collaborate on common knowledge. The mechanics of the contribution structure (e.g. open to all, provenance on statements) and institutional attention/authority (i.e. Wikimedia foundation) help with this. An example of Wikidata acting as a default is the use of Wikidata to help collate data on genes

Fittingly enough, Markus Krötzsch and team won the best in-use paper with a convincing demonstration of how well semantic technologies have worked as the query environment for Wikidata. Furthermore, Denny Vrandečić (one of the founders of Wikidata) won the best blue sky paper with the idea of rendering Wikipedia articles directly from Wikidata.

Deep Learning diffusion

As with practically every other conference I’ve been to this year, deep learning as a technique has really been taken up. It’s become just part of the semantic web researchers toolbox. This was particularly clear in the knowledge graph construction area. Papers I liked with DL as part of the solution:

While not DL per sea , I’ll lump embeddings in this section as well. Papers I thought that were interesting are:

The presentation of the above paper was excellent. I particularly liked their slide on related work:

iswc2018-1fd3fcf3.png

As an aside, the work on learning rules and the complementarity of rules to other forms of prediction was an interesting thread in the conference. Besides the above paper, see the work from Heiner Stuckenschmidt’s group on evaluating rules and embedding approaches for knowledge graph completion. The work of Fabian Suchanek’s group on the representativeness of knowledge bases is applicable as well in order to tell whether rule learning from knowledge graphs is coming from a representative source and is also interesting in its own right. Lastly, I thought the use of rules in Beretta et al.’s work to quantify the evidence of an assertion in a knowledge graph to help improve reliability was neat.

Information Quality and Context of Use

The final theme is a bit harder for me to solidify and articulate but it lies at the intersection of information quality and how that information is being used. It’s not just knowing the provenance of information but it’s knowing how information propagates and was intended to be used. Both the upstream and downstream need to be considered. As a consumer of information I want to know the reliability of the information I’m consuming. As a producer I want to know if my information is being used for what it was intended for.

The later problem was demonstrated by the keynote from Jennifer Golbeck on privacy. She touched on a wide variety of work but in particular it’s clear that people don’t know but are concerned with what is happening to their data.

There was also quite a bit of discussion going on about the decentralized web and Tim Berners-Lee’s Solid project throughout the conference. The workshop on decentralization was well attended. Something to keep your eye on.

The keynote by Natasha Noy also touched more broadly on the necessity of quality information this time with respect to scientific data.

The notion of propagation of bias through our information systems was also touched on and is something I’ve been thinking about in terms of data supply chains:

That being said I think there’s an interesting path forward for using technology to address these issues. Yolanda Gil’s work on the need for AI to address our own biases in science is a step forward in that direction. This is a slide from her excellent keynote at SemSci Workshop:

iswc2018-09cc97c4.png

All this is to say that this is an absolutely critical topic and one where the standard “more research is needed” is very true. I’m happy to see this community thinking about it.

Final Thought

The Semantic Web community has produced a lot (see this slide from Nataha’s keynote:

iswc2018-d5af2fed.png

ISWC 2018 definitely added to that body of knowledge but more importantly I think did a fantastic job of reinforcing and exciting the community.

Random Notes

A couple of weeks ago I was at Provenance Week 2018 – a biennial conference that brings together various communities working on data provenance. Personally, it’s a fantastic event as it’s an opportunity to see the range of work going on from provenance in astronomy data to the newest work on database theory for provenance. Bringing together these various strands is important as there is work from across computer science that touches on data provenance.

The week is anchored by the International Provenance and Annotation Workshop (IPAW) and the Theory and Practice of Provenance (TaPP) and includes events focused on emerging areas of interest including incremental re-computation , provenance-based security and algorithmic accountability. There were 90 attendees up from ~60 in the prior events and here they are:

IMG_0626-2.jpg

The folks at Kings College London, led by Vasa Curcin, did a fantastic job of organizing the event including great social outings on-top of their department building and with a boat ride along the thames. They also catered to the world cup fans as well. Thanks Vasa!

2018-07-11 21.29.07

I had the following major takeaways from the conference:

Improved Capture Systems

The two years since the last provenance week have seen a number of improved systems for capturing provenance. In the systems setting, DARPAs Transparent Computing program has given a boost to scaling out provenance capture systems. These systems use deep operating system instrumentation to capture logs over the past several years these have become more efficient and scalable e.g. Camflow, SPADE. This connects with the work we’ve been doing on improving capture using whole system record-and-replay. You  can now run these systems almost full-time although they capture significant amounts of data (3 days = ~110 GB). Indeed, the folks at Galois presented an impressive looking graph database specifically focused on working with provenance and time series data streaming from these systems.

Beyond the security use case, sciunit.run was a a neat tool using execution traces to produce reproducible computational experiments.

There were also a number of systems for improving the generation of instrumentation to capture provenance. UML2PROV automatically generates provenance instrumentation from UML diagrams and source code using the provenance templates approach. (Also used to capture provenance in an IoT setting.) Curator implements provenance capture for micro-services using existing logging libraries. Similarly, UNICORE now implements provenance for its HPC environment. I still believe structured logging is one of the under rated ways of integrating provenance capture into systems.

Finally, there was some interesting work on reconstructing provenance. In particular, I liked Alexander Rasin‘s work on reconstructing the contents of a database from its environment to answer provenance queries:2018-07-10 16.34.08.jpg

Also, the IPAW best paper looked at using annotations in a workflow to infer dependency relations:

Lastly, there was some initial work on extracting provenance of  health studies directly from published literature which I thought was a interesting way of recovering provenance.

Provenance for Accountability

Another theme (mirrored by the event noted above) was the use of provenance for accountability. This has always been a major use for provenance as pointed out by Bertram Ludäscher in his keynote:

However, I think due to increasing awareness around personal data usage and privacy the need for provenance is being recognized. See, for example, the Royal Society’s report on Data management and use: Governance in the 21st century. At Provenance Week, there were several papers addressing provenance for GDPR, see:

Also, the I was impressed with the demo from Imosphere using provenance for accountability and trust in health data:

Re-computation & Its Applications

Using provenance to determine what to recompute seems to have a number of interesting applications in different domains. Paolo Missier showed for example how it can be used to determine when to recompute in next generation sequencing pipelines.

I particular liked their notion of a re-computation front – what set of past executions do you need to re-execute in order to address the change in data.

Wrattler was a neat extension of the computational notebook idea that showed how provenance can be used to automatically propagate changes through notebook executions and support suggestions.

Marta Mattoso‘s team discussed the application of provenance to track the adjustments when performing steering of executions in complex HPC applications.

The work of Melanie Herschel‘s team on provenance for data integration points to the benefits of potentially applying recomputation using provenance to make the iterative nature of data integration speedier as she enumerated in her presentation at the recomputation worskhop.2018-07-12 15.01.42.jpg

You can see all the abstracts from the workshop here. I understand from Paolo that they will produce a report from the discussions there.

Overall, I left provenance week encouraged by the state of the community, the number of interesting application areas, and the plethora of research questions to work on.

Random Links

 

The early part of last week I attended the Web Science 2018 conference. It was hosted here in Amsterdam which was nice for me. It was nice to be at a conference where I could go home in the evening.

Web Science is an interesting research area in that it treats the Web itself as an object of study. It’s a highly interdisciplinary area that combines primarily social science with computer science. I always envision it as a loop with studies of what’s actually going on the Web leading to new interventions on the Web which we then need to study.

There were what I guess a hundred or so people there … it’s a small but fun community. I won’t give a complete rundown of the conference. You can find summaries of each day done by Cat Morgan (Workshop DayDay 1Day 2Day 3) but instead give an assortment of things that stuck out for me:

And some tweets:

Last week, I was at Provenance Week 2016. This event happens once every two years and brings together a wide range of researchers working on provenance. You can check out my trip report from the last Provenance Week in 2014.  This year Provenance Week combined:

For me, Provenance Week is like coming home, lots of old friends and a favorite subject of mine. It’s also a good event to attend because it crosses the subfields of computer science, everything from security in operating systems to scientific workflows on to database theory. In one day, I went from a discussion on the role of indirection in data citation to staring at the C code of a database. Marta, Boris and Sarah really put together a solid program. There were about 60 attendees across the four days:

ProvenanceWeek_2016-06-08_D4S2484

So what was I doing there? Having served as co-chair of the W3C PROV working group, I thought it was important to be at the PROV: Three years later event where we reflected on the status of PROV, it’s uptake and usage. I presented some ongoing work on measuring the usage of provenance on the web of data.  Additionally, I gave the presentation of joint work led by my student Manolis Stamatogiannakis and done in conjunction with Ashish Gehani‘s group at SRI. The work focused on using benchmarks to help inform decisions on what provenance capture system to use. Slides:

I’ll now walk through my 3 big take aways from the event.

Provenance to attack Advanced Persistent Threats

DARPA’s $60 million transparent computing explicitly calls out the use of provenance to address the problem of what’s called an Advanced Persistent Threat (APTs). APTs are attacks that are long terms, look like standard business processes, and involve the attacker knowing the system well. This has led to a number of groups exploring the use of system level provenance capture techniques (e.g. SPADE and OPUS) and then integrating that from multiple distributed sources using PROV inspired data models. This was well described by David Archer is his talk as assembling multiple causal graphs from event streams.  James Cheney’s talk on provenance segmentation also addressed these issues well. This reminded me some what of the work on distributed provenance capture using structured logs that the Netlogger and Pegasus teams do, however, they leverage the structure of a workflow system to help with the assembly.

I particularly liked Yang JiSangho Lee and  Wenke Lee‘s work on using user level record and replay to track and replay provenance. This builds upon some of our work that used system level record replay as mechanism for separating provenance capture and instrumentation. But now in user space using the nifty rr tool from Mozilla. I think this thread of being able to apply provenance instrumentation after the fact  on an execution trace holds a lot of promise.

Overall, it’s great to see this level of attention on the use of provenance for security and in more broadly of using long term records of provenance to do analysis.

PROV as the starting point

Given that this was the ten year anniversary of IPAW, it was appropriate that Luc Moreau gave one of the keynotes. As really one of the drivers of the community, Luc gave a review of the development of the community and its successes.One of those outcomes was the W3C PROV standards. 

Overall, it was nice to see the variety of uses of PROV and the tools built around it. It’s really become the jumping off point for exploration. For example, Pete Edwards team combined PROV and a number of other ontologies including (P-Plan) to create a semantic representation of what’s going on within a professional kitchen in order to check food safety compliance. 

burger

Another example is the use of PROV as a jumping off point for the investigation into the provenance model of HL7 FHIR (a new standard for electronic healthcare records interchange).

As whole, I think the attendees felt that what was missing was an active central point to see what was going on with PROV and pointers to resources for implementation. The aim is to make sure that the W3c PROV wiki is up-to-date and is a better resource overall.

Provenance as lens: Data Citation, Documents & Versioning

An interesting theme was the use of provenance concepts to give a frame for other practices. For example, Susan Davidson gave a great keynote on data citation and how using a variant of provenance polynomials can help us understand how to automatically build citations for various parts of curated databases. The keynote was based off her work with James Frew and Peter Buneman that will appear in CACM (preprint). Another good example of provenance to support data citation was Nick Car’s work for Geoscience Australia.

Furthermore, the notion of provenance as the substructure for complex documents appeared several times. For example, the Impacts on Human  Health of Global Climate Change report from globalchange.gov uses provenance as a backbone. Both the OPUS and PoeM systems are exploring using provenance to generate high-level experiment reports.

Finally, I thought David Koop‘s versioning of version trees showed how using provenance as lens can help better understand versioning of version trees themselves. (I have to give David credit for presenting a super recursive concept so well).

Overall, another great event and I hope we can continue to attract new CS researchers focusing on provenance.

Random Notes

  • PROV in JSON-LD – good for streaming
  • Theoretical provenance paper recipe = extend provenance polynomials to deal with new operators. Prove nice result. e.g. now for Linear Algebra.
  • Prefixes! R-PROV, P-PROV, D-PROV, FS-PROV, SC-PROV, — let me know if I missed any..
  • Intel Secure Guard Extensions (SGX) – interesting
  • Surprised how dependent I’ve become on taking pictures in conferences for note taking. Not being able to really impacted my flow. Plus, there are less pictures for this
  • Thanks to Adriane for hosting!
  • A provenance based data science environment
  • 👍Learning Health Systems – from Vasa Curcin

Next week is the 2015 International Semantic Web Conference. I had the opportunity with the Michel Dumontier to chair a new track on Datasets and Ontologies. A key part of of the Semantic Web has always been shared resources, whether it’s common standards through the W3C or open datasets like those found in the LOD cloud. Indeed, one of the major successes of our community is the availability of these resources.

ISWC over the years has experimented with different ways of highlighting these contributions and bringing them into the scientific literature. For the past couple of years, we have had an evaluation track specifically devoted to reproducibility and evaluation studies. Last year datasets were included to form a larger RDBS track. This year we again have a specific Empirical Studies and Evaluation track along side the Data & Ontologies track.

The reviewers had a tough job for this track. First, it was new so it’s hard to make a standard judgment. Secondly, we asked reviewers not only to review the paper but the resource itself along a number of dimensions. Overall, I think they did a good job. Below you’ll find the resources chosen for presentation at the conference and a brief headline of what to me is interesting about the paper. In the spirt of the track, I link to the resource as well as the paper.

Datasets

  •  Automatic Curation of Clinical Trials Data in LinkedCT by Oktie Hassanzadeh and Renée J Miller (paper) – clinicaltrials.gov published as linked data in an open and queryable. This resource has been around since 2008. I love the fact that they post downtime and other status info on twitter https://twitter.com/linkedct
  • LSQ: Linked SPARQL Queries Dataset by Muhammad Saleem, Muhammad Intizar Ali, Qaiser Mehmood, Aidan Hogan and Axel-Cyrille Ngonga Ngomo (paper). – Query logs are becoming an ever more important resource from everything from search engines to database query optimization. See for example USEWOD. This resource provides queryable versions in SPARQL of the query logs from several major datasets including dbpedia and linked geo data.
  • Provenance-Centered Dataset of Drug-Drug Interactions by Juan Banda, Tobias Kuhn, Nigam Shah and Michel Dumontier (paper) – this resources provides aggregated set of drug-drug interactions coming from 8 different sources. I like how they provided a doi for the bulk download of their datasource as well as spraql endpoint. It also uses nanopublications as the representation format.
  • Semantic Bridges for Biodiversity Science by Natalia Villanueva-Rosales, Nicholas Del Rio, Deana Pennington and Luis Garnica Chavira (paper) – this resource allows biodiversity scientist to work with species distribution models. The interesting thing about this resource is that it not only provides linked data, a spraql endpoint and ontologies but also semantic web services (i.e. SADI) for orchestrating these models.
  • DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons by Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann and Sebastian Hellmann  (paper) – this is another chapter in exposing wikimedia content as structured data. This resource provides structured information for the media content in Wikimedia commons. Now you can spraql for all images with a CC-by-sa v2.0 license.

Ontologies

Overall, I think this is a good representation of the plethora of deep datasets and ontologies that the community is creating.  Take a minute and check out these new resources.

Last week I was in Florence Italy for the 23rd International World Wide Web Conference (WWW 2015). This is the leading computer science conference focused on web technology writ large. It’s a big conference – 1400 attendees this year. WWW is excellent for getting a good bearing on the latest across multiple subfields in computer science. Another way to say it is that I run into friends from the semantic web community, NLP community, data mining community, web standards community, the scholarly communication community, etc.. I think on the Tuesday night I traversed four different venues hanging out with various groups.

This is the first time since 2010 that I attended WWW. It was good to be back. I was there the entire week so there was a ton but I’ll try to boil what I saw down into 3 takeaways. But first…

What was I doing there?

First, was that I co-authored a research track paper with Marcin Wylot and Philippe Cudré-Mauroux of the eXascale Infolab (cool name) on Executing Provenance Queries over Web Data (slides, paper). We showed that because of the highly selective nature of provenance on the web of data, we can actually improve query performance within a triple store. I was super happy to have this accepted given the ~14%! acceptance rate.

Second, I gave the opening talk of the Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD) workshop. I discussed the current state of scholarly productivity and used the notion of the burden of knowledge as a motivation for knowledge graphs as a mechanism to help increase that productivity. I even went web for my slides.

Continuing on the theme of knowledge graphs, I participated on a panel in the industry track around knowledge graphs. More thoughts on this coming up.

Knowledge graph panel www

The Takeaways

From my perspective there were three core takeaways:

  1. Knowledge Graphs/Bases everywhere
  2. Assume the Web
  3. Scholarly applications are interesting applications

1. Knowledge Graphs/Bases everywhere

I could call this Entities everywhere. Perhaps, it was the sessions I chose to attend but it felt like when I was at the conference in 2010 where every other paper was about online advertising. There were a ton of papers on entity linking, entity disambiguation, entity (etc.) many others had knowledge base construction as a motivation.

2015-05-21 14.49.54

There were two tutorials on knowledge graphs both of them were full and the one from Google/Facebook involved moving to a completely new room. Both were excellent. The one from the Yago team has really good material.  As a side note, it was interesting to sit-in on tutorials where I already have a decent handle on the material. It let me compare my own intellectual framework for the material and others out there. For example, I liked the Yago tutorial’s distinction between source-centric and yield-centric information extraction and how we pursue the yield approach when doing automated knowledge base construction. A recommended exercise for the reader.

Beyond just being a plethora of stuff, I think our panel discussion highlighted themes that appeared across several papers.

Dealing with long tail entities
In general, approaches to knowledge base construction have relied on well known entities (e.g. wikipedia) and frequency (you’re mentioned a lot, you’re an entity). For many domain specific entities, for example in science, and also emergent entities this is a challenge. A number of authors tried to tackle this by:

  • looking at web page titles as a potential data source for entities (Song et al.)
  • use particular types of web tables to help assign entities to classes (Wang et al.)
  • use social context help entity extraction (Jie Tang et al. )
  • discover new meta relations between entities (Meng et al.)

Quality
All the organizations on the industry panel spend significant resources on quality maintenance of their knowledge graphs. The question here is how to best decrease the amount of human input and increase automation.

An interesting example that was talked about quite frequently is the move of Freebase to Wikidata. Wikidata runs under the same guidelines as Wikipedia so all facts need to have claims grounded in sources from the Web. Well it turns out this is difficult because many facts are sourced from Wikipedia itself. This kind of dare I say it provenance is really important. Most current large scale knowledge graphs support provenance but as we automate more it would be nice to be able to automatically judge these sources using that provenance.

One paper that I saw that addressed quality issues this was GERBIL – General Entity Annotator Benchmarking Framework. This 25 author paper! devised a common framework for testing entity linking tools. It’s great to see the community looking at these sorts of common QA frameworks.

Multimedia
This seemed to be bubbling up. On the panel, the company Tagasauris was looking at constructing a mediaGraph by analyzing video content. During the Yago tutorial, the presenters mentioned potential future work on extracting common sense knowledge by looking at videos. In general, both extraction of facts from multimedia but also using knowledge graphs to understand multimedia seems like a challenging but fruitful area. One particular example was the paper “Tagging Personal Photos with Transfer Deep Learning”. What was cool was the injection of a personal photo ontology into the training of the network as priors. This led to both better results but probably more impotently decreased the training time. Another example is the work from Gerhard Weikum’s group on extracting knowledge from movie scripts. 

Finally, as I commented at the Linked Data on the Web Workshop, the growth of knowledge graphs is a triumph of the semantic web and linked data. Making knowledge bases open and available on the Web using reusable schemes has really been a boon to the area.

2. Assume the Web

It’s obvious but is worth repeating: the web is really big!

These stats were from Andrei Broder’s excellent keynote. The size of the web motivates the need for better web technology (e.g. search) and as that improves so do our expectations. Broder called out three axes of progress

  1. scaling up with quality
  2. faster response
  3. higher functionality levels

We progress on all these dimensions. But the scale of the web doesn’t just change the technology we need to develop but it changes our methods.

For example, a paper I liked a lot was “Leveraging Pattern Semantics for Extracting Entities in Enterprises”. This bares resembles towards problems we face extracting entities that are not found on the web because there only mentioned within a private environment (e.g. internal product names). But even in this environment they rely on the Web. They rank semantic patterns they extract by using relations extracted from the web.

For me, it means that even if the application isn’t necessarily for “the web”, I should think about the web as a potential part of the solution.

3 Scholarly applications are interesting applications

I’m biased, but I think scholarly applications are particularly interesting and you saw that at WWW. I attended two workshops dealing with technology and scholarship. SAVE-SD and Big Scholar. I was particularly impressed with the scholarly knowledge graph that’s being built on-top of the Bing Satori Knowledge Graph, which covers venues, authors, papers, and organizations from 100 million papers. (It seems there are probably 120 million total on the web.) At their demo they showed some awesome queries that you can do like:  “papers on multiple sclerosis citing artificial intelligence” Another example is venues appearing in the side of bing searches with related venues, due dates, etc:

See Kuansan Wang’s (@kuansanw) talk for more info (slides). As far as I understand, MSR will also be releasing the Microsoft Academic Graph for experimentation in a couple of weeks. Based on this graph MSR is co-organizing with Antonio Gulli from Elsevier the WSDM Cup in 2016

It was a pleasure to meet C. Lee Giles of CiteSeerX. It was good seeing an overview of that system and he had some good pointers (e.g. GROBID for metadata extraction and ParsCit for citation extraction).

From SAVE-SD there were two papers that caught my eye:

There were also a number of main track papers that applied methods to scholarly content.

Overall, WWW 2015 was a huge event so this trip report really is just what I could touch. I didn’t even get the chance to go to the W3C sessions and Web Science talks. You can check out all the proceedings here, definitely worth a look.

Random thoughts

  • The web isn’t scale free – it’s log-log. Gotta check out Clauset et al 2009, Power-law distributions in empirical data
  • If you’re a researcher remember that Broder’s “A taxonomy of web search” – was originally rejected from WWW 2002, it now has 1700+ citations.
  • Aidan Hogan + 1 for colorful slides and showing that we need to just deal with blank nodes and not get so hung up about it.  (paper, code)
  • If you do machine learning, do your parameter studies. Most papers had them.
  • PROV and information diffusion combined. So awesome.
  • Ah conference internet… It’s always hard.
  • People are hiring like crazy. Booths from Baidu, Facebook, Yahoo, LinkedIn. Oh, and never discount how frisbee’s can motivate highly educated geeks.
  • On the hiring note, I liked how the companies listed their attendees and their talks.
  • Tons and tons of talks with authors from companies. I should really do some stats. It was like every paper.
  • Italy, food, florentine steak – yummy!
  • Corollary, running is necessary but running in Florence is beautiful. Head by the Duomo across the river and up through the gardens.
  • What you can do with four square data:  2015-05-21 11.27.11
  • Larry and Sergei won the test of time award. 
  • Gotta ask the folks at Insight about their distributional semantics work.

Earlier this week, I attended the SNN Symposium –  Intelligent Machines. SNN is the Dutch foundation for Neural Networks, which coordinates the Netherlands national platform on machine learning, which connects most of the ML groups in the Netherlands.

It’s not typical for a 1 day Dutch specific academic symposium to sell out – but this one did. This is a combination of the topic (machine learning is hot!) but also the speakers. The organizers put together a great line-up:

It’s not typical to get essentially 4 keynotes in one day. Instead of going through each talk in turn, I’ll try to draw some of the major items that I took away from across the talks.

The Case for Probability Theory

Both Prof. Ghahramani and Dr. Herbrich made strong arguments for probability as the core way to think about machine learning/intelligence and in particular a bayesian view of the world . Herberich summarized the argument to use probability as:

  • Probability is a calculus of uncertainty (argued using the “naturalness” of Cox Axioms)
  • It maps well to computational systems – (factor graphs allow for computational distribution )
  • It decouples inference, prediction and decision

Factor Graphs!

For me, it was a nice reminder to think of optimization as an approximation for computing probabilities. More generally, coming back to a simplified high-level framework makes understanding the complexities of the algorithms easier. Ghahramani did a great job of connecting this framework with the underlying mathematics. Slides from his ML course are here – unfortunately without the lecturer himself.

The Rise of Reinforcement Learning

The presentations by Daan Wierstra and Sethu Vijayakumar both featured pretty amazing demos. Dr. Wierstra work at was on the team that developed algorithms that can learn to play Atari games purely from pixels and a knowledge of the game score. This uses reinforcement learning to train a convolutional neural network. The key invention here was to keep around the past experience when providing input back into the neural network.

Likewise, Prof. Vijayakumar showed how robots can also learn via reinforcement. Here’s an example of a robot arm learning to balance a pole.

Reinforcement learning can help attack the problem of data efficiency that’s faced by machine learning. Essentially, it’s hard to get enough training data, let alone labelled training data. We’ve seen the rise of unsupervised methods to take advantage of the data we do have. (Side note: unsupervised approaches just keep getting better) But by situating the agent in an environment, it it’s easier to provide the sort of training necessary. Instead of examples, one needs to provide the appropriate feedback environment. From Wienstra’s talk, again the apparent difficulty for reinforcement learning is temporal abstraction – using knowledge from past to learn. Both the Atari and Robot example receive fairly immediate reinforcement on their tasks.

This takes us back to the classic ideas of situated cognition and of course the work of Luc Steels.

Good Task Formulation

Sometimes half the battle in research is coming up with a good task formulation. This is obvious but it’s actually quite difficult. What struck me was each of the speakers was good at formulating their problem and the metrics by which they can test it. For example, Prof. Ghahramani was able to articulate his goals and measure of success for the development of the Automatic Statistician – a system for finding a good model of a given data and providing a nifty human readable and transparent report. Here’s one for affairs 🙂 

(Side note: the combination of parameter search and search through components reminds of work on the Wings Workflow environment.)

Likewise, Dr. Herbrich was good at translating the various problems faced within Amazon into specific ML tasks. For example, here’s his definition for Content Linkage:

image1

 

He then broke this down into the specific well defined tasks through the rest of talk. The important thing here is to keep coming back to these core tasks and having well defined evaluation criteria. (See also Watson’s approach)

Attacking General AI?

Deep Mind - general AI

One thing that stood out to me was the audacious of the Google Deep Mind goal – to solve General AI. Essentially, designing “AI that can operate over a wide range of tasks”. Why now? Wierstra emphasized the available compute power and advances in different algorithms. I thought the interesting comment was that they have something like a 30 year time horizon within a company. Of course, funding may not last long, but articulating that goal and demonstrable attacking it is something that I would expect more from academia. Indeed, I wonder if we are not thinking enough  They already have very impressive results. The atari example but also their DRAW algorithm for learning to generate images :

I also like their approach of Neural Turing Machines – using recurrent neural network to create a computer itself. By adding memory to neural networks there trying to tackle the “memory” problem discussed above.

Overall, it was an invigorating day.

Random thoughts:

  • Robots demos are cool!

  • Text Kernel and Postdam’s use of word2vec for entity extraction in CVs was interesting.
  •  (click to see the full size poster)IMG_0019
%d bloggers like this: