Earlier this week, I attended the SNN Symposium –  Intelligent Machines. SNN is the Dutch foundation for Neural Networks, which coordinates the Netherlands national platform on machine learning, which connects most of the ML groups in the Netherlands.

It’s not typical for a 1 day Dutch specific academic symposium to sell out – but this one did. This is a combination of the topic (machine learning is hot!) but also the speakers. The organizers put together a great line-up:

It’s not typical to get essentially 4 keynotes in one day. Instead of going through each talk in turn, I’ll try to draw some of the major items that I took away from across the talks.

The Case for Probability Theory

Both Prof. Ghahramani and Dr. Herbrich made strong arguments for probability as the core way to think about machine learning/intelligence and in particular a bayesian view of the world . Herberich summarized the argument to use probability as:

  • Probability is a calculus of uncertainty (argued using the “naturalness” of Cox Axioms)
  • It maps well to computational systems – (factor graphs allow for computational distribution )
  • It decouples inference, prediction and decision

Factor Graphs!

For me, it was a nice reminder to think of optimization as an approximation for computing probabilities. More generally, coming back to a simplified high-level framework makes understanding the complexities of the algorithms easier. Ghahramani did a great job of connecting this framework with the underlying mathematics. Slides from his ML course are here – unfortunately without the lecturer himself.

The Rise of Reinforcement Learning

The presentations by Daan Wierstra and Sethu Vijayakumar both featured pretty amazing demos. Dr. Wierstra work at was on the team that developed algorithms that can learn to play Atari games purely from pixels and a knowledge of the game score. This uses reinforcement learning to train a convolutional neural network. The key invention here was to keep around the past experience when providing input back into the neural network.

Likewise, Prof. Vijayakumar showed how robots can also learn via reinforcement. Here’s an example of a robot arm learning to balance a pole.

Reinforcement learning can help attack the problem of data efficiency that’s faced by machine learning. Essentially, it’s hard to get enough training data, let alone labelled training data. We’ve seen the rise of unsupervised methods to take advantage of the data we do have. (Side note: unsupervised approaches just keep getting better) But by situating the agent in an environment, it it’s easier to provide the sort of training necessary. Instead of examples, one needs to provide the appropriate feedback environment. From Wienstra’s talk, again the apparent difficulty for reinforcement learning is temporal abstraction – using knowledge from past to learn. Both the Atari and Robot example receive fairly immediate reinforcement on their tasks.

This takes us back to the classic ideas of situated cognition and of course the work of Luc Steels.

Good Task Formulation

Sometimes half the battle in research is coming up with a good task formulation. This is obvious but it’s actually quite difficult. What struck me was each of the speakers was good at formulating their problem and the metrics by which they can test it. For example, Prof. Ghahramani was able to articulate his goals and measure of success for the development of the Automatic Statistician – a system for finding a good model of a given data and providing a nifty human readable and transparent report. Here’s one for affairs :-) 

(Side note: the combination of parameter search and search through components reminds of work on the Wings Workflow environment.)

Likewise, Dr. Herbrich was good at translating the various problems faced within Amazon into specific ML tasks. For example, here’s his definition for Content Linkage:

image1

 

He then broke this down into the specific well defined tasks through the rest of talk. The important thing here is to keep coming back to these core tasks and having well defined evaluation criteria. (See also Watson’s approach)

Attacking General AI?

Deep Mind - general AI

One thing that stood out to me was the audacious of the Google Deep Mind goal – to solve General AI. Essentially, designing “AI that can operate over a wide range of tasks”. Why now? Wierstra emphasized the available compute power and advances in different algorithms. I thought the interesting comment was that they have something like a 30 year time horizon within a company. Of course, funding may not last long, but articulating that goal and demonstrable attacking it is something that I would expect more from academia. Indeed, I wonder if we are not thinking enough  They already have very impressive results. The atari example but also their DRAW algorithm for learning to generate images :

I also like their approach of Neural Turing Machines – using recurrent neural network to create a computer itself. By adding memory to neural networks there trying to tackle the “memory” problem discussed above.

Overall, it was an invigorating day.

Random thoughts:

  • Robots demos are cool!

  • Text Kernel and Postdam’s use of word2vec for entity extraction in CVs was interesting.
  •  (click to see the full size poster)IMG_0019

This past week I attended a workshop the Evolution and variation of classification systems organized by the Knowescape EU project. The project studies how knowledge evolves and makes cool maps like this one:

The aim of the workshop was to discuss how knowledge organization systems and classification systems change.  By knowledge organization systems, we mean things like the Universal Decimal Classification system or the Wikipedia Category Structure. My interest here is the interplay between the change in data and the change in the organization system used for that data. For example, I may use a certain vocabulary or ontology to describe a dataset (i.e. the columns), how does that impact data analysis procedures when that organization’s meaning changes.  Many of our visualizations decisions and analysis are based on how we categorize (whether mechanical or automatically) data according to such organizational structures. Albert Meroño-Peñuela gave an excellent example of that with his work on dutch historical census data. Furthermore, the organization system used may impact the ability to repurpose and combine data.

Interestingly, even though we’ve seen highly automated approaches emerge for search and other information analysis tasks Knowledge Organization Systems (KOSs) still often provide extremely useful information. For example, we’ve see how scheme.org and wikipedia structure have been central to the emergence of knowledge graphs. Likewise, extremely adaptable organization systems such as hashtags have been foundational for other services.

At the workshop, I particularly enjoyed Joesph Tennis keynote on the diversity and stability of KOSs. He’s work on ontogeny is starting to measure that change. He demonstrated this by looking at the Dewey Decimal System but others have shown that the change is apparent in other KOSs (1, 2, 3, 4). Understanding this change could help in constructing better and more applicable organization systems.

From both Joseph’s talk as well as the talk Richard Smiraglia (one of the leaders in the Knowledge Organization), it’s clear that as with many other sciences our ability to understand information systems can now become much more deeply empirical. Because the objects of study (e.g. vocabularies, ontologies, taxonomies, dictionaries) are available on the Web in digital form we can now analyze them. This is the promise of Web Observatories. Indeed, that was an interesting outcome of the workshop was that the construction of KOSs observatory was not that far fetched and could be done using aggregators such as Linked Open Vocabularies and Taxonomy Warehouse. I’ll be interested to see if this gets built.

Finally, it occurred to me that there is a major lack of studies on the evolution of the urban dictionary as a KOS. Somewhat ought to do something about it :-)

Random Notes

Last week (Jan 29 & 30), I was at the NSF & Sloan foundation workshop: Supporting Scientific Discovery through Norms and Practices for Software and Data Citation and Attribution. The workshop is in the context of the NSF’s dear colleague letter on the subject. The workshop brought together a range of backgrounds and organizations from Mozilla to NIH and NASA. I got to catch up with several friends but was able to meet some new folks as well. Check out the workshop’s github page with a list of 22 use cases submitted to the workshop.

I was pleased to see the impact impact of the work of FORCE11 on helping drive this space. In particular, the Joint Principles on Data Citation and Resource Identifiers (RRIDS) seem to be helping the community focus on citing other forms of scholarly output and were brought up several times in the meeting.

I think there were two main points from the conference:

  1. We have the infrastructure.
  2. Sustainability

Infrastructure

It was clear that we have much of the infrastructure in-place to enable the citation and referencing of outputs such as software and data.

In terms of software, piggy backing off existing infrastructures seems to be the most likely approach. The versioning/release mindset built into software development means that hosting infrastructure such as Github or Google Code provide a strong start. These can then be integrated with existing scholarly attribution systems.My colleague Sweitze Roffel presented Elsevier’s work on Original Software Publications. This approach leverages the existing journal based ecosystem to provide the permanence and context associated with things in the scientific record. Another approach is to use the data hosting/citation infrastructure to give code a DOI e.g. by using Zenodo. Both approaches work with Github.

The biggest thing will be promoting the actual use of proper citations. James Howison of University of Texas Austin presented interesting deep dive results on how people refer to software in the scientific literature  (slide set below) (Githhub). It shows that people want to do this but often don’t know how. His study was focused I’d like to do this same study in an automatic fashion on the whole of the literature. I know he’s working with others on training machine learning models for finding software mentions so that would be quite cool. Maybe it would be possible to back-fill the software citation graph this way?

In terms of data citation, we are much farther along because many of the existing data repositories support the minting of data citations. Many of the questions asked were about cases with changing or mash-ups of data. These are impotent edge cases to look at. I think progress will be made here by leveraging the landing pages for data to provide additional metadata. Indeed, Joan Starr from the California Digital Library is going to bring this back to the DataCite working group to talk about how to enable this. I was also impressed with the PLOS lead Making Data Count project and Martin Fenner’s continued development of the Lagotto altmetrics platform. In particular there was discussion about getting a supplementary guideline for software and data downloads included in COUNTER. This would be a great step in getting data and citation properly counted.

Sustainability

Sustainability is one of the key questions that have been going around in the larger discussion. How do we fund software and data resources necessary for the community. I think the distinction that arose was the need to differentiate between:

  • software as an infrastructure; and
  • software as an experiment/method.

This seems rather obvious but the tendency is for the later to become the former and this causes issues in particular for sustainability.

Issues include:

  1. It’s difficult to identify which software will become key to the community and thus where to provide the investment.
  2. Scientific infrastructure software tends to be funded on project to project basis or sometimes as a sideline of a lab.
  3. Software that begins as an experiment is often not engineered correctly.
  4. As Luis Ibanez from Google pointed out, we often loose the original developers overtime and there’s a need to involve new contributors.

The Software Sustainability Institute in the UK has begun to tackle some of these problems. But there is still lack of clear avenues for aggregating the funding necessary. One of the popular models is the creation of a non-profit foundation to support a piece of software  but this leads to “foundation fatigue.” Others approaches shift the responsibility to university libraries, but libraries may not have the required organizational capabilities. Katherine Skinner’s recent talk at FORCE 2015 covered some of the same ground here.

One of the interesting ideas that came up at the workshop was the use of other parts of the University institution to help tap into different funding streams (e.g. the IPR office; university development office). An example of this is Internet2 which is sponsored directly by universities. However, as pointed out by Dan Katz, to support this sort of sustainability there is a need to have insight into the deeper impact of this sort of software for the scientific community.

Conclusion

You can see a summary of the outcomes here. In particular, take a look at the critical asks. These concrete requests were formulated by the workshop attendees to address some of the identified issues. I’ll be interested to see the report that comes out of the workshop and how that can help move us forward.

NewsReader Amsterdam Hackathon

This past Wednesday (Jan. 21, 2015) I was at the NewsReader Hackathon. NewsReader is a EU project to extract events and build stories from the news. They use a sophisticated NLP pipeline combined with semantic background knowledge to perform this task. The hackathon was an opportunity to talk to members of one of the leading NLP groups in the Netherlands (CLTL) and find out more about their current pipeline. Additionally, one of the project partners is Lexis Nexis, a sister company of Elsevier, so it was nice to see how their content was being used as basis for event extraction and also meet some of my colleagues.  The combination of news and research  is particularly of interest in light of the recent Elsevier acquisition of NewsFlo.

Besides the chance to meet people, I also got to do some hacking myself to see how the NewsReader API worked. I used the api to plot the number and type of events featuring universities. (The resulting iPython Notebook)

A couple of pointers for future reference:

2015-01-13 10.06.28Last week, I was at FORCE 2015 – the future of research communications and e-scholarship conference held in Oxford. This is the third conference in a series that started with Beyond the PDF in 2011 and continued with Beyond the PDF 2 that I led the organization of in Amsterdam in 2013 (my wrap-up is here). This conference provides one of the only forums that brings together a variety of people who are in the vanguard of scholarly communication from librarians and computer scientists, to researchers, funders and publishers. Pretty much every role was represented in the ~250 attendees.

To give you an idea, I saw the developers of the Papers reference manager, the editorial director of PLOS One, a funder from the Wellcome Trust, a librarian from University of Iowa, and public policy junior researchers from Brazil/Germany2015-01-12 12.40.34

The curators (i.e. conference chairs), Dave De Roure and Melissa Haendel did a great job of pulling in a whole range of topics and styles in a a great venue. We even had the opportunity to see copies of the Philosophical Transactions. Speaking from experience this is a tough conference to organize because everything is pretty dynamic and there’s lots of different styles. (e.g. Dave and last minute beer run for the Hackathon!)

So what was I doing there? I helped organize the hackathon, which gave some space to work on content extraction, and reference manger support for data citation and for people to talk over pizza. This lead to proposals for two 1k challenges. (Remember to vote for which one you want to give 1000 pounds to..) I also helped organize the poster and demo / geek out sessions. A trailer for those sessions is below:

Themes

The conferences was too packed to go through everything but I wanted to go through the core themes that I got out of it:

1. Scholarly media is not just text

Data, images, slides, videos, software – scholarly media is not just text.  It never ways but it’s clear that the primacy of text is slowly being reduced and eventually be treated on par with these other forms of output. This is being made possible by the number of new platforms being introduced whether it’s Fighsare or Xenodo for data, github for code or HUBZero for the entire analytics lifestyle. It’s about sharing the actual research object rather than the textual argument. I think what brought this home to me is the amount of time spent discussing and presenting how these content types can be shoehorned into traditional text environments (e.g. journal citations).

2. Not access, understanding

The assumption at FORCE 2015, is that scholarship will be open access. The question then arises what do you do with the open access content. Phil Bourne, in his closing remarks, mentioned the lack of things being done with the current open access corpus. This notion of the need to do more clearly came over in Chris Lintott, founder of Galaxy Zoo, keynote:

He discussed how the literature was a barrier to amateurs contributing more to science. Specially, he mentioned accessible research summaries.  But, in general, there is a need to consider a more diverse audience in our communication not only for amateurs but for scientists from other disciplines or policy makers, for example.

3. Quality under pressure

The amount of scholarship continues to grow and there are perverse incentives. Scott Edmunds from Gigascience brought this out in his vision idea’s talk.

The current answer to this is peer review. But as most researchers will tell you, we are already overwhelmed. I get tons of requests to review and it’s hard to turn down my colleagues. Maybe a market for peer review will develop (see below) but what we need is more automated mechanisms of quality control or for publishers to do more quality control before things get sent to reviewers. Maybe we should see peer review as constructive feedback and not a filter. Likewise, by valuing other parts of the system maybe we can increase both the transparency and overall quality of the science.

4. Science as a service

The poster below from Bianca Kramer and  Jeroen Bosman highlighted the explosion in services available for scholarly communication. This continues a theme that I emphasized last year and that Ian Foster has talked about – the ability to do more and more science by just calling an API. Why can’t I build my lab from a cafe?

Wrap-up & Random Notes

The FORCE community is a special one. I hope we can continue to work together to push scholarly communication forward. I’m already looking forward to FORCE 2016 in Portland. There’s lots to be excited about as the way we do research rapidly changes. Finally, here are some random notes from the conference:

Last week I got from a great 8! days in Riva del Garda, Italy attending the 2014 International Semantic Web Conference and associated events. This is one of those events where your colleagues on Facebook get annoyed with the pretty pictures of a lakes and mountains that their other colleagues keep posting:

2014-10-23 06.57.15

ISWC is the key conference for semantic web research and the place to see what’s happening. This year’s conference had 630 attendees which is a strong showing for the event. The conference is as usual selective:
2014-10-21 09.11.38
Interestingly, the numbers were about on par with last year except for the in-use track where we had a much larger number of submissions. I suspect this is because all tracks had synchronized submission deadlines whereas the in-use track was after the research track last year. The replication, dataset, software, and benchmark track is a new addition to the conference and a good won I might add. Having a place to present for these sort of scholarly output is important and from my perspective a good move by the conference. You can find the papers (published and in preprint form) on the website.. More importantly you can find a big chunk of the slides presented on Eventifier.

So why am I hanging out in Italy (other than the pasta).  I was co-organizer the Doctoral Consortium for the event.

Additionally, I was on a panel for the Context Interpretation and Meaning workshop. I also attended a pre-meeting on archiving linked data for the PRELIDA project. Lastly, we had an in-use paper in the conference on adaptive linking used within the Open PHACTS platform to support chemistry.. Alasdair Gray did a fantastic job of leading and presenting the paper.

So on to the show.Three themes, which I discuss in turn:

  1. It’s not Volume, it’s Variety
  2. Variety & the Semantic Spectrum
  3. Fuzziness & Metrics

It’s not Volume, it’s Variety

I’m becoming more convinced that the issue for most “big” data problems isn’t volume or velocity, it’s variety. In particular, I think the hardware/systems folks are addressing the first two problems at a rate that means that for many (most?) workloads the software abstractions provided are enough to deal with the data sizes and speed involved. This inkling was confirmed to me a couple of weeks ago when I saw a talk by Peter Hofstee, the designer of the Cell microprocessor, talking about his recent work on computer architectures for big data.

This notion was further confirmed at ISWC. Bryan Thompson of BigData triple store fame, presented his new work using GPUs (mapgraph.io) that can do graph processing on hundreds of millions of nodes using GPUs using similar abstractions to Signal/Collect or GraphLab. Additionally, as I was sitting in the session on Large Scale RDF processing – many of the systems were focused on a clustered environment but using ~100 million triple test sets even though you can process these with a single beefy server. It seems that for online analytics workloads you can do these with a simple server setup and for truly web scale workloads these will be at the level of clusters that can be provisioned fairly straightforwardly using THE cloud. I mean in our community the best examples are webdatacommons.org or the work of the VU team on LODLaundry  – both of these process graphs in the billions using the Hadoop ecosystem on either local or Amazon based clusters. Furthermore, the best paper in the in-use track (Semantic Traffic Diagnosis with STAR-CITY: Architecture and Lessons Learned from Deployment in Dublin, Bologna, Miami and Rio) from IBM actually scrapped using a specific streaming system because even data coming from traffic sensors wasn’t fast enough to make it worthwhile.

Indeed, in Prabhakar Raghavan‘s  (yes! the Intro. to Information Retrieval and Google guy) keynote, he noted that he would love to have problems that were just computational in nature. Likewise, Yolanda Gil discussed that the difficulties and that the challenges lay not in necessarily data analysis but in data preparation (i.e. it’s a data mess!) 2014-10-21 14.08.27

The hard part is data variety and heterogeneity, which transitions, nicely, into our next theme…

Variety & the Semantic Spectrum

Chris Bizer gave an update to the measurements of the Linked Data Cloud - this was a highlight talk.

The Linked Data Cloud has grown essentially doubling (towards generously ~1000 datasets) but the growth of schema.org based data (see the Microdata+RDFa series ISWC 2014 paper) has ~500,000 datasets. Chris gave an interesting analysis about what he thinks this means in a nice mailing list post. The comparison is summed up below:

So what we are dealing with is really a spectrum of semantics from extremely rich knowledge bases to more shallow mark-up (As a side note: Guha’s thought’s on Schema.org are always worth a revisit.) To address, this spectrum, I saw quite a few papers trying to deal with it using a variety of CS techniques from NLP to databases. Indeed, two of the best papers were related to this subject:

Also on this front were works on optimizing linked discovery (HELIOS), machine reading (SHELDON), entity recognition, and query probabilistic triple stores. All of these works hand in common trying to take approaches from other CS fields and adapt or improve them to deal with these problems of variety within a spectrum of semantics.

Fuzziness & Metrics

The final theme that I pulled out of the conference was the area of evaluation metrics but ones that either dealt with or catered for the fact that there are no hard truths, especially, when using corpora developed using human judgements. The quintessential example of this is my colleague Lora Aroyo’s work on Crowd Truth – trying to capture disagreement in the process of creating gold standard corpora in crowd sourcing environments. Other example is the very nice work from Michelle Cheatham and Pascal Hitzler on creating an uncertain OAEI conference benchmark.  Raghavan‘s keynote also homed in on the need for more metrics especially as we have a change in the type of search interfaces that we typically use (going from keyword searches to more predictive contextual search). This theme was also prevalent in the workshops in particular how to do we measure in the face of changing contexts. Examples include:

A Note on the Best Reviewers

Good citizens:

A nice note: some were nominated by authors of papers that the reviewer rejected because the review was so good. That’s what good peer review is about – improving our science.

Random Notes

  • Love the work Bizer and crew are doing on Web Tables. Check it out.
  • Conferences are so good for quick lit reviews. Thanks to Bijan Parsia who sent me the direction of Pavel Klinov‘s work on probabilistic reasoning over inconsistent ontologies.
  • grafter.org – nice site
  • Yes, you can reproduce results.
  • There’s more provenance on the Web of Data than ever. (Unfortunately, PROV is still small percentage wise.)
  • On the other hand, PROV was in many talks like last year. It’s become a touch point. Another post on this is on the way.
  • The work by Halpin and Cheney on using SPARQL update for provenance tracking is quite cool. 
  • A win from the VU: DIVE 3rd place in the semantic web challenge 
  • Amazing wifi at the conference! Unbelievable!
  • +1 to the Poster & Demo crew: keeping 160 lightening talks going on time and fun – that’s hard
  • 10 year award goes to software: Protege: well deserved
  • http://ws.nju.edu.cn/explass/
  • From Nigel’s keynote: it seems that the killer app of open data is …. insurance
  • Two years in a row that stuff I worked has gotten a shout out in a keynote (Social Task Networks). 😃
  • ….. I don’t think the streak will last
  • 99% of queries have nouns (i.e. entities)
  • I hope I did Sarven’s Call for Linked Research justice
  • We really ought to archive LOV – vocabularies are small but they take a lot of work. It’s worth it.
  • The Media Ecology project is pretty cool. Clearly, people who have lived in LA (e.g. Mark Williams) just know what it takes ;-)
  • Like: Linked Data Fragments – that’s the way to question assumptions.
  • A low-carb diet in italy – lots of running
Follow

Get every new post delivered to your Inbox.

Join 36 other followers

%d bloggers like this: