I was in southern California for essentially a big chunk of August. I had a day visit to the Information Sciences Institute (slides here),  a some nice discussions with friends and also a chance to hang out at the ocean. So here are 10 observations:

  1. I still think hooking up  Abstract Meaning Representation to linked data semantics is something worth trying out.
  2. What is data? I Christine Borgman’s definition “Data refers to entities used as evidence of phenomena for the purposes of research or scholarship”. p.29
  3. Silicon Beach is like a thing. Overhead in Venice, literally, “Tech dude: We need to iterate and test our mvp. Product dude: Steve Jobs didn’t ask what the marked wanted. We need vision!”.
  4. “a future incarnation of Siri, Cortana or other digital companions will be more like a knowledgeable colleague than a personal assistant.” 
  5. JSON-LD + PROV + Elastic Search + lots of other stuff is awesome. I DIG it. Looking forward to hearing more at ISWC.
  6. Something to check out for altmetrics fans: Media Impact Project
  7. UCSB has a sweet campus….
  8. A nice ontology for software metadata: OntoSoft.
  9. AirBnB is great but this is the first trip where I encountered negative responses from neighbors / neighborhood.
  10. You can predict transformative scientific research


Last week, I was a the Theory and Practice of Provenance 2015 (TaPP’15) held in Edinburgh. This is the seventh edition of the workshop. You can check out my trip report from last year’s event which was held during Provenance week here. TaPP’s aim is to be a venue for a place where people can present their early and innovative research ideas.

The event is useful because it brings a cross section of researchers from different CS communities ranging from databases, programming language theory, distributed systems, to e-science and the semantic web. While it’s nice to see old friends at this event, one discussion that was had during the two days was how we can connect back in a stronger fashion to these larger communities especially as the interest in provenance increases within them.

I discuss the three themes I pulled from the event but you can take a look at all of the papers online at the event’s site and see what you think.

1. Execution traces as a core primitive

I was happy to be presenting on behalf of one of my students Manolis Stamatogiannakis whose been studying how to capture provenance of desktop systems using virtual machines and other technologies from the systems community. (He’s hanging out at SRI with Ashish Gehani for the summer so couldn’t make it.) A key idea in the paper we presented was to separate the capture of an execution trace from the instrumentation needed to analyze provence (paper). The slides for the talk are embedded below:

The mechanism used to do this is a technology called record & replay (we use PANDA) but this notion of capturing a light weight execution trace and then replaying it deterministically is also popping up in other communities. For example, Boris Glavic has been using it successfully for database provenance in his work on GProM and reenactment queries. There he uses the audit logging and time travel features of modern databases (i.e. execution trace) to support rich provenance queries. 

This need to separate capture from queries was emphasized by David Gammack and Adriane Chapman work on trying to develop agent based models to figure out what instrumentation needs to be be applied in order to capture provenance. Until we can efficiently capture everything this is still going to be a stumbling block for completely provenance aware systems. I think that thinking about execution traces as a core primitive for provenance systems may be a way forward.

2. Workflow lessons in non-workflow environments

There are numerous benefits to using (scientific) workflow systems for computational experiments one of which is that it provides a good mechanism for capturing provenance in a declarative form. However, not all users can or will adopt workflow environments. Many use computational notebooks (e.g. Jupyter)  or just shell scripts. The YesWorkflow system (very inside community joke here) uses convention and a series of comments to help users produce a workflow and provenance structure from their scrips and file system (paper). Likewise, work on combining noWorkflow, a provenance tracking system for python, and iPython notebooks shows real promise (paper). This reminded me of the PROV-O-Matic work by Rinke Hoekstra.

Of course you can combine yesWorkflow and noWorkflow together into one big system.

Overall, I like the trend towards applying workflow concepts in-situ. It got me thinking about applying the scientific workflow results to the abstractions provided by Apache Spark. Just a hunch that this might be an interesting direction.

3. Completing the pipeline

The last theme I wanted to pull out is that I think we are inching towards being able to truly connect provenance generated by applications. My first example, is the work by Glavic and his students on importing and ingesting PROV-JSON into a database. This lets you query the provenance of query results but include information on the pipelines that got it there.

This is something I’ve wanted to do for ages with Marcin Wylot’s work on TripleProv, I was a bit bummed that Boris got their first but I’m glad somebody did it 🙂

The second example was the continued push forward for provenance in the systems community. In particular, the OPUS and SPADE systems, which I was aware off but now also the work on Linux Provenance Modules by Adam Bates that was introduced to me at TaPP. These all point to the ability to leverage key operating systems constructs to capture and manage provenance. For example, Adam showed how to make use of  mandatory access control policies to provide focused and complete capture of provenance for particular applications. 

I have high hopes here.

Random thoughts

To conclude I’ll end with some thoughts from the notebook.

I hope to see many familiar and new faces at next year’s Provenance Week (which combines TaPP and IPAW).

From Florence, I  headed to Washington D.C. to attend the Society for Scholarly Publishing Annual Meeting (SSP) last week. This conference is a big event for academic publishers. It’s primarily attended by people who either work for publishers as well as companies that provide services for them. It also includes a smattering of librarians and technologist. The demographic was quite different from WWW – more in suits and more women (attention CS community). This reflects the make-up of the academic publishing industry as a whole as shown by the survey done by Amy Brand, which was presented at the beginning of the conference. Full data here. Here’s a small glimpse.

2015-05-27 16.12.52

Big Literature, Big Usage

I was at SSP primarily to give a talk and then be on a panel in the Big Literature, Big Usage session. The session with Jan Velterop and Paul Cohen went well. Jan presented the need for large scale analysis of the literature in order for science to progress (shout outs to the Steve Pettifier led Lazaurs project). He had a great set of slides of showing how fast one would have to read in order to read every paper being deposited in pubmed (the pages just flash buy). I followed up with a discussion of recent work on Machine Reading (see below) and how it impacts publishers. The aim was to show that automated analysis and reading of the literature is not somewhere off in the future but is viable now.

Paul Cohen from DARPA followed up by the discussion of their Big Mechanism program. This effort is absolutely fascinating. Paul’s claim was that we currently cannot understand large scale complex systems. He characterized this as pull vs. a push approach. Paraphrasing, we currently try to pull on the knowledge into individuals’ heads and then make connections (i.e. build a model).  vs. a push approach  where we push all the information from individuals out and have the computer build a model. The former makes understanding such large scale systems for all intensive purposes impossible. To attack this problem, the program’s aim is to automatically build large scale causal computational models directly from the literature.  Paul pointed out there are still difficulties with machine reading (e.g. coreference resolution is still a challenge), however, the progress is there. Amazing they are having success with building models in cancer biology. Both the vision and the passion in Paul’s talk was just compelling. (As a nice aside, folks at Elsevier (e.g. Anita De Waard) are a small part of the project.) are participating in this program.

We followed up with a Q/A panel session. We discussed the fact that all of us believe that sharing/publishing of computational models is really the future. It’s just unfortunate to lock this information up in text. We answered a number of questions around feasibility (yes, this is happening even if it’s hard). Also, we discussed the impossibility of doing some of this science without having computers deeply involved. Things are just too complicated and we are not getting the requisite productivity.

So that was my bit. I also got to attend a number of sessions and catch up with a number of people. What’s up @MarkHahnel  and @jenniferlin15  – thanks for letting me heckle from the back of the room 🙂

Business Stuff

I attended a number of sessions that discussed the business of academic publishing. The big factor seems to be fairly flat growth in library budgets but with a growing amount of journals. This was mentioned in Jayne Marks from Wolters Kluwer’s talk as well as a whole session on where to get growth. I thought the mergers and acquisition talk from @lamb was of interest. It seems that there is even more room for consolidation in the industry.

2015-05-28 16.56.102015-05-28 16.59.09

I also feel that the availability of large of amounts of cheap capital has not been taken full advantage of in the industry. Beyond consolidation new product development seems to be the best way to get growth.  I think one notion that’s of interest is the transition towards Big Funder Deal where funders are essentially paying in bulk for their research to be published.

I enjoyed the cost of OA business models session. A very interesting set of slides from Robert Kiley about Wellcome Trust’s open access journal spend is embedded below. This is a must look at in terms of where costs are coming from. It is a clarion call to all publishers in terms of delivering what they say they are going to deliver.

Pete Binfield of Peer J gave an insightful about the disruptive nature of non-legacy digital only OA publishers. However, I think it may overestimate the costs of the current subscription infrastructure. Also, as Marks noted in her talk 50% of physicians still require print and a majority students want print textbooks. I wonder how much this is the predominate factor in legacy costs?

2015-05-29 11.00.06

Overall, throughout the sessions, it still felt a bit … umm…. slow. We are still talking papers maybe with a bit of metrics or data in for spice but I think there’s much more to be done in helping us scientist do better and that scholarly media/information providers.

Media & Technology

The conference lined up three excellent keynote sessions. The former CEO of MakerBot Jenny Lawton gave a “life advice” talk. I think the best line was “do a gap analysis on yourself”. Actually, the most interesting bit was her answer to a question about open source. Her answer was that we live in very IP and patent oriented world and it’s important to figure out how if you want to be an open company to work strategically in that world. The interview form with the New Yorker author Ken Auletta worked great. His book Googled
The End of the World As We Know It is now on my to read list. A couple of interesting points:

  • New York Times subscribers read 35 minutes a day (print) vs 35 minutes a month (online)
  • Human factors drive more decisions in the highest levels of business than let on.
  • Editorial and fact checking at the level of the New Yorker is a game changer for an author.
  • He’s really big on having a calling card.

Finally, Charles Watkinson gave a talk about how the monograph publishing is experimenting with digital and how it’s adopting many of the same features as journal articles. He called out Morgan and Claypool’s Synthesis Series as an example innovator in this space — I’m an editor 😉

I always enjoy T Scott Plutchak’s talks. He talked about his new role in bringing together data wranglers across his university. He made a good case that this role is really necessary in today’s science. I agree. But it’s unclear how one can keep the talent needed for data wrangling within academia especially in the library.

Overall, SSP was useful in understanding the current state and thinking of this industry.

Random Thoughts

Last week I was in Florence Italy for the 23rd International World Wide Web Conference (WWW 2015). This is the leading computer science conference focused on web technology writ large. It’s a big conference – 1400 attendees this year. WWW is excellent for getting a good bearing on the latest across multiple subfields in computer science. Another way to say it is that I run into friends from the semantic web community, NLP community, data mining community, web standards community, the scholarly communication community, etc.. I think on the Tuesday night I traversed four different venues hanging out with various groups.

This is the first time since 2010 that I attended WWW. It was good to be back. I was there the entire week so there was a ton but I’ll try to boil what I saw down into 3 takeaways. But first…

What was I doing there?

First, was that I co-authored a research track paper with Marcin Wylot and Philippe Cudré-Mauroux of the eXascale Infolab (cool name) on Executing Provenance Queries over Web Data (slides, paper). We showed that because of the highly selective nature of provenance on the web of data, we can actually improve query performance within a triple store. I was super happy to have this accepted given the ~14%! acceptance rate.

Second, I gave the opening talk of the Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD) workshop. I discussed the current state of scholarly productivity and used the notion of the burden of knowledge as a motivation for knowledge graphs as a mechanism to help increase that productivity. I even went web for my slides.

Continuing on the theme of knowledge graphs, I participated on a panel in the industry track around knowledge graphs. More thoughts on this coming up.

Knowledge graph panel www

The Takeaways

From my perspective there were three core takeaways:

  1. Knowledge Graphs/Bases everywhere
  2. Assume the Web
  3. Scholarly applications are interesting applications

1. Knowledge Graphs/Bases everywhere

I could call this Entities everywhere. Perhaps, it was the sessions I chose to attend but it felt like when I was at the conference in 2010 where every other paper was about online advertising. There were a ton of papers on entity linking, entity disambiguation, entity (etc.) many others had knowledge base construction as a motivation.

2015-05-21 14.49.54

There were two tutorials on knowledge graphs both of them were full and the one from Google/Facebook involved moving to a completely new room. Both were excellent. The one from the Yago team has really good material.  As a side note, it was interesting to sit-in on tutorials where I already have a decent handle on the material. It let me compare my own intellectual framework for the material and others out there. For example, I liked the Yago tutorial’s distinction between source-centric and yield-centric information extraction and how we pursue the yield approach when doing automated knowledge base construction. A recommended exercise for the reader.

Beyond just being a plethora of stuff, I think our panel discussion highlighted themes that appeared across several papers.

Dealing with long tail entities
In general, approaches to knowledge base construction have relied on well known entities (e.g. wikipedia) and frequency (you’re mentioned a lot, you’re an entity). For many domain specific entities, for example in science, and also emergent entities this is a challenge. A number of authors tried to tackle this by:

  • looking at web page titles as a potential data source for entities (Song et al.)
  • use particular types of web tables to help assign entities to classes (Wang et al.)
  • use social context help entity extraction (Jie Tang et al. )
  • discover new meta relations between entities (Meng et al.)

All the organizations on the industry panel spend significant resources on quality maintenance of their knowledge graphs. The question here is how to best decrease the amount of human input and increase automation.

An interesting example that was talked about quite frequently is the move of Freebase to Wikidata. Wikidata runs under the same guidelines as Wikipedia so all facts need to have claims grounded in sources from the Web. Well it turns out this is difficult because many facts are sourced from Wikipedia itself. This kind of dare I say it provenance is really important. Most current large scale knowledge graphs support provenance but as we automate more it would be nice to be able to automatically judge these sources using that provenance.

One paper that I saw that addressed quality issues this was GERBIL – General Entity Annotator Benchmarking Framework. This 25 author paper! devised a common framework for testing entity linking tools. It’s great to see the community looking at these sorts of common QA frameworks.

This seemed to be bubbling up. On the panel, the company Tagasauris was looking at constructing a mediaGraph by analyzing video content. During the Yago tutorial, the presenters mentioned potential future work on extracting common sense knowledge by looking at videos. In general, both extraction of facts from multimedia but also using knowledge graphs to understand multimedia seems like a challenging but fruitful area. One particular example was the paper “Tagging Personal Photos with Transfer Deep Learning”. What was cool was the injection of a personal photo ontology into the training of the network as priors. This led to both better results but probably more impotently decreased the training time. Another example is the work from Gerhard Weikum’s group on extracting knowledge from movie scripts. 

Finally, as I commented at the Linked Data on the Web Workshop, the growth of knowledge graphs is a triumph of the semantic web and linked data. Making knowledge bases open and available on the Web using reusable schemes has really been a boon to the area.

2. Assume the Web

It’s obvious but is worth repeating: the web is really big!

These stats were from Andrei Broder’s excellent keynote. The size of the web motivates the need for better web technology (e.g. search) and as that improves so do our expectations. Broder called out three axes of progress

  1. scaling up with quality
  2. faster response
  3. higher functionality levels

We progress on all these dimensions. But the scale of the web doesn’t just change the technology we need to develop but it changes our methods.

For example, a paper I liked a lot was “Leveraging Pattern Semantics for Extracting Entities in Enterprises”. This bares resembles towards problems we face extracting entities that are not found on the web because there only mentioned within a private environment (e.g. internal product names). But even in this environment they rely on the Web. They rank semantic patterns they extract by using relations extracted from the web.

For me, it means that even if the application isn’t necessarily for “the web”, I should think about the web as a potential part of the solution.

3 Scholarly applications are interesting applications

I’m biased, but I think scholarly applications are particularly interesting and you saw that at WWW. I attended two workshops dealing with technology and scholarship. SAVE-SD and Big Scholar. I was particularly impressed with the scholarly knowledge graph that’s being built on-top of the Bing Satori Knowledge Graph, which covers venues, authors, papers, and organizations from 100 million papers. (It seems there are probably 120 million total on the web.) At their demo they showed some awesome queries that you can do like:  “papers on multiple sclerosis citing artificial intelligence” Another example is venues appearing in the side of bing searches with related venues, due dates, etc:

See Kuansan Wang’s (@kuansanw) talk for more info (slides). As far as I understand, MSR will also be releasing the Microsoft Academic Graph for experimentation in a couple of weeks. Based on this graph MSR is co-organizing with Antonio Gulli from Elsevier the WSDM Cup in 2016

It was a pleasure to meet C. Lee Giles of CiteSeerX. It was good seeing an overview of that system and he had some good pointers (e.g. GROBID for metadata extraction and ParsCit for citation extraction).

From SAVE-SD there were two papers that caught my eye:

There were also a number of main track papers that applied methods to scholarly content.

Overall, WWW 2015 was a huge event so this trip report really is just what I could touch. I didn’t even get the chance to go to the W3C sessions and Web Science talks. You can check out all the proceedings here, definitely worth a look.

Random thoughts

  • The web isn’t scale free – it’s log-log. Gotta check out Clauset et al 2009, Power-law distributions in empirical data
  • If you’re a researcher remember that Broder’s “A taxonomy of web search” – was originally rejected from WWW 2002, it now has 1700+ citations.
  • Aidan Hogan + 1 for colorful slides and showing that we need to just deal with blank nodes and not get so hung up about it.  (paper, code)
  • If you do machine learning, do your parameter studies. Most papers had them.
  • PROV and information diffusion combined. So awesome.
  • Ah conference internet… It’s always hard.
  • People are hiring like crazy. Booths from Baidu, Facebook, Yahoo, LinkedIn. Oh, and never discount how frisbee’s can motivate highly educated geeks.
  • On the hiring note, I liked how the companies listed their attendees and their talks.
  • Tons and tons of talks with authors from companies. I should really do some stats. It was like every paper.
  • Italy, food, florentine steak – yummy!
  • Corollary, running is necessary but running in Florence is beautiful. Head by the Duomo across the river and up through the gardens.
  • What you can do with four square data:  2015-05-21 11.27.11
  • Larry and Sergei won the test of time award. 
  • Gotta ask the folks at Insight about their distributional semantics work.

OP_logo_300dpiI was proud to be part of the Open PHACTS project for three years. The project built a platform for drug discovery that integrates data over multiple different kinds of chemistry and biological data currently connecting information about compounds, targets, pathways, diseases and tis
sues.. The platform is still going strong and is now supported by a foundation that is supported by its users from companies such as GSK, Roche, Janssen and Lilly. The foundation is also involved in several projects such as Big Data for Europe. The project was large and produced many outputs including numerous publications. I wanted to tell a brief story of Open PHACTS by just categorizing the publications. This will hopefully help people navigate the results of the project. Note, I removed the authors for readability but click through to find all the great people who did this work.


Speaks for itself…

Use cases

The information needs of drug discovery scientists. 83 use cases gathered and analyzed. 20 prioritized use case questions as the result.

Platform design and construction

Semantic technologies are great for integration – How do we get them to be fast and easy for developers? Leverage APIs

Applying the platform to do drug discovery

Can the platform do what it says it can do? Yep. 16/20 use case questions could be answered and some ones we didn’t think of. Plus, some cool end-user applications (e.g. The Open PHACTS Explorer and Chembionavigator )

Interesting computer science

Along the way we addressed some computer science challenges like: How do we scale up querying over RDF? How do we deal with the multiplicity of mappings? How do we mix commercial, private and public data?

Supporting Better Data

The project supported data providers in creating and updating RDF versions of their datasets.

Supporting Interoperability

Many members of the project worked within a number of communities to develop specifications that help for dataset description (especially in terms of provenance) and interchange.

Overall, the Open PHACTS project not only delivered a data integration platform for drug discovery but also helped  through the construction of more interoperable datasets and lessons about how to construct such platforms. I look forward to seeing what happens as the platform continues to be developed but maybe more importantly the impact of the results of the project as they diffuse.

This past week I was asked to attend an offsite meeting of a local research group where they were discussing ethics.  They asked me to present a topic around ethics within science and scholarship. This gave me an opportunity to try to condense some of my recent thoughts. Roughly, I’ve been playing around with the idea that there is a growing conflict between what those outside of scholarship view the practice of scholarship as (“an ideal”) and how the actually messy practice of it works (“the norms”).  In the slides, below I try to make a start of an argument that we should be clear about the norms that we have. Articulate them and embrace them. I try to boil this down in to two:

  1. be transparent,
  2. embrace the iterative nature of scholarship

I’d love to hear your thoughts on this line of thinking.

Earlier this week, I attended the SNN Symposium –  Intelligent Machines. SNN is the Dutch foundation for Neural Networks, which coordinates the Netherlands national platform on machine learning, which connects most of the ML groups in the Netherlands.

It’s not typical for a 1 day Dutch specific academic symposium to sell out – but this one did. This is a combination of the topic (machine learning is hot!) but also the speakers. The organizers put together a great line-up:

It’s not typical to get essentially 4 keynotes in one day. Instead of going through each talk in turn, I’ll try to draw some of the major items that I took away from across the talks.

The Case for Probability Theory

Both Prof. Ghahramani and Dr. Herbrich made strong arguments for probability as the core way to think about machine learning/intelligence and in particular a bayesian view of the world . Herberich summarized the argument to use probability as:

  • Probability is a calculus of uncertainty (argued using the “naturalness” of Cox Axioms)
  • It maps well to computational systems – (factor graphs allow for computational distribution )
  • It decouples inference, prediction and decision

Factor Graphs!

For me, it was a nice reminder to think of optimization as an approximation for computing probabilities. More generally, coming back to a simplified high-level framework makes understanding the complexities of the algorithms easier. Ghahramani did a great job of connecting this framework with the underlying mathematics. Slides from his ML course are here – unfortunately without the lecturer himself.

The Rise of Reinforcement Learning

The presentations by Daan Wierstra and Sethu Vijayakumar both featured pretty amazing demos. Dr. Wierstra work at was on the team that developed algorithms that can learn to play Atari games purely from pixels and a knowledge of the game score. This uses reinforcement learning to train a convolutional neural network. The key invention here was to keep around the past experience when providing input back into the neural network.

Likewise, Prof. Vijayakumar showed how robots can also learn via reinforcement. Here’s an example of a robot arm learning to balance a pole.

Reinforcement learning can help attack the problem of data efficiency that’s faced by machine learning. Essentially, it’s hard to get enough training data, let alone labelled training data. We’ve seen the rise of unsupervised methods to take advantage of the data we do have. (Side note: unsupervised approaches just keep getting better) But by situating the agent in an environment, it it’s easier to provide the sort of training necessary. Instead of examples, one needs to provide the appropriate feedback environment. From Wienstra’s talk, again the apparent difficulty for reinforcement learning is temporal abstraction – using knowledge from past to learn. Both the Atari and Robot example receive fairly immediate reinforcement on their tasks.

This takes us back to the classic ideas of situated cognition and of course the work of Luc Steels.

Good Task Formulation

Sometimes half the battle in research is coming up with a good task formulation. This is obvious but it’s actually quite difficult. What struck me was each of the speakers was good at formulating their problem and the metrics by which they can test it. For example, Prof. Ghahramani was able to articulate his goals and measure of success for the development of the Automatic Statistician – a system for finding a good model of a given data and providing a nifty human readable and transparent report. Here’s one for affairs 🙂 

(Side note: the combination of parameter search and search through components reminds of work on the Wings Workflow environment.)

Likewise, Dr. Herbrich was good at translating the various problems faced within Amazon into specific ML tasks. For example, here’s his definition for Content Linkage:



He then broke this down into the specific well defined tasks through the rest of talk. The important thing here is to keep coming back to these core tasks and having well defined evaluation criteria. (See also Watson’s approach)

Attacking General AI?

Deep Mind - general AI

One thing that stood out to me was the audacious of the Google Deep Mind goal – to solve General AI. Essentially, designing “AI that can operate over a wide range of tasks”. Why now? Wierstra emphasized the available compute power and advances in different algorithms. I thought the interesting comment was that they have something like a 30 year time horizon within a company. Of course, funding may not last long, but articulating that goal and demonstrable attacking it is something that I would expect more from academia. Indeed, I wonder if we are not thinking enough  They already have very impressive results. The atari example but also their DRAW algorithm for learning to generate images :

I also like their approach of Neural Turing Machines – using recurrent neural network to create a computer itself. By adding memory to neural networks there trying to tackle the “memory” problem discussed above.

Overall, it was an invigorating day.

Random thoughts:

  • Robots demos are cool!

  • Text Kernel and Postdam’s use of word2vec for entity extraction in CVs was interesting.
  •  (click to see the full size poster)IMG_0019
%d bloggers like this: