Archive

events

Last week, I was at Provenance Week 2016. This event happens once every two years and brings together a wide range of researchers working on provenance. You can check out my trip report from the last Provenance Week in 2014.  This year Provenance Week combined:

For me, Provenance Week is like coming home, lots of old friends and a favorite subject of mine. It’s also a good event to attend because it crosses the subfields of computer science, everything from security in operating systems to scientific workflows on to database theory. In one day, I went from a discussion on the role of indirection in data citation to staring at the C code of a database. Marta, Boris and Sarah really put together a solid program. There were about 60 attendees across the four days:

ProvenanceWeek_2016-06-08_D4S2484

So what was I doing there? Having served as co-chair of the W3C PROV working group, I thought it was important to be at the PROV: Three years later event where we reflected on the status of PROV, it’s uptake and usage. I presented some ongoing work on measuring the usage of provenance on the web of data.  Additionally, I gave the presentation of joint work led by my student Manolis Stamatogiannakis and done in conjunction with Ashish Gehani‘s group at SRI. The work focused on using benchmarks to help inform decisions on what provenance capture system to use. Slides:

I’ll now walk through my 3 big take aways from the event.

Provenance to attack Advanced Persistent Threats

DARPA’s $60 million transparent computing explicitly calls out the use of provenance to address the problem of what’s called an Advanced Persistent Threat (APTs). APTs are attacks that are long terms, look like standard business processes, and involve the attacker knowing the system well. This has led to a number of groups exploring the use of system level provenance capture techniques (e.g. SPADE and OPUS) and then integrating that from multiple distributed sources using PROV inspired data models. This was well described by David Archer is his talk as assembling multiple causal graphs from event streams.  James Cheney’s talk on provenance segmentation also addressed these issues well. This reminded me some what of the work on distributed provenance capture using structured logs that the Netlogger and Pegasus teams do, however, they leverage the structure of a workflow system to help with the assembly.

I particularly liked Yang JiSangho Lee and  Wenke Lee‘s work on using user level record and replay to track and replay provenance. This builds upon some of our work that used system level record replay as mechanism for separating provenance capture and instrumentation. But now in user space using the nifty rr tool from Mozilla. I think this thread of being able to apply provenance instrumentation after the fact  on an execution trace holds a lot of promise.

Overall, it’s great to see this level of attention on the use of provenance for security and in more broadly of using long term records of provenance to do analysis.

PROV as the starting point

Given that this was the ten year anniversary of IPAW, it was appropriate that Luc Moreau gave one of the keynotes. As really one of the drivers of the community, Luc gave a review of the development of the community and its successes.One of those outcomes was the W3C PROV standards. 

Overall, it was nice to see the variety of uses of PROV and the tools built around it. It’s really become the jumping off point for exploration. For example, Pete Edwards team combined PROV and a number of other ontologies including (P-Plan) to create a semantic representation of what’s going on within a professional kitchen in order to check food safety compliance. 

burger

Another example is the use of PROV as a jumping off point for the investigation into the provenance model of HL7 FHIR (a new standard for electronic healthcare records interchange).

As whole, I think the attendees felt that what was missing was an active central point to see what was going on with PROV and pointers to resources for implementation. The aim is to make sure that the W3c PROV wiki is up-to-date and is a better resource overall.

Provenance as lens: Data Citation, Documents & Versioning

An interesting theme was the use of provenance concepts to give a frame for other practices. For example, Susan Davidson gave a great keynote on data citation and how using a variant of provenance polynomials can help us understand how to automatically build citations for various parts of curated databases. The keynote was based off her work with James Frew and Peter Buneman that will appear in CACM (preprint). Another good example of provenance to support data citation was Nick Car’s work for Geoscience Australia.

Furthermore, the notion of provenance as the substructure for complex documents appeared several times. For example, the Impacts on Human  Health of Global Climate Change report from globalchange.gov uses provenance as a backbone. Both the OPUS and PoeM systems are exploring using provenance to generate high-level experiment reports.

Finally, I thought David Koop‘s versioning of version trees showed how using provenance as lens can help better understand versioning of version trees themselves. (I have to give David credit for presenting a super recursive concept so well).

Overall, another great event and I hope we can continue to attract new CS researchers focusing on provenance.

Random Notes

  • PROV in JSON-LD – good for streaming
  • Theoretical provenance paper recipe = extend provenance polynomials to deal with new operators. Prove nice result. e.g. now for Linear Algebra.
  • Prefixes! R-PROV, P-PROV, D-PROV, FS-PROV, SC-PROV, — let me know if I missed any..
  • Intel Secure Guard Extensions (SGX) – interesting
  • Surprised how dependent I’ve become on taking pictures in conferences for note taking. Not being able to really impacted my flow. Plus, there are less pictures for this
  • Thanks to Adriane for hosting!
  • A provenance based data science environment
  • 👍Learning Health Systems – from Vasa Curcin
Advertisements

Next week is the 2015 International Semantic Web Conference. I had the opportunity with the Michel Dumontier to chair a new track on Datasets and Ontologies. A key part of of the Semantic Web has always been shared resources, whether it’s common standards through the W3C or open datasets like those found in the LOD cloud. Indeed, one of the major successes of our community is the availability of these resources.

ISWC over the years has experimented with different ways of highlighting these contributions and bringing them into the scientific literature. For the past couple of years, we have had an evaluation track specifically devoted to reproducibility and evaluation studies. Last year datasets were included to form a larger RDBS track. This year we again have a specific Empirical Studies and Evaluation track along side the Data & Ontologies track.

The reviewers had a tough job for this track. First, it was new so it’s hard to make a standard judgment. Secondly, we asked reviewers not only to review the paper but the resource itself along a number of dimensions. Overall, I think they did a good job. Below you’ll find the resources chosen for presentation at the conference and a brief headline of what to me is interesting about the paper. In the spirt of the track, I link to the resource as well as the paper.

Datasets

  •  Automatic Curation of Clinical Trials Data in LinkedCT by Oktie Hassanzadeh and Renée J Miller (paper) – clinicaltrials.gov published as linked data in an open and queryable. This resource has been around since 2008. I love the fact that they post downtime and other status info on twitter https://twitter.com/linkedct
  • LSQ: Linked SPARQL Queries Dataset by Muhammad Saleem, Muhammad Intizar Ali, Qaiser Mehmood, Aidan Hogan and Axel-Cyrille Ngonga Ngomo (paper). – Query logs are becoming an ever more important resource from everything from search engines to database query optimization. See for example USEWOD. This resource provides queryable versions in SPARQL of the query logs from several major datasets including dbpedia and linked geo data.
  • Provenance-Centered Dataset of Drug-Drug Interactions by Juan Banda, Tobias Kuhn, Nigam Shah and Michel Dumontier (paper) – this resources provides aggregated set of drug-drug interactions coming from 8 different sources. I like how they provided a doi for the bulk download of their datasource as well as spraql endpoint. It also uses nanopublications as the representation format.
  • Semantic Bridges for Biodiversity Science by Natalia Villanueva-Rosales, Nicholas Del Rio, Deana Pennington and Luis Garnica Chavira (paper) – this resource allows biodiversity scientist to work with species distribution models. The interesting thing about this resource is that it not only provides linked data, a spraql endpoint and ontologies but also semantic web services (i.e. SADI) for orchestrating these models.
  • DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons by Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann and Sebastian Hellmann  (paper) – this is another chapter in exposing wikimedia content as structured data. This resource provides structured information for the media content in Wikimedia commons. Now you can spraql for all images with a CC-by-sa v2.0 license.

Ontologies

Overall, I think this is a good representation of the plethora of deep datasets and ontologies that the community is creating.  Take a minute and check out these new resources.

Last week I was in Florence Italy for the 23rd International World Wide Web Conference (WWW 2015). This is the leading computer science conference focused on web technology writ large. It’s a big conference – 1400 attendees this year. WWW is excellent for getting a good bearing on the latest across multiple subfields in computer science. Another way to say it is that I run into friends from the semantic web community, NLP community, data mining community, web standards community, the scholarly communication community, etc.. I think on the Tuesday night I traversed four different venues hanging out with various groups.

This is the first time since 2010 that I attended WWW. It was good to be back. I was there the entire week so there was a ton but I’ll try to boil what I saw down into 3 takeaways. But first…

What was I doing there?

First, was that I co-authored a research track paper with Marcin Wylot and Philippe Cudré-Mauroux of the eXascale Infolab (cool name) on Executing Provenance Queries over Web Data (slides, paper). We showed that because of the highly selective nature of provenance on the web of data, we can actually improve query performance within a triple store. I was super happy to have this accepted given the ~14%! acceptance rate.

Second, I gave the opening talk of the Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD) workshop. I discussed the current state of scholarly productivity and used the notion of the burden of knowledge as a motivation for knowledge graphs as a mechanism to help increase that productivity. I even went web for my slides.

Continuing on the theme of knowledge graphs, I participated on a panel in the industry track around knowledge graphs. More thoughts on this coming up.

Knowledge graph panel www

The Takeaways

From my perspective there were three core takeaways:

  1. Knowledge Graphs/Bases everywhere
  2. Assume the Web
  3. Scholarly applications are interesting applications

1. Knowledge Graphs/Bases everywhere

I could call this Entities everywhere. Perhaps, it was the sessions I chose to attend but it felt like when I was at the conference in 2010 where every other paper was about online advertising. There were a ton of papers on entity linking, entity disambiguation, entity (etc.) many others had knowledge base construction as a motivation.

2015-05-21 14.49.54

There were two tutorials on knowledge graphs both of them were full and the one from Google/Facebook involved moving to a completely new room. Both were excellent. The one from the Yago team has really good material.  As a side note, it was interesting to sit-in on tutorials where I already have a decent handle on the material. It let me compare my own intellectual framework for the material and others out there. For example, I liked the Yago tutorial’s distinction between source-centric and yield-centric information extraction and how we pursue the yield approach when doing automated knowledge base construction. A recommended exercise for the reader.

Beyond just being a plethora of stuff, I think our panel discussion highlighted themes that appeared across several papers.

Dealing with long tail entities
In general, approaches to knowledge base construction have relied on well known entities (e.g. wikipedia) and frequency (you’re mentioned a lot, you’re an entity). For many domain specific entities, for example in science, and also emergent entities this is a challenge. A number of authors tried to tackle this by:

  • looking at web page titles as a potential data source for entities (Song et al.)
  • use particular types of web tables to help assign entities to classes (Wang et al.)
  • use social context help entity extraction (Jie Tang et al. )
  • discover new meta relations between entities (Meng et al.)

Quality
All the organizations on the industry panel spend significant resources on quality maintenance of their knowledge graphs. The question here is how to best decrease the amount of human input and increase automation.

An interesting example that was talked about quite frequently is the move of Freebase to Wikidata. Wikidata runs under the same guidelines as Wikipedia so all facts need to have claims grounded in sources from the Web. Well it turns out this is difficult because many facts are sourced from Wikipedia itself. This kind of dare I say it provenance is really important. Most current large scale knowledge graphs support provenance but as we automate more it would be nice to be able to automatically judge these sources using that provenance.

One paper that I saw that addressed quality issues this was GERBIL – General Entity Annotator Benchmarking Framework. This 25 author paper! devised a common framework for testing entity linking tools. It’s great to see the community looking at these sorts of common QA frameworks.

Multimedia
This seemed to be bubbling up. On the panel, the company Tagasauris was looking at constructing a mediaGraph by analyzing video content. During the Yago tutorial, the presenters mentioned potential future work on extracting common sense knowledge by looking at videos. In general, both extraction of facts from multimedia but also using knowledge graphs to understand multimedia seems like a challenging but fruitful area. One particular example was the paper “Tagging Personal Photos with Transfer Deep Learning”. What was cool was the injection of a personal photo ontology into the training of the network as priors. This led to both better results but probably more impotently decreased the training time. Another example is the work from Gerhard Weikum’s group on extracting knowledge from movie scripts. 

Finally, as I commented at the Linked Data on the Web Workshop, the growth of knowledge graphs is a triumph of the semantic web and linked data. Making knowledge bases open and available on the Web using reusable schemes has really been a boon to the area.

2. Assume the Web

It’s obvious but is worth repeating: the web is really big!

These stats were from Andrei Broder’s excellent keynote. The size of the web motivates the need for better web technology (e.g. search) and as that improves so do our expectations. Broder called out three axes of progress

  1. scaling up with quality
  2. faster response
  3. higher functionality levels

We progress on all these dimensions. But the scale of the web doesn’t just change the technology we need to develop but it changes our methods.

For example, a paper I liked a lot was “Leveraging Pattern Semantics for Extracting Entities in Enterprises”. This bares resembles towards problems we face extracting entities that are not found on the web because there only mentioned within a private environment (e.g. internal product names). But even in this environment they rely on the Web. They rank semantic patterns they extract by using relations extracted from the web.

For me, it means that even if the application isn’t necessarily for “the web”, I should think about the web as a potential part of the solution.

3 Scholarly applications are interesting applications

I’m biased, but I think scholarly applications are particularly interesting and you saw that at WWW. I attended two workshops dealing with technology and scholarship. SAVE-SD and Big Scholar. I was particularly impressed with the scholarly knowledge graph that’s being built on-top of the Bing Satori Knowledge Graph, which covers venues, authors, papers, and organizations from 100 million papers. (It seems there are probably 120 million total on the web.) At their demo they showed some awesome queries that you can do like:  “papers on multiple sclerosis citing artificial intelligence” Another example is venues appearing in the side of bing searches with related venues, due dates, etc:

See Kuansan Wang’s (@kuansanw) talk for more info (slides). As far as I understand, MSR will also be releasing the Microsoft Academic Graph for experimentation in a couple of weeks. Based on this graph MSR is co-organizing with Antonio Gulli from Elsevier the WSDM Cup in 2016

It was a pleasure to meet C. Lee Giles of CiteSeerX. It was good seeing an overview of that system and he had some good pointers (e.g. GROBID for metadata extraction and ParsCit for citation extraction).

From SAVE-SD there were two papers that caught my eye:

There were also a number of main track papers that applied methods to scholarly content.

Overall, WWW 2015 was a huge event so this trip report really is just what I could touch. I didn’t even get the chance to go to the W3C sessions and Web Science talks. You can check out all the proceedings here, definitely worth a look.

Random thoughts

  • The web isn’t scale free – it’s log-log. Gotta check out Clauset et al 2009, Power-law distributions in empirical data
  • If you’re a researcher remember that Broder’s “A taxonomy of web search” – was originally rejected from WWW 2002, it now has 1700+ citations.
  • Aidan Hogan + 1 for colorful slides and showing that we need to just deal with blank nodes and not get so hung up about it.  (paper, code)
  • If you do machine learning, do your parameter studies. Most papers had them.
  • PROV and information diffusion combined. So awesome.
  • Ah conference internet… It’s always hard.
  • People are hiring like crazy. Booths from Baidu, Facebook, Yahoo, LinkedIn. Oh, and never discount how frisbee’s can motivate highly educated geeks.
  • On the hiring note, I liked how the companies listed their attendees and their talks.
  • Tons and tons of talks with authors from companies. I should really do some stats. It was like every paper.
  • Italy, food, florentine steak – yummy!
  • Corollary, running is necessary but running in Florence is beautiful. Head by the Duomo across the river and up through the gardens.
  • What you can do with four square data:  2015-05-21 11.27.11
  • Larry and Sergei won the test of time award. 
  • Gotta ask the folks at Insight about their distributional semantics work.

Earlier this week, I attended the SNN Symposium –  Intelligent Machines. SNN is the Dutch foundation for Neural Networks, which coordinates the Netherlands national platform on machine learning, which connects most of the ML groups in the Netherlands.

It’s not typical for a 1 day Dutch specific academic symposium to sell out – but this one did. This is a combination of the topic (machine learning is hot!) but also the speakers. The organizers put together a great line-up:

It’s not typical to get essentially 4 keynotes in one day. Instead of going through each talk in turn, I’ll try to draw some of the major items that I took away from across the talks.

The Case for Probability Theory

Both Prof. Ghahramani and Dr. Herbrich made strong arguments for probability as the core way to think about machine learning/intelligence and in particular a bayesian view of the world . Herberich summarized the argument to use probability as:

  • Probability is a calculus of uncertainty (argued using the “naturalness” of Cox Axioms)
  • It maps well to computational systems – (factor graphs allow for computational distribution )
  • It decouples inference, prediction and decision

Factor Graphs!

For me, it was a nice reminder to think of optimization as an approximation for computing probabilities. More generally, coming back to a simplified high-level framework makes understanding the complexities of the algorithms easier. Ghahramani did a great job of connecting this framework with the underlying mathematics. Slides from his ML course are here – unfortunately without the lecturer himself.

The Rise of Reinforcement Learning

The presentations by Daan Wierstra and Sethu Vijayakumar both featured pretty amazing demos. Dr. Wierstra work at was on the team that developed algorithms that can learn to play Atari games purely from pixels and a knowledge of the game score. This uses reinforcement learning to train a convolutional neural network. The key invention here was to keep around the past experience when providing input back into the neural network.

Likewise, Prof. Vijayakumar showed how robots can also learn via reinforcement. Here’s an example of a robot arm learning to balance a pole.

Reinforcement learning can help attack the problem of data efficiency that’s faced by machine learning. Essentially, it’s hard to get enough training data, let alone labelled training data. We’ve seen the rise of unsupervised methods to take advantage of the data we do have. (Side note: unsupervised approaches just keep getting better) But by situating the agent in an environment, it it’s easier to provide the sort of training necessary. Instead of examples, one needs to provide the appropriate feedback environment. From Wienstra’s talk, again the apparent difficulty for reinforcement learning is temporal abstraction – using knowledge from past to learn. Both the Atari and Robot example receive fairly immediate reinforcement on their tasks.

This takes us back to the classic ideas of situated cognition and of course the work of Luc Steels.

Good Task Formulation

Sometimes half the battle in research is coming up with a good task formulation. This is obvious but it’s actually quite difficult. What struck me was each of the speakers was good at formulating their problem and the metrics by which they can test it. For example, Prof. Ghahramani was able to articulate his goals and measure of success for the development of the Automatic Statistician – a system for finding a good model of a given data and providing a nifty human readable and transparent report. Here’s one for affairs 🙂 

(Side note: the combination of parameter search and search through components reminds of work on the Wings Workflow environment.)

Likewise, Dr. Herbrich was good at translating the various problems faced within Amazon into specific ML tasks. For example, here’s his definition for Content Linkage:

image1

 

He then broke this down into the specific well defined tasks through the rest of talk. The important thing here is to keep coming back to these core tasks and having well defined evaluation criteria. (See also Watson’s approach)

Attacking General AI?

Deep Mind - general AI

One thing that stood out to me was the audacious of the Google Deep Mind goal – to solve General AI. Essentially, designing “AI that can operate over a wide range of tasks”. Why now? Wierstra emphasized the available compute power and advances in different algorithms. I thought the interesting comment was that they have something like a 30 year time horizon within a company. Of course, funding may not last long, but articulating that goal and demonstrable attacking it is something that I would expect more from academia. Indeed, I wonder if we are not thinking enough  They already have very impressive results. The atari example but also their DRAW algorithm for learning to generate images :

I also like their approach of Neural Turing Machines – using recurrent neural network to create a computer itself. By adding memory to neural networks there trying to tackle the “memory” problem discussed above.

Overall, it was an invigorating day.

Random thoughts:

  • Robots demos are cool!

  • Text Kernel and Postdam’s use of word2vec for entity extraction in CVs was interesting.
  •  (click to see the full size poster)IMG_0019

Last week (Jan 29 & 30), I was at the NSF & Sloan foundation workshop: Supporting Scientific Discovery through Norms and Practices for Software and Data Citation and Attribution. The workshop is in the context of the NSF’s dear colleague letter on the subject. The workshop brought together a range of backgrounds and organizations from Mozilla to NIH and NASA. I got to catch up with several friends but was able to meet some new folks as well. Check out the workshop’s github page with a list of 22 use cases submitted to the workshop.

I was pleased to see the impact impact of the work of FORCE11 on helping drive this space. In particular, the Joint Principles on Data Citation and Resource Identifiers (RRIDS) seem to be helping the community focus on citing other forms of scholarly output and were brought up several times in the meeting.

I think there were two main points from the conference:

  1. We have the infrastructure.
  2. Sustainability

Infrastructure

It was clear that we have much of the infrastructure in-place to enable the citation and referencing of outputs such as software and data.

In terms of software, piggy backing off existing infrastructures seems to be the most likely approach. The versioning/release mindset built into software development means that hosting infrastructure such as Github or Google Code provide a strong start. These can then be integrated with existing scholarly attribution systems.My colleague Sweitze Roffel presented Elsevier’s work on Original Software Publications. This approach leverages the existing journal based ecosystem to provide the permanence and context associated with things in the scientific record. Another approach is to use the data hosting/citation infrastructure to give code a DOI e.g. by using Zenodo. Both approaches work with Github.

The biggest thing will be promoting the actual use of proper citations. James Howison of University of Texas Austin presented interesting deep dive results on how people refer to software in the scientific literature  (slide set below) (Githhub). It shows that people want to do this but often don’t know how. His study was focused I’d like to do this same study in an automatic fashion on the whole of the literature. I know he’s working with others on training machine learning models for finding software mentions so that would be quite cool. Maybe it would be possible to back-fill the software citation graph this way?

In terms of data citation, we are much farther along because many of the existing data repositories support the minting of data citations. Many of the questions asked were about cases with changing or mash-ups of data. These are impotent edge cases to look at. I think progress will be made here by leveraging the landing pages for data to provide additional metadata. Indeed, Joan Starr from the California Digital Library is going to bring this back to the DataCite working group to talk about how to enable this. I was also impressed with the PLOS lead Making Data Count project and Martin Fenner’s continued development of the Lagotto altmetrics platform. In particular there was discussion about getting a supplementary guideline for software and data downloads included in COUNTER. This would be a great step in getting data and citation properly counted.

Sustainability

Sustainability is one of the key questions that have been going around in the larger discussion. How do we fund software and data resources necessary for the community. I think the distinction that arose was the need to differentiate between:

  • software as an infrastructure; and
  • software as an experiment/method.

This seems rather obvious but the tendency is for the later to become the former and this causes issues in particular for sustainability.

Issues include:

  1. It’s difficult to identify which software will become key to the community and thus where to provide the investment.
  2. Scientific infrastructure software tends to be funded on project to project basis or sometimes as a sideline of a lab.
  3. Software that begins as an experiment is often not engineered correctly.
  4. As Luis Ibanez from Google pointed out, we often loose the original developers overtime and there’s a need to involve new contributors.

The Software Sustainability Institute in the UK has begun to tackle some of these problems. But there is still lack of clear avenues for aggregating the funding necessary. One of the popular models is the creation of a non-profit foundation to support a piece of software  but this leads to “foundation fatigue.” Others approaches shift the responsibility to university libraries, but libraries may not have the required organizational capabilities. Katherine Skinner’s recent talk at FORCE 2015 covered some of the same ground here.

One of the interesting ideas that came up at the workshop was the use of other parts of the University institution to help tap into different funding streams (e.g. the IPR office; university development office). An example of this is Internet2 which is sponsored directly by universities. However, as pointed out by Dan Katz, to support this sort of sustainability there is a need to have insight into the deeper impact of this sort of software for the scientific community.

Conclusion

You can see a summary of the outcomes here. In particular, take a look at the critical asks. These concrete requests were formulated by the workshop attendees to address some of the identified issues. I’ll be interested to see the report that comes out of the workshop and how that can help move us forward.

NewsReader Amsterdam Hackathon

This past Wednesday (Jan. 21, 2015) I was at the NewsReader Hackathon. NewsReader is a EU project to extract events and build stories from the news. They use a sophisticated NLP pipeline combined with semantic background knowledge to perform this task. The hackathon was an opportunity to talk to members of one of the leading NLP groups in the Netherlands (CLTL) and find out more about their current pipeline. Additionally, one of the project partners is Lexis Nexis, a sister company of Elsevier, so it was nice to see how their content was being used as basis for event extraction and also meet some of my colleagues.  The combination of news and research  is particularly of interest in light of the recent Elsevier acquisition of NewsFlo.

Besides the chance to meet people, I also got to do some hacking myself to see how the NewsReader API worked. I used the api to plot the number and type of events featuring universities. (The resulting iPython Notebook)

A couple of pointers for future reference:

2015-01-13 10.06.28Last week, I was at FORCE 2015 – the future of research communications and e-scholarship conference held in Oxford. This is the third conference in a series that started with Beyond the PDF in 2011 and continued with Beyond the PDF 2 that I led the organization of in Amsterdam in 2013 (my wrap-up is here). This conference provides one of the only forums that brings together a variety of people who are in the vanguard of scholarly communication from librarians and computer scientists, to researchers, funders and publishers. Pretty much every role was represented in the ~250 attendees.

To give you an idea, I saw the developers of the Papers reference manager, the editorial director of PLOS One, a funder from the Wellcome Trust, a librarian from University of Iowa, and public policy junior researchers from Brazil/Germany2015-01-12 12.40.34

The curators (i.e. conference chairs), Dave De Roure and Melissa Haendel did a great job of pulling in a whole range of topics and styles in a a great venue. We even had the opportunity to see copies of the Philosophical Transactions. Speaking from experience this is a tough conference to organize because everything is pretty dynamic and there’s lots of different styles. (e.g. Dave and last minute beer run for the Hackathon!)

So what was I doing there? I helped organize the hackathon, which gave some space to work on content extraction, and reference manger support for data citation and for people to talk over pizza. This lead to proposals for two 1k challenges. (Remember to vote for which one you want to give 1000 pounds to..) I also helped organize the poster and demo / geek out sessions. A trailer for those sessions is below:

Themes

The conferences was too packed to go through everything but I wanted to go through the core themes that I got out of it:

1. Scholarly media is not just text

Data, images, slides, videos, software – scholarly media is not just text.  It never ways but it’s clear that the primacy of text is slowly being reduced and eventually be treated on par with these other forms of output. This is being made possible by the number of new platforms being introduced whether it’s Fighsare or Xenodo for data, github for code or HUBZero for the entire analytics lifestyle. It’s about sharing the actual research object rather than the textual argument. I think what brought this home to me is the amount of time spent discussing and presenting how these content types can be shoehorned into traditional text environments (e.g. journal citations).

2. Not access, understanding

The assumption at FORCE 2015, is that scholarship will be open access. The question then arises what do you do with the open access content. Phil Bourne, in his closing remarks, mentioned the lack of things being done with the current open access corpus. This notion of the need to do more clearly came over in Chris Lintott, founder of Galaxy Zoo, keynote:

He discussed how the literature was a barrier to amateurs contributing more to science. Specially, he mentioned accessible research summaries.  But, in general, there is a need to consider a more diverse audience in our communication not only for amateurs but for scientists from other disciplines or policy makers, for example.

3. Quality under pressure

The amount of scholarship continues to grow and there are perverse incentives. Scott Edmunds from Gigascience brought this out in his vision idea’s talk.

The current answer to this is peer review. But as most researchers will tell you, we are already overwhelmed. I get tons of requests to review and it’s hard to turn down my colleagues. Maybe a market for peer review will develop (see below) but what we need is more automated mechanisms of quality control or for publishers to do more quality control before things get sent to reviewers. Maybe we should see peer review as constructive feedback and not a filter. Likewise, by valuing other parts of the system maybe we can increase both the transparency and overall quality of the science.

4. Science as a service

The poster below from Bianca Kramer and  Jeroen Bosman highlighted the explosion in services available for scholarly communication. This continues a theme that I emphasized last year and that Ian Foster has talked about – the ability to do more and more science by just calling an API. Why can’t I build my lab from a cafe?

Wrap-up & Random Notes

The FORCE community is a special one. I hope we can continue to work together to push scholarly communication forward. I’m already looking forward to FORCE 2016 in Portland. There’s lots to be excited about as the way we do research rapidly changes. Finally, here are some random notes from the conference:

%d bloggers like this: