The early part of last week I attended the Web Science 2018 conference. It was hosted here in Amsterdam which was nice for me. It was nice to be at a conference where I could go home in the evening.
Web Science is an interesting research area in that it treats the Web itself as an object of study. It’s a highly interdisciplinary area that combines primarily social science with computer science. I always envision it as a loop with studies of what’s actually going on the Web leading to new interventions on the Web which we then need to study.
There were what I guess a hundred or so people there … it’s a small but fun community. I won’t give a complete rundown of the conference. You can find summaries of each day done by Cat Morgan (Workshop Day, Day 1, Day 2, Day 3) but instead give an assortment of things that stuck out for me:
The conference also hosted Tim Berners-Lee ACM Turing lecture, which is an obviously big deal. This was opened up to the public. There were ~900 people there. It was an excellent talk giving a history of the web and thoughts about its current status. Video will be available soon.
Interesting work on how people evaluate the credibility of online news in search engine result pages.
Nice reproducibility pack from Laura Hollink & co at CWI on gender differences on wikipedia.
My favorite talk by far “Not Every Remix is an Innovation” – looking at tracking remixing in an online 3D printing sharing community. Amazing insights into process.
Just like Global Warming, Facebook is anthropogenic – humans created it and it’s a lot easier to change (than global warming). You have an obligation to replace and fix it — and it’s an interdisciplinary endeavour to guide us on how #WebSci18#turingaward@timberners_leepic.twitter.com/0zm2EdC38d
It's amazing to consider that something so profoundly simple (the humble URL), can be so powerful, and of course, scalable. At the same time, smart people still struggle to grock this concept. #WebSci18#webscience@W3C#linkeddatahttps://t.co/a2XrydT3Sm
Lots of case studies here at #websci18 – always highly interesting but I’m wondering about generalizability – maybe need websci meta reviews? https://t.co/4cY4pIdfcS
Last week, I had the pleasure to be able to attend a bilateral meeting between the Royal Society and the KNAW. The aim was to strengthen the relation between the UK and Dutch scientific communities. The meeting focused on three scientific areas: quantum physics & technology; nanochemistry; and responsible data science. I was there for the latter. The event was held at Chicheley Hall which is a classic baroque English country house (think Pride & Prejudice). It’s a marvelous venue – very much similar in concept to Dagstuhl (but with an English vibe) where you are really wholly immersed in academic conversation.
.
One of the fun things about the event was getting a glimpse of what other colleagues from other technical disciplines are doing. It was cool to see Prof. Bert Weckhuysen enthusiasm for using imaging technologies to understand catalysts at the nanoscale. Likewise, seeing both the progress and the investment (!) in quantum computing from Prof. Ian Walmsley was informative. I also got an insider intro to the challenges of engineering a quantum computer from Dr. Ruth Oulton.
The responsible data science track had ~15 people. What I liked was that the organizers not only included computer scientists but also legal scholars, politicians, social scientists, philosophers and policy makers. The session consisted primarily of talks but luckily everyone was open to discussion throughout. Broadly, responsible data science covers the ethics of the practice and implications of data science or put another way:
It was apparent through out the meeting that the computer science community is confronting the challenges head on. A compelling example was the talk by Dr. Alastair Beresford from Cambridge about Device Analyzer a system that captures the activity of user’s mobile phones in order to provide data to improve device security, which it has:
He talked compellingly about the trade-offs between consent and privacy and how the project tries to manage these issues. In particular, I thought how they handle data sharing with other researchers was interesting. It reminded me very much of how the Dutch Central Bureau of Statistics manages microdata on populations.
Another example was the discussion by Prof. Maarten De Rijke on the work going on with diversity for recommender and search systems. He called out the Conference on Fairness, Accountability, and Transparency (FAT*) that was happening just after this meeting, where the data science community is engaging on these issues. Indeed, one of my colleagues was tweeting from that meeting:
Huge Kudos to https://t.co/uUaHfDb28i for reporting & helping fix this space. IBM replicated their results internally and released a new, improved API! 👏 #FAT2018
Julian Huppert, former MP, discussed the independent review board setup up by DeepMind Health to enable transparency about their practices. He is part of that board. Interestingly, Richard Horton, Editor of the Lancet is also part of that board Furthermore, Prof. Bart Jacobs discussed the polymorphic encryption based privacy system he’s developing for a collaboration between Google’s Verily and Radboud University around Parkinson’s disease. This is an example that even the majors are engaged around these notions of responsibility. To emphasize this engagement notion even more, during the meeting a new report on the Malicious Uses of AI came out from a number or well-known organizations.
One thing that I kept thinking is that we need more assets or concrete artifacts that data scientists can apply in practice.
So my question is how do I build values into a standard development life cycle? Need actionable artifacts Things like https://t.co/VG63JibOau are important for practice #rdsuknl@hoven_j
As data scientists, we tend to want to use an experimental / data driven approach even to these notions surrounding responsibility.
Computer/data scientists tend to look at news #diversity as a statistical measure of similarity/serendipity etc, while social scientists often look at as a precondition for democracy –@nhelberger#rdsuknl#filterbubble#fakenews
Even though I think there’s absolutely a role here for a data driven approach, it’s worth looking at other kinds of more qualitative methods, for example, by using survey instruments or an ethnographic approach or even studying the textual representation of the regulatory apparatus. For instance, reflecting on the notion of Thick Data is compelling for data science practice. This was brought home by Dr. Ian Brown in his talk on data science and regulation which combined both an economic and survey view:
Personally, I tried to bring some social science literature to bear when discussing the need for transparency in how we source our data. I also argued for the idea that adopting a responsible approach is also actually good for the operational side of data science practice:
While I think it’s important for computer scientists to look at different methods, it’s also important for other disciplines to gain insight into the actual process of data science itself as Dr. Linnet Taylor grappled within in her talk about observing a data governance project.
Overall, I enjoyed both the setting and the content of the meeting. If we can continue to have these sorts of conversations, I think the data science field will be much better placed to deal with the ethical and other implications of our technology.
Lots of discussion of large internet majors and monopolies. There’s lots of academic work on this but I really like Ben Thompson’s notion of aggregator’s as the way to think about them.
Merkle trees are great – but blockchain is a nicer name 😉
I was proud to be part of the Open PHACTS project for three years. The project built a platform for drug discovery that integrates data over multiple different kinds of chemistry and biological data currently connecting information about compounds, targets, pathways, diseases and tis
sues.. The platform is still going strong and is now supported by a foundation that is supported by its users from companies such as GSK, Roche, Janssen and Lilly. The foundation is also involved in several projects such as Big Data for Europe. The project was large and produced many outputs including numerous publications. I wanted to tell a brief story of Open PHACTS by just categorizing the publications. This will hopefully help people navigate the results of the project. Note, I removed the authors for readability but click through to find all the great people who did this work.
Vision
Speaks for itself…
Open PHACTS: semantic interoperability for drug discovery, Drug Discovery Today, Volume 17, Issues 21–22, November 2012, Pages 1188-1198, ISSN 1359-6446, http://dx.doi.org/10.1016/j.drudis.2012.05.016
Use cases
The information needs of drug discovery scientists. 83 use cases gathered and analyzed. 20 prioritized use case questions as the result.
Can the platform do what it says it can do? Yep. 16/20 use case questions could be answered and some ones we didn’t think of. Plus, some cool end-user applications (e.g. The Open PHACTS Explorer and Chembionavigator )
Along the way we addressed some computer science challenges like: How do we scale up querying over RDF? How do we deal with the multiplicity of mappings? How do we mix commercial, private and public data?
Many members of the project worked within a number of communities to develop specifications that help for dataset description (especially in terms of provenance) and interchange.
Overall, the Open PHACTS project not only delivered a data integration platform for drug discovery but also helped through the construction of more interoperable datasets and lessons about how to construct such platforms. I look forward to seeing what happens as the platform continues to be developed but maybe more importantly the impact of the results of the project as they diffuse.
Welcome to a massive multimedia extravaganza trip report from Provenance Week held earlier this month June 9 -13.
Provenance Week brought together two workshops on provenance plus several co-located events. It had roughly 65 participants. It’s not a huge event but it’s a pivotal one for me as it brings together all the core researchers working on provenance from a range of computer science disciplines. That means you hear the latest research on the topic ranging from great deployments of provenance systems to the newest ideas on theoretical properties of provenance. Here’s a picture of the whole crew:
Given that I’m deeply involved in the community, it’s going to be hard to summarize everything of interest because…well…everything was of interest, it also means I had a lot of stuff going on. So what was I doing there?
Together with Luc Moreau and Trung Dong Huynh, I kicked off the week with a tutorial on the W3C PROV provenance model. The tutorial was based on my recent book with Luc. From my count, we had ~30 participants for the tutorial.
We’ve given tutorials in the past on PROV but we made a number of updates as PROV is becoming more mature. First, as the audience had a more diverse technical background we came from a conceptual model (UML) point of view instead of starting with a Semantic Web perspective. Furthermore, we presented both tools and recipes for using PROV. The number of tools we now have out for PROV is growing – ranging from conversion of PROV from various version control systems to neuroimaging workflow pipelines that support PROV.
I had two papers in the main track of the International Provenance and Annotation Workshop (IPAW) as well as a demo and a poster.
Manolis Stamatogiannakis presented his work with me and Herbert Bos – Looking Inside the Black-Box: Capturing Data Provenance using Dynamic Instrumentation . In this work, we looked at applying dynamic binary taint tracking to capture high-fidelity provenance on desktop systems. This work solves what’s known as the n-by-m problem in provenance systems. Essentially, it allows us to see how data flows within an application without having to instrument that application up-front. This lets us know exactly which output of a program is connected to which inputs. The work was well received and we had a bunch of different questions both around speed of the approach and whether we can track high-level application semantics. A demo video is below and you can find all the source code on github.
We also presented our work on converting PROV graphs to IPython notebooks for creating scientific documentation (Generating Scientific Documentation for Computational Experiments Using Provenance). Here we looked at how to try and create documentation from provenance that is gathered in a distributed setting and put that together in easy to use fashion. This work was part of a larger kind of discussion at the event on the connection between provenance gathered in these popular notebook environments and that gathered on more heterogeneous systems. Source code again on github.
I presented a poster on our (with Marcin Wylot and Philippe Cudré-Mauroux) recent work on instrumenting a triple store (i.e. graph database) with provenance. We use a long standing technique provenance polynomials from the database community but applied for large scale RDF graphs. It was good to be able to present this to those from database community that we’re at the conference. I got some good feedback, in particular, on some efficiencies we might implement.
I also demoed (see above) the really awesome work by Rinke Hoekstra on his PROV-O-Viz provenance visualization service. (Paper, Code) . This was a real hit with a number of people wanting to integrate this with their provenance tools.
Provenance Reconstruction + ProvBench
At the end of the week, we co-organized with the ProvBench folks an afternoon about challenge tasks and benchmark datasets. In particular, we looked at the challenge of provenance reconstruction – how do you recreate provenance from data when you didn’t track it in the first place. Together with Tom De Nies we produced a number of datasets for use with this task. It was pretty cool to see that Hazeline Asuncion used these data sets in one of her classes where her students used a wide variety of off the shelf methods.
Awesome! The @provenanceweek provenance reconstruction datasets were used in a 10-week project course at University of Washington
From the performance scores, precision was ok but very dataset dependent and relies on a lot on knowledge of the domain. We’ll be working with Hazeline to look at defining different aspects this problem going forward.
Provenance reconstruction is just one task where we need datasets. ProvBench is focused on gathering those datasets and also defining new challenge tasks to go with them. Checkout this github for a number of datasets. The PROV standard is also making it easier to consume benchmark datasets because you don’t need to write a new parser to get a hold of the data. The dataset I most liked was the Provenance Capture Disparities dataset from the Mitre crew (paper). They provide a gold standard provenance dataset capturing everything that goes on in a desktop environment, plus, two different provenance traces from different kinds of capture systems. This is great for testing both provenance reconstruction but also looking how to merge independent capture sources to achieve a full picture of provenance.
I think I picked out four large themes from provenance week.
Transparent collection
Provenance aggregation, slicing and dicing
Provenance across sources
Transparent Collection
One issue with provenance systems is getting people to install provenance collection systems in the first place let alone installing new modified provenance-aware applications. A number of papers reported on techniques aimed to make it easier to capture more transparent.
A couple of approaches tackled this for the programming languages. One system focused on R (RDataTracker) and the other python (noWorkflow). I particularly enjoyed the noWorkflow python system as they provided not only transparent capture for provenance systems but a number of utilities for working with the captured provenance. Including a diff tool and a conversion from provenance to Prolog rules (I hope Jan reads this). The prolog conversion includes rules that allow for provenance specific queries to be formulated. (On Github). noWorkflow is similar to Rinke’s PROV-O-Matic tool for tracking provenance in python (see video below). I hope we can look into sharing work on a really good python provenance solution.
An interesting discussion point that arose from this work was – how much we should expose provenance to the user? Indeed, the team that did RDataTracker specifically inserted simple on/off statements in their system so the scientific user could control the capture process in their R scripts.
Tracking provenance by instrumenting the operating system level has long been an approach to provenance capture. Here, we saw a couple of techniques that tried to reduce that tracking to simply launching a system background process in user space while improving the fidelity of provenance. This was the approach of our system Data Tracker and Cambridge’s OPUS (specific challenges in dealing with interposition on the std lib were discussed). Ashish Gehani was nice enough to work with me to get his SPADE system setup on my mac. It was pretty much just a checkout, build, and run to start capturing reasonable provenance right away – cool.
Databases have consistently been a central place for provenance research. I was impressed Boris Glavic’s vision (paper) of a completely transparent way to report provenance for database systems by leveraging two common database functions – time travel and an audit log. Essentially, through the use of query rewriting and query replay he’s able to capture/report provenance for database query results. Talking to Boris, they have a lot it implemented already in collaboration with Oracle. Based on prior history (PostgresSQL with provenance), I bet it will happen shortly. What’s interesting is that his approach requires no modification of the database and instead sits as middleware above the database.
Finally, in the discussion session after the Tapp practice session, I asked the presenters who represented the range of these systems to ballpark what kind of overhead they saw for capturing provenance. The conclusion was that we could get between 1% – 15% overhead. In particular, for deterministic replay style systems you can really press down the overhead at capture time.
Provenance aggregation, slicing and dicing
I think Susan Davidson said it best in her presentation on provenance for crowdsourcing – we are at the OLAP stage of provenance. How do we make it easy to combine, recombine, summarize, and work with provenance. What kind of operators, systems, and algorithms do we need? Two interesting applications came to the fore for this kind of need – crowdsourcing and security. Susan’s talk exemplified this but at the Provenance Analytics event there were several other examples (Huynh et al., Dragon et al).
The other area was security. Roly Perera presented his impressive work with James Cheney on cataloging various mechanisms for transforming provenance graphs for the purposes of obfuscating or hiding sensitive parts of the provenance graph. This paper is great reference material for various mechanisms to deal with provenance summarization. One technique for summarization that came up several times in particular with respect to this domain was the use of annotation propagation through provenance graphs (e.g. see ProvAbs by Missier et al. and work by Moreau’s team.)
Provenance across sources
The final theme I saw was how to connect provenance across sources. One could also call this provenance integration. Both Chapman and the Mitre crew with their Provenance Plus tracking system and Ashish with his SPADE system are experiencing this problem of provenance coming from multiple different sources and needing to integrate these sources to get a complete picture of provenance both within a system and spanning multiple systems. I don’t think we have a solution yet but they both (ashish, chapman) articulated the problem well and have some good initial results.
This is not just a systems problem, it’s fundamental that provenance extends across systems. Two of the cool use cases I saw exemplified the need to track provenance across multiple sources.
In many ways, the W3C PROV standard was created to help solve these issues. I think it does help but having a common representation is just the start.
Final thoughts
I didn’t mention it but I was heartened to see that community has taken to using PROV as a mechanism for interchanging data and for having discussions. My feeling is that if you can talk provenance polynomials and PROV graphs, you can speak with pretty much anybody in the provenance community no matter which “home” they have – whether systems, databases, scientific workflows, or the semantic web. Indeed, this is one of the great things about provenance week, is that one was able to see diverse perspectives on this cross cutting concern of provenance.
Lastly, there seemed to many good answers at provenance week but more importantly lots of good questions. Now, I think as a community we should really expose more of the problems we’ve found to a wider audience.
DLR did a fantastic job of organizing. Great job Carina, Laura and Andreas!
I’ve never had happy birthday sung to me at by 60 people at a conference dinner – surprisingly in tune – Kölsch is pretty effective. Thanks everyone!
Stefan Woltran’s keynote on argumentation theory was pretty cool. Really stepped up to the plate to give a theory keynote the night after the conference dinner.
Speaking of theory, I still need to get my head around Bertram’s work on Provenance Games. It looks like a neat way to think about the semantics of provenance.
If you follow this blog, you’ll know that one of the main themes of my research is data provenance – one of the main use cases for it is reproducibility and transparency in science. I’ve been attending and speaking at quite a few events talking about data sharing, reproducibility and making science more transparent. I’ve even published [1, 2] on these topics.
In this context, I’ve been thinking about my own process as a scientist and whether I’m “eating my own dogfood“. Indeed at the Beyond the PDF 2 conference in March, I stood up at the end and in front of ~200 people said that I would change my work practice – we have enough tools to really change how we do science. I knew I could do better.
So this post is about doing just that. In general, my research work consists of larger infrastructure projects in collaborations and then smaller work developing experimental prototypes and mucking with new algorithms. For the former, the projects use all the standard software development stuff (github, jira, wikis) so this gets documented fairly well.
The bit that’s not as good as it should be is for the smaller scale things. I think with my co-authors and I do an ok job at publishing the code and the data associated with our publications — although this could be improved. (It’s too often on our own websites). The major issue I have is that the methods are probably not as reproducible or transparent as they should be – essentially it’s a bit messy for other people to figure out exactly what I was up to when doing something new. It’s not in one place nor is it clearly documented. It also hurts my process in that a lot of the mucking about I do gets lost or it takes time to find. I see this is as a particular problem as I do more web science research where the gathering cleaning and reanalyzing data is a critical part of the endeavor.
With that in mind, I’ve decided to get my act together and follow in the footsteps of the likes of Titus Brown and Carl Boettiger and do more of my science in a reproducible and open fashion.
To do this, I’ve decided to adopt IPython Notebooks as my new note taking environment. This solves the problem of allowing me to try different things out and keep track of all the parts of a project together. Additionally, it lets me “narrate my work” – that is mix commentary with my code, which is pretty cool. My notebook is on github and also contains information about how my system is setup including versions of libraries I’m relying on.
In April, we launched the Open PHACTS Discovery Platform with a corresponding API allowing developers to create drug discovery applications without having to worry about the complexities and pain of integrating multiple databases. We’ve had some great applications being developed on top of this API. If you’re a developer in this space, I encourage you to take a look and see what you can create. Below is a slide set and a webinar about getting started with the API. You can also check out https://dev.openphacts.org for developer documentation and getting an account.
Last week, I attended ACM CHI 2013 and Web Science 2013 in Paris. I had a great time and wanted to give a recap of both conferences, which were collocated.
CHI
This was my first time at CHI – the main computer-human interaction conference. It’s not my main field of study but I was there to Data DJ. I had an interactivity submission accepted with Ayman from Yahoo! Reseach on using turntables to manipulate data. Here’s the abstract:
Spinning Data: Remixing live data like a music DJ
This demonstration investigates data visualization as a performance through the use of disc jockey (DJs) mixing boards. We assert that the tools DJs use in-situ can deeply inform the creation of data mixing interfaces and performances. We present a prototype system, DMix, which allows one to filter and summarize information from social streams using a audio mixing deck. It enables the Data DJ to distill multiple feeds of information in order to give an overview of a live event.
Paul Groth and David A. Shamma. 2013. Spinning data: remixing live data like a music dj. In CHI ’13 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’13). ACM, New York, NY, USA, 3063-3066. DOI=10.1145/2468356.2479611 http://doi.acm.org/10.1145/2468356.2479611 (PDF)
It was a fun experience… although it was a lot of demo giving (reception + all coffee breaks). The reactions were really positive. Essentially, once a person touched the deck they really got the interaction. Plus, a couple of notable people stopped by that seemed to like the interaction: Jacob Nielsen and @kristw from twitter data science. The kind of response I got made me really want to pursue the project more. I also learned about how we can make the interaction better.
In addition to my demo, I was impressed with the cool stuff on display (e.g. traceable skateboards) as well as the number of companies there looking for talent. The conference itself was huge with 3500 people and it was the first conference I attended where they had multiple sponsored parties.
WebSci
Web Science was after CHI and is more in my area of research.
These papers were chiefly done by the first authors both students at the VU. Anca attended Web Science and did a great job presenting our poster on using Google Scholar to measure academic independence. There was a lot of interest and we got quite a few ideas on how to improve the paper (bigger sample!).
The other paper by Fabian Eikelboom was very well received. It compared online and offline pray cards and tried to see how the web modified this form of communication. Here’s a couple of tweets:
I found quite a few things that I really liked at this year’s web science. A couple of pointers:
Henry S Thompson, Jonathan A Rees and Jeni Tennison: URIs in data: for entities, or for descriptions of entities: A critical analysis – Talked about the http range 14 and the problem of unintended extensibility points within standards. I think a critical area of Web Science is how the social construction of technical standards impacts the Web and its development. This is an example of this kind of research.
Catherine C. Marshall and Frank M. Shipman: Experiences Surveying the Crowd: Reflections on methods, participation, and reliability – really got me thinking about the notion of hypotheticals in law and how this relates to provenance on the web.
Panagiotis Metaxas and Eni Mustafaraj: The Rise and the Fall of a Citizen Reporter – a compelling example of how twitter influences the mexican drug war and how trust is difficult to determine online. The subsequent Trust Trails project looks interesting.
The folks over at the UvA at digitalmethods.net are doing a lot of fun work with respect to studying the web as a social object. It’s worth looking at their work.
Sebastien Heymann and Benedicte Le Grand. Towards A Redefinition of Time in Information Networks?
Unfortunately, there were some things that I hope will improve for next year. First, as you can tell above the papers were not available online during the conference. This is really a bummer when your trying to tweet about things you see and follow-up later. Secondly, I thought there were a few too many philosophy papers. In particular, it worries me when a computer scientist is presenting a philosophy paper at a science conference. I think the program committee needs to watch out for spreading too thinly in the name of interdisciplinarity. Finally, the pecha kucha session was a real success – short, succinct presentations that really raised interest in the work. This, however, didn’t carry over into the main sessions which often ran too long.
Overall, both CHI and Web Science were well worth the time – I made a bunch of connections and saw some good research that will influence some of my work. Oh and it turns out Paris has some amazing coffee:
Wow! The last three days have been crazy, hectic, awesome and inspiring. We just finished putting on The Future of Research Communication and e-Scholarhip (FORCE11)’sBeyond the PDF 2 conference here in Amsterdam. (I was chair of the organizing committee and in charge of local arrangements) The idea behind Beyond the PDF was to bring together a diverse set of people (scholars, technologists, policy experts, librarians, start-ups, publishers, …) all interested in making scholarly and research communication better. In that case, I think we achieved are goal. We had 210 attendees from across the spectrum. Below are two charts: one of the types organizations of the attendees and domain they are from.
The program of the conference was varied. We covered new tools, business models, the context of the approach, research evaluation, visions for the futures and how to moved forward. Here, I won’t go over the entire conference here. We’ll have a complete video online soon (thanks Elsevier). I just wanted to call out some personal highlights.
Keynotes
We had two great keynotes from Kathleen Fitzpatrick of the Modern Language Association and the other from Carol Tenopir (Chancellor’s Professor at the School of Information Sciences at the University of Tennessee, Knoxville). Kathleen discussed how it is essential for humanities to embrace new forms of scholarly communication as it allows for faster dissemination of their work. Carol discussed the practice of reading for academics. She’s done in-depth tracking of how scientists read. Some interesting tidbits: successful scientists read more and so far social media use has not decreased the amount of reading that scientists do. The keynotes were really a sign of how much more humanities were present at this conference than Beyond the PDF 1.
Kathleen Fitzpatrick (@kfitz). Director of Scholarly Communication , Modern Language Association
The tools are there
Just two years ago at the first Beyond the PDF, there were mainly initial ideas and drafts for next generation research communication tools. At this year’s conference, there were really a huge number of tools that are ready to be used. Figshare, PDFX, Authorea, Mendeley, IsaTools, StemBook, Commons in a Box, IPython, ImpactStory and on…
Furthermore, there are different ways of publishing from PeerJ to Hypothes.is and even just posting to blog. Probably the interesting idea of the conference was the use of github to essential publish.
For me this made me think it’s time to think about my own scientific workflow and figure out how to update it to better use these tools in practice.
People made connections
At the end of the conference, I asked if people had made a new connection. Almost every hand went up. It was great to see publishers, technologists, librarians also talking together. The twitter back channel at the conference was great. We saw a lot of conversations that kept going on #btpdf2 and also people commenting while watching the live stream. Check out a great Storify of the social media stream of the conference done by Graham Steel.
We gave a challenge to the community, “what would you do with 1k today that would change scholarly communication for the better? ” The challenge was well received and we had a bunch of different ideas from sponsoring viewing parties to encouraging the adoption of DOIs in the developing world and by small publishers.
The Challenge of Evaluation
We had a great discussion around the role of evaluation. I think the format that was used by Carole Goble for the evaluation session where we had role playing representing key players in the evaluation of research and researchers really highlighted the fact that we have a first mover problem. None of the roles feel that “they should go first”. It was unclear how to push past that challenge.
Summary
Personally, I had a great time. FORCE 11 is a unique community and I think brings together people that need to talk to change the way we communicate scholarship. This was my quick thoughts on the event. There’s a lot more to come. We will have the video of the event up soon. Also, we will have drawn notes posted provided by Jongens van de Tekeningen. Also, we will award a series of 1k grants to support ongoing work. Finally, I hope to see many more blog posts documenting the different views of attendees.
Thanks
We had many great sponsors that helped make a great event. Things like live streaming, student scholarships, a professional set-up, demos & dinner ensure that an event like this works.
PLOS is a major open-access online publisher and the publisher of the leading megajournal PLOS One. A mega-journal is one that accepts any scientifically sound manuscript. This means there is no decision on novelty just a decision on whether the paper was done in a scientifically sound way. The consequence is that this leads to much more science getting published and the corresponding need for even better filters and search systems for science.
As an online publisher, PLOS tracks many what are termed article level metrics – these metrics go beyond of traditional scientific citations and include things like page views, pdf downloads, mentions on twitter, etc. Article level metrics are to my mind altmetrics aggregated at the article level.
PLOS provides a comprehensive api to obtain these metrics and wants to encourage the broader adoption and usage of these metrics. Thus, they organized this workshop. There were a variety of people attending (https://sites.google.com/site/altmetricsworkshop/attendees/attendee-bios) from publishers (including open access ones and the traditional big ones), funders, librarians to technologists. I was a bit disappointed not to see more social scientists there but I think the push here has been primarily from the representative communities. The goal was to outline key challenges for altmetrics and then corresponding concrete actions that could place in the next 6 months that could help address these challenges. It was an unconference so no presentations and lots of discussion. I found it to be quite intense as we often broke up into small groups where one had to be fully engaged. The organizers are putting together a report that digests the work that was done. I’m excited to see the results.
Me actively contributing 🙂 Thanks Ian Mulvany!
Highlights
Launch of the PLOS Altmetrics Collection. This was really exciting for me as I was one of the organizers of getting this collection produced. Our editorial is here: This collection provides a nice home for future articles on altmetrics
I was impressed about the availability of APIs. There are now several aggregators and good sources of altmetrics in just a bit of time. ImpactStory, almetric.com, plos alm apis, mendeley, figshare.com, microsoft academic search
rOpenSci (http://ropensci.org) is a cool project that provides R apis to many of these alt metric and other sources for analyzing data
There’s quite a bit of interest in services to do these metrics. For example, Plum Analytics (http://www.plumanalytics.com) has a test being done at the University of Pittsburgh. I also talked to other people who were getting interest in using these alternative impact measures and also heard a number of companies are now providing this sort of analytics service.
I talked a lot to Mark Hahnel from Figshare.com about the Data2Semantics LinkItUp service. He is super excited about it and loved the demo. I’m really excited about this collaboration.
Microsoft Academic Search is getting better, they are really turning it into a production product with better and more comprehensive data. I’m expecting a really solid service in the next couple of months.
I learned from Ian Mulvany of eLife that Graph theory is mathematically “the same as” statistical mechanics in physics.
Context, Context, Context – there was a ton of discussion about the importance of context to the numbers one gets from altmetrics. For example, being able to quickly compare to some baseline or by knowing the population which the number is applied.
White board thoughts on context! thanks Ian Mulvany
Related to context was the need for simple semantics – there was a notion that for example we need to know if a retweet in twitter was positive or negative and what kind of person retweeted the paper (i.e. a scientists, a member of the public, a journalist, etc). This was because that unlike citations the population that altmetrics uses is not as clearly defined as it exists in a communication medium that doesn’t just contain scholarly communication.
I had a nice discussion with Elizabeth Iorns the founder of https://www.scienceexchange.com . There doing cool stuff around building markets for performing and replicating experiments.
Independent of the conference, I met up with some people I know from the natural language processing community and one of the things that they were excited about is computational semantics but using statistical approaches. It seems like this is very hot in that community and something we in the knowledge representation & reasoning community should pay attention to.
Hackathon
Associated with the workshop was a hackathon held at the PLOS offices. I worked in a group that built a quick demo called rerank.it . This was a bookmarklet that would highlight papers in pubmed search results based on their online impact according to impact story. So you would get different color coded results based on alt metric scores. This only took a day’s worth of work and really showed to me how far these apis have come in allowing applications to be built. It was a fun environment and was really impressed with the other work that came out.
Random thought on San Francisco
Four Barrel coffee serves really really nice coffee – but get there early before the influx of ultra cool locals
The guys at Goody Cafe are really nice and also serve good coffee
If you’re in the touristy Fisherman’s Wharf area walk to the Fort Mason for fantastic views of the golden gate bridge. The hostel there also looks cool.