At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.

Begin aside: Yes, I did run a conference called Beyond the PDF 2 and  have been known to tweet things like:

But, there’s a lot of great information in papers so we need to get our machines to read. end aside.

You can roughly catergorize the steps of reference extraction as follows:

  1. Extract the structure of the article.  (e.g. find the reference section)
  2. Extract the reference string itself
  3. Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)

Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.

There were three themes that popped out for me:

  1. The reading experience
  2. Resources
  3. Reading from the image

The Reading Experience

Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work  and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.

Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.

Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).


The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases).  Here’s a listing of the open resources that I heard called out:

This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:

An interesting issue was the transparency of algorithms and quality of the resulting citation databases.  Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:

Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.

Reading from the image

In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.

I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.

We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.

Random Notes

  • The organizers set-up a slack backchannel which was useful.
  • I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
  • EU projects can have a legacy – Roman Kern is still using code from where Mendeley was a consortium member.
  • Kölsch is dangerous but tasty
  • More workshops should try the noon to noon format.



Last week, I was at in Malta for a small workshop on building or thinking about the need for observatories for knowledge organization systems (KOSs). Knowledge organization systems are things like taxonomies, classification schemes, ontologies  or concept maps.  The event was hosted by the EU COST action KNOWeSCAPE, which focuses on understanding the dynamics of knowledge through their analysis and importantly visualization.

This was a follow-up to a previous workshop I attended on KOS evolution. Inspired by that workshop, I began to think with my colleague Mike Lauruhn about how the process of constructing KOS is changing with the incorporation of software agents and non-professional contributors (e.g. crowdsourcing). In particular, we wanted to try and get a handle on what a manager of a KOS should think about when dealing with its inevitable evolution especially with the introduction of these new factors. We wrote about this in our article Sources of Change for Modern Knowledge Organization Systems. Knowl. Org. 43(2016)No.8. (preprint).

In my talk (slides below), I presented our article in the context of building large knowledge graphs at Elsevier. The motivating slides were taken from Brad Allen’s keynote from the Dublin Core conference on metadata in the machine age. My aim was to motivate the need for KOS observatories in order to  provide empirical evidence for how to deal with changing KOS.

Both Joseph Tennis and Richard P. Smiraglia gave excellent views on the current state-of-the-art of KOS ontogeny in information systems. In particular, I think the definitional terms introduced by Tennis are useful.  He had the clearest motivation for the need for an observatory – we need to have a central dataset that is collected overtime in order to go beyond case study analysis (e.g. 1 or two KOS) to a population based approach.

I really enjoyed Shenghui Wang‘s talk on her and Rob Koopman’s experiments embeddings to start to try and detect concept drift within journal articles. Roughly put they used different vector spaces for each time duration and were able to see how particular terms changed with respect to other terms in those vector spaces. I’m looking forward to seeing how this work progresses.

2017-02-01 12.20.09 copy.jpg

The workshop was co-organized with the Wikimedia Community Malta so there was good representation from various members of the community. I particular enjoyed meeting John Cummings who is a Wikimedian in Residence at UNESCO. He told me about one of his project to help create high-quality wikipedia pages from UNESCO reports and other open access documents. It’s really cool seeing how deep research based content can be used to expand Wikipedia and the ramifications that has on its evolution. Another Wikipedian Rebecca O’Neill gave a fascinating talk about her rethinking the relationship between citizen curators and traditional memory institutions. Lot’s of stuff at her site so check it out.

Overall, the event confirmed my belief  that there’s lots more that knowledge organization studies can do with respect to large scale knowledge graphs and also those building these graphs can learn from the field.

Random Notes




Last week, I was in Japan for the 15th International Semantic Web Conference. 

For me, this was a big event as I was research track program co-chair together with the amazing Elena Simperl. Being a program chair is a funny thing, you’re not directly responsible for any individual paper, presentation or review but you feel responsible for the entirety. And obviously, organizing 664 reviews for 212 submissions isn’t something to be taken lightly. Beyond my service as research track chair, I think my main contribution was finding good coffee near the event:

With all that said, I think the entire program was really solid. All the preprints are on the website and the proceedings are available from Springer. I’ll try to summarize my main takeaways below. But first some quick stats:

  • 430 participants
  • 212 (research track) + 43 (application track) + 71 (resources track) = 326 submissions
    • that’s up by 61 submission from last year!
  • Acceptance rates:
    • 39/212  =  18% (research track)
    • 12/43 = 28% (application track)
    • 24/71 = 34%  (resources track)
    • I think these reflect the aims of the individual tracks
  • We also had 102 posters and demos and 12 journal track papers
  • 35 student travel winners

My three main takeaways:

  1. Frames are back!
  2. semantics on the web (notice the case)
  3. Science as the next challenge
  4. SPARQL as a driver for other CS communities

(Oh and apologies for the gratuitous use of images and twitter embeds)

Frames are back!

For the past couple of years, a chunk of the community has been focused on the problem of entity resolution/disambiguation whether that’s from text to a KB or across multiple KBs. Indeed, one of the best paper winners (yes, we gave out two – both nominees had great papers) by ISI’s Information Integration Group was an excellent approach to do multi-type entity resolution.  Likewise, Axel and crew gave a pretty heavy duty tutorial on link discovery. On the NLP front, Stefano Faralli presented a nice resource that disambiguates text to lexical resources with a focus on providing both symbolic and distributional representations .

2016-10-21 10.32.59.jpg2016-10-21 10.34.59.jpg

What struck me at the conference were the number of papers beginning to think not just about entities and their relations but the context they are in. This need for context was well motivated by the folks at IBM research working on medical question answering.

Essentially, thinking about classic AI frames but how do obtain these automatically. A clear example of this is the (ongoing) work on FRED:

Similarly, the News Reader system for extracting information into situated events is another example. Another example is extracting process graphs from medical texts. Finally, in the NLP community there’s an increasing focus on developing resources in order to build automated parsers for frame-style semantic representations (e.g. Abstract Meaning Representation). Such representations can be enhanced by connections to semantic web resources as discussed by Burns et al. (I knew this was a great idea in 2015!)

2016-10-21 10.58.15.jpg

In summary,  I think we’re beginning to see how the background knowledge available on the Semantic Web combined with better parsers can help us start to deal better with context in an automated fashion.

semantics on the web

Chris Bizer gave an insightful keynote reflecting on what the community’s expectations were for the semantic web and where we currently are at.

He presented stats on the growth of Linked Data (e.g. stuff in the LOD cloud) as well as web data (e.g. marked pages) but really the main take away is the boom in the later. About 30% of the Web has html embedded data something like 12 million websites.  There’s an 86% adoption rate on top travel website.  I think the choice quote was:

“Probably, every hotel on earth is represented as web data.”

The problem is that this sort of data is not clean, it’s messy – it’s webby data, which brings to Chris’s important point for the community:


While standards have brought us a lot, I think we are starting as a research community to think increasingly about different kinds of semantics and different kinds of structured data.  Some examples from the conference:

An embrace of the whole spectrum of semantics on the web is really a valuable move for the research community. Interestingly enough, I think we can truly experiment with web data through things like Common Crawl and the Web Data Commons. As knowledge graphs, triple stores, and ontologies become increasingly common place especially in enterprise deployments, I’m heartened by these new areas of investigation.

The next challenge: Science

Personally, the third keynote of ISWC by Professor Hiroaki Kitano – the CEO of Sony CSL and creator among other things of the AIBO and founder of RoboCup gave a inspirational speech laying out what he sees as the next AI grand challenge:

2016-10-21 09.20.30.jpg

It will be hard for me to do justice to the keynote as the material per second ratio was pretty much off the chart but he has AI magazine article laying out the vision.

Broadly, he used RoboCup as a framework for discussing how to organize a challenge and pointed to its effectiveness. (e.g Kiva systems a RoboCup spinout was acquired by Amazon for $770 million). He then focused on the issue of the inefficiency in scientific discovery and in particular how assembling knowledge is just too difficult.

:2016-10-21 09.21.04.jpg

2016-10-21 09.40.53 copy.jpg

Assembling this by hand is way too hard!

He then went on to reframe the scientific question as one of a massive search and verification of hypothesis space. 2016-10-21 09.52.29.jpg

I walked out of that keynote pretty charged up.

I think the semantic web community can be a big part of tackling this grand challenge. Science and medicine have always been important domains for applying these technologies and that showed up at this conference as well:

SPARQL as a driver for other CS communities

The 10 year award was given to  Jorge Perez , Marcelo Arenas and Claudio Gutierrez for their paper Semantics and Complexity of SPARQL. Jorge gave just a beautiful 10 minute reflection on the paper and the relationship between theory and practice. I think his slide below really sums up the impact that SPARQL has had not just on the semantic web community but CS as a whole:

2016-10-19 09.35.39.jpg

As further evidence, I thought one of the best technical talks of the conference (even through an earthquake) was by Peter Bonz on emergent schemas for RDF querying.

It was a clear example of how the two DB and semweb communities are learning from one another and that by the semantic web having different requirements (e.g. around schemas), this drives new research.

As a whole, it’s hard to beat a conference where you learn a ton and has the following:

2016-10-19 10.52.15.jpg

Random Pointers

Last week, I was at Provenance Week 2016. This event happens once every two years and brings together a wide range of researchers working on provenance. You can check out my trip report from the last Provenance Week in 2014.  This year Provenance Week combined:

For me, Provenance Week is like coming home, lots of old friends and a favorite subject of mine. It’s also a good event to attend because it crosses the subfields of computer science, everything from security in operating systems to scientific workflows on to database theory. In one day, I went from a discussion on the role of indirection in data citation to staring at the C code of a database. Marta, Boris and Sarah really put together a solid program. There were about 60 attendees across the four days:


So what was I doing there? Having served as co-chair of the W3C PROV working group, I thought it was important to be at the PROV: Three years later event where we reflected on the status of PROV, it’s uptake and usage. I presented some ongoing work on measuring the usage of provenance on the web of data.  Additionally, I gave the presentation of joint work led by my student Manolis Stamatogiannakis and done in conjunction with Ashish Gehani‘s group at SRI. The work focused on using benchmarks to help inform decisions on what provenance capture system to use. Slides:

I’ll now walk through my 3 big take aways from the event.

Provenance to attack Advanced Persistent Threats

DARPA’s $60 million transparent computing explicitly calls out the use of provenance to address the problem of what’s called an Advanced Persistent Threat (APTs). APTs are attacks that are long terms, look like standard business processes, and involve the attacker knowing the system well. This has led to a number of groups exploring the use of system level provenance capture techniques (e.g. SPADE and OPUS) and then integrating that from multiple distributed sources using PROV inspired data models. This was well described by David Archer is his talk as assembling multiple causal graphs from event streams.  James Cheney’s talk on provenance segmentation also addressed these issues well. This reminded me some what of the work on distributed provenance capture using structured logs that the Netlogger and Pegasus teams do, however, they leverage the structure of a workflow system to help with the assembly.

I particularly liked Yang JiSangho Lee and  Wenke Lee‘s work on using user level record and replay to track and replay provenance. This builds upon some of our work that used system level record replay as mechanism for separating provenance capture and instrumentation. But now in user space using the nifty rr tool from Mozilla. I think this thread of being able to apply provenance instrumentation after the fact  on an execution trace holds a lot of promise.

Overall, it’s great to see this level of attention on the use of provenance for security and in more broadly of using long term records of provenance to do analysis.

PROV as the starting point

Given that this was the ten year anniversary of IPAW, it was appropriate that Luc Moreau gave one of the keynotes. As really one of the drivers of the community, Luc gave a review of the development of the community and its successes.One of those outcomes was the W3C PROV standards. 

Overall, it was nice to see the variety of uses of PROV and the tools built around it. It’s really become the jumping off point for exploration. For example, Pete Edwards team combined PROV and a number of other ontologies including (P-Plan) to create a semantic representation of what’s going on within a professional kitchen in order to check food safety compliance. 


Another example is the use of PROV as a jumping off point for the investigation into the provenance model of HL7 FHIR (a new standard for electronic healthcare records interchange).

As whole, I think the attendees felt that what was missing was an active central point to see what was going on with PROV and pointers to resources for implementation. The aim is to make sure that the W3c PROV wiki is up-to-date and is a better resource overall.

Provenance as lens: Data Citation, Documents & Versioning

An interesting theme was the use of provenance concepts to give a frame for other practices. For example, Susan Davidson gave a great keynote on data citation and how using a variant of provenance polynomials can help us understand how to automatically build citations for various parts of curated databases. The keynote was based off her work with James Frew and Peter Buneman that will appear in CACM (preprint). Another good example of provenance to support data citation was Nick Car’s work for Geoscience Australia.

Furthermore, the notion of provenance as the substructure for complex documents appeared several times. For example, the Impacts on Human  Health of Global Climate Change report from uses provenance as a backbone. Both the OPUS and PoeM systems are exploring using provenance to generate high-level experiment reports.

Finally, I thought David Koop‘s versioning of version trees showed how using provenance as lens can help better understand versioning of version trees themselves. (I have to give David credit for presenting a super recursive concept so well).

Overall, another great event and I hope we can continue to attract new CS researchers focusing on provenance.

Random Notes

  • PROV in JSON-LD – good for streaming
  • Theoretical provenance paper recipe = extend provenance polynomials to deal with new operators. Prove nice result. e.g. now for Linear Algebra.
  • Prefixes! R-PROV, P-PROV, D-PROV, FS-PROV, SC-PROV, — let me know if I missed any..
  • Intel Secure Guard Extensions (SGX) – interesting
  • Surprised how dependent I’ve become on taking pictures in conferences for note taking. Not being able to really impacted my flow. Plus, there are less pictures for this
  • Thanks to Adriane for hosting!
  • A provenance based data science environment
  • 👍Learning Health Systems – from Vasa Curcin

It’s kind of appropriate that my last post of 2015 was about the International Semantic Web Conference (ISWC) and my first post of 2016 will be about ISWC.

This years conference will be held in Kobe Japan. This year’s conference already has a number of great things in store. We already have a stellar list of keynote speakers:

  • Kathleen McKeown – Professor of Computer Science at Columbia University,
    Director of the Institute for Data Sciences and Engineering, and Director of the North East Big Data Hub. I was at the hub’s launch last year and it’s really amazing the researchers she brought together through that hub.
  • Hiroaki Kitano – CEO of Sony Computer Science Laboratory and President of the systems biology institute. A truly inspirational figure who done everything from RoboCup to systems biology. He was even an invited artist at MoMA.
  • Chris Bizer – Professor at the Univesity of Mannheim  and Director of the Institute of Computer Science and Business Informatics there. If you’re in the Semantic Web community – you know the amazing work Chris has done. He really kicked the entire move toward Linked Data into high gear.

We have three tracks for you to submit to:

  1. The classic Research Track. Elena and I hope to get your most innovative and groundbreaking work on the cross between semantics and the web writ large. We’ve put together a top notch PC to give you feedback.
  2. The Resources Tracks. Reusable resources like datasets, ontologies, benchmarks and tools are crucial for many research disciplines and especially ours. This track focuses on highlighting them. Alasdair and Marta have put together a rich set of guidelines for a great reusable resources. Check them out.
  3. The Applications Track provides an area to discuss the benefits and challenges of applying semantic technologies. This track, organized by Markus and Freddy, is accepting three different types of submissions on in-use applications, industry applications and industry applications.

In addition to these tracks, ISWC 2016 will have a full program of workshops, posters, demos and student opportunities.

This year we’ll also be allowing submissions to be HTML, letting you experiment with new ways of conveying your contributions. I’m excited to see the creativity in the community using web technologies.

So get those submissions in. Abstracts are due April 20, Full submissions April 30th!





Last week, I hung out in Bethlehem, Pennsylvania for the the 14th International Semantic Web Conference. Bethlehem is famous for the Lehigh University Benchmark  (LUBM) and Bethlehem Steel. This is the major conference focused on the intersection of semantics and web technologies. In addition to being technically super cool, it was a great chance for me to meet many friends and make some new ones.

Let’s begin with some stats:

  • ~450 attendees
  • The conference continues to be selective:
    • Research track: 22% acceptance rate
    • Empirical studies track: 29% acceptance rate
    • In-use track: 40% acceptance rate
    • Datasets and Ontologies: 22% acceptance rate
  • There were 265 submissions across all tracks which is surprisingly the same number as last year.
  • More stats and info in Stefan’s slides (e.g. move to Portugal if you want to get your papers in the conference.)
  • Fancy visualizations courtesy of the STKO group

Before getting into what I thought were the major themes of the conference, a brief note. Reviewing is at the heart of any academic conference. While we can always try and improve review quality, it’s worth calling out good reviewing. The best reviewers were Maribel Acosta (research) and Markus Krötzsch (applied). As data sets and ontologies track co-chair, I can attest to how important good reviewers are.  For this new track we relied heavily on reviewers being flexible and looking at these sorts of contributions differently. So thanks to them!

For me there were three themes of ISWC:

  1. The Spectrum of Entity Resolution
  2. The Spectrum of Linked Data Querying
  3. Buy more RAM

The Spectrum of Entity Resolution

Maybe its because I attended the NLP & DBpedia workshop or the conversation I had about string similarity with Michelle Cheatham, but one theme that I saw was the continued amalgamation of natural language processing (NLP) style entity resolution with database entity resolution (i.e. record linkage). This movement stems from the fact that an increasing amount of linked data is a combination of data extracted from semi-structured sources as well as from NLP. But in addition to that, NLP sources rely on some of these semi-structured datasources to do NLP.

Probably, the best example of that idea is the work that Andrew McCallum presented in his keynote on “epistemlogical knowledge bases”.

Briefly, the idea is to reason with all the information coming from both basic low level NLP (e.g. basic NER, or even surface forms) as well as the knowledge base jointly (plus, anything else) to generate a knowledge base.  One method to do this is universal schemas. For a good intro, check out Sebastien Riedel’s slides.

From McCallum, I like the following papers which gives a good justification and results of doing collective/joint inference.

(Self promotion aside: check out Sara Magliacane’s work on Probabilistic Soft Logics for another way of doing joint inference.)

Following on from this notion of reasoning jointly, Hulpus, Prangnawarat and Hayes showed how to use the graph-based structure of linked data to to perform joint entity and word sense disambiguation from text. Likewise, Prokofyev et al. use the properties of a knowledge graph to perform better co-reference resolution. Essentially, they use this background knowledge to split the clusters of co-referrent entities produced by Stanford CoreNLP. On the same idea, but for more structured data, the TableEL system uses a joint model with soft constraints to perform entity linking for web tables, improving performance by up-to 75% on web tables. (code & data)

One approach to entity linking that I liked was from the Raphael Troncy’s crew titled “Reveal Entities From Texts With a Hybrid Approach” (paper, slides). (Shouldn’t it be “Revealing..”?). They showed that by using essentially the provenance of the data sources they are able to build an adaptive entity linking pipeline. Thus, one doesn’t necessarily have to do as much domain tuning to use these pipelines.

While not specifically about entity resolution, a paper worth pointing out is Type-Constrained Representation Learning in Knowledge Graphs from Denis Krompaß, Stephan Baier and Volker Tresp. They show how background knowledge about entity types can help improve link prediction tasks for generating knowledge graphs. Again, use the kitchen sink and you’ll perform better.

There were a couple of good resources presented for entity resolution tasks.  Bryl, Bizer and Paulheim produced a dataset of surface forms for dbpedia entities. They were able to boost performance up to 20% for extracting accurate surface forms for entities through filtering. Another tool, LANCE looks great for systematically generating benchmark and test sets for instance matching (i.e. entity linking). Also, Michel Dumontier presented work that had a benchmark for entity linking from the life sciences domain.

Finally, as we get better at entity resolution, I think people will turn towards fusion (getting the best possible representation for a real world entity). Examples include:

The Spectrum of Linked Data Querying

So Linked Data Fragments from Ruben Verborgh was the huge breakout of the conference. Oscar Corcho’s excellent COLD keynote was a riff off thinking about the spectrum (from data dumps through to full sparql queries) that was introduced by Reuben. Another example was the work of Maribel Acosta and Maria-Esther Vidal on “Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data”. They developed an adaptive client side spraql query engine for linked data fragments. This allows the server side to support a much simpler API by having a more intelligent client side. (An aside, kids this is how a technical talk should be done. Precise, clean, technical, understandable. Can’t wait to have the the video lecture for reference.)

Even the most centralized solution, the LODLaundromat which is a clean crawl of the entire web of data supports Linked Data Fragments. In some sense, by asking the server to do less you can handle more linked data, and thus do more powerful analysis. This is exemplified by the best paper LODLab byLaurens Rietveld, Wouter Beek, and Stefan Schlobach, which allowed for the reproduction of 3 existing analysis of the web of data at scale.

I think Olaf Hartig, in his paper on LDQL, framed the problem best as (N, Q) (slides). First define the “crawl” of the web you want to query (N)  and then define the query (Q). When we think about what and where are crawls are, we can think about what execution strategies and types of queries we can best support. Or put another way:

More Main Memory = better Triple Stores

Designing scalable graph / triple stores has always been a challenge. We’ve been trapped by the limits of RAM. But computer architecture is changing, and we now have systems that have a lot of main memory either in one machine or across multiple machines. This is a boon to triple stores and graph processing in general. See for example Leskovec team’s work from SIGMOD:

We saw that theme at ISWC as well:

Moral of the story: Buy RAM


This years conference explored the many spectra of the combination of the web and semantics. I liked the mix of methods used by papers and the range of practical (the industry session was packed) to theoretical results. I also think the community is no longer hemmed in by the standards but are using them as solid starting point. This was pointed out by Ian Horrocks in his keynote:
Additionally, this flexibility was exemplified by the best applied paper, “Building and Using a Knowledge Graph to Combat Human Trafficking” by  Pedro Szekely et al.. They used the parts of the semantic web stack that helped (like ontologies and JSON-LD) but used elastic search for storage to create a vital and important solution to a real challenging problem.
Overall, this was an excellent conference.  Next year’s conference is in Kobe, I hope you submit some great papers and I’ll seen you there!

Random Thoughts

Last week (Oct 7 – 9) the altmetrics community made its way to Amsterdam for 2:AM (the second altmetrics conference) and altmetrics15 (the 4th altmetrics workshop). The conference is aimed more at practitioners while the workshop has a bit more research focus. I enjoyed the events from both a content (I’m biased as a co-organizer) as well as logistics perspective (I could bike from home). This was the five year anniversary of the altmetrics manifesto so it was a great opportunity to reflect on the status of the community. Plus the conference organizers brought cake!

This was the first time that all of the authors were in the same room together and we got a chance to share some of our thoughts. The video is here if you want to hear us pontificate:

From my perspective, I think you can summarize the past years in two bullet points:

  • Amazing what the community has done: multiple startups on altmetrics, big companies having altmetric products, many articles and other research objects having altmetric scores, a small but vibrant research community is alive
  • It would be great to focus more on altmetrics to improve the research process rather than just their potential use in research evaluation.

Beyond the reflection on the community itself, I took three themes from the conference:

More & different data please

An interesting aspect is that most studies and implementations rely on social media data (twitter, mendeley, Facebook, blogs, etc). As an aside, it’s worth noting you can do amazing things with this data in a very short amount of time…

However, there is increasing interest in having data from other sources or having more contextualized data.

There were several good examples.  gave a good talk about trying to get data behind who tweets about scientific articles. I’m excited to see how better population data can help us have. The folks at are starting to provide data that looks at how articles are being used in public policy documents. Finally, moving beyond articles, Peter van Besselaar looking at data derived from grant review processes to study, for example, gender bias.

It’s also good to see developments such as the DOI Event Tracker that makes the aggregation of altmetrics data easier. This is hopefully just the start and we will see a continued expansion of the variety of data available for studies.

The role of theory

There was quite a bit of discussions about the appropriateness of the use of altmetrics for different tasks ranging from the development of global evaluation measures to their role in understanding the science system. There was a long discussion of the quality of altmetrics data in particular the transparency of how aggregator’s integrate and provide data.

A number of presenters discussed the need for theory in trying to interpret altmetrics signal. Cameron Neylon gave an excellent talk about his view of the need for a different theoretical view. There was also a break out session at the workshop discussing the role of theory and I look forward to the ether pad becoming something more well defined.  Peter van Bessellaar and I also tried to argue for a question driven approach when using altmetrics.

Finally, I enjoyed the work of Stefanie Haustein, Timothy Bowman, and Rodrigo Costas on interpreting the meaning of altmetrics. This is definitely a must read.

Going beyond research evaluation

I had a number of good conversations with people about the desire to do something that moves beyond the focus of research evaluation. In all honesty, being able or tell stories with a variety of metrics is probably why altmetrics has gained traction.

However, I think a world in which understanding the signals produced by the research system can be used to improve research is the exciting bit. There were some hints of this. In particular, I was compelled by the work of Kristi Holmes on using measures to improve translational medicine at northwestern.


Overall, It’s great to see all the great activity around altmetrics. There are a bunch of great summaries of the event. Check out the altmetrics conference blog and Julie Brikholz’s summary.

%d bloggers like this: