At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.
Begin aside: Yes, I did run a conference called Beyond the PDF 2 and have been known to tweet things like:
But, there’s a lot of great information in papers so we need to get our machines to read. end aside.
You can roughly catergorize the steps of reference extraction as follows:
- Extract the structure of the article. (e.g. find the reference section)
- Extract the reference string itself
- Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)
Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.
There were three themes that popped out for me:
- The reading experience
- Resources
- Reading from the image
The Reading Experience
Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.
Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.
Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).
Resources
The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases). Here’s a listing of the open resources that I heard called out:
- Software
- Data
This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:
An interesting issue was the transparency of algorithms and quality of the resulting citation databases. Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:
Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.
Reading from the image
In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.
I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.
We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.
Random Notes
- The organizers set-up a slack backchannel which was useful.
- I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
- EU projects can have a legacy – Roman Kern is still using code from http://code-research.eu where Mendeley was a consortium member.
- Kölsch is dangerous but tasty
- More workshops should try the noon to noon format.