trip report

The early part of last week I attended the Web Science 2018 conference. It was hosted here in Amsterdam which was nice for me. It was nice to be at a conference where I could go home in the evening.

Web Science is an interesting research area in that it treats the Web itself as an object of study. It’s a highly interdisciplinary area that combines primarily social science with computer science. I always envision it as a loop with studies of what’s actually going on the Web leading to new interventions on the Web which we then need to study.

There were what I guess a hundred or so people there … it’s a small but fun community. I won’t give a complete rundown of the conference. You can find summaries of each day done by Cat Morgan (Workshop DayDay 1Day 2Day 3) but instead give an assortment of things that stuck out for me:

And some tweets:

I had the pleasure of attending the Web Conference 2018 in Lyon last week along with my colleague Corey Harper . This is the 27th addition of the largest conference on the World Wide Web. I have tremendous difficulty  not calling it WWW but I’ll learn! Instead of doing two trip reports the rest of this is a combo of Corey and my thoughts. Before getting to what we took away as main themes of the conference let’s look at the stats and organization:

It’s also worth pointing out that this is just the research track. There were 27 workshops,  21 tutorials, 30 demos (Paul was co-chair), 62 posters, four collocated conferences/events, 4 challenges, a developer track and programming track, a project track, an industry track, and… We are probably missing something as well. Suffice to say, even with the best work of the organizers it was hard to figure out what to see. Organizing an event with 2200+ attendees is a thing is a massive task – over 80 chairs were involved not to mention the PC and the local heavy lifting. Congrats to Fabien, Pierre-Antoine, Lionel and the whole committee for pulling it off.  It’s also great to see as well that the proceedings are open access and available on the web.

Given the breadth of the conference, we obviously couldn’t see everything but from our interests we pulled out the following themes:

  • Dealing with a Polluted Web
  • Tackling Tabular Data
  • Observational Methods
  • Scientific Content as a Driver

Dealing with a Polluted Web

The Web community is really owning it’s responsibility to help mitigate the destructive uses to which the Web is put. From the “Recoding Black Mirror” workshop, which we were sad to miss, through the opening keynote and the tracks on Security and Privacy and Fact Checking, this was a major topic throughout the conference.

Oxford professor Luciano Floridi gave an excellent first keynote  on “The Good Web” which addressed this topic head on. He introduced a number of nice metaphors to describe what’s going on:

  • Polluting agents in the Web ecosystem are like extremphiles, making the environment hostile to all but themselves
  • Democracy in some contexts can be like antibiotics: too much gives growth to antibiotic resistant bacteria.
  • His takeaway is that we need a bit of paternalism in this context now.

His talk was pretty compelling,  you can check out the full video here.

Additionally, Corey was able to attend the panel discussion that opened the “Journalism, Misinformation, and Fact-Checking” track, which included representation from the Credibility Coalition, the International Fact Checking Network, MIT, and WikiMedia. There was a discussion of how to set up economies of trust in the age of attention economies, and while some panelists agreed with Floridi’s call for some paternalism, there was also a warning that some techniques we might deploy to mitigate these risks could lead to “accidental authoritarianism.” The Credibility Coalition also provided an interesting review of how to define credibility indicators for news looking at over 16 indicators of credibility.

We were able to see parts of the “Web and Society track”, which included a number of papers related to social justice oriented themes. This included an excellent paper that showed how recommender systems in social networks often exacerbate and amplify gender and racial disparity in social network connections and engagement. Additionally, many papers addressed the relationship between the mainstream media and the web. (e.g. political polarization and social media, media and public attention using the web).

Some more examples: The best demo was awarded to a system that automatically analyzed privacy policies of websites and summarized them with respect to GDPR and:

More generally, it seems the question is how do we achieve quality assessment at scale?

Tackling Tabular Data

Knowledge graphs and heterogenous networks (there was a workshop on that) were a big part of the conference. Indeed the test of time paper award went to the original Yago paper. There were a number of talks about improving knowledge graphs for example for improving on question answering tasks, determining attributes that are needed to complete a KG or improving relation extraction. While tables have always been an input to knowledge graph construction (e.g. wikpedia infoboxes), an interesting turn was towards treating tabular data as a focus area.

As Natasha Noy from Google noted in her  keynote at the SAVE-SD workshop,  this is an area with a number of exciting research challenges:img_0034_google_savesd.jpg

There was a workshop on data search with a number of papers on the theme. In that workshop, Maarten de Rijke gave a keynote on the work his team has been doing in the context of data search project with Elsevier.

In the main track, there was an excellent talk on Ad-Hoc Table Retrieval using Semantic Similarity. They looked at finding semantically central columns to provide a rank list of columns. More broadly they are looking at spreadsheet compilation as the task (see and the dataset for that task.) Furthermore, the paper Towards Annotating Relational Data on the Web with Language Models looked at enriching tables through linking into a knowledge graph.

Observational Methods

Observing  user behavior has been a part of research on the Web, any web search engine is driven by that notion. What did seem to be striking is the depth of the observational data being employed. Prof. Lorrie Cranor gave an excellent keynote on the user experience of web security (video here). Did you know that if you read all the privacy policies of all the sites you visit it wold take 244 hours per year? Also, the idea of privacy as nutrition labels is pretty cool:

But what was interesting was her labs use of an observatory of 200 participants who allowed their Windows home computers to be instrumented. This kind of instrumentation gives deep insight into how users actually use their browsers and security settings.

Another example of deep observational data, was the use of mouse tracking on search result pages to detect how people search under anxiety conditions:

In the paper by Wei Sui and co-authors on Computational Creative Advertisements presented at the HumL workshop – they use in-home facial and video tracking to measure emotional response to ads by volunteers.

The final example was the use of FMRI scans to track brain activity of participants during web search tasks. All these examples provide amazing insights into how people use these technologies but as these sorts of methods are more broadly adopted, we need to make sure to adopt the kinds of safe-guards adopted by these researchers – e.g. consent, IRBs, anonymization.

Scientific Content as a Driver

It’s probably our bias but we saw a lot of work tackling scientific content. Probably because it’s both interesting and provides a number of challenges. For example, the best paper of the conference (HighLife) was about extracting n-ary relations for knowledge graph construction motivated by the need for such types of relations in creating biomedical knowledge graphs. The aforementioned work on tabular data often is motivated by the needs of research. Obviously SAVE-SD covered this in detail:

In the demo track, the search engine was presented to summarize and visualization of scientific papers. Kuansan Wang at the BigNet workshop talked about Microsoft Academic Search and the difficulties and opportunities in processing so much scientific data.


Paul gave a keynote at the same workshop also using science as the motivation for new methods for building out knowledge graphs. Slides below:

In the panel, Structured Data on the Web 7.0, Google’s Evgeniy Gabrilovich – creator of the Knowledge Vote – noted the challenges of getting highly correct data for Google’s Medical Knowledge graph and that doing this automatically is still difficult.

Finally, using DOIs for studying persistent identifier use over time on the Web.


Overall, we had a fantastic web conference. Good research, good conversations and good food:

Random Thoughts


Last week, I had the pleasure to be able to attend a bilateral meeting between the Royal Society and the KNAW. The aim was to strengthen the relation between the UK and Dutch scientific communities. The meeting focused on three scientific areas: quantum physics & technology; nanochemistry; and responsible data science. I was there for the latter. The event was held at Chicheley Hall which is a classic baroque English country house (think Pride & Prejudice). It’s a marvelous venue – very much similar in concept to Dagstuhl (but with an English vibe) where you are really wholly immersed in academic conversation.


One of the fun things about the event was getting a glimpse of what other colleagues from other technical disciplines are doing. It was cool to see Prof. Bert Weckhuysen enthusiasm for using imaging technologies to understand catalysts at the nanoscale. Likewise, seeing both the progress and the investment (!) in quantum computing from Prof. Ian Walmsley was informative. I also got an insider intro to the challenges of engineering a quantum computer from Dr. Ruth Oulton.

The responsible data science track had ~15 people. What I liked was that the organizers not only included computer scientists but also legal scholars, politicians, social scientists, philosophers and policy makers. The session consisted primarily of talks but luckily everyone was open to discussion throughout. Broadly, responsible data science covers the ethics of the practice and implications of data science or put another way:

For more context, I suggest starting with two sources: 1) The Dutch consortium on responsible data science 2) the paper 10 Simple Rules for Responsible Big Data Research. I took away two themes both from the track as well as my various chats with people during coffee breaks, dinner and the bar.

1) The computer science community is engaging

It was apparent through out the meeting that the computer science community is confronting the challenges head on. A compelling example was the talk by Dr. Alastair Beresford from Cambridge about Device Analyzer a system that captures the activity of user’s mobile phones in order to provide data to improve device security, which it has:

He talked compellingly about the trade-offs between consent and privacy and how the project tries to manage these issues. In particular, I thought how they handle data sharing with other researchers was interesting. It reminded me very much of how the Dutch Central Bureau of Statistics manages microdata on populations.

Another example was the discussion by Prof. Maarten De Rijke on the work going on with diversity for recommender and search systems. He called out the Conference on Fairness, Accountability, and Transparency (FAT*) that was happening just after this meeting, where the data science community is engaging on these issues. Indeed, one of my colleagues was tweeting from that meeting:

Julian Huppert, former MP, discussed the independent review board setup up by DeepMind Health to enable transparency about their practices. He is part of that board.  Interestingly, Richard Horton, Editor of the Lancet is also part of that board Furthermore, Prof. Bart Jacobs discussed the polymorphic encryption based privacy system he’s developing for a collaboration between Google’s Verily and Radboud University around Parkinson’s disease. This is an example that  even the majors are engaged around these notions of responsibility. To emphasize this engagement notion even more, during the meeting a new report on the Malicious Uses of AI came out from a number or well-known organizations.

One thing that I kept thinking is that we need more assets or concrete artifacts that data scientists can apply in practice.

For example, I like the direction outlined in this article from Dr. Virginia Dignum about defining concrete principles using a design for values based approach. See TU Delft’s Design for Values Institute for more on this kind of approach.

2) Other methods needed

As data scientists, we tend to want to use an experimental / data driven approach even to these notions surrounding responsibility.

Even though I think there’s absolutely a role here for a data driven approach, it’s worth looking at other kinds of more qualitative methods, for example, by using survey instruments or an ethnographic approach or even studying the textual representation of the regulatory apparatus.  For instance, reflecting on the notion of Thick Data is compelling for data science practice. This was brought home by Dr. Ian Brown in his talk on data science and regulation which combined both an economic and survey view:

Personally, I tried to bring some social science literature to bear when discussing the need for transparency in how we source our data. I also argued for the idea that adopting a responsible approach is also actually good for the operational side of data science practice:

While I think it’s important for computer scientists to look at different methods, it’s also important for other disciplines to gain insight into the actual process of data science itself as Dr. Linnet Taylor grappled within in her talk about observing a data governance project.

Overall, I enjoyed both the setting and the content of the meeting. If we can continue to have these sorts of conversations, I think the data science field will be much better placed to deal with the ethical and other implications of our technology.

Random Thoughts

  • Peacocks!
  • Regulating Code – something for the reading list
  • Somebody remind me to bring a jacket next time I go to an English Country house!
  • I always love it when egg codes get brought up when talking about provenance.
  • I was told that I had a “Californian conceptualization” of things – I don’t think it was meant as a complement – but I’ll take it as such 🙂
  • Interesting pointer to work by Seda Gurses about in privacy and software engineering from @1Br0wn
  • Lots of discussion of large internet majors and monopolies. There’s lots of academic work on this but I really like Ben Thompson’s notion of aggregator’s as the way to think about them.
  • Merkle trees are great – but blockchain is a nicer name 😉


Last week, I conferenced! I attended the 16th International Semantic Web Conference (ISWC 2017) in Vienna at the beginning of the week and then headed up to FORCE 2017 in Berlin for the back half of the week. For the last several ISWC, I’ve been involved in the organizing committee, but this year I got to relax. It was a nice chance to just be an attendee and see what was up. This was made even nicer by the really tremendous job Axel, Jeff and their team did  in organizing both the logistics and program. The venues were really amazing and the wifi worked!

Before getting into what I thought were the major themes of the conference, lets do some stats:

  • 624 participants
  • Papers
    • Research track: 197 submissions – 44 accepted – 23% acceptance rate
    • In-use: 27 submissions – 9  accepted – 33% acceptance rate
    • Resources: 76 submissions – 23 accepted – 30% acceptance rate
  • 46 posters & 61 demos
  • Over 1000 reviews were done excluding what was done for the workshop / demos / posters. Just a massive amount of work in helping work get better.

This year they expanded the number of best reviewers and I was happy to be one of them:

You can find all the papers online as preprints.

The three themes I took away from the conference were:

  1. Ecosystems for knowledge engineering
  2. Learn from everything
  3. More media

Ecosystems for knowledge engineering

This was a hard theme to find a title for but there were several talks about how to design and engineer the combination of social and technical processes to build knowledge graphs. Deborah McGuinness in her keynote talked about how it took a village to create effective knowledge driven systems. These systems are the combination of experts, knowledge specialists, systems that do ML, ontologies, and data sources. Summed up by the following slide:

My best idea is that this would fall under the rubric of knowledge engineering. Something that has always been part of the semantic web community. What I saw though was the development of more extensive ideas and guidelines about how to create and put into practice not just human focused systems but entire social-techical ecosystems that leveraged all manner of components.

Some examples: Gil et al.’s paper on  creating a platform for high-quality ontology development and data annotation explicitly discusses the community organization along with the platform used to enable it. Knoblock et al’s paper on creating linked data for the American Art Collaborative discusses not only the technology for generating linked data from heterogenous sources but the need for a collaborative workflow facilitated by a shared space (Github) but also the need for tools used to do expert review.  In one of my favorite papers, Piscopo et al evaluated the the provenance of Wikidata statements and also developed machine learning models that could judge authoritativeness & relevance of potential source material. This could provide a helpful tool in allowing Wikidata editors to garden the statements automatically added by bots. As a last example, Jamie Taylor in his keynote discussed how at Google they have a Knowledge Graph Schema team that is there to support a developers in creating interlocking data structures. The team is focused on supporting and maintaining quality of the knowledge graph.

A big discussion area was the idea coming out of the US for a project / initiative around an Open Knowledge Network introduced by Guha. Again, I’ll put this under the notion of how to create these massive social-technical knowledge systems.

I think more work needs to be done in this space not only with respect to the dynamics of these ecosystems as Michael Lauruhn and I discussed in a recent paper but also from a reuse perspective as Pascal Hitzler has been talking about with ontology design patterns.

Learn from everything

The second theme for me was learning from everything. Essentially, this is the use of the combination of structured knowledge and unstructured data within machine learning scenarios to achieve better results. A good example of this was presented by Achim Rettinger on using cross modal embeddings to improve semantic similarity and type prediction tasks:

Likewise, Nada Lavrač discussed in her keynote how to different approaches for semantic data mining, which also leverages different sources of information for learning. In particular, what was interesting is the use of network analysis to create a smaller knowledge network to learn from.

A couple of other examples include:

It’s worth calling out the winner of the renewed  Semantic Web Challenge from IBM, which used deep learning in combination with sources such as dbpedia, geonames and background assumptions for relation learning.

2017-10-23 20.44.14.jpg

Socrates – Winner SWC

(As an aside, I think it’s pretty cool that the challenge was won by IBM on data provided by Thomson Reuters with an award from Elsevier. Open innovation at its best.)

For a more broad take on the complementarity between deep learning and the semantic web, Dan Brickley’s paper is a fun read. Indeed, as we start to potentially address common sense knowledge we will have to take more opportunity to learn from everywhere.

More media

Finally, I think we saw an increase in the number of works dealing with different forms of media. I really enjoyed the talk on Improving Visual Relationship Detection using Semantic Modeling of Scene Descriptions given by Stephan Brier. Where they used a background knowledge base to improve relation prediction between portions of images:


There was entire session focused on multimodal linked data including talks on audio ( MIDI LOD cloud, the Internet Music Archive as linked data) and images IMGPedia content analyzed linked data descriptions of Wikimedia commons.  You can even mash-up music with the SPARQL-DJ.


DBpedia won the 10 year award paper. 10 years later semantic technologies and in particular the notion of a knowledge graph are mainstream (e.g. Thomson Reuters has a 100 billion node knowledge graph). While we may still be focused too much on the available knowledge graphs  for our research work, it seems to me that the community is branching out to begin to answer a range new questions (how to build knowledge ecosystems?, where does learning fit?, …) about the intersection of semantics and the web.

Random Notes:

A month ago, I had the opportunity to attend the Dagstuhl Seminar  Citizen Science: Design and Engagement. Dagstuhl is really a wonderful place. This was my fifth time there. You can get an impression of the atmosphere from the report I wrote about my first trip there. I have primarily been to Dagstuhl for technical topics in the area of data provenance and semantic data management as well as for conversations about open science/research communication.

This seminar was a great chance for me to learn more about citizen science and discuss its intersection with the practice of open science. There was a great group of people there covering the gamut from creators of citizen science platforms to crowd-sourcing researchers. 17272.01.l

As usual with Dagstuhl seminars, it’s less about presentations and more about the conversations. There will be a report documenting the outcome and hopefully a paper describing the common thoughts of the participants. Neal Reeves took vast amounts of notes so I’m sure that this will be a good report :-). Here’s a whiteboard we had full of input:

2017-07-05 11.28.24.jpg

Thus, instead of trying to relay what we came up with (you’ll have to wait for the report), I’ll just pull out some of my own brief highlights.

Background on Citizen Science

There were a lot of good pointers on where to start understand current thinking around citizen science. First, two tutorials from the seminar:

What do citizen science projects look like:

Example projects:

How should citizen science be pursued:

And a Book:

Open Science & Citizen Science

Claudia Göbel gave an excellent talk about the overlap of citizen science and open science. First, she gave an important reminder that science in particular in the 1700s was done as public demonstrations walking us through the example painting below. 2017-07-04 11.23.02

She then looked at the overlap between citizen science and open science. Summarized below:


A follow-on discussion at the with some of the seminar participants led to input for a whitepaper that is being developed through the ECSA on Citizen & Open Science for Europe. Check out the preliminary draft. I look forward to seeing the outcome.

Questioning Assumptions

One thing that I left the seminar thinking about was was the need to question my own (and my field’s) assumptions. This was really inspired by talking to Chris Welty and reflecting on his work with Lora Aroyo on the issues in human annotation and the construction of gold sets.  Some assumptions to question:

  • What qualifications you need to have to be considered a scientist.
  • Interoperability is a good thing to pursue.
  • Openness is a worthy pursuit.
  • We can safely assume a lack of dynamics in computational systems.
  • That human performance is good performance.

Indeed, in Marissa Ponti she pointed to the example below and highlighted some of the potential ramifications of what each of these (what at first blush are positive) citizen science projects could lead to. 2017-07-03 10.06.36

That being said, the ability to rapidly engage more people in the science system seems to be a good thing indeed. An an assumption I’m happy to hold.


Last week, I was the first Language, Data and Knowledge Conference (LDK 2017) hosted in Galway, Ireland. If you show up at a natural language processing conference (especially someplace like LREC) you’ll find a group of people who think about and use linked/structured data. Likewise, if you show up at a linked data/semantic web conference, you’ll find folks who think about and use NLP. I would characterize LDK2017 as place where that intersection of people can hang out for a couple of days.

The conference had ~80 attendees from my count. I enjoyed the setup of a single track, plenty of time to talk, and also really trying to build the community by doing things together. I also enjoyed the fact that there were 4 keynotes for just two days. It really helped give spark to the conference.

Here are some my take-aways from the conference:

Social science as a new challenge domain

Antal van den Bosch gave an excellent keynote emphasizing the need for what he termed holistic approach to language especially for questions in the humanities and social science (tutorial here). This holistic approach takes into account the rich context that word occur in. In particular, he called out the notions of ideolect and socialect that are ways word are understood/used individually and in a particular social group. He are argued the understanding of these computational is a key notion in driving tasks like recommendation.

I personally was interested in Antal’s joint work with Folgert Karsdorp (checkout his github repos!) on Story Networks – constructing networks of how stories are told and retold. For example, how the story of Red Riding Hood has morphed and changed overtime and what are the key sources for its work. This reminded me of the work on information diffusion in social networks. This has direct bearing on how we can detect and track how ideas and technologies propagate in science communication.

I had a great discussion with SocialAI team (Erica Briscoe & Scott Appling) from Georgia Tech about their work on computational social science. In particular, two pointers: the new DARPA next generation social science program to scale-up social science research and their work on characterizing technology capabilities from data for innovation assessment.

Turning toward the long tail of entities

There were a number of talks that focused on how to deal with entities that aren’t necessarily popular. Bichen Shi presented work done at Nokia Bell Labs on entity mention disambiguation. They used Apache Spark to train 700,000 classifiers – one per every entity mention in wikipedia. This allowed them to obtain much more accurate per-mention entity links. Note they used Gerbil for their evaluation. Likewise, Hendrik ter Horst focused on entity linking specifically targeting technical domains (i.e. MeSH & chemicals). During Q/A it was clear that straight-up gazeetering provides an extremely strong baseline in this task. Marieke van Erp presented work on fine-grained entity typing in Spanish and Dutch using word embeddings to go classify hundreds up types.

Natural language generation from KBs is worth a deeper look

Natural language generation from knowledge bases continues a pace. Kathleen McKeown‘s keynote touched on this, in particular, her recent work on mining paraphrasal templates that combines both knowledge bases and free text.  I was impressed with the work of Nina Dethlefs on using deep learning for generating textual description from  a knowledge base. The key insight was how to quickly generate systems to do NLG where the data was sparse using hierarchical composition. In googling around when writing this trip report I stumbled upon Ehud Reiter’s blog which is a good read.

A couple of nice overview slides

While not a theme, there we’re some really nice slides describingfundamentals.

From C. Maria Keet:

2017-06-20 10.09.40

From Christian Chiarcos/Bettina Klimek:


From Sangha Nam

2017-06-19 11.07.02

Overall, it was a good kick-off to a conference. Very well organized and some nice research.

Random Thoughts

At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.

Begin aside: Yes, I did run a conference called Beyond the PDF 2 and  have been known to tweet things like:

But, there’s a lot of great information in papers so we need to get our machines to read. end aside.

You can roughly catergorize the steps of reference extraction as follows:

  1. Extract the structure of the article.  (e.g. find the reference section)
  2. Extract the reference string itself
  3. Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)

Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.

There were three themes that popped out for me:

  1. The reading experience
  2. Resources
  3. Reading from the image

The Reading Experience

Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work  and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.

Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.

Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).


The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases).  Here’s a listing of the open resources that I heard called out:

This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:

An interesting issue was the transparency of algorithms and quality of the resulting citation databases.  Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:

Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.

Reading from the image

In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.

I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.

We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.

Random Notes

  • The organizers set-up a slack backchannel which was useful.
  • I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
  • EU projects can have a legacy – Roman Kern is still using code from where Mendeley was a consortium member.
  • Kölsch is dangerous but tasty
  • More workshops should try the noon to noon format.



%d bloggers like this: