Archive

trip report

About two weeks ago, I had the pleasure of attending the 1st Conference on Automated Knowledge Base Construction held in Amherst, Massachusetts. This conference follows up on a number of successful workshops held at venues like NeurIPS and NAACL. Why a conference and not another workshop? The general chair and host of the conference (and he really did feel like a host), Andrew McCallum articulated this as coming from three drivers: 1) the community spans a number of different research areas but was getting its own identity; 2) the workshop was outgrowing typical colocation opportunities and 3) the motivation to have a smaller event where people could really connect in comparison to some larger venues.

I don’t know the exact total but I think there was just over 110 people at the conference. Importantly, there were top people in the field and they stuck around and hung out. The size, the location, the social events (a lovely group walk in the forest in MA), all made it so that the conference achieved the goal of having time to converse in depth. It reminded me a lot of our Provenance Week events in the scale and depth of conversation.

Oh and Amherst is a terribly cute college town:

2019-05-19 16.56.06.jpg

Given that the conference subject is really central to my research, I found it hard to boil down everything into a some themes but I’ll give it a shot:

  • Representational polyglotism
  • So many datasets so little time
  • The challenges of knowledge (graph) engineering
  • There’s lots more to do!

Representational polyglotism

Untitled 2.png

One of the main points that came up frequently both in talks and in conversation was around what one should use as representation language for knowledge bases and for what purpose. Typed graphs have clearly shown their worth over the last 10 years but with the rise of knowledge graphs in a wide variety of industries and applications. The power of the relational approach especially in its probabilistic form  was shown in excellent talks by Lise Getoor on PSL and by Guy van den Broeck. For efficient query answering and efficiency in data usage, symbolic solutions work well. On the other hand, the softness of embedding or even straight textual representations enables the kind of fuzziness that’s inherent in human knowledge. Currently, our approach to unify these two views is often to encode the relational representation in an embedding space, reason about it geometrically, and then through it back over the wall into symbolic/relational space.  This was something that came up frequently and Van den Broek took this head on in his talk.

Then there’s McCallum’s notion of text as a knowledge graph. This approach was used frequently to different degrees, which is to be expected given that much of the contents of KGs is provided through information extraction. In her talk, Laura Dietz, discussed her work where she annotated the edges of a knowledge graph with paragraph text to improve entity ranking in search.  Likewise, the work presented by Yejin Choi around common sense reasoning used natural language as the representational “formalism”. She discussed the ATOMIC (paper) knowledge graph  which represents a crowed sourced common sense knowledge as natural language text triples (e.g. PersonX finds ___ in the literature).  She then described transformer based, BERT-esque, architectures  (COMET: Commonsense Transformers for Knowledge Graph Construction) that perform well on common sense reasoning tasks based on these kinds of representations.

The performance of BERT style language models on all sorts of tasks, led to Sebastian Riedel considering whether one should treat these models as the KB:

IMG_0738.jpg

It turns out that out-of-the box BERT performs pretty well as a knowledge base for single tokens that have been seen frequently by the model. That’s pretty amazing. Is storing all our knowledge in the parameters of a model the way to go? Maybe not but surely it’s good to investigate the extent of the possibilities here. I guess I came away from the event thinking that we are moving toward an environment where KBs will maintain heterogenous representations and that we are at a point where we need to embrace this range of representations to produce results in order face the challenges of the fuzzy. For example, the challenge of reasoning:

or of disagreement around knowledge as discussed by Chris Welty:

wWPblo8e.jpeg

So many datasets so little time

Progress in this field is driven by data and there were a lot of new datasets presented at the conference. Here’s my (probably incomplete) list:

  • OPIEC – from the makers of the MINIE open ie system – 300 million open information extracted triples with a bunch of interesting annotations;
  • TREC CAR dataset – cool task, auto generate articles for a search query;
  • HAnDS – a new dataset for fined grained entity typing  to support thousands of types;
  • HellaSwag – a new dataset for common sense inference designed to be hard for state-of-the-art transformer based architectures (BERT);
  • ShARC – conversational question answering dataset focused on follow-up questions
  • Materials Synthesis annotated data for extraction of material synthesis recipes from text. Look up in their GitHub repo for more interesting stuff
  • MedMentions – annotated corpora of UMLs mentions in biomedical papers from CZI
  • A bunch of datasets that were submitted to EMNLP so expect those to come soon – follow @nlpmattg.

The challenges of knowledge (graph) engineering

Juan Sequeda has been on this topic for a while – large scale knowledge graphs are really difficult to engineer. The team at DiffBot – who were at the conference – are doing a great job of supplying this engineering as a service through their knowledge graph API.  I’ve been working with another start-up SeMI who are also trying to tackle this challenge. But this is still complicated task as underlined for me when talking to Francois Scharffe who organized the recent industry focused Knowledge Graph Conference. The complexity of KG (social-technical) engineering was one of the main themes of that conference. An example of the need to tackle this complexity at AKBC was the work presented about the knowledge engineering going on for the KG behind Apple’s Siri. Xiao Ling emphasized that they spent a lot of their time thinking about and implementing systems for knowledge base construction developer workflow:

Thinking about these sorts of challenges was also behind several of the presentations in  the Open Knowledge Network workshop: Vicki Tardif from the Google Knowledge Graph discussed these issues in particular with reference to the muddiness of knowledge representation (e.g. how to interpret facets of a single entity? or how to align the inconsistencies of people with that of machines?). Jim McCusker and Deborah McGuinness’ work on the provenance/nanopublication driven WhyIs framework for knowledge graph construction is an important in that their software views a knowledge graph not as an output but as a set of tooling for engineering that graph.

The best paper of the conference Alexandria: Unsupervised High-Precision Knowledge Base Construction using a Probabilistic Program was also about how to lower the barrier to defining knowledge base construction steps using a simple probabilistic program. Building a KB from a single seed fact is impressive but then you need the engineering effort to massively scale probabilistic inference.

Alexandra Meliou’s work on using provenance to help diagnose these pipelines was particularly relevant to this issue. I have now added a bunch of her papers to the queue.

There’s lots more to do

One of the things I most appreciated was that many speakers had a set of research challenges at the end of their presentations. So here’s a set of things you could work on in this space curated from the event. Note these may be paraphrased.

  • Laura Dietz:
    • General purpose schema with many types
    • High coverage/recall (40%?)
    • Extraction of complex relations (not just triples + coref)
    • Bridging existing KGs with text
    • Relevant information extraction
    • Query-specific knowledge graphs
  • Fernando Pereira
    • combing source correlation and grounding
  • Guy van den Broeck
    • Do more than link predication
    • Tear down the wall between query evaluation and knowledge base completion
    • The open world assumption – take it seriously
  • Waleed Ammar
    • Bridge sentence level and document level predictions
    • Summarize published results on a given problem
    • Develop tools to facilitate peer review
    • How do we crowd source annotations for a specialized domain
    • What are leading indicators of a papers impact?
  • Sebastian Riedel
    • Determine what BERT actually knows or what it’s guessing
  • Xian Ren
    • Where can we source complex rules that help AKBC?
    • How do we induce transferrable latent structures from pre-trained models?
    • Can we have modular neural networks for modeling compositional rules?
    • Ho do we model “human effort” in the objective function during training?
  • Matt Gardner
    • Make hard reading datasets by baking required reasoning into them

Finally, I think the biggest challenge that was laid down was from Claudia Wagner, which is how to think a bit more introspectively about the theory behind our AKBC methods and how we might even bring the rigor of social science methodology to our technical approaches:

I left AKBC 2019 with a fountain of ideas and research questions, which I count as a success. This is a community to watch.  AKBC 2020 is definitely on my list of events to attend next year.

Random Pointers

 

Two weeks ago, I had the pleasure of attending the 17th International Semantic Web Conference held at Asiolomar Conference Grounds in California. A tremendously beautiful setting in a state park along the ocean. This trip report is somewhat later than normal because I took the opportunity to hang out for another week along the coast of California.

Before getting into the content of the conference, I think it’s worth saying, if you don’t believe that there are capable, talented, smart and awesome women in computer science at every level of seniority, the ISWC 2018 organizing committee + keynote speakers is the mike drop of counter examples:

Now some stats:

  •  438 attendees
  •  Papers
    •  Research Track: 167 submissions – 39 accepted – 23% acceptance rate
    •  In Use: 55 submissions – 17 accepted – 31% acceptance rate
    •  Resources: 31 submissions – 6 accepted – 19% acceptance rate
  •  38 Posters & 39 Demos
  • 14 industry presentations
  • Over 1000 reviews

These are roughly the same as the last time ISWC was held in the United States. So on to the major themes I took away from the conference plus some asides.

Knowledge Graphs as enterprise assets

It was hard to walk away from the conference without being convinced that knowledge graphs are becoming fundamental to delivering modern information solutions in many domains. The enterprise knowledge graph panel was a demonstration of this idea. A big chunk of the majors were represented:

The stats are impressive. Google’s Knowledge Graph has 1 billion things and 70 billion assertions. Facebook’s knowledge graph which they distinguish from their social graph and has just ramped up this year has 50 Million Entities and 500 million assertions. More importantly, they are critical assets for applications, for example, at eBay their KG is central to creating product pages, at Google and Microsoft, KGs are key to entity search and assistants, and at IBM they use it as part of their corporate offerings. But you know it’s really in-use when knowledge graphs are used for emoji:

It wasn’t just the majors who have or are deploying knowledge graphs. The industry track in particular was full of good examples of knowledge graphs being used in practice. Some ones that stood out were: Bosch’s use of knowledge graphs for question answering in DIY, multiple use cases for digital twin management (Siemens, Aibel); use in a healthcare chatbot (Babylon Health); and for helping to regulate the US finance industry (FINRA). I was also very impressed with Diffbot’s platform for creating KGs from the Web. I contributed to the industry session presenting how Elsevier is using knowledge graphs to drive new products in institutional showcasing and healthcare.

Beyond the wide use of knowledge graphs, there was a number of things I took away from this thread of industrial adoption.

  1. Technology heterogeneity is really the norm. All sorts of storage, processing and representation approaches were being used. It’s good we have the W3C Semantic Web stack but it’s even better that the principles of knowledge representation for messy data are being applied. This is exemplified by Amazon Neptune’s support for TinkerPop & SPARQL.
  2. It’s still hard to build these things. Microsoft said it was hard at scale. IBM said it was hard for unique domains. I had several people come to me after my talk about Elsevier’s H-Graph discussing similar challenges faced in other organizations that are trying to bring their data together especially for machine learning based applications. Note, McCusker’s work is some of the better publicly available thinking on trying to address the entire KG construction lifecycle.
  3. Identity is a real challenge. I think one of the important moves in the success of knowledge graphs was not to over ontologize. However, record linkage and thinking when to unify an entity is still not a solved problem. One common approach was towards moving the creation of an identifiable entity closer to query time to deal with the query context but that removes the shared conceptualization that is one of the benefits of a Knowledge Graph. Indeed, the clarion call by Google’s Jamie Taylor to teach knowledge representation was an outcome of the need for people who can think about these kinds of problem.

In terms of research challenges, much of what was discussed reflects the same kinds of ideas that were discussed at the recent Dagstuhl Knowledge Graph Seminar so I’ll point you to my summary from that event.

Finally, for most enterprises, their knowledge graph(s) were considered a unique asset to the company. This led to an interesting discussion about how to share “common knowledge” and the need to be able to merge such knowledge with local knowledge. This leads to my next theme from the conference.

Wikidata as the default option

When discussing “common knowledge”, Wikidata has become a focal point. In the enterprise knowledge graph panel, it was mentioned as the natural place to collaborate on common knowledge. The mechanics of the contribution structure (e.g. open to all, provenance on statements) and institutional attention/authority (i.e. Wikimedia foundation) help with this. An example of Wikidata acting as a default is the use of Wikidata to help collate data on genes

Fittingly enough, Markus Krötzsch and team won the best in-use paper with a convincing demonstration of how well semantic technologies have worked as the query environment for Wikidata. Furthermore, Denny Vrandečić (one of the founders of Wikidata) won the best blue sky paper with the idea of rendering Wikipedia articles directly from Wikidata.

Deep Learning diffusion

As with practically every other conference I’ve been to this year, deep learning as a technique has really been taken up. It’s become just part of the semantic web researchers toolbox. This was particularly clear in the knowledge graph construction area. Papers I liked with DL as part of the solution:

While not DL per sea , I’ll lump embeddings in this section as well. Papers I thought that were interesting are:

The presentation of the above paper was excellent. I particularly liked their slide on related work:

iswc2018-1fd3fcf3.png

As an aside, the work on learning rules and the complementarity of rules to other forms of prediction was an interesting thread in the conference. Besides the above paper, see the work from Heiner Stuckenschmidt’s group on evaluating rules and embedding approaches for knowledge graph completion. The work of Fabian Suchanek’s group on the representativeness of knowledge bases is applicable as well in order to tell whether rule learning from knowledge graphs is coming from a representative source and is also interesting in its own right. Lastly, I thought the use of rules in Beretta et al.’s work to quantify the evidence of an assertion in a knowledge graph to help improve reliability was neat.

Information Quality and Context of Use

The final theme is a bit harder for me to solidify and articulate but it lies at the intersection of information quality and how that information is being used. It’s not just knowing the provenance of information but it’s knowing how information propagates and was intended to be used. Both the upstream and downstream need to be considered. As a consumer of information I want to know the reliability of the information I’m consuming. As a producer I want to know if my information is being used for what it was intended for.

The later problem was demonstrated by the keynote from Jennifer Golbeck on privacy. She touched on a wide variety of work but in particular it’s clear that people don’t know but are concerned with what is happening to their data.

There was also quite a bit of discussion going on about the decentralized web and Tim Berners-Lee’s Solid project throughout the conference. The workshop on decentralization was well attended. Something to keep your eye on.

The keynote by Natasha Noy also touched more broadly on the necessity of quality information this time with respect to scientific data.

The notion of propagation of bias through our information systems was also touched on and is something I’ve been thinking about in terms of data supply chains:

That being said I think there’s an interesting path forward for using technology to address these issues. Yolanda Gil’s work on the need for AI to address our own biases in science is a step forward in that direction. This is a slide from her excellent keynote at SemSci Workshop:

iswc2018-09cc97c4.png

All this is to say that this is an absolutely critical topic and one where the standard “more research is needed” is very true. I’m happy to see this community thinking about it.

Final Thought

The Semantic Web community has produced a lot (see this slide from Nataha’s keynote:

iswc2018-d5af2fed.png

ISWC 2018 definitely added to that body of knowledge but more importantly I think did a fantastic job of reinforcing and exciting the community.

Random Notes

Last week, I was at Dagstuhl for a seminar on knowledge graphs specifically focused on new directions for knowledge representation. Knowledge Graphs have exploded in practice since the release of Google’s Knowledge Graph in 2012. Examples include knowledge graphs at AirBnb, Zalando, and Thomson Reuters. Beyond commercial knowledge graphs, there are many successful academic/public knowledge graphs including WikiData, Yago, and Nell.

The emergence of these knowledge graphs has led to expanded research interest in constructing, producing and maintaining knowledge bases. As an indicator checkout the recent growth in papers using the term knowledge graph (~10x papers per year since 2012):

knowledgegraph-dagstuhl-20180910-f32c5b3e.png

The research in this area is found across fields of computer science ranging from the semantic web community to natural language processing and machine learning and databases. This is reflected in the recent CFP for the new Automated Knowledge Base Construction Conference.

This particular seminar primarily brought together folks who had a “home” community in the semantic web but were deeply engaged with another community. For example, Prof. Maria-Esther Vidal who is well versed in the database literature. This was nice in that there was already quite a lot of common ground but people who could effectively communicate or at least point to what’s happening in other areas. This was different than many of the other Dagstuhl seminars I’ve been to (this was my 6th), which were much more about bringing together different areas. I think both styles are useful but it felt like we could go faster as the language barrier was lower.

The broad aim of the seminar was to come with research challenges coming from the experience that we’ve had over the last 10 years. There will be a follow-up report that should summarize the thoughts of the whole group. There were a lot of sessions and a lot of amazing discussions both during the day and in the evening facilitated by cheese & wine (a benefit of Dagstuhl) so it’s hard to summarize everything even just on a personal level but I wanted to pull out the things that have stuck with me now that I’m back at home:

1) Knowledge Graphs of Everything

We are increasingly seeing knowledge graphs that cover an entire category of entities. For example, Amazon’s product graph aims to be a knowledge graph of all products in the world, one can think of Google and Apple maps as databases of every location in the world, a database of every company that has ever had a web page, or a database of everyone in India. Two things stand out. One, is that these are large sets of instance data. I would contend their focus is not deeply modeling the domain in some expressive logic ala Cyc. Second, a majority of these databases are built by private companies. I think it’s an interesting question as to whether things like Wikidata can equal these private knowledge graphs in a public way.

Once you start thinking at this scale, a number of interesting questions arise: how you keep these massive graphs up to date; can you integrate these graphs, how do you manage access control and policies (“controlled access”); what can you do with this; can we extend these sorts of graphs to the physical system (e.g. in IoT); what about a knowledge graph of happenings (ie. events). Fundamentally, I think this “everything notion” is a useful framing device for research challenges.

2) Knowledge Graphs as a communication medium

A big discussion point during the seminar was the integration of symbolic and sub-symbolic representations. I think that’s obvious given the success of deep learning and importantly in the representation space – embeddings. I liked how Michael Witbrock framed symbols as a strong prior on something being the case. Indeed, using background knowledge has been shown to improve learning performance on several tasks (e.g. Baier et al. 2018, Marino et al. 2017).

But this topic in general got us thinking about the usefulness of knowledge graphs as an exchange mechanism for machines. There’s is a bit of semantic web dogma that expressing things in a variant of logic helps for machine to machine communication. This is true to some degree but you can imagine that machines might like to consume a massive matrix of numbers instead of human readable symbols with logical operators.

Given that, then, what’s the role of knowledge graphs? One can hypothesize that it is for the exchange of large scale information between humanity and machines and vis versa. Currently, when people communicate large amounts of data they turn towards structure (i.e. libraries, websites with strong information architectures, databases). Why not use the same approach to communicate with machines then. Thus, knowledge graphs can be thought of as a useful medium of exchange between what machines are generating and what humanity would like to consume.

On a somewhat less grand note, we discussed the role of integrating different forms of representation in one knowledge graph. For example, keeping images represented as images and audio represented as audio alongside facts within the same knowledge graph. Additionally, we discussed different mechanisms for attaching semantics to the symbols in knowledge graphs (e.g. latent embeddings of symbols). I tried to capture some of that thinking in a brief overview talk.

In general, as we think of knowledge graphs as a communication medium we should think how to both tweak and expand the existing languages of expression we use for them and the semantics of those languages.

3) Knowledge graphs as social-technical processes

The final kind of thing that stuck in my mind is that at the scale we are talking about much of the issues resolve around the notions of the complex interplay between humans and machines in producing, using and maintaining knowledge graphs. This was reflected in multiple threads:

  • Juan Sequeda’s thinking emerging from his practical experience on the need for knowledge / data engineers to build knowledge graphs and the lack of tooling for them. In some sense, this was a call to revisit the work of ontology engineering but now in the light of this larger scale and extensive adoption.
  • The facts established by the work of Wouter Beek and co on empirical semantics that in large scale knowledge graphs actually how people express information differs from the intended underlying semantics.
  • The notions of how biases and perspectives are reflected in knowledge graphs and the steps taken to begin to address these. A good example is the work of wikidata community to present the biases and gaps in its knowledge base.
  • The success of schema.org and managing the overlapping needs of communities. This stood out because of the launch of Google Dataset search service based on schema.org metadata.

While not related directly to knowledge graphs during the seminar the following piece on the relationship between AI systems and humans came was circulating:

Kate Crawford and Vladan Joler, “Anatomy of an AI System: The Amazon Echo As An Anatomical Map of Human Labor, Data and Planetary Resources,” AI Now Institute and Share Lab, (September 7, 2018) https://anatomyof.ai

There is critical need for more data about the interface between the knowledge graph and its maintainers and users.

As I mentioned, there was lots more that was discussed and I hope the eventual report will capture this. Overall, it was fantastic to spend a week with the people below – both fun and thought provoking.

Random ponters:

A couple of weeks ago I was at Provenance Week 2018 – a biennial conference that brings together various communities working on data provenance. Personally, it’s a fantastic event as it’s an opportunity to see the range of work going on from provenance in astronomy data to the newest work on database theory for provenance. Bringing together these various strands is important as there is work from across computer science that touches on data provenance.

The week is anchored by the International Provenance and Annotation Workshop (IPAW) and the Theory and Practice of Provenance (TaPP) and includes events focused on emerging areas of interest including incremental re-computation , provenance-based security and algorithmic accountability. There were 90 attendees up from ~60 in the prior events and here they are:

IMG_0626-2.jpg

The folks at Kings College London, led by Vasa Curcin, did a fantastic job of organizing the event including great social outings on-top of their department building and with a boat ride along the thames. They also catered to the world cup fans as well. Thanks Vasa!

2018-07-11 21.29.07

I had the following major takeaways from the conference:

Improved Capture Systems

The two years since the last provenance week have seen a number of improved systems for capturing provenance. In the systems setting, DARPAs Transparent Computing program has given a boost to scaling out provenance capture systems. These systems use deep operating system instrumentation to capture logs over the past several years these have become more efficient and scalable e.g. Camflow, SPADE. This connects with the work we’ve been doing on improving capture using whole system record-and-replay. You  can now run these systems almost full-time although they capture significant amounts of data (3 days = ~110 GB). Indeed, the folks at Galois presented an impressive looking graph database specifically focused on working with provenance and time series data streaming from these systems.

Beyond the security use case, sciunit.run was a a neat tool using execution traces to produce reproducible computational experiments.

There were also a number of systems for improving the generation of instrumentation to capture provenance. UML2PROV automatically generates provenance instrumentation from UML diagrams and source code using the provenance templates approach. (Also used to capture provenance in an IoT setting.) Curator implements provenance capture for micro-services using existing logging libraries. Similarly, UNICORE now implements provenance for its HPC environment. I still believe structured logging is one of the under rated ways of integrating provenance capture into systems.

Finally, there was some interesting work on reconstructing provenance. In particular, I liked Alexander Rasin‘s work on reconstructing the contents of a database from its environment to answer provenance queries:2018-07-10 16.34.08.jpg

Also, the IPAW best paper looked at using annotations in a workflow to infer dependency relations:

Lastly, there was some initial work on extracting provenance of  health studies directly from published literature which I thought was a interesting way of recovering provenance.

Provenance for Accountability

Another theme (mirrored by the event noted above) was the use of provenance for accountability. This has always been a major use for provenance as pointed out by Bertram Ludäscher in his keynote:

However, I think due to increasing awareness around personal data usage and privacy the need for provenance is being recognized. See, for example, the Royal Society’s report on Data management and use: Governance in the 21st century. At Provenance Week, there were several papers addressing provenance for GDPR, see:

Also, the I was impressed with the demo from Imosphere using provenance for accountability and trust in health data:

Re-computation & Its Applications

Using provenance to determine what to recompute seems to have a number of interesting applications in different domains. Paolo Missier showed for example how it can be used to determine when to recompute in next generation sequencing pipelines.

I particular liked their notion of a re-computation front – what set of past executions do you need to re-execute in order to address the change in data.

Wrattler was a neat extension of the computational notebook idea that showed how provenance can be used to automatically propagate changes through notebook executions and support suggestions.

Marta Mattoso‘s team discussed the application of provenance to track the adjustments when performing steering of executions in complex HPC applications.

The work of Melanie Herschel‘s team on provenance for data integration points to the benefits of potentially applying recomputation using provenance to make the iterative nature of data integration speedier as she enumerated in her presentation at the recomputation worskhop.2018-07-12 15.01.42.jpg

You can see all the abstracts from the workshop here. I understand from Paolo that they will produce a report from the discussions there.

Overall, I left provenance week encouraged by the state of the community, the number of interesting application areas, and the plethora of research questions to work on.

Random Links

 

The early part of last week I attended the Web Science 2018 conference. It was hosted here in Amsterdam which was nice for me. It was nice to be at a conference where I could go home in the evening.

Web Science is an interesting research area in that it treats the Web itself as an object of study. It’s a highly interdisciplinary area that combines primarily social science with computer science. I always envision it as a loop with studies of what’s actually going on the Web leading to new interventions on the Web which we then need to study.

There were what I guess a hundred or so people there … it’s a small but fun community. I won’t give a complete rundown of the conference. You can find summaries of each day done by Cat Morgan (Workshop DayDay 1Day 2Day 3) but instead give an assortment of things that stuck out for me:

And some tweets:

I had the pleasure of attending the Web Conference 2018 in Lyon last week along with my colleague Corey Harper . This is the 27th addition of the largest conference on the World Wide Web. I have tremendous difficulty  not calling it WWW but I’ll learn! Instead of doing two trip reports the rest of this is a combo of Corey and my thoughts. Before getting to what we took away as main themes of the conference let’s look at the stats and organization:

It’s also worth pointing out that this is just the research track. There were 27 workshops,  21 tutorials, 30 demos (Paul was co-chair), 62 posters, four collocated conferences/events, 4 challenges, a developer track and programming track, a project track, an industry track, and… We are probably missing something as well. Suffice to say, even with the best work of the organizers it was hard to figure out what to see. Organizing an event with 2200+ attendees is a thing is a massive task – over 80 chairs were involved not to mention the PC and the local heavy lifting. Congrats to Fabien, Pierre-Antoine, Lionel and the whole committee for pulling it off.  It’s also great to see as well that the proceedings are open access and available on the web.

Given the breadth of the conference, we obviously couldn’t see everything but from our interests we pulled out the following themes:

  • Dealing with a Polluted Web
  • Tackling Tabular Data
  • Observational Methods
  • Scientific Content as a Driver

Dealing with a Polluted Web

The Web community is really owning it’s responsibility to help mitigate the destructive uses to which the Web is put. From the “Recoding Black Mirror” workshop, which we were sad to miss, through the opening keynote and the tracks on Security and Privacy and Fact Checking, this was a major topic throughout the conference.

Oxford professor Luciano Floridi gave an excellent first keynote  on “The Good Web” which addressed this topic head on. He introduced a number of nice metaphors to describe what’s going on:

  • Polluting agents in the Web ecosystem are like extremphiles, making the environment hostile to all but themselves
  • Democracy in some contexts can be like antibiotics: too much gives growth to antibiotic resistant bacteria.
  • His takeaway is that we need a bit of paternalism in this context now.

His talk was pretty compelling,  you can check out the full video here.

Additionally, Corey was able to attend the panel discussion that opened the “Journalism, Misinformation, and Fact-Checking” track, which included representation from the Credibility Coalition, the International Fact Checking Network, MIT, and WikiMedia. There was a discussion of how to set up economies of trust in the age of attention economies, and while some panelists agreed with Floridi’s call for some paternalism, there was also a warning that some techniques we might deploy to mitigate these risks could lead to “accidental authoritarianism.” The Credibility Coalition also provided an interesting review of how to define credibility indicators for news looking at over 16 indicators of credibility.

We were able to see parts of the “Web and Society track”, which included a number of papers related to social justice oriented themes. This included an excellent paper that showed how recommender systems in social networks often exacerbate and amplify gender and racial disparity in social network connections and engagement. Additionally, many papers addressed the relationship between the mainstream media and the web. (e.g. political polarization and social media, media and public attention using the web).

Some more examples: The best demo was awarded to a system that automatically analyzed privacy policies of websites and summarized them with respect to GDPR and:

More generally, it seems the question is how do we achieve quality assessment at scale?

Tackling Tabular Data

Knowledge graphs and heterogenous networks (there was a workshop on that) were a big part of the conference. Indeed the test of time paper award went to the original Yago paper. There were a number of talks about improving knowledge graphs for example for improving on question answering tasks, determining attributes that are needed to complete a KG or improving relation extraction. While tables have always been an input to knowledge graph construction (e.g. wikpedia infoboxes), an interesting turn was towards treating tabular data as a focus area.

As Natasha Noy from Google noted in her  keynote at the SAVE-SD workshop,  this is an area with a number of exciting research challenges:img_0034_google_savesd.jpg

There was a workshop on data search with a number of papers on the theme. In that workshop, Maarten de Rijke gave a keynote on the work his team has been doing in the context of data search project with Elsevier.

In the main track, there was an excellent talk on Ad-Hoc Table Retrieval using Semantic Similarity. They looked at finding semantically central columns to provide a rank list of columns. More broadly they are looking at spreadsheet compilation as the task (see smarttables.cc and the dataset for that task.) Furthermore, the paper Towards Annotating Relational Data on the Web with Language Models looked at enriching tables through linking into a knowledge graph.

Observational Methods

Observing  user behavior has been a part of research on the Web, any web search engine is driven by that notion. What did seem to be striking is the depth of the observational data being employed. Prof. Lorrie Cranor gave an excellent keynote on the user experience of web security (video here). Did you know that if you read all the privacy policies of all the sites you visit it wold take 244 hours per year? Also, the idea of privacy as nutrition labels is pretty cool:

But what was interesting was her labs use of an observatory of 200 participants who allowed their Windows home computers to be instrumented. This kind of instrumentation gives deep insight into how users actually use their browsers and security settings.

Another example of deep observational data, was the use of mouse tracking on search result pages to detect how people search under anxiety conditions:

In the paper by Wei Sui and co-authors on Computational Creative Advertisements presented at the HumL workshop – they use in-home facial and video tracking to measure emotional response to ads by volunteers.

The final example was the use of FMRI scans to track brain activity of participants during web search tasks. All these examples provide amazing insights into how people use these technologies but as these sorts of methods are more broadly adopted, we need to make sure to adopt the kinds of safe-guards adopted by these researchers – e.g. consent, IRBs, anonymization.

Scientific Content as a Driver

It’s probably our bias but we saw a lot of work tackling scientific content. Probably because it’s both interesting and provides a number of challenges. For example, the best paper of the conference (HighLife) was about extracting n-ary relations for knowledge graph construction motivated by the need for such types of relations in creating biomedical knowledge graphs. The aforementioned work on tabular data often is motivated by the needs of research. Obviously SAVE-SD covered this in detail:

In the demo track, the etymo.io search engine was presented to summarize and visualization of scientific papers. Kuansan Wang at the BigNet workshop talked about Microsoft Academic Search and the difficulties and opportunities in processing so much scientific data.

IMG_0495.JPG

Paul gave a keynote at the same workshop also using science as the motivation for new methods for building out knowledge graphs. Slides below:

In the panel, Structured Data on the Web 7.0, Google’s Evgeniy Gabrilovich – creator of the Knowledge Vote – noted the challenges of getting highly correct data for Google’s Medical Knowledge graph and that doing this automatically is still difficult.

Finally, using DOIs for studying persistent identifier use over time on the Web.

Wrap-up

Overall, we had a fantastic web conference. Good research, good conversations and good food:

Random Thoughts

 

Last week, I had the pleasure to be able to attend a bilateral meeting between the Royal Society and the KNAW. The aim was to strengthen the relation between the UK and Dutch scientific communities. The meeting focused on three scientific areas: quantum physics & technology; nanochemistry; and responsible data science. I was there for the latter. The event was held at Chicheley Hall which is a classic baroque English country house (think Pride & Prejudice). It’s a marvelous venue – very much similar in concept to Dagstuhl (but with an English vibe) where you are really wholly immersed in academic conversation.

.IMG_0290

One of the fun things about the event was getting a glimpse of what other colleagues from other technical disciplines are doing. It was cool to see Prof. Bert Weckhuysen enthusiasm for using imaging technologies to understand catalysts at the nanoscale. Likewise, seeing both the progress and the investment (!) in quantum computing from Prof. Ian Walmsley was informative. I also got an insider intro to the challenges of engineering a quantum computer from Dr. Ruth Oulton.

The responsible data science track had ~15 people. What I liked was that the organizers not only included computer scientists but also legal scholars, politicians, social scientists, philosophers and policy makers. The session consisted primarily of talks but luckily everyone was open to discussion throughout. Broadly, responsible data science covers the ethics of the practice and implications of data science or put another way:

For more context, I suggest starting with two sources: 1) The Dutch consortium on responsible data science 2) the paper 10 Simple Rules for Responsible Big Data Research. I took away two themes both from the track as well as my various chats with people during coffee breaks, dinner and the bar.

1) The computer science community is engaging

It was apparent through out the meeting that the computer science community is confronting the challenges head on. A compelling example was the talk by Dr. Alastair Beresford from Cambridge about Device Analyzer a system that captures the activity of user’s mobile phones in order to provide data to improve device security, which it has:

He talked compellingly about the trade-offs between consent and privacy and how the project tries to manage these issues. In particular, I thought how they handle data sharing with other researchers was interesting. It reminded me very much of how the Dutch Central Bureau of Statistics manages microdata on populations.

Another example was the discussion by Prof. Maarten De Rijke on the work going on with diversity for recommender and search systems. He called out the Conference on Fairness, Accountability, and Transparency (FAT*) that was happening just after this meeting, where the data science community is engaging on these issues. Indeed, one of my colleagues was tweeting from that meeting:

Julian Huppert, former MP, discussed the independent review board setup up by DeepMind Health to enable transparency about their practices. He is part of that board.  Interestingly, Richard Horton, Editor of the Lancet is also part of that board Furthermore, Prof. Bart Jacobs discussed the polymorphic encryption based privacy system he’s developing for a collaboration between Google’s Verily and Radboud University around Parkinson’s disease. This is an example that  even the majors are engaged around these notions of responsibility. To emphasize this engagement notion even more, during the meeting a new report on the Malicious Uses of AI came out from a number or well-known organizations.

One thing that I kept thinking is that we need more assets or concrete artifacts that data scientists can apply in practice.

For example, I like the direction outlined in this article from Dr. Virginia Dignum about defining concrete principles using a design for values based approach. See TU Delft’s Design for Values Institute for more on this kind of approach.

2) Other methods needed

As data scientists, we tend to want to use an experimental / data driven approach even to these notions surrounding responsibility.

Even though I think there’s absolutely a role here for a data driven approach, it’s worth looking at other kinds of more qualitative methods, for example, by using survey instruments or an ethnographic approach or even studying the textual representation of the regulatory apparatus.  For instance, reflecting on the notion of Thick Data is compelling for data science practice. This was brought home by Dr. Ian Brown in his talk on data science and regulation which combined both an economic and survey view:

Personally, I tried to bring some social science literature to bear when discussing the need for transparency in how we source our data. I also argued for the idea that adopting a responsible approach is also actually good for the operational side of data science practice:

While I think it’s important for computer scientists to look at different methods, it’s also important for other disciplines to gain insight into the actual process of data science itself as Dr. Linnet Taylor grappled within in her talk about observing a data governance project.

Overall, I enjoyed both the setting and the content of the meeting. If we can continue to have these sorts of conversations, I think the data science field will be much better placed to deal with the ethical and other implications of our technology.

Random Thoughts

  • Peacocks!
  • Regulating Code – something for the reading list
  • Somebody remind me to bring a jacket next time I go to an English Country house!
  • I always love it when egg codes get brought up when talking about provenance.
  • I was told that I had a “Californian conceptualization” of things – I don’t think it was meant as a complement – but I’ll take it as such 🙂
  • Interesting pointer to work by Seda Gurses about in privacy and software engineering from @1Br0wn
  • Lots of discussion of large internet majors and monopolies. There’s lots of academic work on this but I really like Ben Thompson’s notion of aggregator’s as the way to think about them.
  • Merkle trees are great – but blockchain is a nicer name 😉

 

%d bloggers like this: