Archive

Tag Archives: dagstuhl

Last week, I was at Dagstuhl for a seminar on knowledge graphs specifically focused on new directions for knowledge representation. Knowledge Graphs have exploded in practice since the release of Google’s Knowledge Graph in 2012. Examples include knowledge graphs at AirBnb, Zalando, and Thomson Reuters. Beyond commercial knowledge graphs, there are many successful academic/public knowledge graphs including WikiData, Yago, and Nell.

The emergence of these knowledge graphs has led to expanded research interest in constructing, producing and maintaining knowledge bases. As an indicator checkout the recent growth in papers using the term knowledge graph (~10x papers per year since 2012):

knowledgegraph-dagstuhl-20180910-f32c5b3e.png

The research in this area is found across fields of computer science ranging from the semantic web community to natural language processing and machine learning and databases. This is reflected in the recent CFP for the new Automated Knowledge Base Construction Conference.

This particular seminar primarily brought together folks who had a “home” community in the semantic web but were deeply engaged with another community. For example, Prof. Maria-Esther Vidal who is well versed in the database literature. This was nice in that there was already quite a lot of common ground but people who could effectively communicate or at least point to what’s happening in other areas. This was different than many of the other Dagstuhl seminars I’ve been to (this was my 6th), which were much more about bringing together different areas. I think both styles are useful but it felt like we could go faster as the language barrier was lower.

The broad aim of the seminar was to come with research challenges coming from the experience that we’ve had over the last 10 years. There will be a follow-up report that should summarize the thoughts of the whole group. There were a lot of sessions and a lot of amazing discussions both during the day and in the evening facilitated by cheese & wine (a benefit of Dagstuhl) so it’s hard to summarize everything even just on a personal level but I wanted to pull out the things that have stuck with me now that I’m back at home:

1) Knowledge Graphs of Everything

We are increasingly seeing knowledge graphs that cover an entire category of entities. For example, Amazon’s product graph aims to be a knowledge graph of all products in the world, one can think of Google and Apple maps as databases of every location in the world, a database of every company that has ever had a web page, or a database of everyone in India. Two things stand out. One, is that these are large sets of instance data. I would contend their focus is not deeply modeling the domain in some expressive logic ala Cyc. Second, a majority of these databases are built by private companies. I think it’s an interesting question as to whether things like Wikidata can equal these private knowledge graphs in a public way.

Once you start thinking at this scale, a number of interesting questions arise: how you keep these massive graphs up to date; can you integrate these graphs, how do you manage access control and policies (“controlled access”); what can you do with this; can we extend these sorts of graphs to the physical system (e.g. in IoT); what about a knowledge graph of happenings (ie. events). Fundamentally, I think this “everything notion” is a useful framing device for research challenges.

2) Knowledge Graphs as a communication medium

A big discussion point during the seminar was the integration of symbolic and sub-symbolic representations. I think that’s obvious given the success of deep learning and importantly in the representation space – embeddings. I liked how Michael Witbrock framed symbols as a strong prior on something being the case. Indeed, using background knowledge has been shown to improve learning performance on several tasks (e.g. Baier et al. 2018, Marino et al. 2017).

But this topic in general got us thinking about the usefulness of knowledge graphs as an exchange mechanism for machines. There’s is a bit of semantic web dogma that expressing things in a variant of logic helps for machine to machine communication. This is true to some degree but you can imagine that machines might like to consume a massive matrix of numbers instead of human readable symbols with logical operators.

Given that, then, what’s the role of knowledge graphs? One can hypothesize that it is for the exchange of large scale information between humanity and machines and vis versa. Currently, when people communicate large amounts of data they turn towards structure (i.e. libraries, websites with strong information architectures, databases). Why not use the same approach to communicate with machines then. Thus, knowledge graphs can be thought of as a useful medium of exchange between what machines are generating and what humanity would like to consume.

On a somewhat less grand note, we discussed the role of integrating different forms of representation in one knowledge graph. For example, keeping images represented as images and audio represented as audio alongside facts within the same knowledge graph. Additionally, we discussed different mechanisms for attaching semantics to the symbols in knowledge graphs (e.g. latent embeddings of symbols). I tried to capture some of that thinking in a brief overview talk.

In general, as we think of knowledge graphs as a communication medium we should think how to both tweak and expand the existing languages of expression we use for them and the semantics of those languages.

3) Knowledge graphs as social-technical processes

The final kind of thing that stuck in my mind is that at the scale we are talking about much of the issues resolve around the notions of the complex interplay between humans and machines in producing, using and maintaining knowledge graphs. This was reflected in multiple threads:

  • Juan Sequeda’s thinking emerging from his practical experience on the need for knowledge / data engineers to build knowledge graphs and the lack of tooling for them. In some sense, this was a call to revisit the work of ontology engineering but now in the light of this larger scale and extensive adoption.
  • The facts established by the work of Wouter Beek and co on empirical semantics that in large scale knowledge graphs actually how people express information differs from the intended underlying semantics.
  • The notions of how biases and perspectives are reflected in knowledge graphs and the steps taken to begin to address these. A good example is the work of wikidata community to present the biases and gaps in its knowledge base.
  • The success of schema.org and managing the overlapping needs of communities. This stood out because of the launch of Google Dataset search service based on schema.org metadata.

While not related directly to knowledge graphs during the seminar the following piece on the relationship between AI systems and humans came was circulating:

Kate Crawford and Vladan Joler, “Anatomy of an AI System: The Amazon Echo As An Anatomical Map of Human Labor, Data and Planetary Resources,” AI Now Institute and Share Lab, (September 7, 2018) https://anatomyof.ai

There is critical need for more data about the interface between the knowledge graph and its maintainers and users.

As I mentioned, there was lots more that was discussed and I hope the eventual report will capture this. Overall, it was fantastic to spend a week with the people below – both fun and thought provoking.

Random ponters:

A month ago, I had the opportunity to attend the Dagstuhl Seminar  Citizen Science: Design and Engagement. Dagstuhl is really a wonderful place. This was my fifth time there. You can get an impression of the atmosphere from the report I wrote about my first trip there. I have primarily been to Dagstuhl for technical topics in the area of data provenance and semantic data management as well as for conversations about open science/research communication.

This seminar was a great chance for me to learn more about citizen science and discuss its intersection with the practice of open science. There was a great group of people there covering the gamut from creators of citizen science platforms to crowd-sourcing researchers. 17272.01.l

As usual with Dagstuhl seminars, it’s less about presentations and more about the conversations. There will be a report documenting the outcome and hopefully a paper describing the common thoughts of the participants. Neal Reeves took vast amounts of notes so I’m sure that this will be a good report :-). Here’s a whiteboard we had full of input:

2017-07-05 11.28.24.jpg

Thus, instead of trying to relay what we came up with (you’ll have to wait for the report), I’ll just pull out some of my own brief highlights.

Background on Citizen Science

There were a lot of good pointers on where to start understand current thinking around citizen science. First, two tutorials from the seminar:

What do citizen science projects look like:

Example projects:

How should citizen science be pursued:

And a Book:

Open Science & Citizen Science

Claudia Göbel gave an excellent talk about the overlap of citizen science and open science. First, she gave an important reminder that science in particular in the 1700s was done as public demonstrations walking us through the example painting below. 2017-07-04 11.23.02

She then looked at the overlap between citizen science and open science. Summarized below:

citizenopenscience.png

A follow-on discussion at the with some of the seminar participants led to input for a whitepaper that is being developed through the ECSA on Citizen & Open Science for Europe. Check out the preliminary draft. I look forward to seeing the outcome.

Questioning Assumptions

One thing that I left the seminar thinking about was was the need to question my own (and my field’s) assumptions. This was really inspired by talking to Chris Welty and reflecting on his work with Lora Aroyo on the issues in human annotation and the construction of gold sets.  Some assumptions to question:

  • What qualifications you need to have to be considered a scientist.
  • Interoperability is a good thing to pursue.
  • Openness is a worthy pursuit.
  • We can safely assume a lack of dynamics in computational systems.
  • That human performance is good performance.

Indeed, in Marissa Ponti she pointed to the example below and highlighted some of the potential ramifications of what each of these (what at first blush are positive) citizen science projects could lead to. 2017-07-03 10.06.36

That being said, the ability to rapidly engage more people in the science system seems to be a good thing indeed. An an assumption I’m happy to hold.

Random

Last week, I was at a seminar on Semantic Data Management at Dagstuhl. A month ago I was at Dagstuhl discussing the principles of provenance. You can read more about the atmosphere and style of a Dagstuhl event at the post on the provenance event. From my perspective, it’s pretty cool to get invited to multiple Dagstuhl events in short succession… I think it just happens that two of my main research areas overlap and were scheduled in the same time period.

Semantic Data Management Group photo

Obligatory Dagstuhl Group Photo

Indeed, one of the topics for discussion at the seminar was provenance. The others were scalability, dynamcity, and search. The organizers (Elena Simprel, Karl AbererGrigoris AntoniouOscar Corcho and Rudi Studer) will put together a report summarizing all the outcomes. What I wanted to do is focus on the key points that I took away from the seminar.

Scaling semantic data management = scaling graph databases

There was some discussion around what it means to scale in terms of semantic data management. For the most part this boiled down to, what does it mean to scale RDF databases? The organizers did a good job of bringing members of industry in that have actual experience in building scalable RDF systems. The first day contained some great discussion about the guts of databases and what makes scaling hard – issues such as the latency of storage infrastructure and what the right join algorithm were. Steve Harris brought out the difficulty of backup and restore in real world systems and the lack of research in that area.  But my primary feeling was the challenges of scalability are ones of how we deal with large graphs. In my open work in Open PHACTs, I’ve seen how using graphs has increased our flexibility but challenged us in terms of scalability.

Dealing with large graphs is hard but I think the Semantic Web community can lead the way here because we have a nice substrate, namely, an exchange model for graphs and a common query language.  This leads to the next point:

Benchmarks! Benchmarks! Benchmarks!

Throughout the week there was discussion of the need for all types of benchmarks. LUBM and BSBM  have served us well but we need better benchmarks: more and different types of queries, more realistic datasets, configurable benchmarks, etc.  There was also discussions of other types of benchmarks, for example, a provenance corpus or a corpus that combines structured and unstructured data for ranking. One comment that I heard in terms of benchmarks is where should you publish them? Unlike the IR community we don’t have  something thing like TREC. Although, I think USEWOD is a good example of bootstrapping this sort of activity.

Let’s just be uncertain

One of the cross-cutting themes of the symposium was the need to deal with uncertainty. From dealing with crawled data, to information extraction systems, to even data created by classic knowledge capture, there is a need to express and use uncertainty.  In the area of provenance, I was impressed Martin Theobald’s URDF system that deals with both uncertain data and uncertain rules.

One major handicap is that RDF systems have is that reification let’s you associate confidence values with statements but is just extremely verbose. At the symposium, Bryan Thompson  and Orri Erling led the way in constructing a proposal to expose statement level identifiers that are compatible with reification. Olaf Hartig even worked out an approach that makes this compatible with SPARQL semantics. I’m looking forward to seeing their final proposal. This will make associating uncertainity and other evidence related information to triples.

One final thing to say is that these discussions made me glad that attributes are included in the PROV model. This provides an important hook for this kind of uncertainty information.

Crowdsourcing is a component

There was quite a lot of talk about integrating crowdsourcing into the data management stack (See Wolf-Tilo Balke’s work). It’s clear that when we are designing semantic data management systems that crowdsourcing is clearly an important component. Just as ontology engineers are boxes in many of our architectures maybe the crowd should be there by default as well.

Provenance – get it out – we’re ready

Beyond being a discussant in the conversation. I also gave an intro to provenance research based on the categorization of content, management and use produced in the Provenance Incubator. Luc Moreau, Olaf Hartig and Paolo Missier gave a walkthrough of the PROV spec coming from the W3C.  We had some interesting technical feedback but the general impression I got was – it looks pretty good, get it out there, this is something we need and can use – now.

For example, I had discussions with Manuel Salvadores about using PROV as the ontology for describing provenance in BioPortal. Satya S. Sahoo (a working group member) is extending PROV for capturing provenance in sleep studies. There was discussion of connecting PROV with the Semantic Sensor Network ontology. As with other Semantic Web standards, PROV will provide the basis for both applications and future research. It’s now up to us as working group to get these documents out.

Embracing other communities

I think the community as a whole has been doing a good job in embracing other communities. This has been shown by those working on RDF stores who have embraced the database community. Also in semantic search there is a good conversation that is bridging the IR community and the database field. Interestingly, semantic search is really a driver of that conversation. I learned about a good survey paper by Thanh Tran and Peter Mika at Dagstuhl – highly recommended.

Federation is a spectrum

There was lots of talk about federation at the symposium. My general impression is that federation is not something that we can say yes or no. Instead different applications will require different kinds of federation. I think there is lots of room to research how we can systematically place systems on the federation spectrum. I come with a series of requirements, where and how should I include federation in my data management scheme. For example, I may want to trade-of computational overhead for space as suggested by Olaf Hartig in his Link Traversal Based Query Execution approach (i.e. follow your nose). This caused some of the most entertaining discussions at the symposium. Should you need a data center to query the Web of Data? Let’s find out.

Conclusion

I think the report coming from this symposium will provide a good document sketching out the research challenges in semantic data management for the next several years. I’m looking forward to it. I’ll end with a quote from a slide in José Manuel Gomez-Perez‘s talk. According to the IDC 2011 Digital Universe study, metadata is the fastest growing data category.

There’s demand for the work we are doing and there are many challenges remaining – this promises to be a fun couple of years.

Last week, I attended a workshop at Dagstuhl on the Principles of Provenance. Before talking about the content of the workshop itself, it’s worth describing the experience of Dagstuhl. The venue itself is a manor house located pretty much in the middle of nowhere in southwest Germany.  The nature around the venue is nice and it really is away from it all. All the attendees stay in the manor house so you spend not only the scheduled workshop times with your colleagues but also breakfast, lunch, dinner and evenings. They also have small tricks to ensure that everyone mingles, for example, by pseudo-randomly seating people at different tables for meals. Additionally, Dagstuhl is specifically for computer science – they have a good internet connection and one of the best computer science libraries I’ve seen.  All these things together make Dagstuhl a unique  intellectually intense environment. It’s one of the nicest traditions in computer science.

Me at the Principles of Provenance workshop

With that context in mind, the organizers of the Principles of Provenance workshop (James Cheney, Wang-Chiew Tan, Bertram Ludaescher, Stijn Vansummeren) brought together computer scientists studying provenance from the perspective of databases, the semantic web, scientific workflow, programming languages and software engineering. While I knew most of the people in this broad community (at least from paper titles), I met some new people and got to know people better. The organizers started the workshop with overviews of provenance work from 4 areas:

  1. Provenance in Database Systems
  2. Provenance in Workflows and Scientific Computation
  3. Provenance in Software Engineer, programming languages and security
  4. Provenance interchange on the web (i.e. the w3C standardization effort)

These tutorials were a great idea because they provided a common basis for communication throughout the week. The rest of the week combined quite a few talks and plenty of discussion The organizers are putting together a report right now containing abstracts and presentations so I won’t go into that more here. What I do want to do is pull out 3 take-aways that I had from the week.

1) Connecting consensus models to formal foundations

Because provenance often spans multiple systems (my data is often sourced from somewhere else), there is a need for provenance systems to interoperate. There have be a number of efforts to enable this interoperability including the creation of the Open Provenance Model as well as the current standardization effort at the W3C. Because these efforts are trying to bridge across multiple implementation, they are driven by community consensus: what models can we agree upon, what is minimally necessary for interchange, what is easy to understand and implement.

Separately, there is quite a lot of work on formal foundations of provenance especially within the database community. This work is grounded in applications but also in formal theory that ensures that provenance information has nice properties. Concretely, one can show that certain types of provenance within a database context can be expressed as polynomials, algebraically manipulated, and also related. (semirings!) Plus, provenance polynomials sounds nice. Check out T.J. Green’s thesis for starters:

Todd J. Green. Collaborative Data Sharing with Mappings and Provenance. PhD thesis, University of Pennsylvania, 2009

During the workshop, it became clear to me that the consensus based models (which are often graphical in nature) can not only be formalized but also be directly connected to these database focused formalizations. I just needed to get over the differences in syntax.  This could imply that we could have nice way to trace provenance across systems and through databases and be able to understand the mathematical properties of this interconnection.

2) Social implications of producing provenance

For a couple of years now, I’ve been asked by people and have asked myself, so what do you do with provenance? I think there are a lot of good answers for that (e.g. requirements for provenance in e-science). However, the community has spent a lot of time thinking about how to capture provenance from a technical point of view asking questions like: how do we instrument systems? how do we store provenance efficiently? can we leverage execution environments for tracing?

At Dagstuhl, Carole Goble asked another question, why would people record and share provenance in the first place? There are big social implications that we need to grapple with: producing provenance may expose information that we are not ready to share, it may require us to change work practice  leading to effort that we may not want to give or it may be in form that is to raw to be useful. Developing techniques to address these issues is from my point of view a new and important area of work.

From my perspective, we are starting to work on the ideas of how to reconstruct provenance from data that will hopefully reduce the effort for producers of provenance.

3) Provenance is important for messy data integration

A key usecase for  provenance is tracking back to original data sources after data has been integrated. This is particularly important when the data integration requires complex processing (e.g. natural language processing). Christopher Ré gave a fantastic example of this with a demonstration the WiscI system part of the Hazy project. This application enriches Wikipedia pages with facts collected from a (~40 TB) web crawl and provides links back to a supporting source for those facts. It was a great example of how provenance is really foundational to providing confidence in these systems.

Beyond these points, there was a lot more discussed, which will be summarized in the forthcoming report. This was a great workshop for me. From my point of view, I wanted to thank the organizers for putting it together. It’s a lot of effort. Additionally, thanks to all of the participants for really great conversations.

%d bloggers like this: