Archive

trip report

Last week, I conferenced! I attended the 16th International Semantic Web Conference (ISWC 2017) in Vienna at the beginning of the week and then headed up to FORCE 2017 in Berlin for the back half of the week. For the last several ISWC, I’ve been involved in the organizing committee, but this year I got to relax. It was a nice chance to just be an attendee and see what was up. This was made even nicer by the really tremendous job Axel, Jeff and their team did  in organizing both the logistics and program. The venues were really amazing and the wifi worked!

Before getting into what I thought were the major themes of the conference, lets do some stats:

  • 624 participants
  • Papers
    • Research track: 197 submissions – 44 accepted – 23% acceptance rate
    • In-use: 27 submissions – 9  accepted – 33% acceptance rate
    • Resources: 76 submissions – 23 accepted – 30% acceptance rate
  • 46 posters & 61 demos
  • Over 1000 reviews were done excluding what was done for the workshop / demos / posters. Just a massive amount of work in helping work get better.

This year they expanded the number of best reviewers and I was happy to be one of them:

You can find all the papers online as preprints.

The three themes I took away from the conference were:

  1. Ecosystems for knowledge engineering
  2. Learn from everything
  3. More media

Ecosystems for knowledge engineering

This was a hard theme to find a title for but there were several talks about how to design and engineer the combination of social and technical processes to build knowledge graphs. Deborah McGuinness in her keynote talked about how it took a village to create effective knowledge driven systems. These systems are the combination of experts, knowledge specialists, systems that do ML, ontologies, and data sources. Summed up by the following slide:

My best idea is that this would fall under the rubric of knowledge engineering. Something that has always been part of the semantic web community. What I saw though was the development of more extensive ideas and guidelines about how to create and put into practice not just human focused systems but entire social-techical ecosystems that leveraged all manner of components.

Some examples: Gil et al.’s paper on  creating a platform for high-quality ontology development and data annotation explicitly discusses the community organization along with the platform used to enable it. Knoblock et al’s paper on creating linked data for the American Art Collaborative discusses not only the technology for generating linked data from heterogenous sources but the need for a collaborative workflow facilitated by a shared space (Github) but also the need for tools used to do expert review.  In one of my favorite papers, Piscopo et al evaluated the the provenance of Wikidata statements and also developed machine learning models that could judge authoritativeness & relevance of potential source material. This could provide a helpful tool in allowing Wikidata editors to garden the statements automatically added by bots. As a last example, Jamie Taylor in his keynote discussed how at Google they have a Knowledge Graph Schema team that is there to support a developers in creating interlocking data structures. The team is focused on supporting and maintaining quality of the knowledge graph.

A big discussion area was the idea coming out of the US for a project / initiative around an Open Knowledge Network introduced by Guha. Again, I’ll put this under the notion of how to create these massive social-technical knowledge systems.

I think more work needs to be done in this space not only with respect to the dynamics of these ecosystems as Michael Lauruhn and I discussed in a recent paper but also from a reuse perspective as Pascal Hitzler has been talking about with ontology design patterns.

Learn from everything

The second theme for me was learning from everything. Essentially, this is the use of the combination of structured knowledge and unstructured data within machine learning scenarios to achieve better results. A good example of this was presented by Achim Rettinger on using cross modal embeddings to improve semantic similarity and type prediction tasks:

Likewise, Nada Lavrač discussed in her keynote how to different approaches for semantic data mining, which also leverages different sources of information for learning. In particular, what was interesting is the use of network analysis to create a smaller knowledge network to learn from.

A couple of other examples include:

It’s worth calling out the winner of the renewed  Semantic Web Challenge from IBM, which used deep learning in combination with sources such as dbpedia, geonames and background assumptions for relation learning.

2017-10-23 20.44.14.jpg

Socrates – Winner SWC

(As an aside, I think it’s pretty cool that the challenge was won by IBM on data provided by Thomson Reuters with an award from Elsevier. Open innovation at its best.)

For a more broad take on the complementarity between deep learning and the semantic web, Dan Brickley’s paper is a fun read. Indeed, as we start to potentially address common sense knowledge we will have to take more opportunity to learn from everywhere.

More media

Finally, I think we saw an increase in the number of works dealing with different forms of media. I really enjoyed the talk on Improving Visual Relationship Detection using Semantic Modeling of Scene Descriptions given by Stephan Brier. Where they used a background knowledge base to improve relation prediction between portions of images:

tresp.png

There was entire session focused on multimodal linked data including talks on audio ( MIDI LOD cloud, the Internet Music Archive as linked data) and images IMGPedia content analyzed linked data descriptions of Wikimedia commons.  You can even mash-up music with the SPARQL-DJ.

Conclusion

DBpedia won the 10 year award paper. 10 years later semantic technologies and in particular the notion of a knowledge graph are mainstream (e.g. Thomson Reuters has a 100 billion node knowledge graph). While we may still be focused too much on the available knowledge graphs  for our research work, it seems to me that the community is branching out to begin to answer a range new questions (how to build knowledge ecosystems?, where does learning fit?, …) about the intersection of semantics and the web.

Random Notes:

Advertisements

A month ago, I had the opportunity to attend the Dagstuhl Seminar  Citizen Science: Design and Engagement. Dagstuhl is really a wonderful place. This was my fifth time there. You can get an impression of the atmosphere from the report I wrote about my first trip there. I have primarily been to Dagstuhl for technical topics in the area of data provenance and semantic data management as well as for conversations about open science/research communication.

This seminar was a great chance for me to learn more about citizen science and discuss its intersection with the practice of open science. There was a great group of people there covering the gamut from creators of citizen science platforms to crowd-sourcing researchers. 17272.01.l

As usual with Dagstuhl seminars, it’s less about presentations and more about the conversations. There will be a report documenting the outcome and hopefully a paper describing the common thoughts of the participants. Neal Reeves took vast amounts of notes so I’m sure that this will be a good report :-). Here’s a whiteboard we had full of input:

2017-07-05 11.28.24.jpg

Thus, instead of trying to relay what we came up with (you’ll have to wait for the report), I’ll just pull out some of my own brief highlights.

Background on Citizen Science

There were a lot of good pointers on where to start understand current thinking around citizen science. First, two tutorials from the seminar:

What do citizen science projects look like:

Example projects:

How should citizen science be pursued:

And a Book:

Open Science & Citizen Science

Claudia Göbel gave an excellent talk about the overlap of citizen science and open science. First, she gave an important reminder that science in particular in the 1700s was done as public demonstrations walking us through the example painting below. 2017-07-04 11.23.02

She then looked at the overlap between citizen science and open science. Summarized below:

citizenopenscience.png

A follow-on discussion at the with some of the seminar participants led to input for a whitepaper that is being developed through the ECSA on Citizen & Open Science for Europe. Check out the preliminary draft. I look forward to seeing the outcome.

Questioning Assumptions

One thing that I left the seminar thinking about was was the need to question my own (and my field’s) assumptions. This was really inspired by talking to Chris Welty and reflecting on his work with Lora Aroyo on the issues in human annotation and the construction of gold sets.  Some assumptions to question:

  • What qualifications you need to have to be considered a scientist.
  • Interoperability is a good thing to pursue.
  • Openness is a worthy pursuit.
  • We can safely assume a lack of dynamics in computational systems.
  • That human performance is good performance.

Indeed, in Marissa Ponti she pointed to the example below and highlighted some of the potential ramifications of what each of these (what at first blush are positive) citizen science projects could lead to. 2017-07-03 10.06.36

That being said, the ability to rapidly engage more people in the science system seems to be a good thing indeed. An an assumption I’m happy to hold.

Random

Last week, I was the first Language, Data and Knowledge Conference (LDK 2017) hosted in Galway, Ireland. If you show up at a natural language processing conference (especially someplace like LREC) you’ll find a group of people who think about and use linked/structured data. Likewise, if you show up at a linked data/semantic web conference, you’ll find folks who think about and use NLP. I would characterize LDK2017 as place where that intersection of people can hang out for a couple of days.

The conference had ~80 attendees from my count. I enjoyed the setup of a single track, plenty of time to talk, and also really trying to build the community by doing things together. I also enjoyed the fact that there were 4 keynotes for just two days. It really helped give spark to the conference.

Here are some my take-aways from the conference:

Social science as a new challenge domain

Antal van den Bosch gave an excellent keynote emphasizing the need for what he termed holistic approach to language especially for questions in the humanities and social science (tutorial here). This holistic approach takes into account the rich context that word occur in. In particular, he called out the notions of ideolect and socialect that are ways word are understood/used individually and in a particular social group. He are argued the understanding of these computational is a key notion in driving tasks like recommendation.

I personally was interested in Antal’s joint work with Folgert Karsdorp (checkout his github repos!) on Story Networks – constructing networks of how stories are told and retold. For example, how the story of Red Riding Hood has morphed and changed overtime and what are the key sources for its work. This reminded me of the work on information diffusion in social networks. This has direct bearing on how we can detect and track how ideas and technologies propagate in science communication.

I had a great discussion with SocialAI team (Erica Briscoe & Scott Appling) from Georgia Tech about their work on computational social science. In particular, two pointers: the new DARPA next generation social science program to scale-up social science research and their work on characterizing technology capabilities from data for innovation assessment.

Turning toward the long tail of entities

There were a number of talks that focused on how to deal with entities that aren’t necessarily popular. Bichen Shi presented work done at Nokia Bell Labs on entity mention disambiguation. They used Apache Spark to train 700,000 classifiers – one per every entity mention in wikipedia. This allowed them to obtain much more accurate per-mention entity links. Note they used Gerbil for their evaluation. Likewise, Hendrik ter Horst focused on entity linking specifically targeting technical domains (i.e. MeSH & chemicals). During Q/A it was clear that straight-up gazeetering provides an extremely strong baseline in this task. Marieke van Erp presented work on fine-grained entity typing in Spanish and Dutch using word embeddings to go classify hundreds up types.

Natural language generation from KBs is worth a deeper look

Natural language generation from knowledge bases continues a pace. Kathleen McKeown‘s keynote touched on this, in particular, her recent work on mining paraphrasal templates that combines both knowledge bases and free text.  I was impressed with the work of Nina Dethlefs on using deep learning for generating textual description from  a knowledge base. The key insight was how to quickly generate systems to do NLG where the data was sparse using hierarchical composition. In googling around when writing this trip report I stumbled upon Ehud Reiter’s blog which is a good read.

A couple of nice overview slides

While not a theme, there we’re some really nice slides describingfundamentals.

From C. Maria Keet:

2017-06-20 10.09.40

From Christian Chiarcos/Bettina Klimek:

2017-06-20-11-09-34.jpg

From Sangha Nam

2017-06-19 11.07.02

Overall, it was a good kick-off to a conference. Very well organized and some nice research.

Random Thoughts

At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.

Begin aside: Yes, I did run a conference called Beyond the PDF 2 and  have been known to tweet things like:

But, there’s a lot of great information in papers so we need to get our machines to read. end aside.

You can roughly catergorize the steps of reference extraction as follows:

  1. Extract the structure of the article.  (e.g. find the reference section)
  2. Extract the reference string itself
  3. Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)

Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.

There were three themes that popped out for me:

  1. The reading experience
  2. Resources
  3. Reading from the image

The Reading Experience

Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work  and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.

Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.

Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).

Resources

The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases).  Here’s a listing of the open resources that I heard called out:

This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:

An interesting issue was the transparency of algorithms and quality of the resulting citation databases.  Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:

Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.

Reading from the image

In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.

I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.

We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.

Random Notes

  • The organizers set-up a slack backchannel which was useful.
  • I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
  • EU projects can have a legacy – Roman Kern is still using code from http://code-research.eu where Mendeley was a consortium member.
  • Kölsch is dangerous but tasty
  • More workshops should try the noon to noon format.

 

 

Last week, I was at in Malta for a small workshop on building or thinking about the need for observatories for knowledge organization systems (KOSs). Knowledge organization systems are things like taxonomies, classification schemes, ontologies  or concept maps.  The event was hosted by the EU COST action KNOWeSCAPE, which focuses on understanding the dynamics of knowledge through their analysis and importantly visualization.

This was a follow-up to a previous workshop I attended on KOS evolution. Inspired by that workshop, I began to think with my colleague Mike Lauruhn about how the process of constructing KOS is changing with the incorporation of software agents and non-professional contributors (e.g. crowdsourcing). In particular, we wanted to try and get a handle on what a manager of a KOS should think about when dealing with its inevitable evolution especially with the introduction of these new factors. We wrote about this in our article Sources of Change for Modern Knowledge Organization Systems. Knowl. Org. 43(2016)No.8. (preprint).

In my talk (slides below), I presented our article in the context of building large knowledge graphs at Elsevier. The motivating slides were taken from Brad Allen’s keynote from the Dublin Core conference on metadata in the machine age. My aim was to motivate the need for KOS observatories in order to  provide empirical evidence for how to deal with changing KOS.

Both Joseph Tennis and Richard P. Smiraglia gave excellent views on the current state-of-the-art of KOS ontogeny in information systems. In particular, I think the definitional terms introduced by Tennis are useful.  He had the clearest motivation for the need for an observatory – we need to have a central dataset that is collected overtime in order to go beyond case study analysis (e.g. 1 or two KOS) to a population based approach.

I really enjoyed Shenghui Wang‘s talk on her and Rob Koopman’s experiments embeddings to start to try and detect concept drift within journal articles. Roughly put they used different vector spaces for each time duration and were able to see how particular terms changed with respect to other terms in those vector spaces. I’m looking forward to seeing how this work progresses.

2017-02-01 12.20.09 copy.jpg

The workshop was co-organized with the Wikimedia Community Malta so there was good representation from various members of the community. I particular enjoyed meeting John Cummings who is a Wikimedian in Residence at UNESCO. He told me about one of his project to help create high-quality wikipedia pages from UNESCO reports and other open access documents. It’s really cool seeing how deep research based content can be used to expand Wikipedia and the ramifications that has on its evolution. Another Wikipedian Rebecca O’Neill gave a fascinating talk about her rethinking the relationship between citizen curators and traditional memory institutions. Lot’s of stuff at her site so check it out.

Overall, the event confirmed my belief  that there’s lots more that knowledge organization studies can do with respect to large scale knowledge graphs and also those building these graphs can learn from the field.

Random Notes

 

 

 

Last week, I was in Japan for the 15th International Semantic Web Conference. 

For me, this was a big event as I was research track program co-chair together with the amazing Elena Simperl. Being a program chair is a funny thing, you’re not directly responsible for any individual paper, presentation or review but you feel responsible for the entirety. And obviously, organizing 664 reviews for 212 submissions isn’t something to be taken lightly. Beyond my service as research track chair, I think my main contribution was finding good coffee near the event:

With all that said, I think the entire program was really solid. All the preprints are on the website and the proceedings are available from Springer. I’ll try to summarize my main takeaways below. But first some quick stats:

  • 430 participants
  • 212 (research track) + 43 (application track) + 71 (resources track) = 326 submissions
    • that’s up by 61 submission from last year!
  • Acceptance rates:
    • 39/212  =  18% (research track)
    • 12/43 = 28% (application track)
    • 24/71 = 34%  (resources track)
    • I think these reflect the aims of the individual tracks
  • We also had 102 posters and demos and 12 journal track papers
  • 35 student travel winners

My three main takeaways:

  1. Frames are back!
  2. semantics on the web (notice the case)
  3. Science as the next challenge
  4. SPARQL as a driver for other CS communities

(Oh and apologies for the gratuitous use of images and twitter embeds)

Frames are back!

For the past couple of years, a chunk of the community has been focused on the problem of entity resolution/disambiguation whether that’s from text to a KB or across multiple KBs. Indeed, one of the best paper winners (yes, we gave out two – both nominees had great papers) by ISI’s Information Integration Group was an excellent approach to do multi-type entity resolution.  Likewise, Axel and crew gave a pretty heavy duty tutorial on link discovery. On the NLP front, Stefano Faralli presented a nice resource that disambiguates text to lexical resources with a focus on providing both symbolic and distributional representations .

2016-10-21 10.32.59.jpg2016-10-21 10.34.59.jpg

What struck me at the conference were the number of papers beginning to think not just about entities and their relations but the context they are in. This need for context was well motivated by the folks at IBM research working on medical question answering.

Essentially, thinking about classic AI frames but how do obtain these automatically. A clear example of this is the (ongoing) work on FRED:

Similarly, the News Reader system for extracting information into situated events is another example. Another example is extracting process graphs from medical texts. Finally, in the NLP community there’s an increasing focus on developing resources in order to build automated parsers for frame-style semantic representations (e.g. Abstract Meaning Representation). Such representations can be enhanced by connections to semantic web resources as discussed by Burns et al. (I knew this was a great idea in 2015!)

2016-10-21 10.58.15.jpg

In summary,  I think we’re beginning to see how the background knowledge available on the Semantic Web combined with better parsers can help us start to deal better with context in an automated fashion.

semantics on the web

Chris Bizer gave an insightful keynote reflecting on what the community’s expectations were for the semantic web and where we currently are at.

He presented stats on the growth of Linked Data (e.g. stuff in the LOD cloud) as well as web data (e.g. schema.org marked pages) but really the main take away is the boom in the later. About 30% of the Web has html embedded data something like 12 million websites.  There’s an 86% adoption rate on top travel website.  I think the choice quote was:

“Probably, every hotel on earth is represented as web data.”

The problem is that this sort of data is not clean, it’s messy – it’s webby data, which brings to Chris’s important point for the community:

2016-10-20-09-46-12

While standards have brought us a lot, I think we are starting as a research community to think increasingly about different kinds of semantics and different kinds of structured data.  Some examples from the conference:

An embrace of the whole spectrum of semantics on the web is really a valuable move for the research community. Interestingly enough, I think we can truly experiment with web data through things like Common Crawl and the Web Data Commons. As knowledge graphs, triple stores, and ontologies become increasingly common place especially in enterprise deployments, I’m heartened by these new areas of investigation.

The next challenge: Science

Personally, the third keynote of ISWC by Professor Hiroaki Kitano – the CEO of Sony CSL and creator among other things of the AIBO and founder of RoboCup gave a inspirational speech laying out what he sees as the next AI grand challenge:

2016-10-21 09.20.30.jpg

It will be hard for me to do justice to the keynote as the material per second ratio was pretty much off the chart but he has AI magazine article laying out the vision.

Broadly, he used RoboCup as a framework for discussing how to organize a challenge and pointed to its effectiveness. (e.g Kiva systems a RoboCup spinout was acquired by Amazon for $770 million). He then focused on the issue of the inefficiency in scientific discovery and in particular how assembling knowledge is just too difficult.

:2016-10-21 09.21.04.jpg

2016-10-21 09.40.53 copy.jpg

Assembling this by hand is way too hard!

He then went on to reframe the scientific question as one of a massive search and verification of hypothesis space. 2016-10-21 09.52.29.jpg

I walked out of that keynote pretty charged up.

I think the semantic web community can be a big part of tackling this grand challenge. Science and medicine have always been important domains for applying these technologies and that showed up at this conference as well:

SPARQL as a driver for other CS communities

The 10 year award was given to  Jorge Perez , Marcelo Arenas and Claudio Gutierrez for their paper Semantics and Complexity of SPARQL. Jorge gave just a beautiful 10 minute reflection on the paper and the relationship between theory and practice. I think his slide below really sums up the impact that SPARQL has had not just on the semantic web community but CS as a whole:

2016-10-19 09.35.39.jpg

As further evidence, I thought one of the best technical talks of the conference (even through an earthquake) was by Peter Bonz on emergent schemas for RDF querying.

It was a clear example of how the two DB and semweb communities are learning from one another and that by the semantic web having different requirements (e.g. around schemas), this drives new research.

As a whole, it’s hard to beat a conference where you learn a ton and has the following:

2016-10-19 10.52.15.jpg

Random Pointers

Last week, I was at Provenance Week 2016. This event happens once every two years and brings together a wide range of researchers working on provenance. You can check out my trip report from the last Provenance Week in 2014.  This year Provenance Week combined:

For me, Provenance Week is like coming home, lots of old friends and a favorite subject of mine. It’s also a good event to attend because it crosses the subfields of computer science, everything from security in operating systems to scientific workflows on to database theory. In one day, I went from a discussion on the role of indirection in data citation to staring at the C code of a database. Marta, Boris and Sarah really put together a solid program. There were about 60 attendees across the four days:

ProvenanceWeek_2016-06-08_D4S2484

So what was I doing there? Having served as co-chair of the W3C PROV working group, I thought it was important to be at the PROV: Three years later event where we reflected on the status of PROV, it’s uptake and usage. I presented some ongoing work on measuring the usage of provenance on the web of data.  Additionally, I gave the presentation of joint work led by my student Manolis Stamatogiannakis and done in conjunction with Ashish Gehani‘s group at SRI. The work focused on using benchmarks to help inform decisions on what provenance capture system to use. Slides:

I’ll now walk through my 3 big take aways from the event.

Provenance to attack Advanced Persistent Threats

DARPA’s $60 million transparent computing explicitly calls out the use of provenance to address the problem of what’s called an Advanced Persistent Threat (APTs). APTs are attacks that are long terms, look like standard business processes, and involve the attacker knowing the system well. This has led to a number of groups exploring the use of system level provenance capture techniques (e.g. SPADE and OPUS) and then integrating that from multiple distributed sources using PROV inspired data models. This was well described by David Archer is his talk as assembling multiple causal graphs from event streams.  James Cheney’s talk on provenance segmentation also addressed these issues well. This reminded me some what of the work on distributed provenance capture using structured logs that the Netlogger and Pegasus teams do, however, they leverage the structure of a workflow system to help with the assembly.

I particularly liked Yang JiSangho Lee and  Wenke Lee‘s work on using user level record and replay to track and replay provenance. This builds upon some of our work that used system level record replay as mechanism for separating provenance capture and instrumentation. But now in user space using the nifty rr tool from Mozilla. I think this thread of being able to apply provenance instrumentation after the fact  on an execution trace holds a lot of promise.

Overall, it’s great to see this level of attention on the use of provenance for security and in more broadly of using long term records of provenance to do analysis.

PROV as the starting point

Given that this was the ten year anniversary of IPAW, it was appropriate that Luc Moreau gave one of the keynotes. As really one of the drivers of the community, Luc gave a review of the development of the community and its successes.One of those outcomes was the W3C PROV standards. 

Overall, it was nice to see the variety of uses of PROV and the tools built around it. It’s really become the jumping off point for exploration. For example, Pete Edwards team combined PROV and a number of other ontologies including (P-Plan) to create a semantic representation of what’s going on within a professional kitchen in order to check food safety compliance. 

burger

Another example is the use of PROV as a jumping off point for the investigation into the provenance model of HL7 FHIR (a new standard for electronic healthcare records interchange).

As whole, I think the attendees felt that what was missing was an active central point to see what was going on with PROV and pointers to resources for implementation. The aim is to make sure that the W3c PROV wiki is up-to-date and is a better resource overall.

Provenance as lens: Data Citation, Documents & Versioning

An interesting theme was the use of provenance concepts to give a frame for other practices. For example, Susan Davidson gave a great keynote on data citation and how using a variant of provenance polynomials can help us understand how to automatically build citations for various parts of curated databases. The keynote was based off her work with James Frew and Peter Buneman that will appear in CACM (preprint). Another good example of provenance to support data citation was Nick Car’s work for Geoscience Australia.

Furthermore, the notion of provenance as the substructure for complex documents appeared several times. For example, the Impacts on Human  Health of Global Climate Change report from globalchange.gov uses provenance as a backbone. Both the OPUS and PoeM systems are exploring using provenance to generate high-level experiment reports.

Finally, I thought David Koop‘s versioning of version trees showed how using provenance as lens can help better understand versioning of version trees themselves. (I have to give David credit for presenting a super recursive concept so well).

Overall, another great event and I hope we can continue to attract new CS researchers focusing on provenance.

Random Notes

  • PROV in JSON-LD – good for streaming
  • Theoretical provenance paper recipe = extend provenance polynomials to deal with new operators. Prove nice result. e.g. now for Linear Algebra.
  • Prefixes! R-PROV, P-PROV, D-PROV, FS-PROV, SC-PROV, — let me know if I missed any..
  • Intel Secure Guard Extensions (SGX) – interesting
  • Surprised how dependent I’ve become on taking pictures in conferences for note taking. Not being able to really impacted my flow. Plus, there are less pictures for this
  • Thanks to Adriane for hosting!
  • A provenance based data science environment
  • 👍Learning Health Systems – from Vasa Curcin
%d bloggers like this: