Archive

academia

I had the pleasure of attending the Web Conference 2018 in Lyon last week along with my colleague Corey Harper . This is the 27th addition of the largest conference on the World Wide Web. I have tremendous difficulty  not calling it WWW but I’ll learn! Instead of doing two trip reports the rest of this is a combo of Corey and my thoughts. Before getting to what we took away as main themes of the conference let’s look at the stats and organization:

It’s also worth pointing out that this is just the research track. There were 27 workshops,  21 tutorials, 30 demos (Paul was co-chair), 62 posters, four collocated conferences/events, 4 challenges, a developer track and programming track, a project track, an industry track, and… We are probably missing something as well. Suffice to say, even with the best work of the organizers it was hard to figure out what to see. Organizing an event with 2200+ attendees is a thing is a massive task – over 80 chairs were involved not to mention the PC and the local heavy lifting. Congrats to Fabien, Pierre-Antoine, Lionel and the whole committee for pulling it off.  It’s also great to see as well that the proceedings are open access and available on the web.

Given the breadth of the conference, we obviously couldn’t see everything but from our interests we pulled out the following themes:

  • Dealing with a Polluted Web
  • Tackling Tabular Data
  • Observational Methods
  • Scientific Content as a Driver

Dealing with a Polluted Web

The Web community is really owning it’s responsibility to help mitigate the destructive uses to which the Web is put. From the “Recoding Black Mirror” workshop, which we were sad to miss, through the opening keynote and the tracks on Security and Privacy and Fact Checking, this was a major topic throughout the conference.

Oxford professor Luciano Floridi gave an excellent first keynote  on “The Good Web” which addressed this topic head on. He introduced a number of nice metaphors to describe what’s going on:

  • Polluting agents in the Web ecosystem are like extremphiles, making the environment hostile to all but themselves
  • Democracy in some contexts can be like antibiotics: too much gives growth to antibiotic resistant bacteria.
  • His takeaway is that we need a bit of paternalism in this context now.

His talk was pretty compelling,  you can check out the full video here.

Additionally, Corey was able to attend the panel discussion that opened the “Journalism, Misinformation, and Fact-Checking” track, which included representation from the Credibility Coalition, the International Fact Checking Network, MIT, and WikiMedia. There was a discussion of how to set up economies of trust in the age of attention economies, and while some panelists agreed with Floridi’s call for some paternalism, there was also a warning that some techniques we might deploy to mitigate these risks could lead to “accidental authoritarianism.” The Credibility Coalition also provided an interesting review of how to define credibility indicators for news looking at over 16 indicators of credibility.

We were able to see parts of the “Web and Society track”, which included a number of papers related to social justice oriented themes. This included an excellent paper that showed how recommender systems in social networks often exacerbate and amplify gender and racial disparity in social network connections and engagement. Additionally, many papers addressed the relationship between the mainstream media and the web. (e.g. political polarization and social media, media and public attention using the web).

Some more examples: The best demo was awarded to a system that automatically analyzed privacy policies of websites and summarized them with respect to GDPR and:

More generally, it seems the question is how do we achieve quality assessment at scale?

Tackling Tabular Data

Knowledge graphs and heterogenous networks (there was a workshop on that) were a big part of the conference. Indeed the test of time paper award went to the original Yago paper. There were a number of talks about improving knowledge graphs for example for improving on question answering tasks, determining attributes that are needed to complete a KG or improving relation extraction. While tables have always been an input to knowledge graph construction (e.g. wikpedia infoboxes), an interesting turn was towards treating tabular data as a focus area.

As Natasha Noy from Google noted in her  keynote at the SAVE-SD workshop,  this is an area with a number of exciting research challenges:img_0034_google_savesd.jpg

There was a workshop on data search with a number of papers on the theme. In that workshop, Maarten de Rijke gave a keynote on the work his team has been doing in the context of data search project with Elsevier.

In the main track, there was an excellent talk on Ad-Hoc Table Retrieval using Semantic Similarity. They looked at finding semantically central columns to provide a rank list of columns. More broadly they are looking at spreadsheet compilation as the task (see smarttables.cc and the dataset for that task.) Furthermore, the paper Towards Annotating Relational Data on the Web with Language Models looked at enriching tables through linking into a knowledge graph.

Observational Methods

Observing  user behavior has been a part of research on the Web, any web search engine is driven by that notion. What did seem to be striking is the depth of the observational data being employed. Prof. Lorrie Cranor gave an excellent keynote on the user experience of web security (video here). Did you know that if you read all the privacy policies of all the sites you visit it wold take 244 hours per year? Also, the idea of privacy as nutrition labels is pretty cool:

But what was interesting was her labs use of an observatory of 200 participants who allowed their Windows home computers to be instrumented. This kind of instrumentation gives deep insight into how users actually use their browsers and security settings.

Another example of deep observational data, was the use of mouse tracking on search result pages to detect how people search under anxiety conditions:

In the paper by Wei Sui and co-authors on Computational Creative Advertisements presented at the HumL workshop – they use in-home facial and video tracking to measure emotional response to ads by volunteers.

The final example was the use of FMRI scans to track brain activity of participants during web search tasks. All these examples provide amazing insights into how people use these technologies but as these sorts of methods are more broadly adopted, we need to make sure to adopt the kinds of safe-guards adopted by these researchers – e.g. consent, IRBs, anonymization.

Scientific Content as a Driver

It’s probably our bias but we saw a lot of work tackling scientific content. Probably because it’s both interesting and provides a number of challenges. For example, the best paper of the conference (HighLife) was about extracting n-ary relations for knowledge graph construction motivated by the need for such types of relations in creating biomedical knowledge graphs. The aforementioned work on tabular data often is motivated by the needs of research. Obviously SAVE-SD covered this in detail:

In the demo track, the etymo.io search engine was presented to summarize and visualization of scientific papers. Kuansan Wang at the BigNet workshop talked about Microsoft Academic Search and the difficulties and opportunities in processing so much scientific data.

IMG_0495.JPG

Paul gave a keynote at the same workshop also using science as the motivation for new methods for building out knowledge graphs. Slides below:

In the panel, Structured Data on the Web 7.0, Google’s Evgeniy Gabrilovich – creator of the Knowledge Vote – noted the challenges of getting highly correct data for Google’s Medical Knowledge graph and that doing this automatically is still difficult.

Finally, using DOIs for studying persistent identifier use over time on the Web.

Wrap-up

Overall, we had a fantastic web conference. Good research, good conversations and good food:

Random Thoughts

 

Last week, I had the pleasure to be able to attend a bilateral meeting between the Royal Society and the KNAW. The aim was to strengthen the relation between the UK and Dutch scientific communities. The meeting focused on three scientific areas: quantum physics & technology; nanochemistry; and responsible data science. I was there for the latter. The event was held at Chicheley Hall which is a classic baroque English country house (think Pride & Prejudice). It’s a marvelous venue – very much similar in concept to Dagstuhl (but with an English vibe) where you are really wholly immersed in academic conversation.

.IMG_0290

One of the fun things about the event was getting a glimpse of what other colleagues from other technical disciplines are doing. It was cool to see Prof. Bert Weckhuysen enthusiasm for using imaging technologies to understand catalysts at the nanoscale. Likewise, seeing both the progress and the investment (!) in quantum computing from Prof. Ian Walmsley was informative. I also got an insider intro to the challenges of engineering a quantum computer from Dr. Ruth Oulton.

The responsible data science track had ~15 people. What I liked was that the organizers not only included computer scientists but also legal scholars, politicians, social scientists, philosophers and policy makers. The session consisted primarily of talks but luckily everyone was open to discussion throughout. Broadly, responsible data science covers the ethics of the practice and implications of data science or put another way:

For more context, I suggest starting with two sources: 1) The Dutch consortium on responsible data science 2) the paper 10 Simple Rules for Responsible Big Data Research. I took away two themes both from the track as well as my various chats with people during coffee breaks, dinner and the bar.

1) The computer science community is engaging

It was apparent through out the meeting that the computer science community is confronting the challenges head on. A compelling example was the talk by Dr. Alastair Beresford from Cambridge about Device Analyzer a system that captures the activity of user’s mobile phones in order to provide data to improve device security, which it has:

He talked compellingly about the trade-offs between consent and privacy and how the project tries to manage these issues. In particular, I thought how they handle data sharing with other researchers was interesting. It reminded me very much of how the Dutch Central Bureau of Statistics manages microdata on populations.

Another example was the discussion by Prof. Maarten De Rijke on the work going on with diversity for recommender and search systems. He called out the Conference on Fairness, Accountability, and Transparency (FAT*) that was happening just after this meeting, where the data science community is engaging on these issues. Indeed, one of my colleagues was tweeting from that meeting:

Julian Huppert, former MP, discussed the independent review board setup up by DeepMind Health to enable transparency about their practices. He is part of that board.  Interestingly, Richard Horton, Editor of the Lancet is also part of that board Furthermore, Prof. Bart Jacobs discussed the polymorphic encryption based privacy system he’s developing for a collaboration between Google’s Verily and Radboud University around Parkinson’s disease. This is an example that  even the majors are engaged around these notions of responsibility. To emphasize this engagement notion even more, during the meeting a new report on the Malicious Uses of AI came out from a number or well-known organizations.

One thing that I kept thinking is that we need more assets or concrete artifacts that data scientists can apply in practice.

For example, I like the direction outlined in this article from Dr. Virginia Dignum about defining concrete principles using a design for values based approach. See TU Delft’s Design for Values Institute for more on this kind of approach.

2) Other methods needed

As data scientists, we tend to want to use an experimental / data driven approach even to these notions surrounding responsibility.

Even though I think there’s absolutely a role here for a data driven approach, it’s worth looking at other kinds of more qualitative methods, for example, by using survey instruments or an ethnographic approach or even studying the textual representation of the regulatory apparatus.  For instance, reflecting on the notion of Thick Data is compelling for data science practice. This was brought home by Dr. Ian Brown in his talk on data science and regulation which combined both an economic and survey view:

Personally, I tried to bring some social science literature to bear when discussing the need for transparency in how we source our data. I also argued for the idea that adopting a responsible approach is also actually good for the operational side of data science practice:

While I think it’s important for computer scientists to look at different methods, it’s also important for other disciplines to gain insight into the actual process of data science itself as Dr. Linnet Taylor grappled within in her talk about observing a data governance project.

Overall, I enjoyed both the setting and the content of the meeting. If we can continue to have these sorts of conversations, I think the data science field will be much better placed to deal with the ethical and other implications of our technology.

Random Thoughts

  • Peacocks!
  • Regulating Code – something for the reading list
  • Somebody remind me to bring a jacket next time I go to an English Country house!
  • I always love it when egg codes get brought up when talking about provenance.
  • I was told that I had a “Californian conceptualization” of things – I don’t think it was meant as a complement – but I’ll take it as such 🙂
  • Interesting pointer to work by Seda Gurses about in privacy and software engineering from @1Br0wn
  • Lots of discussion of large internet majors and monopolies. There’s lots of academic work on this but I really like Ben Thompson’s notion of aggregator’s as the way to think about them.
  • Merkle trees are great – but blockchain is a nicer name 😉

 

Last week, I conferenced! I attended the 16th International Semantic Web Conference (ISWC 2017) in Vienna at the beginning of the week and then headed up to FORCE 2017 in Berlin for the back half of the week. For the last several ISWC, I’ve been involved in the organizing committee, but this year I got to relax. It was a nice chance to just be an attendee and see what was up. This was made even nicer by the really tremendous job Axel, Jeff and their team did  in organizing both the logistics and program. The venues were really amazing and the wifi worked!

Before getting into what I thought were the major themes of the conference, lets do some stats:

  • 624 participants
  • Papers
    • Research track: 197 submissions – 44 accepted – 23% acceptance rate
    • In-use: 27 submissions – 9  accepted – 33% acceptance rate
    • Resources: 76 submissions – 23 accepted – 30% acceptance rate
  • 46 posters & 61 demos
  • Over 1000 reviews were done excluding what was done for the workshop / demos / posters. Just a massive amount of work in helping work get better.

This year they expanded the number of best reviewers and I was happy to be one of them:

You can find all the papers online as preprints.

The three themes I took away from the conference were:

  1. Ecosystems for knowledge engineering
  2. Learn from everything
  3. More media

Ecosystems for knowledge engineering

This was a hard theme to find a title for but there were several talks about how to design and engineer the combination of social and technical processes to build knowledge graphs. Deborah McGuinness in her keynote talked about how it took a village to create effective knowledge driven systems. These systems are the combination of experts, knowledge specialists, systems that do ML, ontologies, and data sources. Summed up by the following slide:

My best idea is that this would fall under the rubric of knowledge engineering. Something that has always been part of the semantic web community. What I saw though was the development of more extensive ideas and guidelines about how to create and put into practice not just human focused systems but entire social-techical ecosystems that leveraged all manner of components.

Some examples: Gil et al.’s paper on  creating a platform for high-quality ontology development and data annotation explicitly discusses the community organization along with the platform used to enable it. Knoblock et al’s paper on creating linked data for the American Art Collaborative discusses not only the technology for generating linked data from heterogenous sources but the need for a collaborative workflow facilitated by a shared space (Github) but also the need for tools used to do expert review.  In one of my favorite papers, Piscopo et al evaluated the the provenance of Wikidata statements and also developed machine learning models that could judge authoritativeness & relevance of potential source material. This could provide a helpful tool in allowing Wikidata editors to garden the statements automatically added by bots. As a last example, Jamie Taylor in his keynote discussed how at Google they have a Knowledge Graph Schema team that is there to support a developers in creating interlocking data structures. The team is focused on supporting and maintaining quality of the knowledge graph.

A big discussion area was the idea coming out of the US for a project / initiative around an Open Knowledge Network introduced by Guha. Again, I’ll put this under the notion of how to create these massive social-technical knowledge systems.

I think more work needs to be done in this space not only with respect to the dynamics of these ecosystems as Michael Lauruhn and I discussed in a recent paper but also from a reuse perspective as Pascal Hitzler has been talking about with ontology design patterns.

Learn from everything

The second theme for me was learning from everything. Essentially, this is the use of the combination of structured knowledge and unstructured data within machine learning scenarios to achieve better results. A good example of this was presented by Achim Rettinger on using cross modal embeddings to improve semantic similarity and type prediction tasks:

Likewise, Nada Lavrač discussed in her keynote how to different approaches for semantic data mining, which also leverages different sources of information for learning. In particular, what was interesting is the use of network analysis to create a smaller knowledge network to learn from.

A couple of other examples include:

It’s worth calling out the winner of the renewed  Semantic Web Challenge from IBM, which used deep learning in combination with sources such as dbpedia, geonames and background assumptions for relation learning.

2017-10-23 20.44.14.jpg

Socrates – Winner SWC

(As an aside, I think it’s pretty cool that the challenge was won by IBM on data provided by Thomson Reuters with an award from Elsevier. Open innovation at its best.)

For a more broad take on the complementarity between deep learning and the semantic web, Dan Brickley’s paper is a fun read. Indeed, as we start to potentially address common sense knowledge we will have to take more opportunity to learn from everywhere.

More media

Finally, I think we saw an increase in the number of works dealing with different forms of media. I really enjoyed the talk on Improving Visual Relationship Detection using Semantic Modeling of Scene Descriptions given by Stephan Brier. Where they used a background knowledge base to improve relation prediction between portions of images:

tresp.png

There was entire session focused on multimodal linked data including talks on audio ( MIDI LOD cloud, the Internet Music Archive as linked data) and images IMGPedia content analyzed linked data descriptions of Wikimedia commons.  You can even mash-up music with the SPARQL-DJ.

Conclusion

DBpedia won the 10 year award paper. 10 years later semantic technologies and in particular the notion of a knowledge graph are mainstream (e.g. Thomson Reuters has a 100 billion node knowledge graph). While we may still be focused too much on the available knowledge graphs  for our research work, it seems to me that the community is branching out to begin to answer a range new questions (how to build knowledge ecosystems?, where does learning fit?, …) about the intersection of semantics and the web.

Random Notes:

Last week, I was the first Language, Data and Knowledge Conference (LDK 2017) hosted in Galway, Ireland. If you show up at a natural language processing conference (especially someplace like LREC) you’ll find a group of people who think about and use linked/structured data. Likewise, if you show up at a linked data/semantic web conference, you’ll find folks who think about and use NLP. I would characterize LDK2017 as place where that intersection of people can hang out for a couple of days.

The conference had ~80 attendees from my count. I enjoyed the setup of a single track, plenty of time to talk, and also really trying to build the community by doing things together. I also enjoyed the fact that there were 4 keynotes for just two days. It really helped give spark to the conference.

Here are some my take-aways from the conference:

Social science as a new challenge domain

Antal van den Bosch gave an excellent keynote emphasizing the need for what he termed holistic approach to language especially for questions in the humanities and social science (tutorial here). This holistic approach takes into account the rich context that word occur in. In particular, he called out the notions of ideolect and socialect that are ways word are understood/used individually and in a particular social group. He are argued the understanding of these computational is a key notion in driving tasks like recommendation.

I personally was interested in Antal’s joint work with Folgert Karsdorp (checkout his github repos!) on Story Networks – constructing networks of how stories are told and retold. For example, how the story of Red Riding Hood has morphed and changed overtime and what are the key sources for its work. This reminded me of the work on information diffusion in social networks. This has direct bearing on how we can detect and track how ideas and technologies propagate in science communication.

I had a great discussion with SocialAI team (Erica Briscoe & Scott Appling) from Georgia Tech about their work on computational social science. In particular, two pointers: the new DARPA next generation social science program to scale-up social science research and their work on characterizing technology capabilities from data for innovation assessment.

Turning toward the long tail of entities

There were a number of talks that focused on how to deal with entities that aren’t necessarily popular. Bichen Shi presented work done at Nokia Bell Labs on entity mention disambiguation. They used Apache Spark to train 700,000 classifiers – one per every entity mention in wikipedia. This allowed them to obtain much more accurate per-mention entity links. Note they used Gerbil for their evaluation. Likewise, Hendrik ter Horst focused on entity linking specifically targeting technical domains (i.e. MeSH & chemicals). During Q/A it was clear that straight-up gazeetering provides an extremely strong baseline in this task. Marieke van Erp presented work on fine-grained entity typing in Spanish and Dutch using word embeddings to go classify hundreds up types.

Natural language generation from KBs is worth a deeper look

Natural language generation from knowledge bases continues a pace. Kathleen McKeown‘s keynote touched on this, in particular, her recent work on mining paraphrasal templates that combines both knowledge bases and free text.  I was impressed with the work of Nina Dethlefs on using deep learning for generating textual description from  a knowledge base. The key insight was how to quickly generate systems to do NLG where the data was sparse using hierarchical composition. In googling around when writing this trip report I stumbled upon Ehud Reiter’s blog which is a good read.

A couple of nice overview slides

While not a theme, there we’re some really nice slides describingfundamentals.

From C. Maria Keet:

2017-06-20 10.09.40

From Christian Chiarcos/Bettina Klimek:

2017-06-20-11-09-34.jpg

From Sangha Nam

2017-06-19 11.07.02

Overall, it was a good kick-off to a conference. Very well organized and some nice research.

Random Thoughts

At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.

Begin aside: Yes, I did run a conference called Beyond the PDF 2 and  have been known to tweet things like:

But, there’s a lot of great information in papers so we need to get our machines to read. end aside.

You can roughly catergorize the steps of reference extraction as follows:

  1. Extract the structure of the article.  (e.g. find the reference section)
  2. Extract the reference string itself
  3. Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)

Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.

There were three themes that popped out for me:

  1. The reading experience
  2. Resources
  3. Reading from the image

The Reading Experience

Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work  and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.

Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.

Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).

Resources

The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases).  Here’s a listing of the open resources that I heard called out:

This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:

An interesting issue was the transparency of algorithms and quality of the resulting citation databases.  Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:

Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.

Reading from the image

In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.

I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.

We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.

Random Notes

  • The organizers set-up a slack backchannel which was useful.
  • I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
  • EU projects can have a legacy – Roman Kern is still using code from http://code-research.eu where Mendeley was a consortium member.
  • Kölsch is dangerous but tasty
  • More workshops should try the noon to noon format.

 

 

Last week, I was in Japan for the 15th International Semantic Web Conference. 

For me, this was a big event as I was research track program co-chair together with the amazing Elena Simperl. Being a program chair is a funny thing, you’re not directly responsible for any individual paper, presentation or review but you feel responsible for the entirety. And obviously, organizing 664 reviews for 212 submissions isn’t something to be taken lightly. Beyond my service as research track chair, I think my main contribution was finding good coffee near the event:

With all that said, I think the entire program was really solid. All the preprints are on the website and the proceedings are available from Springer. I’ll try to summarize my main takeaways below. But first some quick stats:

  • 430 participants
  • 212 (research track) + 43 (application track) + 71 (resources track) = 326 submissions
    • that’s up by 61 submission from last year!
  • Acceptance rates:
    • 39/212  =  18% (research track)
    • 12/43 = 28% (application track)
    • 24/71 = 34%  (resources track)
    • I think these reflect the aims of the individual tracks
  • We also had 102 posters and demos and 12 journal track papers
  • 35 student travel winners

My three main takeaways:

  1. Frames are back!
  2. semantics on the web (notice the case)
  3. Science as the next challenge
  4. SPARQL as a driver for other CS communities

(Oh and apologies for the gratuitous use of images and twitter embeds)

Frames are back!

For the past couple of years, a chunk of the community has been focused on the problem of entity resolution/disambiguation whether that’s from text to a KB or across multiple KBs. Indeed, one of the best paper winners (yes, we gave out two – both nominees had great papers) by ISI’s Information Integration Group was an excellent approach to do multi-type entity resolution.  Likewise, Axel and crew gave a pretty heavy duty tutorial on link discovery. On the NLP front, Stefano Faralli presented a nice resource that disambiguates text to lexical resources with a focus on providing both symbolic and distributional representations .

2016-10-21 10.32.59.jpg2016-10-21 10.34.59.jpg

What struck me at the conference were the number of papers beginning to think not just about entities and their relations but the context they are in. This need for context was well motivated by the folks at IBM research working on medical question answering.

Essentially, thinking about classic AI frames but how do obtain these automatically. A clear example of this is the (ongoing) work on FRED:

Similarly, the News Reader system for extracting information into situated events is another example. Another example is extracting process graphs from medical texts. Finally, in the NLP community there’s an increasing focus on developing resources in order to build automated parsers for frame-style semantic representations (e.g. Abstract Meaning Representation). Such representations can be enhanced by connections to semantic web resources as discussed by Burns et al. (I knew this was a great idea in 2015!)

2016-10-21 10.58.15.jpg

In summary,  I think we’re beginning to see how the background knowledge available on the Semantic Web combined with better parsers can help us start to deal better with context in an automated fashion.

semantics on the web

Chris Bizer gave an insightful keynote reflecting on what the community’s expectations were for the semantic web and where we currently are at.

He presented stats on the growth of Linked Data (e.g. stuff in the LOD cloud) as well as web data (e.g. schema.org marked pages) but really the main take away is the boom in the later. About 30% of the Web has html embedded data something like 12 million websites.  There’s an 86% adoption rate on top travel website.  I think the choice quote was:

“Probably, every hotel on earth is represented as web data.”

The problem is that this sort of data is not clean, it’s messy – it’s webby data, which brings to Chris’s important point for the community:

2016-10-20-09-46-12

While standards have brought us a lot, I think we are starting as a research community to think increasingly about different kinds of semantics and different kinds of structured data.  Some examples from the conference:

An embrace of the whole spectrum of semantics on the web is really a valuable move for the research community. Interestingly enough, I think we can truly experiment with web data through things like Common Crawl and the Web Data Commons. As knowledge graphs, triple stores, and ontologies become increasingly common place especially in enterprise deployments, I’m heartened by these new areas of investigation.

The next challenge: Science

Personally, the third keynote of ISWC by Professor Hiroaki Kitano – the CEO of Sony CSL and creator among other things of the AIBO and founder of RoboCup gave a inspirational speech laying out what he sees as the next AI grand challenge:

2016-10-21 09.20.30.jpg

It will be hard for me to do justice to the keynote as the material per second ratio was pretty much off the chart but he has AI magazine article laying out the vision.

Broadly, he used RoboCup as a framework for discussing how to organize a challenge and pointed to its effectiveness. (e.g Kiva systems a RoboCup spinout was acquired by Amazon for $770 million). He then focused on the issue of the inefficiency in scientific discovery and in particular how assembling knowledge is just too difficult.

:2016-10-21 09.21.04.jpg

2016-10-21 09.40.53 copy.jpg

Assembling this by hand is way too hard!

He then went on to reframe the scientific question as one of a massive search and verification of hypothesis space. 2016-10-21 09.52.29.jpg

I walked out of that keynote pretty charged up.

I think the semantic web community can be a big part of tackling this grand challenge. Science and medicine have always been important domains for applying these technologies and that showed up at this conference as well:

SPARQL as a driver for other CS communities

The 10 year award was given to  Jorge Perez , Marcelo Arenas and Claudio Gutierrez for their paper Semantics and Complexity of SPARQL. Jorge gave just a beautiful 10 minute reflection on the paper and the relationship between theory and practice. I think his slide below really sums up the impact that SPARQL has had not just on the semantic web community but CS as a whole:

2016-10-19 09.35.39.jpg

As further evidence, I thought one of the best technical talks of the conference (even through an earthquake) was by Peter Bonz on emergent schemas for RDF querying.

It was a clear example of how the two DB and semweb communities are learning from one another and that by the semantic web having different requirements (e.g. around schemas), this drives new research.

As a whole, it’s hard to beat a conference where you learn a ton and has the following:

2016-10-19 10.52.15.jpg

Random Pointers

Last week, I was at Provenance Week 2016. This event happens once every two years and brings together a wide range of researchers working on provenance. You can check out my trip report from the last Provenance Week in 2014.  This year Provenance Week combined:

For me, Provenance Week is like coming home, lots of old friends and a favorite subject of mine. It’s also a good event to attend because it crosses the subfields of computer science, everything from security in operating systems to scientific workflows on to database theory. In one day, I went from a discussion on the role of indirection in data citation to staring at the C code of a database. Marta, Boris and Sarah really put together a solid program. There were about 60 attendees across the four days:

ProvenanceWeek_2016-06-08_D4S2484

So what was I doing there? Having served as co-chair of the W3C PROV working group, I thought it was important to be at the PROV: Three years later event where we reflected on the status of PROV, it’s uptake and usage. I presented some ongoing work on measuring the usage of provenance on the web of data.  Additionally, I gave the presentation of joint work led by my student Manolis Stamatogiannakis and done in conjunction with Ashish Gehani‘s group at SRI. The work focused on using benchmarks to help inform decisions on what provenance capture system to use. Slides:

I’ll now walk through my 3 big take aways from the event.

Provenance to attack Advanced Persistent Threats

DARPA’s $60 million transparent computing explicitly calls out the use of provenance to address the problem of what’s called an Advanced Persistent Threat (APTs). APTs are attacks that are long terms, look like standard business processes, and involve the attacker knowing the system well. This has led to a number of groups exploring the use of system level provenance capture techniques (e.g. SPADE and OPUS) and then integrating that from multiple distributed sources using PROV inspired data models. This was well described by David Archer is his talk as assembling multiple causal graphs from event streams.  James Cheney’s talk on provenance segmentation also addressed these issues well. This reminded me some what of the work on distributed provenance capture using structured logs that the Netlogger and Pegasus teams do, however, they leverage the structure of a workflow system to help with the assembly.

I particularly liked Yang JiSangho Lee and  Wenke Lee‘s work on using user level record and replay to track and replay provenance. This builds upon some of our work that used system level record replay as mechanism for separating provenance capture and instrumentation. But now in user space using the nifty rr tool from Mozilla. I think this thread of being able to apply provenance instrumentation after the fact  on an execution trace holds a lot of promise.

Overall, it’s great to see this level of attention on the use of provenance for security and in more broadly of using long term records of provenance to do analysis.

PROV as the starting point

Given that this was the ten year anniversary of IPAW, it was appropriate that Luc Moreau gave one of the keynotes. As really one of the drivers of the community, Luc gave a review of the development of the community and its successes.One of those outcomes was the W3C PROV standards. 

Overall, it was nice to see the variety of uses of PROV and the tools built around it. It’s really become the jumping off point for exploration. For example, Pete Edwards team combined PROV and a number of other ontologies including (P-Plan) to create a semantic representation of what’s going on within a professional kitchen in order to check food safety compliance. 

burger

Another example is the use of PROV as a jumping off point for the investigation into the provenance model of HL7 FHIR (a new standard for electronic healthcare records interchange).

As whole, I think the attendees felt that what was missing was an active central point to see what was going on with PROV and pointers to resources for implementation. The aim is to make sure that the W3c PROV wiki is up-to-date and is a better resource overall.

Provenance as lens: Data Citation, Documents & Versioning

An interesting theme was the use of provenance concepts to give a frame for other practices. For example, Susan Davidson gave a great keynote on data citation and how using a variant of provenance polynomials can help us understand how to automatically build citations for various parts of curated databases. The keynote was based off her work with James Frew and Peter Buneman that will appear in CACM (preprint). Another good example of provenance to support data citation was Nick Car’s work for Geoscience Australia.

Furthermore, the notion of provenance as the substructure for complex documents appeared several times. For example, the Impacts on Human  Health of Global Climate Change report from globalchange.gov uses provenance as a backbone. Both the OPUS and PoeM systems are exploring using provenance to generate high-level experiment reports.

Finally, I thought David Koop‘s versioning of version trees showed how using provenance as lens can help better understand versioning of version trees themselves. (I have to give David credit for presenting a super recursive concept so well).

Overall, another great event and I hope we can continue to attract new CS researchers focusing on provenance.

Random Notes

  • PROV in JSON-LD – good for streaming
  • Theoretical provenance paper recipe = extend provenance polynomials to deal with new operators. Prove nice result. e.g. now for Linear Algebra.
  • Prefixes! R-PROV, P-PROV, D-PROV, FS-PROV, SC-PROV, — let me know if I missed any..
  • Intel Secure Guard Extensions (SGX) – interesting
  • Surprised how dependent I’ve become on taking pictures in conferences for note taking. Not being able to really impacted my flow. Plus, there are less pictures for this
  • Thanks to Adriane for hosting!
  • A provenance based data science environment
  • 👍Learning Health Systems – from Vasa Curcin
%d bloggers like this: