Archive

Tag Archives: trip report

I’m just getting back from a nice trip to the US where I attended the Academic Data Science Alliance leadership summit and before that the Web Conference 2023 (WWW 2023) in Austin, Texas. This is the premier academic conference on the Web. The conference organisation was led by two friends and collaborators Dr. Juan Sequeda and Dr. Ying Ding. They did a fantastic job with the structure, food, keynotes (e.g. ACM Turing Award Winner Bob Metcalfe) and who can not give two thumbs up to BBQ and Austin Live Music. The last in-person Web Conference I was at was in 2018 in Lyon, so it was good to be back and to catch-up with a lot of folks in the community.

Provenance Week 2023

The main reason that I was at Web Conf was for Provenance Week 2023, which was collocated, It’s a bit of misnomer – since it was a special two day event. In the past, we’ve done this as a whole week as a separate event but coming out of the pandemic, the steering committees felt that collocating would be better. There were about 20 attendees. I was presenting the work we’ve done led by Stefan Grafberger on on mlinspect and use-cases for provenance and end-to-end machine learning. It was also nice to meet Julia Stoyanovich our co-author on this work for the first time in person. I also was very happy to celebrate the 10th anniversary of the W3C Prov provenance recommendation.

For that we organised a panel, with the other co-chair of the working group (Prof. Luc Moreau) and two co-editors (Prof. Paolo Missier, Prof. Deborah McGuinness). All three are also leaders in provenance research. We also were joined by Bryon Jacob – CTO of data.world. It was excellent to have Bryon there as data.world is a heavy user of PROV and wasn’t involved in the standardisation effort. He commented on how from his perspective the spec was really usable. We discussed the up take of PROV. The panel felt that uptake has been good with demonstrable impact use but the committee members were hoping for more. The fact that it is often used within systems or as a frame of reference (e.g. HL7 FHIR) means that it’s not as widely known as hoped. I think the panel did agree that provenance is needed now more than ever. For example, Bryon focused on data governance where data.world employs provenance. It’s becoming critical to know where data comes from to understand the broader data estate but also to deal with legal issues related to provenance. Additionally, generative AI is placing further demands on provenance. Here, I would point to work being pushed by Adobe specifically the Content Authenticity Initiative and their Firefly tools + LLM. Overall, the panel reinforced to me the need for interoperable provenance and the role that PROV has played in providing a reference point.

Beyond the panel, I took 3 things from the workshop:

  1. The intersection of provenance and data science/AI pipelines is promising. There’s a clear demand for it (and broadly ML-ops) but it also provides particular constraints that make designing provenance systems (somewhat) easier. You can make some assumptions about the kind of frameworks being used and the targeted applications are not too specific but also not completely general purpose. There’s also space for empirical insights to drive system development. This intersection was being investigated not only in our work mentioned above. For example, in Vanessa Braganholo’s keynote on the noWorkflow provenance system, I thought their work on analysing 1.4 million Juypter Notebooks was cool. I’d also mention a number of other systems discussed at the workshop including Data Provenance for Data Science and Vizier and a new system presented at the workshop focused on deep learning and data integration. Lastly, much of this work also touches on the importance of data cleaning in data science. Here, I liked the work presented by  Bertram Ludäscher on using prospective provenance to document data cleaning and reuse such pipelines.
  2. Provenance-by-design – I found this notion introduced by Luc in his paper interesting. Instead of trying to retrofit provenance gathering to applications either through instrumentation or logging, one should first design what provenance to capture and then integrate the business logic with that. In some sense, this is thinking about your workflow but also what you need to report. I can imagine this being beneficial in regulated environments such as banking or in sustainability applications as described in the paper above.
  3. Interesting tasks for provenance in databases: I liked a couple of different papers that used the provenance functionality of database systems (i.e. provenance polynomials) for various tasks. For example, the work by Tanja Auge on using provenance to help create sharable portions of databases; or the work on expanding the explanation of queries to including contextual information (+10 points for using basketball examples); or the use of this functionality to support database education as presented by Sudeepa Roy in her keynote; or even to support provenance for SHACL.
  4. Provenance as a measure for data value: I really enjoyed Boris Glavic’s keynote on relevance-based data management. In particular, the idea of determining relevance of data and using that to understand which data has value and which doesn’t. Also check out his deck if you want a checklist for doing a keynote 😀

Overall, I think the workshop was a success. It was good to catch-up with old friends but also it was nice to hear from the younger scholars there that they felt connected to the community. Thanks to Yuval and Daniel for organising and also giving plenty of time for discussion during the workshop.


In addition to provenance week, there were a number of things that caught my eye at the conference. First, the Web Conf remains a top tier conference that’s challenging to get into. With 1891 submissions and acceptance rate of 19% in the research track. Given the quality of the conference there is an increasing number of submissions that maybe don’t really belong to the venue. Hence, I thought it was a great initiative by the organisers to really focus on defining what makes a web conference paper:

Generative AI

There was a lot of background discussion going on about generative AI and the implications for the web. Here, I would point to three of the keynotes. From the perspective of misinformation and the potential to expand that through generative AI, the keynote by David Rand specifically addressed misinformation and how to combat it from a social science perspective. More broadly Barbara Poblete’s advocated forcefully for inclusion in the development of AI systems and LLMs based on her research on developing social media and AI systems in Chile. Bob Metcalfe in his Turning Award speech discussed the idea of an engineering mindset and embracing the problems and opportunities of new technology. In his case, it was the internet, but why not for generative AI?

From the research talks, I liked the work on creating a pretrained knowledge graph model that can then be used by prompting. I also liked the work on doing query log analysis on prompt logs from users of generative models to help understand user intent. This is a pretty interesting analysis over quite a lot of prompts:

Generative AI also provides a new source of knowledge. A nifty example of this was from The Creative Web track where Bhavya et al. mined and importantly assessed creative analogies from GPT-3. Also there was nice example of extracting cultural common sense knowledge and the strengths and weaknesses of LLMs and knowledge graphs.

Wikidata

I spent some time in the history of the web sessions. This was really fun. Here, I would particularly call out the really great talk about the creation of Wikidata by Denny Vrandečić. It’s an amazing success story. Definitely checkout the whole talk on YouTube.

More broadly there were a number of useful talks about enriching Wikidata. Specifically, about completing Wikidata tables using GPT-3 and using Wikidata to seed an information extractor from the web. This later paper is interesting for me because it uses QA based information extraction with an LLM a technique that we’ve been researching heavily. What I thought was interesting is that they do the QA directly on the HTML source itself. They use Wikidata to fine tune this extraction model.

Taxonomies are back and a thought on KG completion

There were quite a number of papers on building taxonomies including the student best paper award. Pointers:

In general, automatically creating hierarchies are useful for browsing and also useful for computer vision problems. Whether these papers are truly tackling taxonomies or just building hierarchies was a discussion we were having in the coffee break.

More broadly in the sessions where these papers were presented, there were a lot of papers on link prediction/node classification in knowledge graphs whether it was with metapaths; tackling temporal knowledge graphs or using multiple modalities. I’ve done work on this task myself but it would be nice to see different topics and more importantly different evaluation datasets. As Denny noted, Freebase shutdown in 2014 and we’re still doing evaluation on it

Overall, I think the Web as an evolving platform still presents some of the most exciting research challenges in CS. Austin was a great place to have it. Kudos to the team and the community.

Random thoughts

Two weeks ago, I had the pleasure of attending the 17th International Semantic Web Conference held at Asiolomar Conference Grounds in California. A tremendously beautiful setting in a state park along the ocean. This trip report is somewhat later than normal because I took the opportunity to hang out for another week along the coast of California.

Before getting into the content of the conference, I think it’s worth saying, if you don’t believe that there are capable, talented, smart and awesome women in computer science at every level of seniority, the ISWC 2018 organizing committee + keynote speakers is the mike drop of counter examples:

Now some stats:

  •  438 attendees
  •  Papers
    •  Research Track: 167 submissions – 39 accepted – 23% acceptance rate
    •  In Use: 55 submissions – 17 accepted – 31% acceptance rate
    •  Resources: 31 submissions – 6 accepted – 19% acceptance rate
  •  38 Posters & 39 Demos
  • 14 industry presentations
  • Over 1000 reviews

These are roughly the same as the last time ISWC was held in the United States. So on to the major themes I took away from the conference plus some asides.

Knowledge Graphs as enterprise assets

It was hard to walk away from the conference without being convinced that knowledge graphs are becoming fundamental to delivering modern information solutions in many domains. The enterprise knowledge graph panel was a demonstration of this idea. A big chunk of the majors were represented:

The stats are impressive. Google’s Knowledge Graph has 1 billion things and 70 billion assertions. Facebook’s knowledge graph which they distinguish from their social graph and has just ramped up this year has 50 Million Entities and 500 million assertions. More importantly, they are critical assets for applications, for example, at eBay their KG is central to creating product pages, at Google and Microsoft, KGs are key to entity search and assistants, and at IBM they use it as part of their corporate offerings. But you know it’s really in-use when knowledge graphs are used for emoji:

It wasn’t just the majors who have or are deploying knowledge graphs. The industry track in particular was full of good examples of knowledge graphs being used in practice. Some ones that stood out were: Bosch’s use of knowledge graphs for question answering in DIY, multiple use cases for digital twin management (Siemens, Aibel); use in a healthcare chatbot (Babylon Health); and for helping to regulate the US finance industry (FINRA). I was also very impressed with Diffbot’s platform for creating KGs from the Web. I contributed to the industry session presenting how Elsevier is using knowledge graphs to drive new products in institutional showcasing and healthcare.

Beyond the wide use of knowledge graphs, there was a number of things I took away from this thread of industrial adoption.

  1. Technology heterogeneity is really the norm. All sorts of storage, processing and representation approaches were being used. It’s good we have the W3C Semantic Web stack but it’s even better that the principles of knowledge representation for messy data are being applied. This is exemplified by Amazon Neptune’s support for TinkerPop & SPARQL.
  2. It’s still hard to build these things. Microsoft said it was hard at scale. IBM said it was hard for unique domains. I had several people come to me after my talk about Elsevier’s H-Graph discussing similar challenges faced in other organizations that are trying to bring their data together especially for machine learning based applications. Note, McCusker’s work is some of the better publicly available thinking on trying to address the entire KG construction lifecycle.
  3. Identity is a real challenge. I think one of the important moves in the success of knowledge graphs was not to over ontologize. However, record linkage and thinking when to unify an entity is still not a solved problem. One common approach was towards moving the creation of an identifiable entity closer to query time to deal with the query context but that removes the shared conceptualization that is one of the benefits of a Knowledge Graph. Indeed, the clarion call by Google’s Jamie Taylor to teach knowledge representation was an outcome of the need for people who can think about these kinds of problem.

In terms of research challenges, much of what was discussed reflects the same kinds of ideas that were discussed at the recent Dagstuhl Knowledge Graph Seminar so I’ll point you to my summary from that event.

Finally, for most enterprises, their knowledge graph(s) were considered a unique asset to the company. This led to an interesting discussion about how to share “common knowledge” and the need to be able to merge such knowledge with local knowledge. This leads to my next theme from the conference.

Wikidata as the default option

When discussing “common knowledge”, Wikidata has become a focal point. In the enterprise knowledge graph panel, it was mentioned as the natural place to collaborate on common knowledge. The mechanics of the contribution structure (e.g. open to all, provenance on statements) and institutional attention/authority (i.e. Wikimedia foundation) help with this. An example of Wikidata acting as a default is the use of Wikidata to help collate data on genes

Fittingly enough, Markus Krötzsch and team won the best in-use paper with a convincing demonstration of how well semantic technologies have worked as the query environment for Wikidata. Furthermore, Denny Vrandečić (one of the founders of Wikidata) won the best blue sky paper with the idea of rendering Wikipedia articles directly from Wikidata.

Deep Learning diffusion

As with practically every other conference I’ve been to this year, deep learning as a technique has really been taken up. It’s become just part of the semantic web researchers toolbox. This was particularly clear in the knowledge graph construction area. Papers I liked with DL as part of the solution:

While not DL per sea , I’ll lump embeddings in this section as well. Papers I thought that were interesting are:

The presentation of the above paper was excellent. I particularly liked their slide on related work:

iswc2018-1fd3fcf3.png

As an aside, the work on learning rules and the complementarity of rules to other forms of prediction was an interesting thread in the conference. Besides the above paper, see the work from Heiner Stuckenschmidt’s group on evaluating rules and embedding approaches for knowledge graph completion. The work of Fabian Suchanek’s group on the representativeness of knowledge bases is applicable as well in order to tell whether rule learning from knowledge graphs is coming from a representative source and is also interesting in its own right. Lastly, I thought the use of rules in Beretta et al.’s work to quantify the evidence of an assertion in a knowledge graph to help improve reliability was neat.

Information Quality and Context of Use

The final theme is a bit harder for me to solidify and articulate but it lies at the intersection of information quality and how that information is being used. It’s not just knowing the provenance of information but it’s knowing how information propagates and was intended to be used. Both the upstream and downstream need to be considered. As a consumer of information I want to know the reliability of the information I’m consuming. As a producer I want to know if my information is being used for what it was intended for.

The later problem was demonstrated by the keynote from Jennifer Golbeck on privacy. She touched on a wide variety of work but in particular it’s clear that people don’t know but are concerned with what is happening to their data.

There was also quite a bit of discussion going on about the decentralized web and Tim Berners-Lee’s Solid project throughout the conference. The workshop on decentralization was well attended. Something to keep your eye on.

The keynote by Natasha Noy also touched more broadly on the necessity of quality information this time with respect to scientific data.

The notion of propagation of bias through our information systems was also touched on and is something I’ve been thinking about in terms of data supply chains:

That being said I think there’s an interesting path forward for using technology to address these issues. Yolanda Gil’s work on the need for AI to address our own biases in science is a step forward in that direction. This is a slide from her excellent keynote at SemSci Workshop:

iswc2018-09cc97c4.png

All this is to say that this is an absolutely critical topic and one where the standard “more research is needed” is very true. I’m happy to see this community thinking about it.

Final Thought

The Semantic Web community has produced a lot (see this slide from Nataha’s keynote:

iswc2018-d5af2fed.png

ISWC 2018 definitely added to that body of knowledge but more importantly I think did a fantastic job of reinforcing and exciting the community.

Random Notes

Last week, I was the first Language, Data and Knowledge Conference (LDK 2017) hosted in Galway, Ireland. If you show up at a natural language processing conference (especially someplace like LREC) you’ll find a group of people who think about and use linked/structured data. Likewise, if you show up at a linked data/semantic web conference, you’ll find folks who think about and use NLP. I would characterize LDK2017 as place where that intersection of people can hang out for a couple of days.

The conference had ~80 attendees from my count. I enjoyed the setup of a single track, plenty of time to talk, and also really trying to build the community by doing things together. I also enjoyed the fact that there were 4 keynotes for just two days. It really helped give spark to the conference.

Here are some my take-aways from the conference:

Social science as a new challenge domain

Antal van den Bosch gave an excellent keynote emphasizing the need for what he termed holistic approach to language especially for questions in the humanities and social science (tutorial here). This holistic approach takes into account the rich context that word occur in. In particular, he called out the notions of ideolect and socialect that are ways word are understood/used individually and in a particular social group. He are argued the understanding of these computational is a key notion in driving tasks like recommendation.

I personally was interested in Antal’s joint work with Folgert Karsdorp (checkout his github repos!) on Story Networks – constructing networks of how stories are told and retold. For example, how the story of Red Riding Hood has morphed and changed overtime and what are the key sources for its work. This reminded me of the work on information diffusion in social networks. This has direct bearing on how we can detect and track how ideas and technologies propagate in science communication.

I had a great discussion with SocialAI team (Erica Briscoe & Scott Appling) from Georgia Tech about their work on computational social science. In particular, two pointers: the new DARPA next generation social science program to scale-up social science research and their work on characterizing technology capabilities from data for innovation assessment.

Turning toward the long tail of entities

There were a number of talks that focused on how to deal with entities that aren’t necessarily popular. Bichen Shi presented work done at Nokia Bell Labs on entity mention disambiguation. They used Apache Spark to train 700,000 classifiers – one per every entity mention in wikipedia. This allowed them to obtain much more accurate per-mention entity links. Note they used Gerbil for their evaluation. Likewise, Hendrik ter Horst focused on entity linking specifically targeting technical domains (i.e. MeSH & chemicals). During Q/A it was clear that straight-up gazeetering provides an extremely strong baseline in this task. Marieke van Erp presented work on fine-grained entity typing in Spanish and Dutch using word embeddings to go classify hundreds up types.

Natural language generation from KBs is worth a deeper look

Natural language generation from knowledge bases continues a pace. Kathleen McKeown‘s keynote touched on this, in particular, her recent work on mining paraphrasal templates that combines both knowledge bases and free text.  I was impressed with the work of Nina Dethlefs on using deep learning for generating textual description from  a knowledge base. The key insight was how to quickly generate systems to do NLG where the data was sparse using hierarchical composition. In googling around when writing this trip report I stumbled upon Ehud Reiter’s blog which is a good read.

A couple of nice overview slides

While not a theme, there we’re some really nice slides describingfundamentals.

From C. Maria Keet:

2017-06-20 10.09.40

From Christian Chiarcos/Bettina Klimek:

2017-06-20-11-09-34.jpg

From Sangha Nam

2017-06-19 11.07.02

Overall, it was a good kick-off to a conference. Very well organized and some nice research.

Random Thoughts

Last week, I was a the Theory and Practice of Provenance 2015 (TaPP’15) held in Edinburgh. This is the seventh edition of the workshop. You can check out my trip report from last year’s event which was held during Provenance week here. TaPP’s aim is to be a venue for a place where people can present their early and innovative research ideas.

The event is useful because it brings a cross section of researchers from different CS communities ranging from databases, programming language theory, distributed systems, to e-science and the semantic web. While it’s nice to see old friends at this event, one discussion that was had during the two days was how we can connect back in a stronger fashion to these larger communities especially as the interest in provenance increases within them.

I discuss the three themes I pulled from the event but you can take a look at all of the papers online at the event’s site and see what you think.

1. Execution traces as a core primitive

I was happy to be presenting on behalf of one of my students Manolis Stamatogiannakis whose been studying how to capture provenance of desktop systems using virtual machines and other technologies from the systems community. (He’s hanging out at SRI with Ashish Gehani for the summer so couldn’t make it.) A key idea in the paper we presented was to separate the capture of an execution trace from the instrumentation needed to analyze provence (paper). The slides for the talk are embedded below:

The mechanism used to do this is a technology called record & replay (we use PANDA) but this notion of capturing a light weight execution trace and then replaying it deterministically is also popping up in other communities. For example, Boris Glavic has been using it successfully for database provenance in his work on GProM and reenactment queries. There he uses the audit logging and time travel features of modern databases (i.e. execution trace) to support rich provenance queries. 

This need to separate capture from queries was emphasized by David Gammack and Adriane Chapman work on trying to develop agent based models to figure out what instrumentation needs to be be applied in order to capture provenance. Until we can efficiently capture everything this is still going to be a stumbling block for completely provenance aware systems. I think that thinking about execution traces as a core primitive for provenance systems may be a way forward.

2. Workflow lessons in non-workflow environments

There are numerous benefits to using (scientific) workflow systems for computational experiments one of which is that it provides a good mechanism for capturing provenance in a declarative form. However, not all users can or will adopt workflow environments. Many use computational notebooks (e.g. Jupyter)  or just shell scripts. The YesWorkflow system (very inside community joke here) uses convention and a series of comments to help users produce a workflow and provenance structure from their scrips and file system (paper). Likewise, work on combining noWorkflow, a provenance tracking system for python, and iPython notebooks shows real promise (paper). This reminded me of the PROV-O-Matic work by Rinke Hoekstra.

Of course you can combine yesWorkflow and noWorkflow together into one big system.

Overall, I like the trend towards applying workflow concepts in-situ. It got me thinking about applying the scientific workflow results to the abstractions provided by Apache Spark. Just a hunch that this might be an interesting direction.

3. Completing the pipeline

The last theme I wanted to pull out is that I think we are inching towards being able to truly connect provenance generated by applications. My first example, is the work by Glavic and his students on importing and ingesting PROV-JSON into a database. This lets you query the provenance of query results but include information on the pipelines that got it there.

This is something I’ve wanted to do for ages with Marcin Wylot’s work on TripleProv, I was a bit bummed that Boris got their first but I’m glad somebody did it 🙂

The second example was the continued push forward for provenance in the systems community. In particular, the OPUS and SPADE systems, which I was aware off but now also the work on Linux Provenance Modules by Adam Bates that was introduced to me at TaPP. These all point to the ability to leverage key operating systems constructs to capture and manage provenance. For example, Adam showed how to make use of  mandatory access control policies to provide focused and complete capture of provenance for particular applications. 

I have high hopes here.

Random thoughts

To conclude I’ll end with some thoughts from the notebook.

I hope to see many familiar and new faces at next year’s Provenance Week (which combines TaPP and IPAW).

From Florence, I  headed to Washington D.C. to attend the Society for Scholarly Publishing Annual Meeting (SSP) last week. This conference is a big event for academic publishers. It’s primarily attended by people who either work for publishers as well as companies that provide services for them. It also includes a smattering of librarians and technologist. The demographic was quite different from WWW – more in suits and more women (attention CS community). This reflects the make-up of the academic publishing industry as a whole as shown by the survey done by Amy Brand, which was presented at the beginning of the conference. Full data here. Here’s a small glimpse.

2015-05-27 16.12.52

Big Literature, Big Usage

I was at SSP primarily to give a talk and then be on a panel in the Big Literature, Big Usage session. The session with Jan Velterop and Paul Cohen went well. Jan presented the need for large scale analysis of the literature in order for science to progress (shout outs to the Steve Pettifier led Lazaurs project). He had a great set of slides of showing how fast one would have to read in order to read every paper being deposited in pubmed (the pages just flash buy). I followed up with a discussion of recent work on Machine Reading (see below) and how it impacts publishers. The aim was to show that automated analysis and reading of the literature is not somewhere off in the future but is viable now.

Paul Cohen from DARPA followed up by the discussion of their Big Mechanism program. This effort is absolutely fascinating. Paul’s claim was that we currently cannot understand large scale complex systems. He characterized this as pull vs. a push approach. Paraphrasing, we currently try to pull on the knowledge into individuals’ heads and then make connections (i.e. build a model).  vs. a push approach  where we push all the information from individuals out and have the computer build a model. The former makes understanding such large scale systems for all intensive purposes impossible. To attack this problem, the program’s aim is to automatically build large scale causal computational models directly from the literature.  Paul pointed out there are still difficulties with machine reading (e.g. coreference resolution is still a challenge), however, the progress is there. Amazing they are having success with building models in cancer biology. Both the vision and the passion in Paul’s talk was just compelling. (As a nice aside, folks at Elsevier (e.g. Anita De Waard) are a small part of the project.) are participating in this program.

We followed up with a Q/A panel session. We discussed the fact that all of us believe that sharing/publishing of computational models is really the future. It’s just unfortunate to lock this information up in text. We answered a number of questions around feasibility (yes, this is happening even if it’s hard). Also, we discussed the impossibility of doing some of this science without having computers deeply involved. Things are just too complicated and we are not getting the requisite productivity.

So that was my bit. I also got to attend a number of sessions and catch up with a number of people. What’s up @MarkHahnel  and @jenniferlin15  – thanks for letting me heckle from the back of the room 🙂

Business Stuff

I attended a number of sessions that discussed the business of academic publishing. The big factor seems to be fairly flat growth in library budgets but with a growing amount of journals. This was mentioned in Jayne Marks from Wolters Kluwer’s talk as well as a whole session on where to get growth. I thought the mergers and acquisition talk from @lamb was of interest. It seems that there is even more room for consolidation in the industry.

2015-05-28 16.56.102015-05-28 16.59.09

I also feel that the availability of large of amounts of cheap capital has not been taken full advantage of in the industry. Beyond consolidation new product development seems to be the best way to get growth.  I think one notion that’s of interest is the transition towards Big Funder Deal where funders are essentially paying in bulk for their research to be published.

I enjoyed the cost of OA business models session. A very interesting set of slides from Robert Kiley about Wellcome Trust’s open access journal spend is embedded below. This is a must look at in terms of where costs are coming from. It is a clarion call to all publishers in terms of delivering what they say they are going to deliver.

Pete Binfield of Peer J gave an insightful about the disruptive nature of non-legacy digital only OA publishers. However, I think it may overestimate the costs of the current subscription infrastructure. Also, as Marks noted in her talk 50% of physicians still require print and a majority students want print textbooks. I wonder how much this is the predominate factor in legacy costs?

2015-05-29 11.00.06

Overall, throughout the sessions, it still felt a bit … umm…. slow. We are still talking papers maybe with a bit of metrics or data in for spice but I think there’s much more to be done in helping us scientist do better and that scholarly media/information providers.

Media & Technology

The conference lined up three excellent keynote sessions. The former CEO of MakerBot Jenny Lawton gave a “life advice” talk. I think the best line was “do a gap analysis on yourself”. Actually, the most interesting bit was her answer to a question about open source. Her answer was that we live in very IP and patent oriented world and it’s important to figure out how if you want to be an open company to work strategically in that world. The interview form with the New Yorker author Ken Auletta worked great. His book Googled
The End of the World As We Know It is now on my to read list. A couple of interesting points:

  • New York Times subscribers read 35 minutes a day (print) vs 35 minutes a month (online)
  • Human factors drive more decisions in the highest levels of business than let on.
  • Editorial and fact checking at the level of the New Yorker is a game changer for an author.
  • He’s really big on having a calling card.

Finally, Charles Watkinson gave a talk about how the monograph publishing is experimenting with digital and how it’s adopting many of the same features as journal articles. He called out Morgan and Claypool’s Synthesis Series as an example innovator in this space — I’m an editor 😉

I always enjoy T Scott Plutchak’s talks. He talked about his new role in bringing together data wranglers across his university. He made a good case that this role is really necessary in today’s science. I agree. But it’s unclear how one can keep the talent needed for data wrangling within academia especially in the library.

Overall, SSP was useful in understanding the current state and thinking of this industry.

Random Thoughts

2015-01-13 10.06.28Last week, I was at FORCE 2015 – the future of research communications and e-scholarship conference held in Oxford. This is the third conference in a series that started with Beyond the PDF in 2011 and continued with Beyond the PDF 2 that I led the organization of in Amsterdam in 2013 (my wrap-up is here). This conference provides one of the only forums that brings together a variety of people who are in the vanguard of scholarly communication from librarians and computer scientists, to researchers, funders and publishers. Pretty much every role was represented in the ~250 attendees.

To give you an idea, I saw the developers of the Papers reference manager, the editorial director of PLOS One, a funder from the Wellcome Trust, a librarian from University of Iowa, and public policy junior researchers from Brazil/Germany2015-01-12 12.40.34

The curators (i.e. conference chairs), Dave De Roure and Melissa Haendel did a great job of pulling in a whole range of topics and styles in a a great venue. We even had the opportunity to see copies of the Philosophical Transactions. Speaking from experience this is a tough conference to organize because everything is pretty dynamic and there’s lots of different styles. (e.g. Dave and last minute beer run for the Hackathon!)

So what was I doing there? I helped organize the hackathon, which gave some space to work on content extraction, and reference manger support for data citation and for people to talk over pizza. This lead to proposals for two 1k challenges. (Remember to vote for which one you want to give 1000 pounds to..) I also helped organize the poster and demo / geek out sessions. A trailer for those sessions is below:

Themes

The conferences was too packed to go through everything but I wanted to go through the core themes that I got out of it:

1. Scholarly media is not just text

Data, images, slides, videos, software – scholarly media is not just text.  It never ways but it’s clear that the primacy of text is slowly being reduced and eventually be treated on par with these other forms of output. This is being made possible by the number of new platforms being introduced whether it’s Fighsare or Xenodo for data, github for code or HUBZero for the entire analytics lifestyle. It’s about sharing the actual research object rather than the textual argument. I think what brought this home to me is the amount of time spent discussing and presenting how these content types can be shoehorned into traditional text environments (e.g. journal citations).

2. Not access, understanding

The assumption at FORCE 2015, is that scholarship will be open access. The question then arises what do you do with the open access content. Phil Bourne, in his closing remarks, mentioned the lack of things being done with the current open access corpus. This notion of the need to do more clearly came over in Chris Lintott, founder of Galaxy Zoo, keynote:

He discussed how the literature was a barrier to amateurs contributing more to science. Specially, he mentioned accessible research summaries.  But, in general, there is a need to consider a more diverse audience in our communication not only for amateurs but for scientists from other disciplines or policy makers, for example.

3. Quality under pressure

The amount of scholarship continues to grow and there are perverse incentives. Scott Edmunds from Gigascience brought this out in his vision idea’s talk.

The current answer to this is peer review. But as most researchers will tell you, we are already overwhelmed. I get tons of requests to review and it’s hard to turn down my colleagues. Maybe a market for peer review will develop (see below) but what we need is more automated mechanisms of quality control or for publishers to do more quality control before things get sent to reviewers. Maybe we should see peer review as constructive feedback and not a filter. Likewise, by valuing other parts of the system maybe we can increase both the transparency and overall quality of the science.

4. Science as a service

The poster below from Bianca Kramer and  Jeroen Bosman highlighted the explosion in services available for scholarly communication. This continues a theme that I emphasized last year and that Ian Foster has talked about – the ability to do more and more science by just calling an API. Why can’t I build my lab from a cafe?

Wrap-up & Random Notes

The FORCE community is a special one. I hope we can continue to work together to push scholarly communication forward. I’m already looking forward to FORCE 2016 in Portland. There’s lots to be excited about as the way we do research rapidly changes. Finally, here are some random notes from the conference:

Last week I got from a great 8! days in Riva del Garda, Italy attending the 2014 International Semantic Web Conference and associated events. This is one of those events where your colleagues on Facebook get annoyed with the pretty pictures of a lakes and mountains that their other colleagues keep posting:

2014-10-23 06.57.15

ISWC is the key conference for semantic web research and the place to see what’s happening. This year’s conference had 630 attendees which is a strong showing for the event. The conference is as usual selective:
2014-10-21 09.11.38
Interestingly, the numbers were about on par with last year except for the in-use track where we had a much larger number of submissions. I suspect this is because all tracks had synchronized submission deadlines whereas the in-use track was after the research track last year. The replication, dataset, software, and benchmark track is a new addition to the conference and a good one I might add. Having a place to present for these sorts of scholarly output is important and from my perspective a good move by the conference. You can find the papers (published and in preprint form) on the website.. More importantly you can find a big chunk of the slides presented on Eventifier.

So why am I hanging out in Italy (other than the pasta).  I was co-organizer the Doctoral Consortium for the event.

Additionally, I was on a panel for the Context Interpretation and Meaning workshop. I also attended a pre-meeting on archiving linked data for the PRELIDA project. Lastly, we had an in-use paper in the conference on adaptive linking used within the Open PHACTS platform to support chemistry.. Alasdair Gray did a fantastic job of leading and presenting the paper.

So on to the show.Three themes, which I discuss in turn:

  1. It’s not Volume, it’s Variety
  2. Variety & the Semantic Spectrum
  3. Fuzziness & Metrics

It’s not Volume, it’s Variety

I’m becoming more convinced that the issue for most “big” data problems isn’t volume or velocity, it’s variety. In particular, I think the hardware/systems folks are addressing the first two problems at a rate that means that for many (most?) workloads the software abstractions provided are enough to deal with the data sizes and speed involved. This inkling was confirmed to me a couple of weeks ago when I saw a talk by Peter Hofstee, the designer of the Cell microprocessor, talking about his recent work on computer architectures for big data.

This notion was further confirmed at ISWC. Bryan Thompson of BigData triple store fame, presented his new work using GPUs (mapgraph.io) that can do graph processing on hundreds of millions of nodes using GPUs using similar abstractions to Signal/Collect or GraphLab. Additionally, as I was sitting in the session on Large Scale RDF processing – many of the systems were focused on a clustered environment but using ~100 million triple test sets even though you can process these with a single beefy server. It seems that for online analytics workloads you can do these with a simple server setup and for truly web scale workloads these will be at the level of clusters that can be provisioned fairly straightforwardly using THE cloud. I mean in our community the best examples are webdatacommons.org or the work of the VU team on LODLaundry  – both of these process graphs in the billions using the Hadoop ecosystem on either local or Amazon based clusters. Furthermore, the best paper in the in-use track (Semantic Traffic Diagnosis with STAR-CITY: Architecture and Lessons Learned from Deployment in Dublin, Bologna, Miami and Rio) from IBM actually scrapped using a specific streaming system because even data coming from traffic sensors wasn’t fast enough to make it worthwhile.

Indeed, in Prabhakar Raghavan‘s  (yes! the Intro. to Information Retrieval and Google guy) keynote, he noted that he would love to have problems that were just computational in nature. Likewise, Yolanda Gil discussed that the difficulties and that the challenges lay not in necessarily data analysis but in data preparation (i.e. it’s a data mess!) 2014-10-21 14.08.27

The hard part is data variety and heterogeneity, which transitions, nicely, into our next theme…

Variety & the Semantic Spectrum

Chris Bizer gave an update to the measurements of the Linked Data Cloud – this was a highlight talk.

The Linked Data Cloud has grown essentially doubling (towards generously ~1000 datasets) but the growth of schema.org based data (see the Microdata+RDFa series ISWC 2014 paper) has ~500,000 datasets. Chris gave an interesting analysis about what he thinks this means in a nice mailing list post. The comparison is summed up below:

So what we are dealing with is really a spectrum of semantics from extremely rich knowledge bases to more shallow mark-up (As a side note: Guha’s thought’s on Schema.org are always worth a revisit.) To address, this spectrum, I saw quite a few papers trying to deal with it using a variety of CS techniques from NLP to databases. Indeed, two of the best papers were related to this subject:

Also on this front were works on optimizing linked discovery (HELIOS), machine reading (SHELDON), entity recognition, and query probabilistic triple stores. All of these works hand in common trying to take approaches from other CS fields and adapt or improve them to deal with these problems of variety within a spectrum of semantics.

Fuzziness & Metrics

The final theme that I pulled out of the conference was the area of evaluation metrics but ones that either dealt with or catered for the fact that there are no hard truths, especially, when using corpora developed using human judgements. The quintessential example of this is my colleague Lora Aroyo’s work on Crowd Truth – trying to capture disagreement in the process of creating gold standard corpora in crowd sourcing environments. Other example is the very nice work from Michelle Cheatham and Pascal Hitzler on creating an uncertain OAEI conference benchmark.  Raghavan‘s keynote also homed in on the need for more metrics especially as we have a change in the type of search interfaces that we typically use (going from keyword searches to more predictive contextual search). This theme was also prevalent in the workshops in particular how to do we measure in the face of changing contexts. Examples include:

A Note on the Best Reviewers

Good citizens:

A nice note: some were nominated by authors of papers that the reviewer rejected because the review was so good. That’s what good peer review is about – improving our science.

Random Notes

  • Love the work Bizer and crew are doing on Web Tables. Check it out.
  • Conferences are so good for quick lit reviews. Thanks to Bijan Parsia who sent me the direction of Pavel Klinov‘s work on probabilistic reasoning over inconsistent ontologies.
  • grafter.org – nice site
  • Yes, you can reproduce results.
  • There’s more provenance on the Web of Data than ever. (Unfortunately, PROV is still small percentage wise.)
  • On the other hand, PROV was in many talks like last year. It’s become a touch point. Another post on this is on the way.
  • The work by Halpin and Cheney on using SPARQL update for provenance tracking is quite cool. 
  • A win from the VU: DIVE 3rd place in the semantic web challenge 
  • Amazing wifi at the conference! Unbelievable!
  • +1 to the Poster & Demo crew: keeping 160 lightening talks going on time and fun – that’s hard
  • 10 year award goes to software: Protege: well deserved
  • http://ws.nju.edu.cn/explass/
  • From Nigel’s keynote: it seems that the killer app of open data is …. insurance
  • Two years in a row that stuff I worked has gotten a shout out in a keynote (Social Task Networks). 😃
  • ….. I don’t think the streak will last
  • 99% of queries have nouns (i.e. entities)
  • I hope I did Sarven’s Call for Linked Research justice
  • We really ought to archive LOV – vocabularies are small but they take a lot of work. It’s worth it.
  • The Media Ecology project is pretty cool. Clearly, people who have lived in LA (e.g. Mark Williams) just know what it takes 😉
  • Like: Linked Data Fragments – that’s the way to question assumptions.
  • A low-carb diet in italy – lots of running

Welcome to a massive multimedia extravaganza trip report from Provenance Week held earlier this month June 9 -13.

Provenance Week brought together two workshops on provenance plus several co-located events. It had roughly 65 participants. It’s not a huge event but it’s a pivotal one for me as it brings together all the core researchers working on provenance from a range of computer science disciplines. That means you hear the latest research on the topic ranging from great deployments of provenance systems to the newest ideas on theoretical properties of provenance. Here’s a picture of the whole crew:

Given that I’m deeply involved in the community, it’s going to be hard to summarize everything of interest because…well…everything was of interest, it also means I had a lot of stuff going on. So what was I doing there?

Activities


 

PROV Tutorial

Together with Luc Moreau and Trung Dong Huynh, I kicked off the week with a tutorial on the W3C PROV provenance model. The tutorial was based on my recent book with Luc. From my count, we had ~30 participants for the tutorial.

We’ve given tutorials in the past on PROV but we made a number of updates as PROV is becoming more mature. First, as the audience had a more diverse technical background we came from a conceptual model (UML) point of view instead of starting with a Semantic Web perspective. Furthermore, we presented both tools and recipes for using PROV. The number of tools we now have out for PROV is growing – ranging from  conversion of PROV from various version control systems to neuroimaging workflow pipelines that support PROV.

I think the hit of the show was Dong’s demonstration of interacting with PROV using his Prov python module (pypi) and Southampton’s Prov Store.

Papers & Posters

I had two papers in the main track of the International Provenance and Annotation Workshop (IPAW) as well as a demo and a poster.

Manolis Stamatogiannakis presented his work with me and Herbert Bos – Looking Inside the Black-Box: Capturing Data Provenance using Dynamic Instrumentation . In this work, we looked at applying dynamic binary taint tracking to capture high-fidelity provenance on  desktop systems. This work solves what’s known as the n-by-m problem in provenance systems. Essentially, it allows us to see how data flows within an application without having to instrument that application up-front. This lets us know exactly which output of a program is connected to which inputs. The work was well received and we had a bunch of different questions both around speed of the approach and whether we can track high-level application semantics. A demo video is below and you can find all the source code on github.

We also presented our work on converting PROV graphs to IPython notebooks for creating scientific documentation (Generating Scientific Documentation for Computational Experiments Using Provenance). Here we looked at how to try and create documentation from provenance that is gathered in a distributed setting and put that together in easy to use fashion. This work was part of a larger kind of discussion at the event on the connection between provenance gathered in these popular notebook environments and that gathered on more heterogeneous systems. Source code again on github.

I presented a poster on our (with Marcin Wylot and Philippe Cudré-Mauroux) recent work on instrumenting a triple store (i.e. graph database) with provenance.  We use a long standing technique provenance polynomials from the database community but applied for large scale RDF graphs. It was good to be able to present this to those from database community that we’re at the conference. I got some good feedback, in particular, on some efficiencies we might implement.

 

I also demoed (see above) the really awesome work by Rinke Hoekstra on his PROV-O-Viz provenance visualization service. (Paper, Code) . This was a real hit with a number of people wanting to integrate this with their provenance tools.

Provenance Reconstruction + ProvBench

At the end of the week, we co-organized with the ProvBench folks an afternoon about challenge tasks and benchmark datasets. In particular, we looked at the challenge of provenance reconstruction – how do you recreate provenance from data when you didn’t track it in the first place. Together with Tom De Nies we  produced a number of datasets for use with this task. It was pretty cool to see that Hazeline Asuncion used these data sets in one of her classes where her students used a wide variety of off the shelf methods.

From the performance scores, precision was ok but very dataset dependent and relies on a lot on knowledge of the domain. We’ll be working with Hazeline to look at defining different aspects this problem going forward.

Provenance reconstruction is just one task where we need datasets. ProvBench is focused on gathering those datasets and also defining new challenge tasks to go with them. Checkout this github for a number of datasets. The PROV standard is also making it easier to consume benchmark datasets because you don’t need to write a new parser to get a hold of the data. The dataset I most liked was the Provenance Capture Disparities dataset from the Mitre crew (paper). They provide a gold standard provenance dataset capturing everything that goes on in a desktop environment, plus, two different provenance traces from different kinds of capture systems. This is great for testing both provenance reconstruction but also looking how to merge independent capture sources to achieve a full picture of provenance.

There is also a nice tool to covert Wikipedia edit histories to PROV.

Themes


I think I picked out four large themes from provenance week.

  1. Transparent collection
  2. Provenance aggregation, slicing and dicing
  3. Provenance across sources

Transparent Collection

One issue with provenance systems is getting people to install provenance collection systems in the first place let alone installing new modified provenance-aware applications. A number of papers reported on techniques aimed to make it easier to capture more transparent.

A couple of approaches tackled this for the programming languages. One system focused on R (RDataTracker) and the other python (noWorkflow). I particularly enjoyed the noWorkflow python system as they provided not only transparent capture for provenance systems but a number of utilities for working with the captured provenance. Including a diff tool and a conversion from provenance to Prolog rules (I hope Jan reads this). The prolog conversion includes rules that allow for provenance specific queries to be formulated. (On Github). noWorkflow is similar to Rinke’s PROV-O-Matic tool for tracking provenance in python (see video below). I hope we can look into sharing work on a really good python provenance solution.

An interesting discussion point that arose from this work was – how much we should expose provenance to the user? Indeed, the team that did RDataTracker specifically inserted simple on/off statements in their system so the scientific user  could control the capture process in their R scripts.

Tracking provenance by instrumenting the operating system level has long been an approach to provenance capture. Here, we saw a couple of techniques that tried to reduce that tracking to simply launching a system background process in user space while improving the fidelity of provenance. This was the approach of our system Data Tracker and Cambridge’s OPUS (specific challenges in dealing with interposition on the std lib were discussed).  Ashish Gehani was nice enough to work with me to get his SPADE system setup on my mac.  It was pretty much just a checkout, build, and run to start capturing reasonable provenance right away – cool.

Databases have consistently been a central place for provenance research.  I was impressed  Boris Glavic’s vision (paper) of a completely transparent way to report provenance for database systems by leveraging two common database functions – time travel and an audit log. Essentially, through the use of query rewriting and query replay he’s able to capture/report provenance for database query results. Talking to Boris, they have a lot it implemented already in collaboration with Oracle. Based on prior history (PostgresSQL with provenance), I bet it will happen shortly.  What’s interesting is that his approach requires no modification of the database and instead sits as middleware above the database.

Finally, in the discussion session after the Tapp practice session, I asked the presenters who represented the range of these systems to ballpark what kind of overhead they saw for capturing provenance. The conclusion was that we could get between 1% – 15% overhead. In particular, for deterministic replay style systems you can really press down the overhead at capture time.

Provenance  aggregation, slicing and dicing

I think Susan Davidson said it best in her presentation on provenance for crowdsourcing  – we are at the OLAP stage of provenance. How do we make it easy to combine, recombine, summarize, and work with provenance. What kind of operators, systems, and algorithms do we need? Two interesting applications came to the fore for this kind of need – crowdsourcing and security. Susan’s talk exemplified this but at the Provenance Analytics event there were several other examples (Huynh et al., Dragon et al).

The other area was security.  Roly Perera  presented his impressive work with James Cheney on cataloging various mechanisms for transforming provenance graphs for the purposes of obfuscating or hiding sensitive parts of the provenance graph. This paper is great reference material for various mechanisms to deal with provenance summarization. One technique for summarization that came up several times in particular with respect to this domain was the use of annotation propagation through provenance graphs (e.g. see ProvAbs by Missier et al. and work by Moreau’s team.)

Provenance across sources

The final theme I saw was how to connect provenance across sources. One could also call this provenance integration. Both Chapman and the Mitre crew with their  Provenance Plus tracking system  and Ashish with his SPADE system are experiencing this problem of provenance coming from multiple different sources and needing to integrate these sources to get a complete picture of provenance both within a system and spanning multiple systems. I don’t think we have a solution yet but they both (ashish, chapman) articulated the problem well and have some good initial results.

This is not just a systems problem, it’s fundamental that provenance extends across systems. Two of the cool use cases I saw exemplified the need to track provenance across multiple sources.

The Kiel Center for Marine Science (GEOMAR)  has developed a provenance system to track their data throughout their entire organization stemming from data collected on their boats all the way through a data publication. Yes you read that right, provenance gathered on awesome boats!  This invokes digital pens, workflow systems and data management systems.

The other was the the recently released US National Climate Change Assessment. The findings of that report stem from 13 different institutions within the US Government. The data backing those findings is represented in a structured fashion including the use of PROV. Curt Tilmes presented more about this amazing use case at Provenance Analytics.

In many ways, the W3C PROV standard was created to help solve these issues. I think it does help but having a common representation is just the start.


Final thoughts

I didn’t mention it but I was heartened to see that community has taken to using PROV as a mechanism for interchanging data and for having discussions.  My feeling is that if you can talk provenance polynomials and PROV graphs, you can speak with pretty much anybody in the provenance community no matter which “home” they have – whether systems, databases, scientific workflows, or the semantic web.  Indeed, this is one of the great things about provenance week, is that one was able to see diverse perspectives on this cross cutting concern of provenance.

Lastly, there seemed to many good answers at provenance week but more importantly lots of good questions. Now, I think as a community we should really expose more of the problems we’ve found to a wider audience.

Random Notes

  • It was great to see the interaction between a number of different services supporting PROV (e.g. git2prov.org, prizims , prov-o-viz, prov store,  prov-pings, PLUS)
  • ProvBench on datahub – thanks Tim
  • DLR did a fantastic job of organizing. Great job Carina, Laura and Andreas!
  • I’ve never had happy birthday sung to me at by 60 people at a conference dinner – surprisingly in tune – Kölsch is pretty effective. Thanks everyone!
  • Stefan Woltran’s keynote on argumentation theory was pretty cool. Really stepped up to the plate to give a theory keynote the night after the conference dinner.
  • Speaking of theory, I still need to get my head around Bertram’s work on Provenance Games. It looks like a neat way to think about the semantics of provenance.
  • Check out Daniel’s trip report on provenance week.
  • I think this is long enough…..

I seem to be a regular attendee of the Extended Semantic Web Conference series (2013 trip report). This year ESWC was back in Crete, which means that you can get photos like the one below taken to make your colleagues jealous:

2014-05-26 18.11.15

 

As I write this, the conference is still going on but I had to leave early to early to head to Iceland where I will briefly gate crash the natural language processing crowd at LREC 2014. Let’s begin with the stats of ESWC:

  • 204 submissions
  • 25% acceptance rate
  • ~ 4.5 reviews per submission

The number of submissions was up from last year. I don’t have the numbers on attendance but it seemed in-line with last year as well. So, what was I doing at the conference?

This year ESWC introduced a semantic web evaluation track. We participated in two of these new evaluation tracks. I showed off our linkitup tool for the Semantic Web Publishing Challenge. [paper]. The tool lets you enrich research data uploaded to Figshare with links to external sites. Valentina Maccatrozzo presented her contribution to the Linked Open Data Recommender Systems challenge. She’s exploring using richer semantics to do recommendation, which, from the comments on her poster, was seen as a novel approach by the attendees. Overall, I think all our work went over well. However, it would be good to see more of the VU Semweb group content in the main track. The Netherlands only had 14 paper submissions. It was also nice to see PROV mentioned in several places. Finally, conferencse are great places to do face-2-face work. I had nice chats with quite a few people, in particular, with Tobias Kuhn on the development of the nanopublications spec and with Avi Bernstein on our collaboration leveraging his group’s Signal & Collect framework.

So what were the big themes of this year’s conference. I pulled out three:

  1. Easing development with Linked Data
  2. Entities everywhere
  3. Methodological maturity

Easing development

As a community, we’ve built interesting infrastructure for machine readable data sharing, querying, vocabulary publication and the like. Now that we have all this data,  the community is turning towards making it easier to develop applications with it. This is not necessarily a new problem and people have tackled it before (e.g. ActiveRDF). But the availability of data seems to be renewing attention to this problem. This was reflected by Stefan Staab’s Keynote on Programming the Semantic Web. I think the central issue he identified was how to program against flexible data models that are the hallmark of semantic web data. Stefan argued strongly for static typing and programmer support but, as an audience member noted, there is a general trend in development circles towards document style databases with weaker type systems. It will be interesting to see how this plays out.

Aside: A thought I had was whether we could easily publish the type systems that developers create when programming back out onto the web and merge them with existing vocabularies….

This notion of easing development was also present in the SALAD workshop (a workshop on APIs). This is dear to my heart. I’ve seen in my own projects how APIs really help developers make use of semantic data when building applications. There was quite a lot of discussion around the role of SPARQL with respect to APIs as well as whether to supply data dumps or an API and what type of API that should be. I think it’s fair enough to say that Web APIs are winning, see the paper RESTful or RESTless – Current State of Today’s Top Web APIs, and we need to devise systems that deal with that while still leveraging all our semantic goodness. That being said it’s nice to see mature tooling appearing for Linked Data/Semantic Web data (e.g. RedLink toolsMarin Dimitrov’s talk on selling semweb solutions commercially).

Entities everywhere

There were a bunch of papers on entity resolution, disambiguation, etc. Indeed, Linked Data provides a really fresh arena to do this kind of work as both the data and schemas are structured and yet at the same time messy. I had quite a few nice discussions with Pedro Szekely on the topic and am keen to work on getting some of our ideas on linking into the Karma system he is developing with others.  From my perspective, two papers caught my eye. One on using coreference to actually improve sparql query performance. Often times we think of all these equality links as a performance penalty, it’s interesting to think about whether they can actually help us improve performance on different tasks. The other paper was “A Probabilistic Approach for Integrating Heterogeneous Knowledge Sources“, which uses Markov Logic Networks to align web information extraction data (e.g. NELL) to DBpedia. This is interesting as it allows us to enrich clean background knowledge with data gathered from the web. It’s also neat in that it’s another example of the combination of  statistical inference and (soft) rules.

This emphasis on entities is in contrast with the thought-provoking keynote by Oxford philosopher Luciano Floridi, who discussed various notions of complexity and argued that we need to think not in terms of entities but in fact interactions. This was motivated by the following statistic – that by 2020 7.5 billion people vs. 50 billion devices and all of these things will be interconnected and talking.

Indeed, while entities especially in messy data is far from being a solved problem, we are starting to see dynamics emerging as clear area of interest. This is reflected by the best student paper Hybrid Acquisition of Temporal Scopes for RDF Data.

Methodological maturity

The final theme I wanted to touch on was methodological maturity. The semantic web project is 15 years old (young in scientific terms) and the community has now become focused on having rigorous evaluation criteria. I think every paper I saw at ESWC had a strong evaluation section (or at least a strongly defensible one). This is a good thing! However, this focus pushes people towards safety in their methodology, for instance the plethora of papers that use LUBM, which can lead towards safety in research. We had an excellent discussion about this trend in the EMPIRICAL workshop – check out a brief write up here. Indeed, it makes one wonder if

  1. these simpler methodologies (my system is faster than yours on benchmark x) exacerbate a tendency to do engineering and not answer scientific questions; and
  2. whether the amalgamation of ideas that characterizes semantic web research is toned down leading to less exciting research.

One answer to this trend is to encourage a more wide spread acceptance and knowledge of different scientific methodologies (e.g. ethnography), which would allow us to explore other areas.

Finally,  I would recommend Abraham Bernstein & Natasha Noy – “Is This Really Science? The Semantic Webber’s Guide to Evaluating Research Contributions“, which I found out about at the EMPIRICAL workshop.

Final Notes

Here are some other pointers that didn’t fit into my themes.

 

This past week I was at Academic Publishing in Europe 9 (APE 2014) for two days. I was invited to talk about altmetrics. This was kind of an update of the double act that I did with Mike Taylor from Elsevier Labs last year at another publishing conference, UKSG. You can find the slides of my talk from APE below along with a video of my presentation. Overall, the talk was well received:

I think for publishers the biggest thing is to recognize this as something they play a role in as well as emphasizing that altmetrics broaden the measurement space. It’s also interesting that authors want support for telling about their research – and need help.

Given that it was a publishing conference, it’s always interesting to see the themes getting talked about. Here are some  highlights from my perspective.

The Netherlands going gold

Open Access was as usual a discussion point. The Dutch State Secretary of Science was there, Sander Dekker, giving a full throated endorsement of gold open access. I thought the discussion by Michael Jubb on monitoring progress of the UK’s Open Access push after the Finich Report was interesting. I think seeing how the UK manages and measures this transition will be critical to understanding the ramifications of open access. However, I have a feeling that they may not be looking at the impact on faculty enough and in particular how money is distributed for open access gold pricing.

Big Data – It’s the variety!

There was a session on big data.  Technically, I thought I wouldn’t get a lot out of this session because with my computer science hat on I’ve heard quite a few technical talks on the subject. However, this session really confirmed to me not that were facing a problem with data processing or storage but data variety.

This was confirmed by the fantastic talk by Jason Swedlow on the Open Microscopy project. The project looks at how to manage and deal with massive amounts of image data and the interoperability of those images. (You can find one of the images that they published here – 281 gigapixels!) If your thinking about data integration or interoperability you should check out this project and his talk. I also like the notion that images as a measurement technique. He noted that their software deals with data size and processing but the difficulties were around the variety and just general dirtiness of all that data.

This emphasis on the issues of data variety as an issue was also emphasize by Simon Hodson  from CoDATA in his talk as he gave an overview of a number of e-science projects where data variety was the central issue.

Data / Other Stuff Citation

Data citation was another theme of the conference. As a community member, it was good to see force11.org mentioned frequently, in particular, the work on data citation principles that’s being facilitated by the community. Also, the resource identification initiative another FORCE11 community group – where researchers can identify specific resources (e.g. model organisms, software) in their publications in a machine readable way. This has already been endorsed by a number of journals (~25) and publishers. This ability to “cite” seems to be the central to how all these other scientific products are beginning to get woven into the scholarly literature. (See also ideacite.org)

A good example of this was – Hans Pfeiffenberger talk on the Earth System Science Data journal – where they have created a journal specifically for data coming from large scale earth measurements. An interesting issue that came up was the need for bidirectional citation –  that is to publish the data and associated commentary at the same time each including references to each other using permanent identifiers with different publishers.

Digital Preservation

There was also some talk about preservation of content born online. Two things stood out for me here:

  1. Peter Burnhill‘s talk on thekeepers.org and hiberlink.org both projects to detect what content is being preserved. I was shocked to hear that only 20% of the online serials stored in a long term archives.
  2. This report seems pretty comprehensive on this front. Note to self – it will be good input for thinking about preserving linked data in the Prelinda project.

Science from the coffee shop

The conference had a session (dotcoms-to-watch) on startups in publishing. What caught me  was that we are really moving toward the idea that Ian Foster has been talking about, namely, science as a service.  With services like scrawl and science exchange, we’re starting to be able to even lab based experiments completely from your laptop. I think this is going to be huge. I already see this in computer science where myself and more of my colleagues turn to the Amazon cloud to boot up our test environments. Pretty soon you’ll be able to do your science just by calling an API.

Random Notes

My Slides & Talk

%d bloggers like this: