What is data?

I’ve posted  a couple of times on this blog about events organized at the VU University Amsterdam to encourage interdisciplinary collaboration. One of the major issues to come out of these prior events was that data sharing is a critical mechanism for enabling interdisciplinary research. However, often times it’s difficult for scientists to know:

  1. Who has what data? and;
  2. whether that data is interesting to them?

This second point is important. Because different disciplines use different vocabularies, it is often times hard to understand whether a data set is truly useful or interesting in the context of new domains. What is data for one domain may or may not be data in another domain.

To help bridge this gap, Iina Hellsten (Organizational Science), Leonie Houtman (Business Research) and myself (Computer Science) organized a Network Institute workshop this past this past Wednesday (March 23, 2011) titled What is Data?

The goal of the workshop was to bring people together from this different domains to discuss the data they use in their everyday practice and to describe what makes data useful to them.

Our goal wasn’t to come up with a philosophical answer to the question but instead build a map of what researchers from these disiplines consider to be useful data for them.  More importantly, however, was to bring these various researchers together to talk to one another.

I was very impressed with the turnout. Around 25  people showed up from social science, business/management research and computer science. Critically, the attendees were fully engaged and together produced a fantastic result.

The attendees

The Process

To build a map of data, we used a variant of a classic knowledge acquisition technique called card sorting. The attendees were divided up into groups (shown above) making sure that the groups had a mix of researchers from each disciplines. Within each group, every researcher was asked to give examples of the data they worked with on a daily basis and explain to the others a bit about they did with that data. This was a chance for people to get to know each other and have discussions in smaller groups. After the end of this each group had a pile of index cards with examples of data sets.

Writing down example data sets

The groups were then asked to group these examples together and then give those collections labels. This was probably the most  difficult part of the process and led to lots of interesting discussions:

Discussion about grouping

Here’s an example result from one of the groups (the green post-it notes are the collection labels):

Sorted cards

The next step was that everyone in the room got to walk around and label the example data sets from all groups with attributes that they thought were important to them. For example, a social networking data set is interesting to me if I can access it programmatically. Each discipline got their own color. Pink = computer science, Orange = social science, yellow = management science.

This resulted in very colorful tables:

After labelling

Once this process was complete, we merged the various tables groupings together by data sets and category (i.e. collection label) leading to a map of data sets:

The Results

A Map of Data

Above is the map created by the group. You can find a (more or less faithful) transcription of the map here. Here’s some highlights.

There were 10 categories of data:

  1. Elicited data (e.g. surveys)
  2. Data based on measurement (e.g. logfiles)
  3. Data wit a particular formats (e.g. xml)
  4. Structured-only data (e.g. databases)
  5. Machine data (e.g. results of a simulation)
  6. Textual data (e.g. interview transcripts)
  7. Social data (e.g. email)
  8. Indexed data (e.g. Web of Science)
  9. Data useful for both quantitative and qualitative analysis (e.g. newspapers)
  10. Data about the researchers themselves (e.g. how did they do an analysis)

After transcribing the data, I would say that computer scientists are interested in having strong structure in the data, whereas social scientists and business scientists are deeply concerned with having high quality data that is representative, credible, and was collected with care. Across all disciplines temporality (or having things on a timeline) seemed to be a critical attribute of useful data.

What’s next?

At the end of the workshop, we discussed where to go from here. The plan is to have a follow-up workshop where each discipline can present their own datasets using these categorizations. To help focus the workshop we are looking for two interdisciplinary teams within the VU that are willing to try data sharing and present the results of that trial at the workshop. If you have a data set, you would like to share, please post it to the Network Institute linked in group. Once you have a team, let myself, Leoni, or Iina know.





Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: