Monthly Archives: January 2011

In preparation for Science Online 2011, I was asked by Mark Hahnel from over at Science 3.0 if I could do some analysis of the blogs that they’ve been aggregating since Octobor (25 thousand posts from 1506 authors). Mark along with Dave Munger will be talking more about the role/importance of aggregators in a session Saturday morning 9am (Developing an aggregator for all science blogs). These analysis provide a high level overview of the content of science blogs. Here are the results.

The first analysis tried to find the topics of blogs and their relationships. We used title words as a proxy for topics and co-occurrence of those words as representative of the relationships between those topics. Here’s the map (click the image to see a larger size):

The words cluster together according to their co-occurrence. The hotter the color the more occurrence of those words. You’ll notice that for example Science and Blog are close to one another. Darwin and days as well as fumbling and tenure are close as well. The visualization was done with Vosviewer software.

I also looked at how blogs are citing research papers. We looked for the occurrence of DOIs as well as research blogging style citations within all the blog posts. We found that there were 964 posts with these sorts of citations. In this case, I thought there would be more but maybe this is down to how I implemented it.

Finally, I looked at what URLs were most commonly used in all the blog posts. Here are the top 20:

URL Occurences 4476 3920 1002 930 789 648 533 485 482 376 350 336 295 271 269 266 265 232 232 195

I was quite happy with this list because they are pretty much all science links. I thought there would be a lot more links to non-science places.

I hope the results can provide a useful discussion piece. Obviously, this is just the start and we can do a lot more interesting analyses. In particular, I think such statistics can be the basis for alt-metrics style measures. If you’re interested in talking to me about these analysis come find me at Science Online.

The university where I work asks us to register all our publications for the year in a central database [1].  Doing this obviously made me think of doing an ego search on my academic papers. Plus, it’s the beginning of the year, which always seems like a good time to look at these things.

The handy tool Publish-or-Perish calculates all sorts of citation metrics based on a search of Google Scholar. The tool lets you pick the set of publications to consider. (For example, I left out all the publications from another Paul Groth who’s a professor of architecture at Berkeley.) I did a cursory run through to remove publications that weren’t mine but I didn’t spend much time so all the standard disclaimers apply. There may be duplicates, it includes technical reports, etc. For transparency, you can find the set of publications considered in the Excel file here. Also, it’s worth noting that the Google Scholar corpus has it’s own problems, in particular, it makes you look better. With all that in mind, let’s get to the fun stuff.

My stats as of Jan. 4, 2011 are:

  • Papers:93,
  • Citations:1318,
  • Years:12,
  • Cites/year:109.83,
  • Cites/paper:14.17/4.0/0,
  • Cites/author:416.35,
  • Papers/author:43.27,
  • Authors/paper:3.04/3.0/2,
  • h-index:21,
  • g-index:34,
  • hc-index:16,
  • hI-index:5.58,
  • hI-norm:11,
  • AWCR:224.17,
  • AW-index:14.97,
  • AWCRpA:70.96,
  • e-index:24.98,
  • hm-index:9.07,

You can find the definitions for these metrics here.

What does it all mean? I don’t know 🙂 I think it’s not half bad.

For comparison, here’s a list of  the h-indexes for top computer scientist computed using Google Scholar. All have  an h-index of 40 or greater. A quick scan through that least, shows that there’s a pretty strong correlation between being a top computer scientist and a high h-index. Thus, I conclude that I should continue concentrating on being a good computer scientists and the statistics will follow.

[1] I don’t know why my university doesn’t support importing publication information from bibtex, or RIS. Everything has to be added by hand, which takes a bit.

    One of the nice things about using cloud services is that sometimes you get a feature that you didn’t expect. Below is a nice set of stats from about how well Think Links did in 2010. I was actually quite happy with 12 posts – one post a month. I will be trying to increase the rate of posts this year. If you’ve been reading this blog, thanks! and have a great 2011. The stats are below:

    Here’s a high level summary of this blogs overall blog health:

    Healthy blog!

    The Blog-Health-o-Meter™ reads Fresher than ever.

    Crunchy numbers

    Featured image

    A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 4,500 times in 2010. That’s about 11 full 747s.


    In 2010, there were 12 new posts, growing the total archive of this blog to 46 posts. There were 12 pictures uploaded, taking up a total of 5mb. That’s about a picture per month.

    The busiest day of the year was October 13th with 176 views. The most popular post that day was Data DJ realized….well at least version 0.1.

    Where did they come from?

    The top referring sites in 2010 were,,,, and

    Some visitors came searching, mostly for provenance open gov, think links, ready made food, 4store, and thinklinks.

    Attractions in 2010

    These are the posts and pages that got the most views in 2010.


    Data DJ realized….well at least version 0.1 October 2010


    4store Amazon Machine Image and Billion Triple Challenge Data Set October 2009


    Linking Slideshare Data June 2010


    A First EU Proposal April 2010


    Two Themes from WWW 2010 May 2010

    %d bloggers like this: