puzzling.org · mary.gardiner.id.au · Macquarie University
Mary Gardiner · Weblog

Wed, 17 Sep 2008

Arguments in favour of published code in computer science/computational linguistics

The September 2008 issue of Computational Linguistics (volume 34:3) has a Last Words column by Ted Pedersen, Empiricism is Not a Matter of Faith. Pedersen has already made a PDF version available. In this article, Pedersen argues that computational linguistics is suffering because there is no tradition of publishing source code with research results.

In an email discussion about this article today, I was arguing in favour of the same idea (which you can take with a grain of salt as I have not published my source code to date). The opposing argument, by the way, was that everything important for a reimplementation should be described in written form in the paper, and if it isn't, then the paper is rubbish.

My arguments:

There are various cases where you want to reproduce something in computing research:

  1. you want to do a very similar task to published method X, but you have an approach Y that you think will work better. This is probably the case where source code is least useful, because you will be spending a lot of time implementing Y with or without the source code to X. However, it is sometimes useful:
    1. you want to contrast your method Y with published method X, but for example you want to do so on a slightly different dataset for some reason, perhaps because the task is slightly different and you want to see if X still works on the slightly different task
    2. your method Y includes some of method X... but the results you are getting aren't nearly as good as claimed in the paper about X and therefore you can't test the things you had hoped to about Y because you can't get a comparable baseline/input from X
    3. your method Y is heaps heaps better than X... and you'd like to prove it definitively by reproducing X's claimed numbers and running Y on the same data... but you can't.
    In 1.1 re-implementation is possible (in fact, I've spent most of my time on this) but source code would be a lot faster. And more results for the same effort is good for science. In 1.2 and 1.3 reimplementation is also slow, and it is VERY common to run into the problem of not being able to reproduce the numbers.
  2. you want to use result X as a stepping stone to result Y, eg you want to use the best possible stemmer to clean up things for your fancy new IR engine. In this case, access to the source code (or a working compiled version, I guess, in most cases, but it means that if you want to adapt any aspect you must reimplement) of X is very useful because it may save you weeks or months on debugging your own version of X when you wanted to be working on Y all along. You also run the risk described in 1 (and by Pedersen) of your implementation of X not working nearly as well as claimed, because X is more fragile than the authors realised and having spent a lot of time on X when X isn't actually your research problem.

Even if you don't need much of the code for X because Y is sufficiently different, for the case of reproducing results think of the code as backup. Sometimes, if you're lucky and you can't reproduce a result and not a long time has passed you can mail the author and they send you the source code or help you out.

However, if time has passed and/or the authors have sloppy data preservation, have quit research, wiped their hard drives, got into drugs, died, or were frauds in the first place and you realise you can't reproduce from the written description (even though arguably this makes it a bad paper), at the moment in CL you are out of luck.

Minimising being out of luck is a good thing for science.



posted at: 17:26 | path: /research-practice | permanent link to this entry

Syndicate

Archives

September 2008
Sun Mon Tue Wed Thu Fri Sat
 
       

All months

Categories

Powered by bloxsom