In an experiment I’m doing, I need to do correlations between very
large matrices. I was surprised to see that a single correlation
calculation could take many hours. I decided it was worthwhile to
benchmark my two main statistics options, R and Numpy, to see which
would do it faster.

Here’s the code I used to do the Python:

Here’s the code I used to do the R (it’s not exactly the same, because
R doesn’t give the option to run a correlation on a matrix directly,
so I’m just using a linear array).

And here’s the enclosing shell script which runs each of the above for
different sizes (it only runs once, because the long ones take so
long, and because very long runs don’t suffer as badly from some
sources of timing errors):

Results

size

R

Numpy

10*10

0.004s

0.076s

50*50

0.138s

2.023

250*250

77s

98s

1250*1250

50239s

40414s

Unscientific conclusion: R wins easily for small sizes (possible
explanation is that it doesn’t need to pre-process the 2d array to a
1d array), but at the large size I need, Numpy is the way to go.

Update many months later: the reason it’s abominably slow in both is
that I’m using Kendall’s tau, a rank-based measure of correlation,
which has O(n^2) time complexity! The real conclusion is therefore:
consider Spearman’s rho instead, which is also rank-based, but faster.
However, there is a can of worms here. They don’t give the same
results, and it’s not obvious how to judge which is “better”.