Question or problem about Python programming:
In numpy / scipy, is there an efficient way to get frequency counts for unique values in an array?
Something along these lines:
x = array( [1,1,1,2,2,2,5,25,1,1] ) y = freq_count( x ) print y >> [[1, 5], [2,3], [5,1], [25,1]]
( For you, R users out there, I’m basically looking for the table() function )
How to solve the problem:
Solution 1:
Take a look at np.bincount
:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html
import numpy as np x = np.array([1,1,1,2,2,2,5,25,1,1]) y = np.bincount(x) ii = np.nonzero(y)[0]
And then:
zip(ii,y[ii]) # [(1, 5), (2, 3), (5, 1), (25, 1)]
or:
np.vstack((ii,y[ii])).T # array([[ 1, 5], [ 2, 3], [ 5, 1], [25, 1]])
or however you want to combine the counts and the unique values.
Solution 2:
As of Numpy 1.9, the easiest and fastest method is to simply use numpy.unique
, which now has a return_counts
keyword argument:
import numpy as np x = np.array([1,1,1,2,2,2,5,25,1,1]) unique, counts = np.unique(x, return_counts=True) print np.asarray((unique, counts)).T
Which gives:
[[ 1 5] [ 2 3] [ 5 1] [25 1]]
A quick comparison with scipy.stats.itemfreq
:
In [4]: x = np.random.random_integers(0,100,1e6) In [5]: %timeit unique, counts = np.unique(x, return_counts=True) 10 loops, best of 3: 31.5 ms per loop In [6]: %timeit scipy.stats.itemfreq(x) 10 loops, best of 3: 170 ms per loop
Solution 3:
Update: The method mentioned in the original answer is deprecated, we should use the new way instead:
>>> import numpy as np >>> x = [1,1,1,2,2,2,5,25,1,1] >>> np.array(np.unique(x, return_counts=True)).T array([[ 1, 5], [ 2, 3], [ 5, 1], [25, 1]])
Original answer:
you can use scipy.stats.itemfreq
>>> from scipy.stats import itemfreq >>> x = [1,1,1,2,2,2,5,25,1,1] >>> itemfreq(x) /usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)` array([[ 1., 5.], [ 2., 3.], [ 5., 1.], [ 25., 1.]])
Solution 4:
I was also interested in this, so I did a little performance comparison (using perfplot, a pet project of mine). Result:
y = np.bincount(a) ii = np.nonzero(y)[0] out = np.vstack((ii, y[ii])).T
is by far the fastest. (Note the log-scaling.)
Code to generate the plot:
import numpy as np import pandas as pd import perfplot from scipy.stats import itemfreq def bincount(a): y = np.bincount(a) ii = np.nonzero(y)[0] return np.vstack((ii, y[ii])).T def unique(a): unique, counts = np.unique(a, return_counts=True) return np.asarray((unique, counts)).T def unique_count(a): unique, inverse = np.unique(a, return_inverse=True) count = np.zeros(len(unique), np.int) np.add.at(count, inverse, 1) return np.vstack((unique, count)).T def pandas_value_counts(a): out = pd.value_counts(pd.Series(a)) out.sort_index(inplace=True) out = np.stack([out.keys().values, out.values]).T return out perfplot.show( setup=lambda n: np.random.randint(0, 1000, n), kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts], n_range=[2 ** k for k in range(26)], logx=True, logy=True, xlabel="len(a)", )
Solution 5:
Using pandas module:
>>> import pandas as pd >>> import numpy as np >>> x = np.array([1,1,1,2,2,2,5,25,1,1]) >>> pd.value_counts(x) 1 5 2 3 25 1 5 1 dtype: int64