Question or problem about Python programming:
I can’t seem to get a simple dtype check working with Pandas’ improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd import numpy as np import random df = pd.DataFrame({ 'x': np.linspace(0, 50, 6), 'y': np.linspace(0, 20, 6), 'cat_column': random.sample('abcdef', 6) }) df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is ‘category’:
df.cat_column.dtype Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64' Out[21]: True
But this doesn’t seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category' --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in () ----> 1 df.x.dtype == 'category' TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
How to solve the problem:
Solution 1:
Use the name
property to do the comparison instead, it should always work because it’s just a string:
>>> import numpy as np >>> arr = np.array([1, 2, 3, 4]) >>> arr.dtype.name 'int64' >>> import pandas as pd >>> cat = pd.Categorical(['a', 'b', 'c']) >>> cat.dtype.name 'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like): return array_like.dtype.name == 'category'
Solution 2:
First, the string representation of the dtype is 'category'
and not 'categorical'
, so this works:
In [41]: df.cat_column.dtype == 'category' Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError
for other dtypes, so you would have to wrap it with a try .. except ..
block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype) Out[42]: True In [43]: pd.api.types.is_categorical_dtype(df.cat_column) Out[43]: True
For non-categorical columns, those statements will return False
instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x) Out[44]: False
For much older version of pandas
, replace pd.api.types
in the above snippet with pd.core.common
.
Solution 3:
In my pandas version (v1.0.3), a shorter version of joris’ answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])}) print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True print(pd.CategoricalDtype.is_dtype(df.noncat)) # False print(pd.CategoricalDtype.is_dtype(df.categ)) # True
Solution 4:
Just putting this here because pandas.DataFrame.select_dtypes()
is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to @Jeff.
Solution 5:
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True