Question or problem about Python programming:
I have a dataframe, df, that has some columns of type float64, while the others are of object. Due to the mixed nature, I cannot use
df.fillna('unknown') #getting error "ValueError: could not convert string to float:"
as the error happened with the columns whose type is float64 (what a misleading error message!)
so I’d wish that I could do something like
for col in df.columns[]: df[col] = df[col].fillna("unknown")
So my question is if there is any such filter expression that I can use with df.columns?
I guess alternatively, less elegantly, I could do:
for col in df.columns: if (df[col].dtype == dtype('O')): # for object type df[col] = df[col].fillna('') # still puzzled, only empty string works as replacement, 'unknown' would not work for certain value leading to error of "ValueError: Error parsing datetime string "unknown" at position 0"
I also would like to know why in the above code replacing ” with ‘unknown’ the code would work for certain cells but failed with a cell with the error of “ValueError: Error parsing datetime string “unknown” at position 0″
Thanks a lot!
Yu
How to solve the problem:
Solution 1:
You can see what the dtype is for all the columns using the dtypes attribute:
In [11]: df = pd.DataFrame([[1, 'a', 2.]]) In [12]: df Out[12]: 0 1 2 0 1 a 2 In [13]: df.dtypes Out[13]: 0 int64 1 object 2 float64 dtype: object In [14]: df.dtypes == object Out[14]: 0 False 1 True 2 False dtype: bool
To access the object columns:
In [15]: df.loc[:, df.dtypes == object] Out[15]: 1 0 a
I think it’s most explicit to use (I’m not sure that inplace would work here):
In [16]: df.loc[:, df.dtypes == object] = df.loc[:, df.dtypes == object].fillna('')
Saying that, I recommend you use NaN for missing data.
Solution 2:
This is conciser:
# select the float columns df_num = df.select_dtypes(include=[np.float]) # select non-numeric columns df_num = df.select_dtypes(exclude=[np.number])
Solution 3:
As @RNA said, you can use pandas.DataFrame.select_dtypes. The code using your example from a question would look like this:
for col in df.select_dtypes(include=['object']).columns: df[col] = df[col].fillna('unknown')