Question or problem about Python programming:
what is the quickest/simplest way to drop nan and inf/-inf values from a pandas DataFrame without resetting mode.use_inf_as_null? I’d like to be able to use the subset and how arguments of dropna, except with inf values considered missing, like:
df.dropna(subset=["col1", "col2"], how="all", with_inf=True)
is this possible? Is there a way to tell dropna to include inf in its definition of missing values?
How to solve the problem:
Solution 1:
The simplest way would be to first replace
infs to NaN:
df.replace([np.inf, -np.inf], np.nan)
and then use the dropna
:
df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")
For example:
In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf]) In [12]: df.replace([np.inf, -np.inf], np.nan) Out[12]: 0 0 1 1 2 2 NaN 3 NaN
The same method would work for a Series.
Solution 2:
With option context, this is possible without permanently setting use_inf_as_na
. For example:
with pd.option_context('mode.use_inf_as_na', True): df = df.dropna(subset=['col1', 'col2'], how='all')
Of course it can be set to treat inf
as NaN
permanently with
pd.set_option('use_inf_as_na', True)
For older versions, replace use_inf_as_na
with use_inf_as_null
.
Solution 3:
Here is another method using .loc
to replace inf with nan on a Series:
s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan
So, in response to the original question:
df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC')) for i in range(3): df.iat[i, i] = np.inf df A B C 0 inf 1.000000 1.000000 1 1.000000 inf 1.000000 2 1.000000 1.000000 inf df.sum() A inf B inf C inf dtype: float64 df.apply(lambda s: s[np.isfinite(s)].dropna()).sum() A 2 B 2 C 2 dtype: float64
Solution 4:
Use (fast and simple):
df = df[np.isfinite(df).all(1)]
This answer is based on DougR’s answer in an other question.
Here an example code:
import pandas as pd import numpy as np df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6]) print('Input:\n',df,sep='') df = df[np.isfinite(df).all(1)] print('\nDropped:\n',df,sep='')
Result:
Input: 0 0 1.0000 1 2.0000 2 3.0000 3 NaN 4 4.0000 5 inf 6 5.0000 7 -inf 8 6.0000 Dropped: 0 0 1.0 1 2.0 2 3.0 4 4.0 6 5.0 8 6.0
Solution 5:
Yet another solution would be to use the isin
method. Use it to determine whether each value is infinite or missing and then chain the all
method to determine if all the values in the rows are infinite or missing.
Finally, use the negation of that result to select the rows that don’t have all infinite or missing values via boolean indexing.
all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns') df[~all_inf_or_nan]