Question or problem about Python programming:
np.where has the semantics of a vectorized if/else (similar to Apache Spark’s when/otherwise DataFrame method). I know that I can use np.where on pandas Series, but pandas often defines its own API to use instead of raw numpy functions, which is usually more convenient with pd.Series/pd.DataFrame.
Sure enough, I found pandas.DataFrame.where. However, at first glance, it has a completely different semantics. I could not find a way to rewrite the most basic example of np.where using pandas where:
# df is pd.DataFrame # how to write this using df.where? df['C'] = np.where((df['A']<0) | (df['B']>0), df['A']+df['B'], df['A']/df['B'])
Am I missing something obvious? Or is pandas where intended for a completely different use case, despite same name as np.where?
How to solve the problem:
Try:
(df['A'] + df['B']).where((df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])
The difference between the numpy
where
and DataFrame
where
is that the default values are supplied by the DataFrame
that the where
method is being called on (docs).
I.e.
np.where(m, A, B)
is roughly equivalent to
A.where(m, B)
If you wanted a similar call signature using pandas, you could take advantage of the way method calls work in Python:
pd.DataFrame.where(cond=(df['A'] < 0) | (df['B'] > 0), self=df['A'] + df['B'], other=df['A'] / df['B'])
or without kwargs (Note: that the positional order of arguments is different from the numpy
where
argument order):
pd.DataFrame.where(df['A'] + df['B'], (df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])