0

I am trying to use numpy where in conjunction with applymap in pandas.

Sample DF:

f = [[1,5],[20,40],[100,21],[15,19],[-46,101]]
test = pd.DataFrame(f,columns=["A","B"])
test 

OP:

    A   B
0   1   5
1   20  40
2   100 21
3   15  19
4   -46 101

Condition is, if a column value is greater than 50 or less than 25 it should be changed to 0 or it should remain as it is.

Code:

test = test.applymap(lambda x:np.where((test[x]>50)| (test[x]<25), 0,test[x]) )
test

Error:

    KeyError                                  Traceback (most recent call last)
~\AppData\Local\Continuum\miniconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (1, 'occurred at index A')

Any suggestions will be helpful

0

2 Answers 2

3

Use DataFrame.mask:

test.mask(test.lt(25)|test.gt(50),0)

or DataFrame.where

test.where(test.ge(25) & test.le(50),0)

Output

   A   B
0  0   0
1  0  40
2  0   0
3  0   0
4  0   0

Using DataFrame.applymap we could do:

test.applymap(lambda x: 0 if (x>50) or (x<25) else x)

but this could become slow for large data frames


Solution with np.where

import numpy as np
pd.DataFrame(np.where((test<25)|(test>50),0,test),index = test.index,columns = test.columns)

EDIT

mean_test = test.mean()
limit = 5
df_filtered = test.mask(test.gt(mean_test.add(limit))|
                        test.lt(mean_test.sub(limit)),0)
print(df_filtered)
    A   B
0   0   0
1  20  40
2   0   0
3  15   0
4   0   0
Sign up to request clarification or add additional context in comments.

6 Comments

this is great! I have 1 more doubt, what if I want to calculate the greater than number (50 in this case) and the lesser than number (25 in this case) for each column say through mean, which could be the better option?
What would be the expected output in that case?
Do you want to calculate the average of those over 50 and those under 25 for each column?
The condition is still the same. But instead of hardcoding the upper limit as 50 and lower limit as 25 for each column, I want to set the upper limit as mean(column) + 5 and lower limit as mean(column) - 5
you may find it useful to know that you can calculate the variance and the standard deviation : pandas.pydata.org/pandas-docs/stable/reference/api/… , pandas.pydata.org/pandas-docs/stable/reference/api/… ,for example: mean_test.add(test.std().mul(3**1/2))
|
0
sample_df = pd.DataFrame(np.random.randint(1,20,size=(10, 2)), columns=list('BC'))
sample_df["date"]= ["2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01",
                    "2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"]
sample_df["date"] = pd.to_datetime(sample_df["date"])
sample_df.set_index(sample_df["date"],inplace=True)
sample_df["A"]=[10,10,10,10,10,12,1,3,4,2]
del sample_df["date"]
sample_df


def func(df,n_bins):
    try:
        proc_col = pd.qcut(df["A"].values, n_bins, labels=range(0,n_bins))
        return proc_col
    except:
        proc_col = pd.qcut(df.mean(axis =1).values, n_bins, labels=range(0,n_bins))
        return proc_col

sample_df["A"]=sample_df.groupby([sample_df.index.get_level_values(0)])[["C","A"]].apply(lambda df: func(df,3))
sample_df


B   C   A
date            
2020-02-01  1   16  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  5   19  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  2   16  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  12  11  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  15  10  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  19  17  [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  17  7   [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  14  1   [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  19  13  [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  15  13  [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 .

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.