how to conditionally drop rows from a Pandas dataframe

Question

My intent is to split the master dataframe purchases into 2 dataframes: a normal one, and one that contains outliers depending on NaN. The code below should span the dataframe length, but it actually throws an exception IndexError: index 4 is out of bounds for axis 0 with size 3

The print statements show that the conditions are right, yet the results (when using for i in range(0,m-1):) are wrong, which is probably due to the way rows are dropped:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {
    'apples': [3, 2, 0, np.nan, 2],
    'oranges': [0, 7, 7, 2, 7],
    'figs':[1, np.nan, 10, np.nan, 10]
}
purchases = pd.DataFrame(data)
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Bob'])
# calculate the proportion of NaN per row
l = len(purchases.columns)
m = len(purchases)
n_nan=0
W= [0 for w in range(m)]
X = [i for i in range(0,m)]
for i in range(0,m):
    n_nan = purchases.iloc[i,:].isna().sum()
    print('row ',i,' number of NaN ',n_nan,' % of Nan ',n_nan*100/l)
    W[i]=n_nan*100/l
# Write code to divide the data into two subsets based on the number of missing
# values in each row.
# use https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
purchases_normal = purchases.copy()
purchases_outliers = purchases.copy()
print('purchases ')
print(purchases)
print('----------------------')
#
for j in range(0,m-0):
#    print('row ',j,' W = ',W[j])
    if W[j]> 20:
        print('at iteration ',j, ' going to drop from purchases_normal as W= ', W[j],' is > 20')
        purchases_normal.drop(purchases_normal.index[j], inplace=True)
    else:
        print('at iteration ',j, ' going to drop from purchases_outliers as W= ', W[j],' is < 20')
        purchases_outliers.drop(purchases_outliers.index[j], inplace=True)
print('purchases normal')
print(purchases_normal)
print('------')
print('purchases outliers')
print(purchases_outliers)

In your inplace drop, you are changing the index, so, next time index[j] will not be what you expect.. — Rajat Jain
– Rajat Jain, Commented May 10, 2020 at 10:36

Rajat Jain · Accepted Answer · 2020-05-10 10:34:41Z

1

Try below loop:

for j in range(0,m-0):
#    print('row ',j,' W = ',W[j])
    if W[j]> 20:
        print('at iteration ',j, ' going to drop from purchases_normal as W= ', W[j],' is > 20')
        purchases_normal = purchases_normal.drop(purchases.index[j])
    else:
        print('at iteration ',j, ' going to drop from purchases_outliers as W= ', W[j],' is < 20')
        purchases_outliers = purchases_outliers.drop(purchases.index[j])

answered May 10, 2020 at 10:34

Rajat Jain

2,0422 gold badges17 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

joseph pareti Over a year ago

Both proposed solutions work fine. Thank you so much. What was the problem in my code, besides the fact I am still using for loops?

Prayson W. Daniel · Accepted Answer · 2020-05-10 11:03:23Z

1

Pandas is build in a way that you don’t have to use for-loop. If you find yourself using for-loop, you are 98% likely that you are doing it wrongly.

If I understand your objectives:
1. Find the number of NaN row-wise
2. Get the percentage used for logic(drop when X)

# ... 
df['number_nan'] = df.isna().sum(axis=1)
df['pct_nan'] = df['number_na']/len(df.columns)

Now having these additional columns. You can filter

above_20 = .2
# dt = df with rows with above 20 percent missing values 
dt = df[df['pct_na'] > above_20]

Let me know if I understand your objectives.

edited May 10, 2020 at 11:03

answered May 10, 2020 at 10:52

Prayson W. Daniel

15.8k6 gold badges57 silver badges62 bronze badges

Collectives™ on Stack Overflow

how to conditionally drop rows from a Pandas dataframe

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related