0

My intent is to split the master dataframe purchases into 2 dataframes: a normal one, and one that contains outliers depending on NaN. The code below should span the dataframe length, but it actually throws an exception IndexError: index 4 is out of bounds for axis 0 with size 3

The print statements show that the conditions are right, yet the results (when using for i in range(0,m-1):) are wrong, which is probably due to the way rows are dropped:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {
    'apples': [3, 2, 0, np.nan, 2],
    'oranges': [0, 7, 7, 2, 7],
    'figs':[1, np.nan, 10, np.nan, 10]
}
purchases = pd.DataFrame(data)
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Bob'])
# calculate the proportion of NaN per row
l = len(purchases.columns)
m = len(purchases)
n_nan=0
W= [0 for w in range(m)]
X = [i for i in range(0,m)]
for i in range(0,m):
    n_nan = purchases.iloc[i,:].isna().sum()
    print('row ',i,' number of NaN ',n_nan,' % of Nan ',n_nan*100/l)
    W[i]=n_nan*100/l
# Write code to divide the data into two subsets based on the number of missing
# values in each row.
# use https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
purchases_normal = purchases.copy()
purchases_outliers = purchases.copy()
print('purchases ')
print(purchases)
print('----------------------')
#
for j in range(0,m-0):
#    print('row ',j,' W = ',W[j])
    if W[j]> 20:
        print('at iteration ',j, ' going to drop from purchases_normal as W= ', W[j],' is > 20')
        purchases_normal.drop(purchases_normal.index[j], inplace=True)
    else:
        print('at iteration ',j, ' going to drop from purchases_outliers as W= ', W[j],' is < 20')
        purchases_outliers.drop(purchases_outliers.index[j], inplace=True)
print('purchases normal')
print(purchases_normal)
print('------')
print('purchases outliers')
print(purchases_outliers)
1
  • In your inplace drop, you are changing the index, so, next time index[j] will not be what you expect.. Commented May 10, 2020 at 10:36

2 Answers 2

1

Try below loop:

for j in range(0,m-0):
#    print('row ',j,' W = ',W[j])
    if W[j]> 20:
        print('at iteration ',j, ' going to drop from purchases_normal as W= ', W[j],' is > 20')
        purchases_normal = purchases_normal.drop(purchases.index[j])
    else:
        print('at iteration ',j, ' going to drop from purchases_outliers as W= ', W[j],' is < 20')
        purchases_outliers = purchases_outliers.drop(purchases.index[j])
Sign up to request clarification or add additional context in comments.

1 Comment

Both proposed solutions work fine. Thank you so much. What was the problem in my code, besides the fact I am still using for loops?
1

Pandas is build in a way that you don’t have to use for-loop. If you find yourself using for-loop, you are 98% likely that you are doing it wrongly.

If I understand your objectives:
1. Find the number of NaN row-wise
2. Get the percentage used for logic(drop when X)

# ... 
df['number_nan'] = df.isna().sum(axis=1)
df['pct_nan'] = df['number_na']/len(df.columns)

Now having these additional columns. You can filter

above_20 = .2
# dt = df with rows with above 20 percent missing values 
dt = df[df['pct_na'] > above_20]

Let me know if I understand your objectives.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.