My intent is to split the master dataframe purchases into 2 dataframes: a normal one, and one that contains outliers depending on NaN. The code below should span the dataframe length, but it actually throws an exception
IndexError: index 4 is out of bounds for axis 0 with size 3
The print statements show that the conditions are right, yet the results (when using for i in range(0,m-1):) are wrong, which is probably due to the way rows are dropped:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {
'apples': [3, 2, 0, np.nan, 2],
'oranges': [0, 7, 7, 2, 7],
'figs':[1, np.nan, 10, np.nan, 10]
}
purchases = pd.DataFrame(data)
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Bob'])
# calculate the proportion of NaN per row
l = len(purchases.columns)
m = len(purchases)
n_nan=0
W= [0 for w in range(m)]
X = [i for i in range(0,m)]
for i in range(0,m):
n_nan = purchases.iloc[i,:].isna().sum()
print('row ',i,' number of NaN ',n_nan,' % of Nan ',n_nan*100/l)
W[i]=n_nan*100/l
# Write code to divide the data into two subsets based on the number of missing
# values in each row.
# use https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
purchases_normal = purchases.copy()
purchases_outliers = purchases.copy()
print('purchases ')
print(purchases)
print('----------------------')
#
for j in range(0,m-0):
# print('row ',j,' W = ',W[j])
if W[j]> 20:
print('at iteration ',j, ' going to drop from purchases_normal as W= ', W[j],' is > 20')
purchases_normal.drop(purchases_normal.index[j], inplace=True)
else:
print('at iteration ',j, ' going to drop from purchases_outliers as W= ', W[j],' is < 20')
purchases_outliers.drop(purchases_outliers.index[j], inplace=True)
print('purchases normal')
print(purchases_normal)
print('------')
print('purchases outliers')
print(purchases_outliers)
index[j]will not be what you expect..