KeyError when dropping rows from Pandas dataframe

Question

I'm trying to drop some rows from a Pandas dataframe because they'd be considered outliers in data. I'm getting a KeyError when trying to drop some rows using the method my professor taught me.

gdp_2019_outliers = np.where(df_gdp['2019'] > 6)
df_gdp.drop(gdp_2019_outliers[0], inplace=True)
gdp_2019_outliers_neg = np.where(df_gdp['2019'] < -3)
df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True) # stacktrace points here as the cause

gdp_2020_outliers = np.where(df_gdp['2020'] > 3)
df_gdp.drop(gdp_2020_outliers[0], inplace=True)
gdp_2020_outliers_neg = np.where(df_gdp['2020'] < -15)
df_gdp.drop(gdp_2020_outliers_neg[0], inplace=True)

So, I find the outliers using np.where(), then pass the list of rows to drop(). It seems like it's trying to drop rows that are no longer in the dataframe, though -- like the first two lines of code dropped rows that were somehow refound.

Any ideas? Is there a better way to drop rows using a condition?

Stacktrack:

Traceback (most recent call last):
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\data_wrangling_project.py", line 104, in <module>
    df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\frame.py", line 4956, in drop
    return super().drop(
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\generic.py", line 4279, in drop
###############################################################################################################
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\generic.py", line 4323, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\indexes\base.py", line 6644, in drop
    raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: '[152] not found in axis'

gdp_columns = ['Country Name', '1980', '1990', '2000', '2010', '2018', '2019', '2020']
df_gdp = pd.read_csv(gdp_file, usecols=gdp_columns)

Dataset: https://www.kaggle.com/zackerym/gdp-annual-growth-for-each-country-1960-2020

Hey, @richardec I added the traceback -- thanks for the help so far. — cjames
– cjames, Commented Feb 5, 2022 at 19:49
Hmm, this is confusing. Will you please provide more of you code? I want to be able to reproduce your dataframe (if it's not proprietary). — user17242583
– user17242583, Commented Feb 5, 2022 at 19:58
That's actually all there is, really. I've added the data source and the remaining lines of code as an edit. — cjames
– cjames, Commented Feb 5, 2022 at 20:03

Valdi_Bo · Accepted Answer · 2022-02-05 20:10:33Z

Let's create the source DataFrame as:

   2019  2020
0     5     2
1     6     7
2     7   -15
3     8     8
4    -4     5
5    -3   -18
6    -2     7
7    -5    -3

So far the index contains consecutive integers, starting from 0.

When you compute gdp_2019_outliers, the result is:

(array([2, 3], dtype=int64),)

And after the first drop df_gdp contains:

   2019  2020
0     5     2
1     6     7
4    -4     5
5    -3   -18
6    -2     7
7    -5    -3

So far your code succeeded, because integer indices of rows are just the same as in the index of df_gdp.

Then, when you compute gdp_2019_outliers_neg, the result is:

(array([2, 5], dtype=int64),)

Now, when you attempt tu run:

df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True)

an exception is thrown:

KeyError: '[2] not found in axis'

The reason why your code failed is that:

np.where finds integer indices of the rows found, again starting from 0 and not corresponding to the index of df_gdp,
but then drop attempts to find rows with just these values in the index and this index does not contain 2.

The proper code should be to use boolean indexing:

gdp_2019_outliers = df_gdp['2019'] > 6
df_gdp = df_gdp[~gdp_2019_outliers]

Then, to drop negative outliers for 2019, run:

gdp_2019_outliers_neg = df_gdp['2019'] < -3
df_gdp = df_gdp[~gdp_2019_outliers_neg]

The result, after both drops, is:

   2019  2020
0     5     2
1     6     7
5    -3   -18
6    -2     7

Proceed the same way to drop other outliers.

score 1 · Accepted Answer · 2022-02-05 19:40:14Z

1

When you call drop, you need to pass it row indexes or column names. You can pass it a mask, which is essentially what you're doing.

Try this instead:

gdp_2019_outliers = np.where(df_gdp['2019'] > 6)
df_gdp.drop(gdp_2019_outliers[0], inplace=True)
gdp_2019_outliers_neg = np.where(df_gdp['2019'] < -3)
# Use this line instead:
df_gdp = df_gdp[~gdp_2019_outliers_neg[0]]


gdp_2020_outliers = np.where(df_gdp['2020'] > 3)
df_gdp.drop(gdp_2020_outliers[0], inplace=True)
gdp_2020_outliers_neg = np.where(df_gdp['2020'] < -15)
# Use this line instead as well:
df_gdp = [~gdp_2020_outliers_neg[0]]

edited Feb 5, 2022 at 19:40

answered Feb 5, 2022 at 19:31

user17242583

1 Comment

cjames Over a year ago

I'm not very familiar with Python, but I get a "TypeError: bad operand type for unary ~: 'tuple'" error when using it this way.

Collectives™ on Stack Overflow

KeyError when dropping rows from Pandas dataframe

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related