0

I'm trying to drop some rows from a Pandas dataframe because they'd be considered outliers in data. I'm getting a KeyError when trying to drop some rows using the method my professor taught me.

gdp_2019_outliers = np.where(df_gdp['2019'] > 6)
df_gdp.drop(gdp_2019_outliers[0], inplace=True)
gdp_2019_outliers_neg = np.where(df_gdp['2019'] < -3)
df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True) # stacktrace points here as the cause

gdp_2020_outliers = np.where(df_gdp['2020'] > 3)
df_gdp.drop(gdp_2020_outliers[0], inplace=True)
gdp_2020_outliers_neg = np.where(df_gdp['2020'] < -15)
df_gdp.drop(gdp_2020_outliers_neg[0], inplace=True)

So, I find the outliers using np.where(), then pass the list of rows to drop(). It seems like it's trying to drop rows that are no longer in the dataframe, though -- like the first two lines of code dropped rows that were somehow refound.

Any ideas? Is there a better way to drop rows using a condition?

Stacktrack:

Traceback (most recent call last):
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\data_wrangling_project.py", line 104, in <module>
    df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\frame.py", line 4956, in drop
    return super().drop(
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\generic.py", line 4279, in drop
###############################################################################################################
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\generic.py", line 4323, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\indexes\base.py", line 6644, in drop
    raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: '[152] not found in axis'

gdp_columns = ['Country Name', '1980', '1990', '2000', '2010', '2018', '2019', '2020']
df_gdp = pd.read_csv(gdp_file, usecols=gdp_columns)

Dataset: https://www.kaggle.com/zackerym/gdp-annual-growth-for-each-country-1960-2020

4
  • Will you please show the full traceback in the question? Commented Feb 5, 2022 at 19:47
  • 1
    Hey, @richardec I added the traceback -- thanks for the help so far. Commented Feb 5, 2022 at 19:49
  • Hmm, this is confusing. Will you please provide more of you code? I want to be able to reproduce your dataframe (if it's not proprietary). Commented Feb 5, 2022 at 19:58
  • That's actually all there is, really. I've added the data source and the remaining lines of code as an edit. Commented Feb 5, 2022 at 20:03

2 Answers 2

1

Let's create the source DataFrame as:

   2019  2020
0     5     2
1     6     7
2     7   -15
3     8     8
4    -4     5
5    -3   -18
6    -2     7
7    -5    -3

So far the index contains consecutive integers, starting from 0.

When you compute gdp_2019_outliers, the result is:

(array([2, 3], dtype=int64),)

And after the first drop df_gdp contains:

   2019  2020
0     5     2
1     6     7
4    -4     5
5    -3   -18
6    -2     7
7    -5    -3

So far your code succeeded, because integer indices of rows are just the same as in the index of df_gdp.

Then, when you compute gdp_2019_outliers_neg, the result is:

(array([2, 5], dtype=int64),)

Now, when you attempt tu run:

df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True)

an exception is thrown:

KeyError: '[2] not found in axis'

The reason why your code failed is that:

  • np.where finds integer indices of the rows found, again starting from 0 and not corresponding to the index of df_gdp,
  • but then drop attempts to find rows with just these values in the index and this index does not contain 2.

The proper code should be to use boolean indexing:

gdp_2019_outliers = df_gdp['2019'] > 6
df_gdp = df_gdp[~gdp_2019_outliers]

Then, to drop negative outliers for 2019, run:

gdp_2019_outliers_neg = df_gdp['2019'] < -3
df_gdp = df_gdp[~gdp_2019_outliers_neg]

The result, after both drops, is:

   2019  2020
0     5     2
1     6     7
5    -3   -18
6    -2     7

Proceed the same way to drop other outliers.

Sign up to request clarification or add additional context in comments.

Comments

1

When you call drop, you need to pass it row indexes or column names. You can pass it a mask, which is essentially what you're doing.

Try this instead:

gdp_2019_outliers = np.where(df_gdp['2019'] > 6)
df_gdp.drop(gdp_2019_outliers[0], inplace=True)
gdp_2019_outliers_neg = np.where(df_gdp['2019'] < -3)
# Use this line instead:
df_gdp = df_gdp[~gdp_2019_outliers_neg[0]]


gdp_2020_outliers = np.where(df_gdp['2020'] > 3)
df_gdp.drop(gdp_2020_outliers[0], inplace=True)
gdp_2020_outliers_neg = np.where(df_gdp['2020'] < -15)
# Use this line instead as well:
df_gdp = [~gdp_2020_outliers_neg[0]]

1 Comment

I'm not very familiar with Python, but I get a "TypeError: bad operand type for unary ~: 'tuple'" error when using it this way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.