1

When I run this code in Jupyter Notebook:

columns = ['nkill', 'nkillus', 'nkillter','nwound', 'nwoundus', 'nwoundte', 'propvalue', 'nperps', 'nperpcap', 'iyear', 'imonth', 'iday']

for col in columns:
    # needed for any missing values set to '-99'
    df[col] = [np.nan if (x < 0) else x for x in 
df[col].tolist()]
    # calculate the mean of the column
    column_temp = [0 if math.isnan(x) else x for x in df[col].tolist()]
    mean = round(np.mean(column_temp))
    # then apply the mean to all NaNs
    df[col].fillna(mean, inplace=True)

I receive the following error:

AttributeError                            Traceback 
(most recent call last)
<ipython-input-56-f8a0a0f314e6> in <module>()
  3 for col in columns:
  4     # needed for any missing values set to '-99'
----> 5     df[col] = [np.nan if (x < 0) else x for x in df[col].tolist()]
  6     # calculate the mean of the column
  7     column_temp = [0 if math.isnan(x) else x for x in df[col].tolist()]

/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   4374             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   4375                 return self[name]
-> 4376             return object.__getattribute__(self, name)
   4377 
   4378     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'tolist'

The code works fine when I run it in Pycharm, and all of my research has led me to conclude that it should be fine. Am I missing something?

I've created a Minimal, Complete, and Verifiable example below:

import numpy as np
import pandas as pd
import os
import math

# get the path to the current working directory
cwd = os.getcwd()

# then add the name of the Excel file, including its extension to get its relative path
# Note: make sure the Excel file is stored inside the cwd
file_path = cwd + "/data.xlsx"

# Copy the database to file
df = pd.read_excel(file_path)

columns = ['nkill', 'nkillus', 'nkillter', 'nwound', 'nwoundus', 'nwoundte', 'propvalue', 'nperps', 'nperpcap', 'iyear', 'imonth', 'iday']

for col in columns:
    # needed for any missing values set to '-99'
    df[col] = [np.nan if (x < 0) else x for x in df[col].tolist()]
    # calculate the mean of the column
    column_temp = [0 if math.isnan(x) else x for x in df[col].tolist()]
    mean = round(np.mean(column_temp))
    # then apply the mean to all NaNs
    df[col].fillna(mean, inplace=True)
11
  • Does this work? df[col].tolist() => df[col].values.tolist() Commented Dec 3, 2018 at 18:27
  • No sorry, it throws up a different error: TypeError: '<' not supported between instances of 'list' and 'int' Commented Dec 3, 2018 at 18:29
  • 3
    @Uncle_Timothy, See How to make good reproducible pandas examples. Mock up a minimal reproducible example, i.e. a minimal and reproducible example of the problem you observe. Commented Dec 3, 2018 at 18:39
  • 2
    I would start with a print(type(df)) and print(type(df['nkill'])), to verify that the objects are (are not) dataframe and series. Commented Dec 3, 2018 at 18:41
  • 1
    Without the xlsx file we can't copy-n-paste and run your code. It isn't Verifiable. Commented Dec 3, 2018 at 19:41

1 Answer 1

3

You have an XY Problem. You've described what you are trying to achieve in your comments, but your approach is not appropriate for Pandas.

Avoid for loops and list

With Pandas, you should look to avoid explicit for loops or conversion to Python list. Pandas builds on NumPy arrays which support vectorised column-wise operations.

So let's look at how you can rewrite:

for col in columns:
    # values less than 0 set to NaN
    # calculate the mean of the column with 0 for NaN
    # then apply the mean to all NaNs

You can now use Pandas methods to achieve the above.

apply + pd.to_numeric + mask + fillna

You can define a function mean_update and use pd.DataFrame.apply to apply it to each series:

df = pd.DataFrame({'A': [1, -2, 3, np.nan],
                   'B': ['hello', 4, 5, np.nan],
                   'C': [-1.5, 3, np.nan, np.nan]})

def mean_update(s):
    s_num = pd.to_numeric(s, errors='coerce')  # convert to numeric
    s_num = s_num.mask(s_num < 0)              # replace values less than 0 with NaN
    s_mean = s_num.fillna(0).mean()            # calculate mean
    return s_num.fillna(s_mean)                # replace NaN with mean

df = df.apply(mean_update)                     # apply to each series

print(df)

     A     B     C
0  1.0  2.25  0.75
1  1.0  4.00  3.00
2  3.0  5.00  0.75
3  1.0  2.25  0.75
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.