7

I have a pandas dataframe that looks like:

d = {'some_col' : ['A', 'B', 'C', 'D', 'E'],
     'alert_status' : [1, 2, 0, 0, 5]}
df = pd.DataFrame(d)

Quite a few tasks at my job require the same tasks in pandas. I'm beginning to write standardized functions that will take a dataframe as a parameter and return something. Here's a simple one:

def alert_read_text(df, alert_status=None):
    if (alert_status is None):
        print 'Warning: A column name with the alerts must be specified'
    alert_read_criteria = df[alert_status] >= 1
    df[alert_status].loc[alert_read_criteria] = 1
    alert_status_dict = {0 : 'Not Read',
                         1 : 'Read'}
    df[alert_status] = df[alert_status].map(alert_status_dict)
    return df[alert_status]

I'm looking to have the function return a series. This way, one could add a column to an existing data frame:

df['alert_status_text'] = alert_read_text(df, alert_status='alert_status')

However, currently, this function will correctly return a series, but also modifies the existing column. How do you make it so the original column passed in does not get modified?

1
  • you can take a copy e.g. copy = df.copy() in your function body Commented Jul 31, 2014 at 22:10

2 Answers 2

6

As you've discovered the passed in dataframe will be modified as params are passed by reference, this is true in python and nothing to do with pandas as such.

So if you don't want to modify the passed df then take a copy:

def alert_read_text(df, alert_status=None):
    if (alert_status is None):
        print 'Warning: A column name with the alerts must be specified'
    copy = df.copy()
    alert_read_criteria = copy[alert_status] >= 1
    copy[alert_status].loc[alert_read_criteria] = 1
    alert_status_dict = {0 : 'Not Read',
                         1 : 'Read'}
    copy[alert_status] = copy[alert_status].map(alert_status_dict)
    return copy[alert_status]

Also see related: pandas dataframe, copy by value

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! That solved it. Is there a standardized variable name that is common to use as the copy within functions like this? Meaning, pandas dataframes are usually abbreviated as df, etc. Do people typically name it as 'copy'? Or is it typically whatever you come up with?
@DataSwede that was just a quick hack example, you could call it whatever you want to be honest, tmp would also do
0

You don't need to set any value on your DataFrame on your example.

def alert_read_text(df, alert_status):
    alert_read_criteria = df[alert_status] >= 1
    alert_status_dict = {False : 'Not Read',
                     True : 'Read'}
    return alert_read_criteria.map(alert_status_dict)

Since the alert_read_criteria Series has the same index as df, you can still do df['alert_status_text'] = alert_read_text(df, alert_status='alert_status') afterwards.

From my experience, assigning columns to a DataFrame passed as parameter while not intending to return such DataFrame is generally a bad pattern. You might be hiding side-effects of the function as well.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.