1

I am trying to eliminate excessive if statements for modifying values in a Pandas dataframe. I will eventually have one for each state, which is a lot of code and the if statement will be performed each time for every state. When my data source is in the list format, I successfully used lambda to make the code for efficient. This is demonstrated in the first code block. I am trying to replicate it with the data in the dataframe but am not sure how.

Efficient Code with Lists:

Projects = [['Project1', 'CT', 800], ['Project2', 'MA', 1000], ['Project3', 'CA', 20]]

for project in Projects:
    project[2] = {
        'CT': lambda: [project[2] * 1.4],
        'MA': lambda: [project[2] * 1.1],
        'CA': lambda: [project[2] * 1.5]
    }[project[1]]()

print Projects

Inefficient code with dataframe:

import pandas as pd
df = pd.DataFrame(data = [['Project1', 'CT', 800], ['Project2', 'MA', 1000], ['Project3', 'CA', 20]], columns=['Project ID', 'State', 'Cost'])

for project_index, project in df.iterrows():
    if project['State'] == 'CT':
        df.ix[project_index, 'Cost'] *= 1.4
    if project['State'] == 'MA':
        df.ix[project_index, 'Cost'] *= 1.1
    if project['State'] == 'CA':
        df.ix[project_index, 'Cost'] *= 1.5

print df
2
  • Instead pf those lambdas, why not just create a dictionary for the factor, {'CT': 1.4, ...}, and call like project[2] *= factors[project[1]]? Commented Jul 21, 2015 at 15:02
  • why not just do a many-to-one merge to create a column of constants 1.4 1.1 1.5 for each state CT MA CA and do the calculation column-wise. Iterating row-by-row is a bit slower. Commented Jul 21, 2015 at 15:08

1 Answer 1

2

I'd construct a dict of your states and desired multiplication factor and just iterate over the dict to get the state and cost factor tuple, use loc and the boolean mask to selectively multiply only those rows in your df:

In [185]:
d = {'CT':1.4, 'MA':1.1, 'CA':1.5}
for item in d.items():
    df.loc[df['State'] == item[0], 'Cost'] *= item[1]
df

Out[185]:
  Project ID State  Cost
0   Project1    CT  1120
1   Project2    MA  1100
2   Project3    CA    30
Sign up to request clarification or add additional context in comments.

1 Comment

Why would you loop through the dictionary and the not the dataframe? What if the dictionary has all 50 states, but the dataframe only has 4 projects. That seems inefficient and may even cause errrors.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.