Need to filter on several columns and change value of one in Python Pandas

Question

I have a table with 150.000 rows and 15 columns. Important columns for this example are COUNTRY, COSTCENTER and EXTENSION. I am reading from a CSV into a Pandas Dataframe. All columns are of type object.

What I want to do is:

Search for a certain COUNTRY (e.g. "China")
Filter for these instances where the COSTCENTER is either 1000 or 2000 or where an EXTENSION starts with "862"
Once all filters have been applied, change the country name in COUNTRY to something new.

I had a solution, but I always got the warning for a chaining issue:

df.COUNTRY[df.COUNTRY.str.match("China") &
                (df.COSTCENTER.str.match("1000") |
                 df.COSTCENTER.str.match("2000"))] = 'China_new_name'

I cannot say, I understood completely, why I could have problems here, but I was looking for an alternative. I was trying with lambda and apply, but I kept getting all sorts of errors.

My latest approach now was:

filter_China = df.ix[(df["COUNTRY"]=="China") &
((df["COSTCENTER"]=="1000") | (df["COSTCENTER"]=="2000"))]

and it seems to filter, what I am looking for (I did not include the search on EXTENSION yet, as I first wanted this to work).

But when I am trying to change a value, based on my search criteria, I am running into trouble:

df.ix[(df["COUNTRY"]=="China") & ((df["COSTCENTER"]=="1000") | 
(df["COSTCENTER"]=="2000")), df["COUNTRY"]] = "China_new_name"

I am getting this error: raise KeyError('%s not in index' % objarr[mask])

What am I missing here? Is the approach the right one or would I need to go a total different route?

Ted Petrou · Accepted Answer · 2016-12-07 13:01:14Z

3

You need to read the section of the documentation on chained indexing and the SettingWithCopy warning

df.loc[df.COUNTRY.str.match("China") &
                (df.COSTCENTER.str.match("1000") |
                 df.COSTCENTER.str.match("2000")), "COUNTRY"] = 'China_new_name'

edited Dec 7, 2016 at 13:01

answered Dec 7, 2016 at 12:56

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

SLglider Over a year ago

Thank you for your answer. I have been reading the documentation, but my knowledge on Python / Pandas is still too low, to understand it completely. Just started a couple of weeks ago... Can you please give me an idea, why you changed from df["COUNTRY]=="China" to a str.match? And overall, is this the right approach (e.g. speed wise)?

jezrael Over a year ago

I think str.match is not necessary. See my answer.

jezrael Over a year ago

Also it is slowier, comparing is with your function without str.startswith and also is slowier.

gneusch Over a year ago

@jezrael In my use case == didn't work, but str.match worked well. I don't know the cause though.

jezrael Over a year ago

@gneusch - Reason is clear, you need match substrings here

jezrael · Accepted Answer · 2016-12-07 14:20:30Z

I think you need compare with == and for check start of string use function str.startswith:

df = pd.DataFrame({'COUNTRY':['China','China','China', 'USA'],
                   'COSTCENTER':['1000','2000','6000','1000'],
                   'EXTENSION':['86212','11862','1000', '8555']})

print (df)
  COSTCENTER COUNTRY EXTENSION
0       1000   China     86212
1       2000   China     11862
2       6000   China      1000
3       1000     USA      8555

df.loc[(df.COUNTRY == "China") & ((df.COSTCENTER == "1000") | (df.COSTCENTER == "2000")) & 
       (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'

print (df)
  COSTCENTER         COUNTRY EXTENSION
0       1000  China_new_name     86212
1       2000           China     11862
2       6000           China      1000
3       1000             USA      8555

Another solution with isin for compare multiple values of column:

df.loc[(df.COUNTRY == "China") & (df.COSTCENTER.isin(["1000", "2000"])) & 
       (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'

print (df)
  COSTCENTER         COUNTRY EXTENSION
0       1000  China_new_name     86212
1       2000           China     11862
2       6000           China      1000
3       1000             USA      8555

Timings:

df = pd.DataFrame({'COUNTRY':['China','China','China', 'USA'],
                   'COSTCENTER':['1000','2000','6000','1000'],
                   'EXTENSION':['86212','11862','1000', '8555']})

#[400000 rows x 3 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
print (df)

In [330]: %timeit df.loc[(df.COUNTRY == "China") & (df.COSTCENTER.isin(["1000", "2000"])) & (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'
1 loop, best of 3: 198 ms per loop

In [331]: %timeit df.loc[(df.COUNTRY == "China") & ((df.COSTCENTER == "1000") | (df.COSTCENTER == "2000")) & (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'
1 loop, best of 3: 238 ms per loop

In [332]: %timeit df.loc[df.COUNTRY.str.match("China") & (df.COSTCENTER.str.match("1000") | df.COSTCENTER.str.match("2000")), "COUNTRY"] = 'China_new_name'
1 loop, best of 3: 745 ms per loop

Ah, this is pretty cool. Thanks for running the different tests. I have to look at this at a later point, to be able to do this myself in the future.

Collectives™ on Stack Overflow

Need to filter on several columns and change value of one in Python Pandas

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related