3

I have a table with 150.000 rows and 15 columns. Important columns for this example are COUNTRY, COSTCENTER and EXTENSION. I am reading from a CSV into a Pandas Dataframe. All columns are of type object.

What I want to do is:

  1. Search for a certain COUNTRY (e.g. "China")
  2. Filter for these instances where the COSTCENTER is either 1000 or 2000 or where an EXTENSION starts with "862"
  3. Once all filters have been applied, change the country name in COUNTRY to something new.

I had a solution, but I always got the warning for a chaining issue:

df.COUNTRY[df.COUNTRY.str.match("China") &
                (df.COSTCENTER.str.match("1000") |
                 df.COSTCENTER.str.match("2000"))] = 'China_new_name'

I cannot say, I understood completely, why I could have problems here, but I was looking for an alternative. I was trying with lambda and apply, but I kept getting all sorts of errors.

My latest approach now was:

filter_China = df.ix[(df["COUNTRY"]=="China") &
((df["COSTCENTER"]=="1000") | (df["COSTCENTER"]=="2000"))]

and it seems to filter, what I am looking for (I did not include the search on EXTENSION yet, as I first wanted this to work).

But when I am trying to change a value, based on my search criteria, I am running into trouble:

df.ix[(df["COUNTRY"]=="China") & ((df["COSTCENTER"]=="1000") | 
(df["COSTCENTER"]=="2000")), df["COUNTRY"]] = "China_new_name"

I am getting this error: raise KeyError('%s not in index' % objarr[mask])

What am I missing here? Is the approach the right one or would I need to go a total different route?

0

2 Answers 2

3

You need to read the section of the documentation on chained indexing and the SettingWithCopy warning

df.loc[df.COUNTRY.str.match("China") &
                (df.COSTCENTER.str.match("1000") |
                 df.COSTCENTER.str.match("2000")), "COUNTRY"] = 'China_new_name'
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for your answer. I have been reading the documentation, but my knowledge on Python / Pandas is still too low, to understand it completely. Just started a couple of weeks ago... Can you please give me an idea, why you changed from df["COUNTRY]=="China" to a str.match? And overall, is this the right approach (e.g. speed wise)?
I think str.match is not necessary. See my answer.
Also it is slowier, comparing is with your function without str.startswith and also is slowier.
@jezrael In my use case == didn't work, but str.match worked well. I don't know the cause though.
@gneusch - Reason is clear, you need match substrings here
2

I think you need compare with == and for check start of string use function str.startswith:

df = pd.DataFrame({'COUNTRY':['China','China','China', 'USA'],
                   'COSTCENTER':['1000','2000','6000','1000'],
                   'EXTENSION':['86212','11862','1000', '8555']})

print (df)
  COSTCENTER COUNTRY EXTENSION
0       1000   China     86212
1       2000   China     11862
2       6000   China      1000
3       1000     USA      8555

df.loc[(df.COUNTRY == "China") & ((df.COSTCENTER == "1000") | (df.COSTCENTER == "2000")) & 
       (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'

print (df)
  COSTCENTER         COUNTRY EXTENSION
0       1000  China_new_name     86212
1       2000           China     11862
2       6000           China      1000
3       1000             USA      8555

Another solution with isin for compare multiple values of column:

df.loc[(df.COUNTRY == "China") & (df.COSTCENTER.isin(["1000", "2000"])) & 
       (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'

print (df)
  COSTCENTER         COUNTRY EXTENSION
0       1000  China_new_name     86212
1       2000           China     11862
2       6000           China      1000
3       1000             USA      8555

Timings:

df = pd.DataFrame({'COUNTRY':['China','China','China', 'USA'],
                   'COSTCENTER':['1000','2000','6000','1000'],
                   'EXTENSION':['86212','11862','1000', '8555']})

#[400000 rows x 3 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
print (df)

In [330]: %timeit df.loc[(df.COUNTRY == "China") & (df.COSTCENTER.isin(["1000", "2000"])) & (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'
1 loop, best of 3: 198 ms per loop

In [331]: %timeit df.loc[(df.COUNTRY == "China") & ((df.COSTCENTER == "1000") | (df.COSTCENTER == "2000")) & (df.EXTENSION.str.startswith('862')), "COUNTRY"] = 'China_new_name'
1 loop, best of 3: 238 ms per loop

In [332]: %timeit df.loc[df.COUNTRY.str.match("China") & (df.COSTCENTER.str.match("1000") | df.COSTCENTER.str.match("2000")), "COUNTRY"] = 'China_new_name'
1 loop, best of 3: 745 ms per loop

1 Comment

Ah, this is pretty cool. Thanks for running the different tests. I have to look at this at a later point, to be able to do this myself in the future.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.