Efficient replace value by value in other column

Question

I am trying to replace one column by another if the values in that column are equal to a string. The value of this string is "wo". If this shows up in column y, replace by column x. Currently I use the following code:

df.y.replace("wo",df.x)

This runs for a very long time (millions of observations, equals days of calculations).

Is there a more efficient way to replace ?

Just in case, the data looks as follows:

 y    x    other variables
 1    mo    something
 2    2     something
 3    3     something
 wo   >5    something
 4    4     something
 wo   7     something

It has to look like:

 y    x    other variables
 1    mo    something
 2    2     something
 3    3     something
 >5   >5   something
 4    4     something
 7    7     something

MaxU - stand with Ukraine · Accepted Answer · 2017-05-06 18:10:33Z

4

try this:

df.loc[(df.y == 'wo'), 'y'] = df.x

it will first filter only those rows where df.y == 'wo' and will assign x column's value to 'y' column

Timeit report:

In [304]: %timeit df.y.replace("wo",df.x)
100 loops, best of 3: 13.9 ms per loop

In [305]: %timeit df.loc[(df.y == 'wo'), 'y'] = df.x
100 loops, best of 3: 3.31 ms per loop

In [306]: %timeit df.ix[(df.y == 'wo'), 'y'] = df.x
100 loops, best of 3: 3.31 ms per loop

UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

edited May 6, 2017 at 18:10

answered Mar 21, 2016 at 14:00

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ami Tavory Over a year ago

Do you have an indication that this actually speeds something up? If it does, I think it should be filed as a performance bug for pd's replace method.

Peter Over a year ago

I will test speed now.

Ami Tavory Over a year ago

@MaxU Hmm, interesting.

Peter Over a year ago

Just tested it for speed. My original code is still running (and considering this is 99,99% of the code that has been running for over a week it probably will be for a while). Your code finished in a couple of minutes. Insane performance upgrade, thank you !!!

MaxU - stand with Ukraine Over a year ago

@Peter, i'm glad i could help

|

user25064 · Accepted Answer · 2016-03-21 14:02:41Z

2

First, Pandas should be notified that that string value "wo" represents IEEE double NaN (aka numpy nan etc). See for example the na_values parameter of the read_csv method here. This will allow the entire column to be stored as double which will increase efficiency. Then use something like this to replace the NaN values with the values from the other column.

answered Mar 21, 2016 at 14:02

user25064

2,1302 gold badges17 silver badges28 bronze badges

2 Comments

Peter Over a year ago

I assume this means that other non-numerics have to also be converted to na ? There can be several symbols and letter combinations in there. While converting them is not a problem for the letter combinations (although I assume I have to mention them all specifically?) sometimes the number is >5. This would make it trickier ? (sorry for incompletely specifying this in the question)

user25064 Over a year ago

I have used the na_values extensively and IF these values really are irrelevant and should be replaced, then they can be specified in iterable form (think list) to that argument. If the values are not irrelevant then depending on the source, the information they provide should perhaps be put off into another categorical data column with N/A values fed into the original column. Edit: the ">5" is not a problem because that doesn't parse as a double.

Collectives™ on Stack Overflow

Efficient replace value by value in other column

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related