2

I am trying to replace one column by another if the values in that column are equal to a string. The value of this string is "wo". If this shows up in column y, replace by column x. Currently I use the following code:

df.y.replace("wo",df.x) 

This runs for a very long time (millions of observations, equals days of calculations).

Is there a more efficient way to replace ?

Just in case, the data looks as follows:

 y    x    other variables
 1    mo    something
 2    2     something
 3    3     something
 wo   >5    something
 4    4     something
 wo   7     something

It has to look like:

 y    x    other variables
 1    mo    something
 2    2     something
 3    3     something
 >5   >5   something
 4    4     something
 7    7     something

2 Answers 2

4

try this:

df.loc[(df.y == 'wo'), 'y'] = df.x

it will first filter only those rows where df.y == 'wo' and will assign x column's value to 'y' column

Timeit report:

In [304]: %timeit df.y.replace("wo",df.x)
100 loops, best of 3: 13.9 ms per loop

In [305]: %timeit df.loc[(df.y == 'wo'), 'y'] = df.x
100 loops, best of 3: 3.31 ms per loop

In [306]: %timeit df.ix[(df.y == 'wo'), 'y'] = df.x
100 loops, best of 3: 3.31 ms per loop

UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

Sign up to request clarification or add additional context in comments.

6 Comments

Do you have an indication that this actually speeds something up? If it does, I think it should be filed as a performance bug for pd's replace method.
I will test speed now.
@MaxU Hmm, interesting.
Just tested it for speed. My original code is still running (and considering this is 99,99% of the code that has been running for over a week it probably will be for a while). Your code finished in a couple of minutes. Insane performance upgrade, thank you !!!
@Peter, i'm glad i could help
|
2

First, Pandas should be notified that that string value "wo" represents IEEE double NaN (aka numpy nan etc). See for example the na_values parameter of the read_csv method here. This will allow the entire column to be stored as double which will increase efficiency. Then use something like this to replace the NaN values with the values from the other column.

2 Comments

I assume this means that other non-numerics have to also be converted to na ? There can be several symbols and letter combinations in there. While converting them is not a problem for the letter combinations (although I assume I have to mention them all specifically?) sometimes the number is >5. This would make it trickier ? (sorry for incompletely specifying this in the question)
I have used the na_values extensively and IF these values really are irrelevant and should be replaced, then they can be specified in iterable form (think list) to that argument. If the values are not irrelevant then depending on the source, the information they provide should perhaps be put off into another categorical data column with N/A values fed into the original column. Edit: the ">5" is not a problem because that doesn't parse as a double.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.