1

I am writting script in Python and I am looking for optimal solution of following problem:

I have big pandas dataframe (at least 100k rows) and if there are rows with the same value in col2 but different value in col3 then I want to change all values in col3 for A

For example:

----------------------
| col1 | col2 | col3 |
----------------------
|   a  |   1  |   A  |
----------------------
|   b  |   2  |   A  |
----------------------
|   c  |   2  |   B  |
----------------------
|   d  |   2  |   B  |
----------------------
|   e  |   3  |   B  |
----------------------
|   f  |   3  |   B  |
----------------------

should look like this:

----------------------
| col1 | col2 | col3 |
----------------------
|   a  |   1  |   A  |
----------------------
|   b  |   2  |   A  |
----------------------
|   c  |   2  |   A  |
----------------------
|   d  |   2  |   A  |
----------------------
|   e  |   3  |   B  |
----------------------
|   f  |   3  |   B  |
----------------------

I solved that problem by sorting dataframe over col2 and iterating over rows, whenever value in col2 changes and in "block" of the same col2 values are different values I change col3 value but this algorithm takes around 60s for 100k rows and I am looking for more sufficient answer.

2
  • the value would be updated to 'A' everytime or there is some rule to which value you want to update ? Commented Aug 8, 2019 at 7:59
  • you can try using pyspark Commented Aug 8, 2019 at 8:13

1 Answer 1

2

Use GroupBy.transform with DataFrameGroupBy.nunique for test number of unique values and set new values by condition in DataFrame.loc:

df.loc[df.groupby('col2')['col3'].transform('nunique') != 1, 'col3'] = 'A' 
print (df)
  col1  col2 col3
0    a     1    A
1    b     2    A
2    c     2    A
3    d     2    A
4    e     3    B
5    f     3    B

Details:

First check number of unique values per groups with transform for same size Series like original DataFrame:

print (df.groupby('col2')['col3'].transform('nunique'))
0    1
1    2
2    2
3    2
4    1
5    1
Name: col3, dtype: int64

And thentest for not equal:

print (df.groupby('col2')['col3'].transform('nunique') != 1)
0    False
1     True
2     True
3     True
4    False
5    False
Name: col3, dtype: bool

Last overwrite True rows by value A.

Sign up to request clarification or add additional context in comments.

1 Comment

beat me to it! :D

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.