I am writting script in Python and I am looking for optimal solution of following problem:
I have big pandas dataframe (at least 100k rows) and if there are rows with the same value in col2 but different value in col3 then I want to change all values in col3 for A
For example:
----------------------
| col1 | col2 | col3 |
----------------------
| a | 1 | A |
----------------------
| b | 2 | A |
----------------------
| c | 2 | B |
----------------------
| d | 2 | B |
----------------------
| e | 3 | B |
----------------------
| f | 3 | B |
----------------------
should look like this:
----------------------
| col1 | col2 | col3 |
----------------------
| a | 1 | A |
----------------------
| b | 2 | A |
----------------------
| c | 2 | A |
----------------------
| d | 2 | A |
----------------------
| e | 3 | B |
----------------------
| f | 3 | B |
----------------------
I solved that problem by sorting dataframe over col2 and iterating over rows, whenever value in col2 changes and in "block" of the same col2 values are different values I change col3 value but this algorithm takes around 60s for 100k rows and I am looking for more sufficient answer.