4

I have a dataframe that looks something like this:

1   2  3  'String'
''  4  X  ''
''  5  X  ''
''  6  7  'String'
''  1  Y  ''

And I want to change the Xs and Ys (put here just to visualize) to the value corresponding to the same column when the last column = 'String'. So, the Xs would become a 3, and the Y would be 7:

1  2  3 'String'
'' 4  3 ''
'' 5  3 ''
'' 6  7 'String'
'' 1  7 ''

The reference value is the same until another 'parent' row comes around. So the first 3 remains until there comes another 'String' parent round.

I tried generating another dataframe containing where there's 'String' and filling from idx to idx+1 with the value, but it's too slow.

This is really similar to a forward fill (pd.ffill()), but not exactly, and I don't really know if it's feasible to turn my problem into a ffill() problem.

3
  • are the values X and Y null or just X and Y? Commented Jul 30 at 15:33
  • They're random int values Commented Jul 30 at 16:23
  • I updated my solution, it now relies on df['D'] being 'String' Commented Jul 30 at 16:33

4 Answers 4

5

Updated solution:

This situation can be solved using .ffill() but, you just have to replace the random int values with `NaN` values,

df.loc[df['D'] != 'String', 'C'] = np.nan

What this does is it finds where df['D'] is not 'String' and assigns a NaN value to it.

Now, the last step is simple, just use .ffill()

df['C'] = df['C'].ffill()

Here is the final result:

>>> df
   C    D
0  3.0  String
1  3.0        
2  3.0        
3  7.0  String
4  7.0        
Sign up to request clarification or add additional context in comments.

4 Comments

Works too. Assuming floats are ok, since assigning nan to a int column will promote it as float. Which most of the time should not be a problem, since float64 can hold exact values up to 9 quadrillions.
If it is a big problem then its possible to convert floats into int
I used this one and it was really quick, thank you!
No problem! Glad to help!
3

Starting from example

import pandas as pd
df=pd.DataFrame({"C":[3, 'X', 'X', 7, 'Y'], 'D':['String', '', '', 'String', '']}) # my own [mre]. You should have included that line in your question ;-)

So df is

   C       D
0  3  String
1  X        
2  X        
3  7  String
4  Y        

(don't worry, the X and Y have no influence on the result. I just included them to imitate your example)

What you are looking for is probably something like:

df['C'] = df.groupby((df['D']=='String').cumsum())['C'].transform('first')

Result:

   C       D
0  3  String
1  3        
2  3        
3  7  String
4  7        

To understand it, it is worth looking at what df['D']=='String').cumsum() does. df['D']=='String' is just a boolean series (True where last column is 'String', False elsewhere). But if you apply .cumsum on such series, it behaves as if True is 1 and False is 0. So what you get is a counter that is incremented each time there is a 'String' and stays as is otherwise. So

>>> (df['D']=='String').cumsum()
0    1
1    1
2    1
3    2
4    2
Name: D, dtype: int64

Which is exactly what you need to group your rows by, to have one group for each row with 'String' and all following rows without (till the next 'String').

Now, just transform C to take the first value of each group, and voila

df['C'] = df.groupby((df['D']=='String').cumsum())['C'].transform('first')

>>> df
   C       D
0  3  String
1  3        
2  3        
3  7  String
4  7        

Comments

2

Another possible solution:

df = df.assign(C = df['C'].where(df['D'].eq('String')).ffill().astype(int))

This creates a new version of df where column C is updated by forward-filling only the numeric values, leaving other values untouched. The df['D'].eq('String') method identifies which entries in column D is 'String'. The where() method replaces non-numeric entries with NaN, and then ffill() propagates the last valid numeric value downward, effectively filling the rows where C was not numeric with the most recent numeric value above it.

Output:

    A  B  C         D
0   1  2  3    String
1  ''  4  3        ''
2  ''  5  3        ''
3  ''  6  7    String
4  ''  1  7        ''

3 Comments

+1. Down vote seems severe. If C is not a string it doesn't work, sure (but the OP's X and Y could appear to be some string values to otherwise numeric column). Since it is not, that solution could still work with df['D']=='String' instead of df['C'].str.isnumeric()
Thanks, I am updating my solution accordingly.
It works but you should update the solution and assign df to your code.
1

You can select the wanted rows with boolean indexing and reindex with method='ffill':

df['C'] = df.loc[df['D'].eq('String'), 'C'].reindex(df.index, method='ffill')

Alternatively, for fun, and assuming you have an ordered index, you could select the rows to propagate with boolean indexing and combine them to the original input with a merge_asof on the index:

df['C'] = pd.merge_asof(df[[]], df.loc[df['D'].eq('String'), 'C'],
                        left_index=True, right_index=True)

Or as a new DataFrame:

out = (pd.merge_asof(df.drop(columns='C'),
                     df.loc[df['D'].eq('String'), 'C'],
                     left_index=True, right_index=True)
         .reindex_like(df)
      )

Output:

   A  B  C       D
0  1  2  3  String
1     4  3        
2     5  3        
3     6  7  String
4     1  7        

1 Comment

A very creative solution!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.