In pandas, how to replace values matching a condition with a value from the names of their columns?

Question

I'm writing a function that takes as input a dataframe and a "mask". The dataframe's assumed to have multiindex columns such as ("some string", 0.4): pairs where the second object is numeric. The mask is intended to be something like df < 2, df >= 4, etc.

The output should be a new table where every value that doesn't match the mask is left alone, and every value that does is replaced by the number of the name of its column.

NaNs in the input should be left alone (unless of course the mask is something like df.isna()).

This is what I've come up with (assume this is in a file called mytable.py):

import pandas as pd
import numpy as np


data = {
    ("A", 0.2): [4.0, 1.0, np.nan],
    ("B", 0.6): [0.0, np.nan, 4.0],
    ("C", 0.7): [0.0, 5.0, 1.0],
}
df = pd.DataFrame(data)


def replaced_with_colname(table, mask):
    series1 = (table[col][mask[col]] for col in table.columns)
    series2 = (s.apply(lambda x: s.name[1]) for s in series1)
    t2 = table.copy()
    for s in series2:
        t2.update(s)
    return t2

An example:

$ python3 -i mytable.py

>>> df
     A    B    C
   0.2  0.6  0.7
0  4.0  0.0  0.0
1  1.0  NaN  5.0
2  NaN  4.0  1.0
>>> replaced_with_colname(df, df>2)
     A    B    C
   0.2  0.6  0.7
0  0.2  0.0  0.0
1  1.0  NaN  0.7
2  NaN  0.6  1.0

It seems to do the job, but it seems convoluted and probably slow, though I didn't benchmark it. My question is: is there a (more) "vectorized", idiomatic way of doing it? Using more pandas methods and fewer for-loops?

Similar questions that helped me, and why they're not exactly what I'm trying to do:

Pandas: replace values in column with condition My replacements come from the column names, not the "body" of the table.
How to replace a value in a pandas dataframe with column name based on a condition? The difference is that I'm replacing not a specific value, but all that fit a condition.

Corralien · Accepted Answer · 2023-01-20 22:14:30Z

2

It's a perfect use case for np.where: if mask is True returns the second level index values else keep as it.

def replaced_with_colname(table, mask):
    data = np.where(mask, df.columns.levels[1], df)
    return pd.DataFrame(data, index=table.index, columns=table.columns)

Usage:

>>> replaced_with_colname(df, df>2)
     A    B    C
   0.2  0.6  0.7
0  0.2  0.0  0.0
1  1.0  NaN  0.7
2  NaN  0.6  1.0

>>> replaced_with_colname(df, df.isna())
     A    B    C
   0.2  0.6  0.7
0  4.0  0.0  0.0
1  1.0  0.6  5.0
2  0.2  4.0  1.0

>>> replaced_with_colname(df, (0<=df) & (df<=1) | df.isna())
     A    B    C
   0.2  0.6  0.7
0  4.0  0.6  0.7
1  0.2  0.6  5.0
2  0.2  4.0  0.7

edited Jan 20, 2023 at 22:14

answered Jan 20, 2023 at 22:08

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Timeless · Accepted Answer · 2023-01-20 23:55:20Z

2

You can approach this by using pandas.Index.get_level_values :

out = (
        df
          .gt(2)
          .mul(df.columns.get_level_values(1))
          .mask(lambda d: [d[col].eq(0) for col in d.columns])
          .combine_first(df)
      )

The comparison operators (eq, ne, le, lt, ge, gt) are equivalent to (==, !=, <=, <, >=, >).

Output :

print(out)

     A    B    C
   0.2  0.6  0.7
0  0.2  0.0  0.0
1  1.0  NaN  0.7
2  NaN  0.6  1.0

If you need a custom function :

def replace_with_colname(table, cond):
    out = (
            df[cond]
              .mul(df.columns.get_level_values(1))
              .mask(lambda d: [d[col].eq(0) for col in d.columns])
              .combine_first(df)
          )
    return out

edited Jan 20, 2023 at 23:55

answered Jan 20, 2023 at 22:11

Timeless

38.3k6 gold badges33 silver badges54 bronze badges

2 Comments

Daniel Diniz Over a year ago

Hm, after changing df.gt(2) to mask (because the condition should be given by the function call, not hardcoded), it seems this usually works but sometimes replaces NaNs with the column names. For example, with replace_with_colname(df, df<=1) the NaN at iloc (2, 0) becomes 0.2.

Timeless Over a year ago

I can't reproduce the issue. print(replace_with_colname(df, df<=1).iat[2,0]) gives nan. I updated my answer with the function used.

Collectives™ on Stack Overflow

In pandas, how to replace values matching a condition with a value from the names of their columns?

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related