0

I'm writing a function that takes as input a dataframe and a "mask". The dataframe's assumed to have multiindex columns such as ("some string", 0.4): pairs where the second object is numeric. The mask is intended to be something like df < 2, df >= 4, etc.

The output should be a new table where every value that doesn't match the mask is left alone, and every value that does is replaced by the number of the name of its column.

NaNs in the input should be left alone (unless of course the mask is something like df.isna()).

This is what I've come up with (assume this is in a file called mytable.py):

import pandas as pd
import numpy as np


data = {
    ("A", 0.2): [4.0, 1.0, np.nan],
    ("B", 0.6): [0.0, np.nan, 4.0],
    ("C", 0.7): [0.0, 5.0, 1.0],
}
df = pd.DataFrame(data)


def replaced_with_colname(table, mask):
    series1 = (table[col][mask[col]] for col in table.columns)
    series2 = (s.apply(lambda x: s.name[1]) for s in series1)
    t2 = table.copy()
    for s in series2:
        t2.update(s)
    return t2

An example:

$ python3 -i mytable.py
>>> df
     A    B    C
   0.2  0.6  0.7
0  4.0  0.0  0.0
1  1.0  NaN  5.0
2  NaN  4.0  1.0
>>> replaced_with_colname(df, df>2)
     A    B    C
   0.2  0.6  0.7
0  0.2  0.0  0.0
1  1.0  NaN  0.7
2  NaN  0.6  1.0

It seems to do the job, but it seems convoluted and probably slow, though I didn't benchmark it. My question is: is there a (more) "vectorized", idiomatic way of doing it? Using more pandas methods and fewer for-loops?

Similar questions that helped me, and why they're not exactly what I'm trying to do:

2 Answers 2

2

It's a perfect use case for np.where: if mask is True returns the second level index values else keep as it.

def replaced_with_colname(table, mask):
    data = np.where(mask, df.columns.levels[1], df)
    return pd.DataFrame(data, index=table.index, columns=table.columns)

Usage:

>>> replaced_with_colname(df, df>2)
     A    B    C
   0.2  0.6  0.7
0  0.2  0.0  0.0
1  1.0  NaN  0.7
2  NaN  0.6  1.0

>>> replaced_with_colname(df, df.isna())
     A    B    C
   0.2  0.6  0.7
0  4.0  0.0  0.0
1  1.0  0.6  5.0
2  0.2  4.0  1.0

>>> replaced_with_colname(df, (0<=df) & (df<=1) | df.isna())
     A    B    C
   0.2  0.6  0.7
0  4.0  0.6  0.7
1  0.2  0.6  5.0
2  0.2  4.0  0.7
Sign up to request clarification or add additional context in comments.

Comments

2

You can approach this by using pandas.Index.get_level_values :

out = (
        df
          .gt(2)
          .mul(df.columns.get_level_values(1))
          .mask(lambda d: [d[col].eq(0) for col in d.columns])
          .combine_first(df)
      )

The comparison operators (eq, ne, le, lt, ge, gt) are equivalent to (==, !=, <=, <, >=, >).

​Output :

print(out)

     A    B    C
   0.2  0.6  0.7
0  0.2  0.0  0.0
1  1.0  NaN  0.7
2  NaN  0.6  1.0

If you need a custom function :

def replace_with_colname(table, cond):
    out = (
            df[cond]
              .mul(df.columns.get_level_values(1))
              .mask(lambda d: [d[col].eq(0) for col in d.columns])
              .combine_first(df)
          )
    return out

2 Comments

Hm, after changing df.gt(2) to mask (because the condition should be given by the function call, not hardcoded), it seems this usually works but sometimes replaces NaNs with the column names. For example, with replace_with_colname(df, df<=1) the NaN at iloc (2, 0) becomes 0.2.
I can't reproduce the issue. print(replace_with_colname(df, df<=1).iat[2,0]) gives nan. I updated my answer with the function used.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.