Select subset of rows from pandas DataFrame using entries from a separate partial MultiIndex

Question

I have data in a pandas DataFrame with a MultiIndex. Let's call the labels of my MultiIndex "Run", "Trigger", and "Cluster". Separately, I have a list of pre-computed selection criteria that I get as a list of entries passing (these tend to be sparse, so listing passing indexes is most space efficient). The selection cuts may only be partially indexed, e.g. may only specify "Run" or ("Run", "Trigger") pairs.

How do I efficiently apply these cuts, ideally without having to inspect them to find their levels?

For example, consider the following data:

index = pandas.MultiIndex.from_product([[0,1,2],[0,1,2],[0,1]], names=['Run','Trigger','Cluster'])
df = pandas.DataFrame(np.random.rand(len(index),3), index=index, columns=['a','b','c'])
print(df)

                            a         b         c
Run Trigger Cluster                              
0   0       0        0.789090  0.776966  0.764152
            1        0.196648  0.635954  0.479195
    1       0        0.007268  0.675339  0.966958
            1        0.055030  0.794982  0.660357
    2       0        0.987798  0.907868  0.583545
            1        0.114886  0.839434  0.070730
1   0       0        0.520827  0.626102  0.088976
            1        0.377423  0.934224  0.404226
    1       0        0.081669  0.485830  0.442296
            1        0.620439  0.537927  0.406362
    2       0        0.155784  0.243656  0.830895
            1        0.734176  0.997579  0.226272
2   0       0        0.867951  0.353823  0.541483
            1        0.615694  0.202370  0.229423
    1       0        0.912423  0.239199  0.406443
            1        0.188609  0.053396  0.222914
    2       0        0.698515  0.493518  0.201951
            1        0.415195  0.975365  0.687365

Selection criteria may take any of the following forms:

set1:
Int64Index([0], dtype='int64', name='Run')

set2:
MultiIndex([(0, 1),
            (1, 2)],
           names=['Run', 'Trigger'])
set3:
MultiIndex([(0, 0, 1),
            (1, 0, 1),
            (2, 1, 0)],
           names=['Run', 'Trigger', 'Cluster'])

Application of these selection lists using a hypothetical select method would result in:

>>> print(df.select(set1))
                            a         b         c
Run Trigger Cluster                              
0   0       0        0.789090  0.776966  0.764152
            1        0.196648  0.635954  0.479195
    1       0        0.007268  0.675339  0.966958
            1        0.055030  0.794982  0.660357
    2       0        0.987798  0.907868  0.583545
            1        0.114886  0.839434  0.070730

>>> print(df.select(set2))
                            a         b         c
Run Trigger Cluster                              
0   1       0        0.007268  0.675339  0.966958
            1        0.055030  0.794982  0.660357
1   2       0        0.155784  0.243656  0.830895
            1        0.734176  0.997579  0.226272

>>> print(df.select(set3))
                            a         b         c
Run Trigger Cluster                              
0   0       1        0.196648  0.635954  0.479195
1   0       1        0.377423  0.934224  0.404226
2   1       0        0.912423  0.239199  0.406443

pandas can join these kinds of mixed-level indices easily, so it seems like this should be a straightforward operation, but I can't figure out the write calls. loc works for set3 because the indices are the same depth, but I need a general solution.

is ur final output a combination of the three dataframes? could u post an expected output? — sammywemmy
– sammywemmy, Commented Mar 19, 2020 at 1:07
@sammywemmy This is an example in the first stage each of 3 completely decoupled analyses. We might take the output after set1 and fill a histogram, the output from set2 to train a BDT, etc. They aren't really related other than that they all share this common first step — thegreatemu
– thegreatemu, Commented Mar 19, 2020 at 15:54

Code Different · Accepted Answer · 2020-03-19 01:20:39Z

1

df.loc[set3] works because set3 has all 3 levels of the index. You can mimic this behavior for set1 and set2 by replacing the missing levels with slicer(None):

def select(df, index):
    slicer = []
    for name in df.index.names:
        if name in index.names:
            values = index.get_level_values(name).values
        else:
            values = slice(None)
        slicer.append(values)

    return df.loc[tuple(slicer), :]

Then you can use:

select(df, set1)
select(df, set2)
select(df, set3)

If you want it as method on the dataframe:

pd.DataFrame.select = select
df.select(set1) # etc.

Note that this will ignore levels in index that do not exists in df.index:

# there's no level "FooBar" in df
set4 = pd.MultiIndex.from_tuples([(0, 42)], names=['Trigger', 'FooBar'])
df.select(set4) # works just fine

I haven't tested the performance, probably not too fast if you do this in a tight loop.

answered Mar 19, 2020 at 1:20

Code Different

93.4k16 gold badges154 silver badges175 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

thegreatemu Over a year ago

+1 this is a nice approach. I still feel like there must be a built-in way to achieve this since I can do joins with dataframes using mismatched indexes like this. But the fact that your approach will ignore extra names in index is a nice bonus

thegreatemu · Accepted Answer · 2020-03-24 22:07:42Z

0

One way to achieve this using pure pandas is the following:

df.align(setN.to_series(), axis=0, join='inner')[0]

That is, convert the 'other' index to a Series and select the parts of each that would be kept during an inner join operation.

answered Mar 24, 2020 at 22:07

thegreatemu

5352 silver badges12 bronze badges

Collectives™ on Stack Overflow

Select subset of rows from pandas DataFrame using entries from a separate partial MultiIndex

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related