1

I have data in a pandas DataFrame with a MultiIndex. Let's call the labels of my MultiIndex "Run", "Trigger", and "Cluster". Separately, I have a list of pre-computed selection criteria that I get as a list of entries passing (these tend to be sparse, so listing passing indexes is most space efficient). The selection cuts may only be partially indexed, e.g. may only specify "Run" or ("Run", "Trigger") pairs.

How do I efficiently apply these cuts, ideally without having to inspect them to find their levels?

For example, consider the following data:

index = pandas.MultiIndex.from_product([[0,1,2],[0,1,2],[0,1]], names=['Run','Trigger','Cluster'])
df = pandas.DataFrame(np.random.rand(len(index),3), index=index, columns=['a','b','c'])
print(df)

                            a         b         c
Run Trigger Cluster                              
0   0       0        0.789090  0.776966  0.764152
            1        0.196648  0.635954  0.479195
    1       0        0.007268  0.675339  0.966958
            1        0.055030  0.794982  0.660357
    2       0        0.987798  0.907868  0.583545
            1        0.114886  0.839434  0.070730
1   0       0        0.520827  0.626102  0.088976
            1        0.377423  0.934224  0.404226
    1       0        0.081669  0.485830  0.442296
            1        0.620439  0.537927  0.406362
    2       0        0.155784  0.243656  0.830895
            1        0.734176  0.997579  0.226272
2   0       0        0.867951  0.353823  0.541483
            1        0.615694  0.202370  0.229423
    1       0        0.912423  0.239199  0.406443
            1        0.188609  0.053396  0.222914
    2       0        0.698515  0.493518  0.201951
            1        0.415195  0.975365  0.687365

Selection criteria may take any of the following forms:

set1:
Int64Index([0], dtype='int64', name='Run')

set2:
MultiIndex([(0, 1),
            (1, 2)],
           names=['Run', 'Trigger'])
set3:
MultiIndex([(0, 0, 1),
            (1, 0, 1),
            (2, 1, 0)],
           names=['Run', 'Trigger', 'Cluster'])

Application of these selection lists using a hypothetical select method would result in:

>>> print(df.select(set1))
                            a         b         c
Run Trigger Cluster                              
0   0       0        0.789090  0.776966  0.764152
            1        0.196648  0.635954  0.479195
    1       0        0.007268  0.675339  0.966958
            1        0.055030  0.794982  0.660357
    2       0        0.987798  0.907868  0.583545
            1        0.114886  0.839434  0.070730

>>> print(df.select(set2))
                            a         b         c
Run Trigger Cluster                              
0   1       0        0.007268  0.675339  0.966958
            1        0.055030  0.794982  0.660357
1   2       0        0.155784  0.243656  0.830895
            1        0.734176  0.997579  0.226272

>>> print(df.select(set3))
                            a         b         c
Run Trigger Cluster                              
0   0       1        0.196648  0.635954  0.479195
1   0       1        0.377423  0.934224  0.404226
2   1       0        0.912423  0.239199  0.406443

pandas can join these kinds of mixed-level indices easily, so it seems like this should be a straightforward operation, but I can't figure out the write calls. loc works for set3 because the indices are the same depth, but I need a general solution.

2
  • is ur final output a combination of the three dataframes? could u post an expected output? Commented Mar 19, 2020 at 1:07
  • @sammywemmy This is an example in the first stage each of 3 completely decoupled analyses. We might take the output after set1 and fill a histogram, the output from set2 to train a BDT, etc. They aren't really related other than that they all share this common first step Commented Mar 19, 2020 at 15:54

2 Answers 2

1

df.loc[set3] works because set3 has all 3 levels of the index. You can mimic this behavior for set1 and set2 by replacing the missing levels with slicer(None):

def select(df, index):
    slicer = []
    for name in df.index.names:
        if name in index.names:
            values = index.get_level_values(name).values
        else:
            values = slice(None)
        slicer.append(values)

    return df.loc[tuple(slicer), :]

Then you can use:

select(df, set1)
select(df, set2)
select(df, set3)

If you want it as method on the dataframe:

pd.DataFrame.select = select
df.select(set1) # etc.

Note that this will ignore levels in index that do not exists in df.index:

# there's no level "FooBar" in df
set4 = pd.MultiIndex.from_tuples([(0, 42)], names=['Trigger', 'FooBar'])
df.select(set4) # works just fine

I haven't tested the performance, probably not too fast if you do this in a tight loop.

Sign up to request clarification or add additional context in comments.

1 Comment

+1 this is a nice approach. I still feel like there must be a built-in way to achieve this since I can do joins with dataframes using mismatched indexes like this. But the fact that your approach will ignore extra names in index is a nice bonus
0

One way to achieve this using pure pandas is the following:

df.align(setN.to_series(), axis=0, join='inner')[0]

That is, convert the 'other' index to a Series and select the parts of each that would be kept during an inner join operation.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.