4

I am trying to work with a pandas multiindex dataframe that looks like this:

                   end ref|alt
chrom start
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C
      3001131  3001132     G|A

I want to be able to do this:

df.loc[('chr1', slice(3000714, 3001110))]

That fails with the following error:

cannot do slice indexing on with these indexers [1204741] of

df.index.levels[1].dtype returns dtype('int64'), so it should work with integer slices right?

Also, any comments on how to do this efficiently would be valuable, as the dataframe has 12 million rows and I need to query it with this kind of slice query ~70 million times.

1 Answer 1

7

I think you need add ,: to the end - it means you need slicing rows, but need all columns:

print (df.loc[('chr1', slice(3000714, 3001110)),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C

Another solution is add axis=0 to loc:

print (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C

But if need only 3000714 and 3001110:

print (df.loc[('chr1', [3000714, 3001110]),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C

idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C

Timings:

In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop

In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop

In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop

In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop
Sign up to request clarification or add additional context in comments.

2 Comments

Fantastic, that worked perfectly. Thanks for the great explanation. I also realized that for my case here, because my first level index was so much smaller than the second (there were 23 items in the level[0] index and 12.6 million in the level[1] index), I got a greater speed up by splitting the dataframe into a dictionary on the first index. On my full dataframe, the df.loc(axis=0)[('chr1', slice(3000714, 3001110))] method took 218 ms per loop, whereas making the dictionary and doing dfs['chr1'].loc[3000714:3001110] took only 95.7 µs per loop. Thanks again!
@jezrael, how would i select a dataframe from one index to another..in that range..i have function that users.index=np.arange(0,len(users)) this is returning nothing...users.loc[start:end:] empty dataframe,but users.dataframe has content

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.