Python Pandas: cannot do slice indexing

Question

I am trying to work with a pandas multiindex dataframe that looks like this:

                   end ref|alt
chrom start
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C
      3001131  3001132     G|A

I want to be able to do this:

df.loc[('chr1', slice(3000714, 3001110))]

That fails with the following error:

cannot do slice indexing on with these indexers [1204741] of

df.index.levels[1].dtype returns dtype('int64'), so it should work with integer slices right?

Also, any comments on how to do this efficiently would be valuable, as the dataframe has 12 million rows and I need to query it with this kind of slice query ~70 million times.

jezrael · Accepted Answer · 2016-06-09 05:19:26Z

7

I think you need add ,: to the end - it means you need slicing rows, but need all columns:

print (df.loc[('chr1', slice(3000714, 3001110)),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C

Another solution is add axis=0 to loc:

print (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C

But if need only 3000714 and 3001110:

print (df.loc[('chr1', [3000714, 3001110]),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C

idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C

Timings:

In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop

In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop

In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop

In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop

edited Jun 9, 2016 at 5:19

answered Jun 9, 2016 at 5:06

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike D Over a year ago

Fantastic, that worked perfectly. Thanks for the great explanation. I also realized that for my case here, because my first level index was so much smaller than the second (there were 23 items in the level[0] index and 12.6 million in the level[1] index), I got a greater speed up by splitting the dataframe into a dictionary on the first index. On my full dataframe, the df.loc(axis=0)[('chr1', slice(3000714, 3001110))] method took 218 ms per loop, whereas making the dictionary and doing dfs['chr1'].loc[3000714:3001110] took only 95.7 µs per loop. Thanks again!

Eliethesaiyan Over a year ago

@jezrael, how would i select a dataframe from one index to another..in that range..i have function that users.index=np.arange(0,len(users)) this is returning nothing...users.loc[start:end:] empty dataframe,but users.dataframe has content

Collectives™ on Stack Overflow

Python Pandas: cannot do slice indexing

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related