4

I have a pandas df that has one of the columns as a tuple. I would like to use query to subset the df using the first entry of the tuple. What's the best way to do this? I'm on pandas .23.3, Python 3.6.6

MWE:

import pandas as pd
df = pd.DataFrame({"val": list(zip(range(9), range(9)[::-1]))})
df.query("val[0] > 3") #this line does not work!

I know that I can split the column up and then subset but I don't want to split it up.

update: for anyone who decides to go the route of unpacking the tuple and having two separate columns, here is a simple way to do this:

df["a"], df["b"] = list(zip(*df.val.tolist()))

2 Answers 2

4

I assume your queries are more complicated than "val > 3". This is one easy way to get the first item from your column—with the .str accessor:

df.val.str[0].to_frame().query('val > 3')

   val
4    4
5    5
6    6
7    7
8    8

The reason this works is because .str will work with any object column (this includes columns of lists and tuples), not just strings (strings are considered objects, one of many possible types).
If query is not a necessity, this will be good enough:

v = df.val.str[0]
v[v > 3]

   val
4    4
5    5
6    6
7    7
8    8

There's also

pd.DataFrame({'val' : [v[0] for v in df['val']}).query('val > 3')

   val
4    4
5    5
6    6
7    7
8    8

Which uses a list comprehension to build a new single columned DataFrame from scratch. This should be the fastest, but I would prefer one of the approaches above.

Sign up to request clarification or add additional context in comments.

5 Comments

they are more complicated but this is interesting: how come df.val.str[0] returns int type rather than string?
@Alex df.str works on any object column (including lists and tuples), not just string (strings are considered objects too).
oh i see so .str will simply return a series of objects in the val column and then you can you any method available on that object? (ie __getitem__ in this case)
@Alex it returns a series, so all series methods are applicable here. For performance, you may want to do an astype(int) conversion because the output of .str[] is an object column.
ok cool, this is good to know. my df obviously contains other columns but i just wanted to filter by the column that contains tuples, hence the reason for doing query. i suppose then i'll just boolean select the df using your method. thanks for the help!
2

What's the best way to do this?

In my opinion, don't work with a series of tuples to begin with. This negates one of the main benefits of Pandas: vectorised computations with NumPy arrays.

Instead, you can split your series of tuples into two series of integers. Then use pd.DataFrame.query as usual:

df = pd.DataFrame(df['val'].values.tolist()).add_prefix('val')

print(df.query('val0 > 3'))

   val0  val1
4     4     4
5     5     3
6     6     2
7     7     1
8     8     0

6 Comments

yes, i think you may be right: my tuple contains (year, month) values but i think it may just be best to have them in separate columns
just remembered why i never split up year and month: if i want to select all elements less than a particular year and month i can't do "year < my_year & month < my_month", becomes very ugly when cols are split up
Well, you can df.query('year < 2010 & month < 5') or df[(df['year'] < 2010) & (df['month'] < 5)].
no that won't work: say the year is 2010 and month 5, and i want all sample points before. while year < 2010 will be true, the sample point in 2009-12 will fail since 12 > 5
in R i represent year month points as integers of the form YYYYMM. i suppose i can do the same in pandas
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.