using query with a tuple column in pandas

Question

I have a pandas df that has one of the columns as a tuple. I would like to use query to subset the df using the first entry of the tuple. What's the best way to do this? I'm on pandas .23.3, Python 3.6.6

MWE:

import pandas as pd
df = pd.DataFrame({"val": list(zip(range(9), range(9)[::-1]))})
df.query("val[0] > 3") #this line does not work!

I know that I can split the column up and then subset but I don't want to split it up.

update: for anyone who decides to go the route of unpacking the tuple and having two separate columns, here is a simple way to do this:

df["a"], df["b"] = list(zip(*df.val.tolist()))

cs95 · Accepted Answer · 2018-08-04 22:12:24Z

4

I assume your queries are more complicated than "val > 3". This is one easy way to get the first item from your column—with the .str accessor:

df.val.str[0].to_frame().query('val > 3')

   val
4    4
5    5
6    6
7    7
8    8

The reason this works is because .str will work with any object column (this includes columns of lists and tuples), not just strings (strings are considered objects, one of many possible types).
If query is not a necessity, this will be good enough:

v = df.val.str[0]
v[v > 3]

   val
4    4
5    5
6    6
7    7
8    8

There's also

pd.DataFrame({'val' : [v[0] for v in df['val']}).query('val > 3')

   val
4    4
5    5
6    6
7    7
8    8

Which uses a list comprehension to build a new single columned DataFrame from scratch. This should be the fastest, but I would prefer one of the approaches above.

edited Aug 4, 2018 at 22:12

answered Aug 4, 2018 at 22:10

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alex Over a year ago

they are more complicated but this is interesting: how come df.val.str[0] returns int type rather than string?

cs95 Over a year ago

@Alex df.str works on any object column (including lists and tuples), not just string (strings are considered objects too).

Alex Over a year ago

oh i see so .str will simply return a series of objects in the val column and then you can you any method available on that object? (ie __getitem__ in this case)

cs95 Over a year ago

@Alex it returns a series, so all series methods are applicable here. For performance, you may want to do an astype(int) conversion because the output of .str[] is an object column.

Alex Over a year ago

ok cool, this is good to know. my df obviously contains other columns but i just wanted to filter by the column that contains tuples, hence the reason for doing query. i suppose then i'll just boolean select the df using your method. thanks for the help!

jpp · Accepted Answer · 2018-08-04 22:22:11Z

2

What's the best way to do this?

In my opinion, don't work with a series of tuples to begin with. This negates one of the main benefits of Pandas: vectorised computations with NumPy arrays.

Instead, you can split your series of tuples into two series of integers. Then use pd.DataFrame.query as usual:

df = pd.DataFrame(df['val'].values.tolist()).add_prefix('val')

print(df.query('val0 > 3'))

   val0  val1
4     4     4
5     5     3
6     6     2
7     7     1
8     8     0

answered Aug 4, 2018 at 22:22

jpp

166k37 gold badges301 silver badges363 bronze badges

6 Comments

Alex Over a year ago

yes, i think you may be right: my tuple contains (year, month) values but i think it may just be best to have them in separate columns

Alex Over a year ago

just remembered why i never split up year and month: if i want to select all elements less than a particular year and month i can't do "year < my_year & month < my_month", becomes very ugly when cols are split up

jpp Over a year ago

Well, you can df.query('year < 2010 & month < 5') or df[(df['year'] < 2010) & (df['month'] < 5)].

Alex Over a year ago

no that won't work: say the year is 2010 and month 5, and i want all sample points before. while year < 2010 will be true, the sample point in 2009-12 will fail since 12 > 5

Alex Over a year ago

in R i represent year month points as integers of the form YYYYMM. i suppose i can do the same in pandas

|

Collectives™ on Stack Overflow

using query with a tuple column in pandas

2 Answers 2

5 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related