Select columns using pandas dataframe.query()

Question

The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.

So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?

Max Power · Accepted Answer · 2017-06-18 02:44:01Z

11

After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.

If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.

UPDATE:

javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...

df.eval('A')

returns a Pandas Series, but

df.eval(['A', 'B'])

does not return at DataFrame, it returns a list (of Pandas Series).

So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.

df.loc[0:4, ['A', 'C']]

output

          A         C
0 -0.497163 -0.046484
1  1.331614  0.741711
2  1.046903 -2.511548
3  0.314644 -0.526187
4 -0.061883 -0.615978

edited Jun 18, 2017 at 2:44

answered Jun 18, 2017 at 1:23

Max Power

9,14616 gold badges64 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

WestCoastProjects Over a year ago

but eval does not return a DataFrame .. ret : ndarray, scalar, or pandas object . In any case upvoted for the effort.

Max Power Over a year ago

hm good point. just tried iris2 = iris.eval(['sepal_length', 'species']) but the iris2 I got back was a list...with each element a Pandas Series. Weird.

WestCoastProjects Over a year ago

Looks like we're back to iloc/loc. maybe play with that a bit and i can award here.

Max Power Over a year ago

udpated my answer. don't consider this a particularly satisfying answer, but think it's the answer.

WestCoastProjects Over a year ago

I use it when performance is not important and prefer to use postgres backend since it supports analytics/windowing functions. Same as years back actually. The other way I do more often is using spark sql

|

Scott Boston · Accepted Answer · 2017-06-18 02:11:55Z

Dataframe.query is more like the where clause in a SQL statement than the select part.

import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

To select a column or columns you can use the following:

df['A'] or df.loc[:,'A']

or

df[['A','B']] or df.loc[:,['A','B']]

To use the .query method you do something like

df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.

                   A         B         C         D
2000-01-03  1.265936 -0.866740 -0.678886 -0.094709
2000-01-04  1.491390 -0.638902 -0.443982 -0.434351
2000-01-05  2.205930  2.186786  1.004054  0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589

Which is more readable in my opinion that boolean index selection with

df[df['A'] > df['B']]

Fabich · Accepted Answer · 2020-02-20 17:26:40Z

4

How about

df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]

Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.

but you would need to filter for rows otherwise it doesn't work.

for filtering columns only better use .loc or .iloc

edited Feb 20, 2020 at 17:26

Fabich

3,1354 gold badges35 silver badges48 bronze badges

answered Feb 20, 2020 at 15:38

gonkan

411 bronze badge

1 Comment

Obinna Nnenanya Over a year ago

This is the best answer for me. Simple and very close to any db query syntax.

WestCoastProjects · Accepted Answer · 2021-07-17 02:29:55Z

2

pandasql

https://pypi.python.org/pypi/pandasql/0.1.0

Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html . The inputs are two DataFrames meat and births : and this approach gives the projections, filtering, aggregation and sorting expected from sql.

@maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.

pysqldf = lambda q: sqldf(q, globals())

q  = """
SELECT
  m.date
  , m.beef
  , b.births
FROM
  meat m
LEFT JOIN
  births b
    ON m.date = b.date
WHERE
    m.date > '1974-12-31';
"""

meat = load_meat()
births = load_births()

df = pysqldf(q)

The output is a pandas DataFrame as desired.

It is working great for my particular use case (evaluating us crimes)

odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)

 odf
:    SMURDER  SRAPE  SROBBERY  SAGASSLT  SOTHASLT  SVANDLSM  SWEAPONS
0        0      0         0         1         1        10        54
1        0      0         0         0         1         0        52
2        0      0         0         0         1         0        46
3        0      0         0         0         1         0        43
4        0      0         0         0         1         0        33
5        1      0         2        16        28         4        32
6        0      0         0         7        17         4        30
7        0      0         0         0         1         0        29
8        0      0         0         7        16         3        29
9        0      0         0         1         0         5        28

Update I have done a bunch of stuff with pandasql now: calculated fields, limits, aliases, cascaded dataframes.. it is just so productive.

Another update (3 yrs later) This works but warning it is very slow (seconds vs milliseconds) –

edited Jul 17, 2021 at 2:29

answered Jun 18, 2017 at 2:29

WestCoastProjects

63.9k109 gold badges368 silver badges638 bronze badges

2 Comments

Max Power Over a year ago

Glad this is working so well for your case. I have been frustrated by it a couple times, but maybe that's on me for not contributing to fix it. Here's the last bug I ran into - select from multiple tables doesn't work. Which is a shame because that's the kind of operation which reads so much nicer in SQL than in base pandas. I also worry that since this issue has been open for ~18 months, and has no person or even labels assigned to it, the library is probably not well maintained.

WestCoastProjects Over a year ago

@MaxPower I am now using postgresql dialect with pandasql now - as opposed to the default and limited sqlite. It is working better so far.

Arindam Roychowdhury · Accepted Answer · 2021-03-19 20:44:51Z

Just a simpler example solution (using `get`):

My goal:

I want the lat and lon columns out of the result of the query.

My table details:

df_city.columns

Index(['name', 'city_id', 'lat', 'lon', 'CountryName', 'ContinentName'], dtype='object')

# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')

# Only lat and lon
city_continent[['lat', 'lon']]

  lat lon
113883    -19.12753   -169.84623
113884    -19.11667   -169.90000
113885    -19.10000   -169.91667
113886    -46.33333   168.85000
113887    -46.36667   168.55000
...   ... ...
347956    -23.14083   113.77630
347957    -31.48023   131.84242
347958    -28.29967   153.30142
347959    -35.60358   138.10548
347960    -35.02852   117.83416
3712 rows × 2 columns

Collectives™ on Stack Overflow

Select columns using pandas dataframe.query()

5 Answers 5

13 Comments

Comments

1 Comment

pandasql

2 Comments

Just a simpler example solution (using `get`):

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

13 Comments

Comments

1 Comment

pandasql

2 Comments

Just a simpler example solution (using get):

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Just a simpler example solution (using `get`):