Executing an SQL query on a Pandas dataset

Question

I have a Pandas dataset called df. How can I do:

df.query("select * from df")

Sorry, our imaginations are poor. Please provide some concrete data and what you actually want. — cs95
– cs95, Commented Aug 24, 2017 at 15:33
What you want is not possible. Dataframes are no SQL databases and can not be queried like one. — Deb
– Deb, Commented Aug 24, 2017 at 15:39
the closest thing to what you want is this : pandas.pydata.org/pandas-docs/stable/generated/… and it's not SQL. — Mohamed Ali JAMAOUI
– Mohamed Ali JAMAOUI, Commented Aug 24, 2017 at 15:45

qwr · Accepted Answer · 2024-06-06 22:40:05Z

154

This is not what pandas.query is supposed to do. You can look at package pandasql (same like sqldf in R )

Update: Note pandasql hasn't been maintained since 2017. Use another library from an answer below.

import pandas as pd
import pandasql as ps

df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
               [1234, 'Customer A', np.nan, '333 Street'],
               [1233, 'Customer B', '444 Street', '333 Street'],
              [1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])

q1 = """SELECT ID FROM df """

print(ps.sqldf(q1, locals()))

     ID
0  1234
1  1234
2  1233
3  1233

Update 2020-07-10

update the pandasql

ps.sqldf("select * from df")

edited Jun 6, 2024 at 22:40

qwr

11.6k6 gold badges75 silver badges121 bronze badges

answered Aug 24, 2017 at 16:03

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jas Over a year ago

I get this error due to numpy not being imported: Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'np' is not defined

BENY Over a year ago

@Jas that was just data import if you change the np.nan to 1000 , it will gone

Matt Sosna Over a year ago

FYI, it doesn't look like this works anymore. I get the error AttributeError: 'Connection' object has no attribute 'cursor'. It might work on older versions of pandas; I'm using v1.3.4.

Pramit Over a year ago

Not sure if pandasql is maintained anymore. DuckDb might be a better option Performance benchmark --> duckdb.org/2021/05/14/sql-on-pandas.html

qwr Over a year ago

@Pramit pandasql's last commit was 8 years ago. I think it's safe to say it's not being maintained

|

Leo Liu · Accepted Answer · 2025-03-28 01:09:30Z

Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.

Update: duckdb is also faster than polars (which is also a very good solution and people are moving to polars for its performance), see https://benchmark.clickhouse.com/

pip install duckdb

import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})

duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe

Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data

nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s

duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms

A improvement of speed of roughly 100x

This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html

Miguel Santos · Accepted Answer · 2020-07-10 11:12:52Z

30

After some time of using this I realised the easiest way is to just do

from pandasql import sqldf

output = sqldf("select * from df")

Works like a charm where df is a pandas dataframe You can install pandasql: https://pypi.org/project/pandasql/

edited Jul 10, 2020 at 11:12

answered Jul 10, 2020 at 11:07

Miguel Santos

2,0566 gold badges21 silver badges32 bronze badges

Comments

user1717828 · Accepted Answer · 2017-08-24 15:47:22Z

5

You can use DataFrame.query(condition) to return a subset of the data frame matching condition like this:

df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
   A  B  C
0  0  1  2
1  3  4  5
2  6  7  8

df.query('C < 6')
   A  B  C
0  0  1  2
1  3  4  5


df.query('2*B <= C')
   A  B  C
0  0  1  2


df.query('A % 2 == 0')
   A  B  C
0  0  1  2
2  6  7  8

This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE is implied.

answered Aug 24, 2017 at 15:47

user1717828

7,2818 gold badges41 silver badges61 bronze badges

3 Comments

BENY Over a year ago

You can add. df.eval pandas.pydata.org/pandas-docs/stable/generated/…

Dobedani Over a year ago

In the 3 examples given above, it's easy to understand the 3 different conditions. If one uses normal subsetting, the same can be achieved though with very similar code: df[df['C'] < 6], df[2*df['B'] <= df['C']] and df[df['A'] % 2 == 0]. I don't see why one would like to SQL anyway. Pandas even has methods like 'groupby" that can be applied to a dataframe to achieve the same as what e.g. a groupby query would return

user1717828 Over a year ago

@Dobedani -- yup, agree that's the preferred syntax. The reason I go with df.query() is in cases where I don't want to rewrite the dataframe name. This is common during exploratory data analysis when I might have lots of dataframes I want to run the same stuff on and sticking to method chaining like .query() let's me simply swap the variable at the beginning of the chain.

user459872 · Accepted Answer · 2025-03-25 08:41:07Z

Starting from polars 1.0, You can use the SQL Interface. It will support polars/pandas and pyarrow objects.

>>> import pandas as pd
>>> 
>>> pandas_df = pd.DataFrame({"id": [1, 2, 3], "Name": ["foo", "bar", "foo bar"]})
>>> pandas_df
   id     Name
0   1      foo
1   2      bar
2   3  foo bar
>>> 
>>> from polars import SQLContext
>>> 
>>> ctx = SQLContext(df=pandas_df)
>>> 
>>> ctx.execute("select id from df", eager=True).to_pandas()
   id
0   1
1   2
2   3
>>> ctx.execute("select * from df", eager=True).to_pandas()
   id     Name
0   1      foo
1   2      bar
2   3  foo bar
>>> ctx.execute("select id, LENGTH(Name) as length_of_name from df", eager=True).to_pandas()
   id  length_of_name
0   1               3
1   2               3
2   3               7
>>>

With the latest version of polars, You can execute SQL on DataFrame level.

>>> import polars as pl
>>> import pandas as pd
>>> 
>>> 
>>> pandas_df = pd.DataFrame({"id": [1, 2, 3]})
>>> 
>>> polars_df = pl.from_pandas(pandas_df)
>>> polars_df.sql("SELECT COUNT(*) from self")
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ 3   │
└─────┘
>>> # You can then convert back to pandas DF by calling `to_pandas()`

SchemeSonic · Accepted Answer · 2022-01-15 16:28:04Z

There is also FugueSQL

pip install fugue[sql]

import pandas as pd
from fugue_sql import fsql

comics_df = pd.DataFrame({'book': ['Secret Wars 8',
                                   'Tomb of Dracula 10',
                                   'Amazing Spider-Man 252',
                                   'New Mutants 98',
                                   'Eternals 1',
                                   'Amazing Spider-Man 300',
                                   'Department of Truth 1'],
                          'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
                          'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
                          'value': [400, 2500, 300, 600, 400, 750, 175]})

# which of my books are graded above 8.0?
query = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""

fsql(query).run()

Output

PandasDataFrame
book:str                                                      |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8                                                 |Marvel       |9.6         |400       
Eternals 1                                                    |Marvel       |9.2         |400       
Department of Truth 1                                         |Image        |9.8         |175       
Total count: 3

References

^{https://fugue-tutorials.readthedocs.io/tutorials/beginner/beginner_sql.html}

^{https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html}

alphacrash · Accepted Answer · 2020-11-07 12:49:52Z

1

Or, you can use the tools that do what they do best:

Install postgresql
Connect to the database:

from sqlalchemy import create_engine
import urllib.parse
engconnect = "{0}://{1}:{2}@{3}:{4}/{5}".format(dialect,user_uenc, pw_uenc, host,port, dbname)
dbengine = create_engine(engconnect)
database = dbengine.connect()

Dump the dataframe into postgres

df.to_sql('mytablename', database, if_exists='replace')

Write your query with all the SQL nesting your brain can handle.

myquery = "select distinct * from mytablename"

Create a dataframe by running the query:

newdf = pd.read_sql(myquery, database)

answered Nov 7, 2020 at 12:49

alphacrash

353 bronze badges

2 Comments

Sebastian Wozny Over a year ago

oh the horror, please don't do this in real life

qwr Over a year ago

That's basically what pandasql does with sqlalchemy, except it's a sqlite db by default.

mechatroner · Accepted Answer · 2022-01-26 01:31:28Z

0

Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements. It also provides a convenient %rbql magic command to use in Jupyter/IPyhon:

# Get some test data:
!pip install vega_datasets
from vega_datasets import data
my_cars_df = data.cars()
# Install and use RBQL:
!pip install rbql
%load_ext rbql
%rbql SELECT * FROM my_cars_df WHERE a.Horsepower > 100 ORDER BY a.Weight_in_lbs DESC

In this example my_cars_df is a Pandas Dataframe.

You can try it in this demo Google Colab notebook.

answered Jan 26, 2022 at 1:31

mechatroner

1,4001 gold badge18 silver badges29 bronze badges

Collectives™ on Stack Overflow

Executing an SQL query on a Pandas dataset

8 Answers 8

6 Comments

Comments

Comments

3 Comments

Comments

References

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

6 Comments

Comments

Comments

3 Comments

Comments

References

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related