Generate SQL statements from a Pandas Dataframe

Question

I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?

I know pandas has a to_sql function, but that only works on a database connection, it can not generate a string.

Example

What I would like is to take a dataframe like so:

import pandas as pd
import numpy as np

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

And a function that would generate this (this example is PostgreSQL but any would be fine):

CREATE TABLE data
(
  index timestamp with time zone,
  "A" double precision,
  "B" double precision,
  "C" double precision,
  "D" double precision
)

joris · Accepted Answer · 2015-06-26 14:23:15Z

74

If you only want the 'CREATE TABLE' sql code (and not the insert of the data), you can use the get_schema function of the pandas.io.sql module:

In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
  "index" TIMESTAMP,
  "A" REAL,
  "B" REAL,
  "C" REAL,
  "D" REAL
)

Some notes:

I had to use reset_index because it otherwise didn't include the index
If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).

answered Jun 26, 2015 at 14:23

joris

140k37 gold badges257 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jared Over a year ago

Is it possible to get the insert data, in order to do an update/insert?

RustyShackleford Over a year ago

@joris how does one specify the sqlalchemy engine?

Jorick Spitzen Over a year ago

I see RustyShackleford has already found it, but here is how you specify the engine: stackoverflow.com/a/51294670/3279262

Random · Accepted Answer · 2018-11-13 05:20:44Z

38

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

SOURCE = df
TARGET = data

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):

# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database

    import pandas as pd
    sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)   
    return sql_text

Check the SQL `CREATE TABLE` Statement String

print('\n\n'.join(sql_text))

GENERATE SQL INSERT STATEMENT FROM DATAFRAME

def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
    sql_texts = []
    for index, row in SOURCE.iterrows():       
        sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))        
    return sql_texts

Check the SQL `INSERT INTO` Statement String

print('\n\n'.join(sql_texts))

edited Nov 13, 2018 at 5:20

Random

4,8892 gold badges40 silver badges46 bronze badges

answered Aug 1, 2018 at 8:51

Jansen Simanullang

2,3151 gold badge19 silver badges11 bronze badges

3 Comments

M.Hussaini Over a year ago

Works just fine. What is the best way to create an update statement based on this solution?

PythonDeveloper Over a year ago

i tried this insert statement but it seems it does not handle Nulls . Generated Insert statement had nan for NULL values and my query tried to insert nan in these fields.

wordsforthewise Over a year ago

Probably would be way more efficient to batch this up to the max sql query size, right?

eitanlees · Accepted Answer · 2022-09-15 09:39:15Z

12

Insert Statement Solution

Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows() as that is very slow. Also this takes care of nan values with the help of regular expressions.

import re

def get_insert_query_from_df(df, dest_table):

    insert = """
    INSERT INTO `{dest_table}` (
        """.format(dest_table=dest_table)

    columns_string = str(list(df.columns))[1:-1]
    columns_string = re.sub(r' ', '\n        ', columns_string)
    columns_string = re.sub(r'\'', '', columns_string)

    values_string = ''

    for row in df.itertuples(index=False,name=None):
        values_string += re.sub(r'nan', 'null', str(row))
        values_string += ',\n'

    return insert + columns_string + ')\n     VALUES\n' + values_string[:-2] + ';'

edited Sep 15, 2022 at 9:39

eitanlees

1,36410 silver badges14 bronze badges

answered Jan 7, 2021 at 17:44

hunterm

3576 silver badges15 bronze badges

4 Comments

Rainb Over a year ago

what is re supposed to be?

eitanlees Over a year ago

@Rainb re stands for "Regular Expression". You can import Python's regular expression library with the command import re

Rainb Over a year ago

@eitanlees shouldn't that be included in the answer?

eitanlees Over a year ago

@Rainb I think it would be good to include that in the answer. I will edit hunterm's answer :)

absoup · Accepted Answer · 2022-01-27 12:00:59Z

7

If you're just looking to generate a string with inserts based on pandas.DataFrame - I'd suggest using bulk sql insert syntax as suggested by @rup.

Here's an example of a function I wrote for that purpose:

import pandas as pd
import re


def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
    """Converts DataFrame to bulk INSERT sql query
    >>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
    >>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
    >>> df
       id        name  is_deleted  balance
    0   1  _suffixnan           1      NaN
    1   2  Noneprefix           0      NaN
    2   3  fooNULLbar           1     2.34
    >>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
    >>> print(query)
    INSERT INTO users (id, name, is_deleted, balance, status, address)
    VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
           (2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
           (3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
    """
    df = df.copy().assign(**kwargs)
    columns = ", ".join(df.columns)
    tuples = map(str, df.itertuples(index=False, name=None))
    values = re.sub(r"(?<=\W)(nan|None)(?=\W)", "NULL", (",\n" + " " * 7).join(tuples))
    return f"INSERT INTO {table} ({columns})\nVALUES {values};"

By the way, it converts nan/None entries to NULL and it's possible to pass constant column=value pairs as keyword arguments (see status="APPROVED" and address=None arguments in docstring example).

Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.

edited Jan 27, 2022 at 12:00

answered Jan 4, 2022 at 21:54

absoup

4463 silver badges12 bronze badges

2 Comments

wordsforthewise Over a year ago

The regex seems really slow for large queries.

wordsforthewise Over a year ago

Not sure if spaces count towards max SQL statement length, but seems like adding the 7 blank spaces is not needed if we're not going to be looking at the query.

rup · Accepted Answer · 2021-08-23 14:25:14Z

SINGLE INSERT QUERY SOLUTION

I didn't find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:

import re 
import pandas as pd 

table = 'your_table_name_here'

# You can read from CSV file here... just using read_sql_query as an example

df = pd.read_sql_query(f'select * from {table}', con=db_connection)


cols = ', '.join(df.columns.to_list()) 
vals = []

for index, r in df.iterrows():
    row = []
    for x in r:
        row.append(f"'{str(x)}'")

    row_str = ', '.join(row)
    vals.append(row_str)

f_values = [] 
for v in vals:
    f_values.append(f'({v})')

# Handle inputting NULL values
f_values = ', '.join(f_values) 
f_values = re.sub(r"('None')", "NULL", f_values)

sql = f"insert into {table} ({cols}) values {f_values};" 

print(sql)

db.dispose()

Delforge · Accepted Answer · 2015-06-26 15:37:32Z

3

If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.

As an example:

import pandas as pd
import numpy as np

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes =  map(lambda x: x.name, df.dtypes.values)

# Storing column names and dtypes in a dataframe

tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])

tableDef.iloc[0]           = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols']   = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes

# Defining a dictionnary to convert dtypes

conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}

# Writing sql in a file

f = open('yourdir\%s.sql' % tableName, 'w')

f.write('CREATE TABLE %s\n' % tableName)
f.write('(\n')

for i, row in tableDef.iterrows():
    sep = ",\n" if i < tableDef.index[-1] else "\n"
    f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep))

f.write(')')

f.close()

You can do the same way to populate your table with INSERT INTO.

answered Jun 26, 2015 at 15:37

Delforge

7928 silver badges17 bronze badges

1 Comment

Doan Vu Over a year ago

I used your code and got error looks like as "Traceback (most recent call last): File "xxx.py", line 58, in <module> f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep)) KeyError: <map object at 0x000002540ACBA5C0>". Then i wrapped the map() inside the list() and this works like a charm. Thank you for your script.

Michel Metran · Accepted Answer · 2023-03-09 00:43:57Z

The solution I used was to send the dataframe to a DB in memory, using SQLite3.

After that I dump the DB, writing the statements in a .sql file.

... just for demo I created an example file:

from datetime import datetime

import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine


# Load Dataset
dataset_name = 'iris'
df = sns.load_dataset(dataset_name)

# Add Name to Index
df.index.name = 'Id'

# Results
df.head()

We create an engine using SQL Alchemy. This connection will be used by pandas, to send the data to the temporary memory, and also by SQLite3, to dump the contents of the database.

# Create Engine with SQL Alchemy (used by pandas)
engine = create_engine(f'sqlite://', echo=False)

# Send data to temporary SQLite3
df.to_sql(name=dataset_name, index=True, con=engine, if_exists='replace')

Finally, we indicate the path to the output file and do the iterdump.

# Output file
output_file = f'sql - {dataset_name}.sql'

# Para cada
with open(output_file, 'w') as f:
    # Date
    data_agora = datetime.today().strftime('%Y.%m.%d %H:%M:%S')
    
    f.write(
        '/****** Query para criação e inserção de registros no DB ******/\n'
    )
    f.write('/*\n')
    f.write(f'São {len(df)} registros\n')
    f.write(f'Obtidos na tabela "{dataset_name}"\n')
    f.write('\n')
    f.write(f'Query feita por Michel Metran em {(data_agora)},\n')
    f.write('*/\n')
    f.write('\r\n')    
    
    with engine.connect() as conn:
        for line in conn.connection.iterdump():            
            f.write(f'{line}\n')
            print(line)
    
    # Close Connection
    conn.close()

To make life easier, I created a function inside a package that I maintain, called "traquitanas", with the function, being necessary to install the package and use the function:

#!pip3 install traquitanas --upgrade
from traquitanas.data import convert_to_sql

convert_to_sql.convert_dataframe_to_sql(df, output_file, dataset_name)

robmsmt · Accepted Answer · 2020-12-07 04:09:06Z

Taking the user @Jaris's post to get the CREATE, I extended it further to work for any CSV

import sqlite3
import pandas as pd

db = './database.db'
csv = './data.csv'
table_name = 'data'

# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()


# now we can insert data
def insert_data(row, c):
    values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
    sql_insert=f"INSERT INTO {table_name} VALUES ({values})"

    try:
        c.execute(sql_insert)
    except Exception as e:
        print(f"SQL:{sql_insert} \n failed with Error:{e}")



# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)

# finally commit all those inserts into the database
conn.commit()

Hopefully this is more simple than the alternative answers and more pythonic!

0x00 · Accepted Answer · 2021-11-06 13:04:09Z

0

Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.

con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))

This worked a little better as there is less messing around with string generation.

answered Nov 6, 2021 at 13:04

0x00

1814 silver badges11 bronze badges

Collectives™ on Stack Overflow

Generate SQL statements from a Pandas Dataframe

Example

9 Answers 9

3 Comments

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

Check the SQL `CREATE TABLE` Statement String

GENERATE SQL INSERT STATEMENT FROM DATAFRAME

Check the SQL `INSERT INTO` Statement String

3 Comments

Insert Statement Solution

4 Comments

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Example

9 Answers 9

3 Comments

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

Check the SQL CREATE TABLE Statement String

GENERATE SQL INSERT STATEMENT FROM DATAFRAME

Check the SQL INSERT INTO Statement String

3 Comments

Insert Statement Solution

4 Comments

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Check the SQL `CREATE TABLE` Statement String

Check the SQL `INSERT INTO` Statement String