55

I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?

I know pandas has a to_sql function, but that only works on a database connection, it can not generate a string.

Example

What I would like is to take a dataframe like so:

import pandas as pd
import numpy as np

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

And a function that would generate this (this example is PostgreSQL but any would be fine):

CREATE TABLE data
(
  index timestamp with time zone,
  "A" double precision,
  "B" double precision,
  "C" double precision,
  "D" double precision
)

9 Answers 9

74

If you only want the 'CREATE TABLE' sql code (and not the insert of the data), you can use the get_schema function of the pandas.io.sql module:

In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
  "index" TIMESTAMP,
  "A" REAL,
  "B" REAL,
  "C" REAL,
  "D" REAL
)

Some notes:

  • I had to use reset_index because it otherwise didn't include the index
  • If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
Sign up to request clarification or add additional context in comments.

3 Comments

Is it possible to get the insert data, in order to do an update/insert?
@joris how does one specify the sqlalchemy engine?
I see RustyShackleford has already found it, but here is how you specify the engine: stackoverflow.com/a/51294670/3279262
38

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

SOURCE = df
TARGET = data

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):

# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database

    import pandas as pd
    sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)   
    return sql_text

Check the SQL CREATE TABLE Statement String

print('\n\n'.join(sql_text))

GENERATE SQL INSERT STATEMENT FROM DATAFRAME

def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
    sql_texts = []
    for index, row in SOURCE.iterrows():       
        sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))        
    return sql_texts

Check the SQL INSERT INTO Statement String

print('\n\n'.join(sql_texts))

3 Comments

Works just fine. What is the best way to create an update statement based on this solution?
i tried this insert statement but it seems it does not handle Nulls . Generated Insert statement had nan for NULL values and my query tried to insert nan in these fields.
Probably would be way more efficient to batch this up to the max sql query size, right?
12

Insert Statement Solution

Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows() as that is very slow. Also this takes care of nan values with the help of regular expressions.

import re

def get_insert_query_from_df(df, dest_table):

    insert = """
    INSERT INTO `{dest_table}` (
        """.format(dest_table=dest_table)

    columns_string = str(list(df.columns))[1:-1]
    columns_string = re.sub(r' ', '\n        ', columns_string)
    columns_string = re.sub(r'\'', '', columns_string)

    values_string = ''

    for row in df.itertuples(index=False,name=None):
        values_string += re.sub(r'nan', 'null', str(row))
        values_string += ',\n'

    return insert + columns_string + ')\n     VALUES\n' + values_string[:-2] + ';'

4 Comments

what is re supposed to be?
@Rainb re stands for "Regular Expression". You can import Python's regular expression library with the command import re
@eitanlees shouldn't that be included in the answer?
@Rainb I think it would be good to include that in the answer. I will edit hunterm's answer :)
7

If you're just looking to generate a string with inserts based on pandas.DataFrame - I'd suggest using bulk sql insert syntax as suggested by @rup.

Here's an example of a function I wrote for that purpose:

import pandas as pd
import re


def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
    """Converts DataFrame to bulk INSERT sql query
    >>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
    >>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
    >>> df
       id        name  is_deleted  balance
    0   1  _suffixnan           1      NaN
    1   2  Noneprefix           0      NaN
    2   3  fooNULLbar           1     2.34
    >>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
    >>> print(query)
    INSERT INTO users (id, name, is_deleted, balance, status, address)
    VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
           (2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
           (3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
    """
    df = df.copy().assign(**kwargs)
    columns = ", ".join(df.columns)
    tuples = map(str, df.itertuples(index=False, name=None))
    values = re.sub(r"(?<=\W)(nan|None)(?=\W)", "NULL", (",\n" + " " * 7).join(tuples))
    return f"INSERT INTO {table} ({columns})\nVALUES {values};"

By the way, it converts nan/None entries to NULL and it's possible to pass constant column=value pairs as keyword arguments (see status="APPROVED" and address=None arguments in docstring example).

Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.

2 Comments

The regex seems really slow for large queries.
Not sure if spaces count towards max SQL statement length, but seems like adding the 7 blank spaces is not needed if we're not going to be looking at the query.
4

SINGLE INSERT QUERY SOLUTION

I didn't find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:

import re 
import pandas as pd 

table = 'your_table_name_here'

# You can read from CSV file here... just using read_sql_query as an example

df = pd.read_sql_query(f'select * from {table}', con=db_connection)


cols = ', '.join(df.columns.to_list()) 
vals = []

for index, r in df.iterrows():
    row = []
    for x in r:
        row.append(f"'{str(x)}'")

    row_str = ', '.join(row)
    vals.append(row_str)

f_values = [] 
for v in vals:
    f_values.append(f'({v})')

# Handle inputting NULL values
f_values = ', '.join(f_values) 
f_values = re.sub(r"('None')", "NULL", f_values)

sql = f"insert into {table} ({cols}) values {f_values};" 

print(sql)

db.dispose()

Comments

3

If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.

As an example:

import pandas as pd
import numpy as np

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes =  map(lambda x: x.name, df.dtypes.values)

# Storing column names and dtypes in a dataframe

tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])

tableDef.iloc[0]           = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols']   = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes

# Defining a dictionnary to convert dtypes

conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}

# Writing sql in a file

f = open('yourdir\%s.sql' % tableName, 'w')

f.write('CREATE TABLE %s\n' % tableName)
f.write('(\n')

for i, row in tableDef.iterrows():
    sep = ",\n" if i < tableDef.index[-1] else "\n"
    f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep))

f.write(')')

f.close()

You can do the same way to populate your table with INSERT INTO.

1 Comment

I used your code and got error looks like as "Traceback (most recent call last): File "xxx.py", line 58, in <module> f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep)) KeyError: <map object at 0x000002540ACBA5C0>". Then i wrapped the map() inside the list() and this works like a charm. Thank you for your script.
1

The solution I used was to send the dataframe to a DB in memory, using SQLite3.

After that I dump the DB, writing the statements in a .sql file.

... just for demo I created an example file:

from datetime import datetime

import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine


# Load Dataset
dataset_name = 'iris'
df = sns.load_dataset(dataset_name)

# Add Name to Index
df.index.name = 'Id'

# Results
df.head()

We create an engine using SQL Alchemy. This connection will be used by pandas, to send the data to the temporary memory, and also by SQLite3, to dump the contents of the database.

# Create Engine with SQL Alchemy (used by pandas)
engine = create_engine(f'sqlite://', echo=False)

# Send data to temporary SQLite3
df.to_sql(name=dataset_name, index=True, con=engine, if_exists='replace')

Finally, we indicate the path to the output file and do the iterdump.

# Output file
output_file = f'sql - {dataset_name}.sql'

# Para cada
with open(output_file, 'w') as f:
    # Date
    data_agora = datetime.today().strftime('%Y.%m.%d %H:%M:%S')
    
    f.write(
        '/****** Query para criação e inserção de registros no DB ******/\n'
    )
    f.write('/*\n')
    f.write(f'São {len(df)} registros\n')
    f.write(f'Obtidos na tabela "{dataset_name}"\n')
    f.write('\n')
    f.write(f'Query feita por Michel Metran em {(data_agora)},\n')
    f.write('*/\n')
    f.write('\r\n')    
    
    with engine.connect() as conn:
        for line in conn.connection.iterdump():            
            f.write(f'{line}\n')
            print(line)
    
    # Close Connection
    conn.close()

To make life easier, I created a function inside a package that I maintain, called "traquitanas", with the function, being necessary to install the package and use the function:

#!pip3 install traquitanas --upgrade
from traquitanas.data import convert_to_sql

convert_to_sql.convert_dataframe_to_sql(df, output_file, dataset_name)

Comments

0

Taking the user @Jaris's post to get the CREATE, I extended it further to work for any CSV

import sqlite3
import pandas as pd

db = './database.db'
csv = './data.csv'
table_name = 'data'

# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()


# now we can insert data
def insert_data(row, c):
    values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
    sql_insert=f"INSERT INTO {table_name} VALUES ({values})"

    try:
        c.execute(sql_insert)
    except Exception as e:
        print(f"SQL:{sql_insert} \n failed with Error:{e}")



# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)

# finally commit all those inserts into the database
conn.commit()

Hopefully this is more simple than the alternative answers and more pythonic!

Comments

0

Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.

con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))

This worked a little better as there is less messing around with string generation.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.