Executing an SQL update statement from a pandas dataframe

Question

Context: I am using MSSQL, pandas, and pyodbc.

Steps:

Obtain dataframe from query using pyodbc (no problemo)
Process columns to generate the context of a new (but already existing) column
Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)

Now how do I execute the sql code in the auxilliary column, without looping through each row?

sample data

The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table

ID	raw	processed	strSQL
1	[email protected]	lorum ipsum	UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2	[email protected]	rumlo sumip	UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3	...	...	...

I would like to execute the SQL script in each row in an efficient manner.

can you porvide sample data. your question will be easy to understand — Vivs
– Vivs, Commented May 16, 2021 at 14:29
Looping through the DataFrame rows and issuing individual UPDATE statements for each table row will be very inefficient. It would be much faster to dump the two columns to a list of tuples as [(processed, id),(processed, id),…] (note the order of the columns) and then use crsr.executemany("UPDATE t SET t.processed=? WHERE t.ID=?", my_list_of_tuples). Be sure to call crsr.fast_executemany = True first for best performance. — Gord Thompson
– Gord Thompson, Commented May 16, 2021 at 14:58
For even better performance, use a Table valued Parameter, or create a temporary table and Bulk Copy into it. Then do a single joined update — Charlieface
– Charlieface, Commented May 16, 2021 at 15:20
See stackoverflow.com/questions/51930062/… for example. A TVP is basically a readonly table variable that you can use to do a joined update. There is no need to loop anything, just a single update t ... from yourtable t join @yourTVP p on ... — Charlieface
– Charlieface, Commented May 16, 2021 at 15:44

Gord Thompson · Accepted Answer · 2021-05-18 10:51:17Z

After I recommended .executemany() in a comment to the question, a subsequent comment from @Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.

For an existing table named MillionRows

ID  TextField
--  ---------
 1  foo
 2  bar
 3  baz
…

and example data of the form

num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]

my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True

crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)

took about 180 seconds (3 minutes).

However, by creating a user-defined table type

CREATE TYPE dbo.TextField_ID AS TABLE 
(
    TextField nvarchar(255) NULL, 
    ID int NOT NULL, 
    PRIMARY KEY (ID)
)

and a stored procedure

CREATE PROCEDURE [dbo].[mr_update]
@tbl dbo.TextField_ID READONLY
AS
BEGIN
    SET NOCOUNT ON;
    UPDATE MillionRows SET TextField = t.TextField
    FROM MillionRows mr INNER JOIN @tbl t ON mr.ID = t.ID
END

when I used

crsr.execute("{CALL mr_update (?)}", (rows,))

it did the same update in approximately 80 seconds (less than half the time).

Collectives™ on Stack Overflow

Executing an SQL update statement from a pandas dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related