Updating a table using dataframes and dbSendQuery

Question

I have queried a table in a database, stored the values in a dataframe and then manipulated them. This is my code for querying the data:

#Setup Connection
con1 <- dbConnect(odbc::odbc(), "XXXX", database="XXXX")
r1 <- dbSendQuery(con1, "
select pcd, oseast1m, osnrth1m from onspd as ons where ons.pcd like 'bt%' and oseast1m != ''
")
result <- dbFetch(r1)

I now want to write the values back from the dataframe to the database with something like:

dbClearResult(r1)
sql <- "
update ons
set ons.oseast1m=?east, os.osnrth1m=?west
from ONS_TEST as ons where ons.Postcode=?post
"
r1 <- dbSendQuery(con1, sqlInterpolate(ANSI(), sql, east = result$oseast1m, west = result$osnrth1m, post = result$pcd))

This gives me an error of "values must be length 1" which is obviously not correct for what I want.

What is the syntax to run an update? Or do I need to write a for loop to achieve the same thing?

thanks

mike

(comment from @saae): could there be a typo in set ons.oseast1m=?east, os.osnrth1m=?west and it should say set ons.oseast1m=?east, ons.osnrth1m=?west? — r2evans
– r2evans, Commented Jan 28, 2020 at 16:13
sqlInterpolate is meant for simple (length 1) binds. I suggest what you are trying to do is an upsert operation in SQL. It's typically best to upload to a temporary table and then upsert into the original table from that temp table. Since each DBMS often has a slightly different mechanism for upserts (whether conflict or not exists), so it'd be useful to know your DBMS. — r2evans
– r2evans, Commented Jan 28, 2020 at 16:16

r2evans · Accepted Answer · 2020-01-28 21:56:59Z

I think you have two options: (1) UPSERT, or (2) parameterized queries.

The first has the advantage of speed (often, depending on the DBMS) at the expense of DBMS-specific dialect of SQL and a little complexity. The second has the advantage of simplicity, but may take longer if you have many rows.

1. `UPSERT`

Steps: create a temporary table (see notes); upload the data; do an update operation with conflict resolution.

I use temp_table_997 as a temporary table here, but there are many ways to deal with temp tables so that you don't accidentally leave it around. I've found the success of this varies with DBMS, so I'll leave it up to the reader.

DBI::dbExecute(con, "
  CREATE TABLE temp_table_997 AS
  SELECT oseast1m, osnrth1m, Postcode FROM ons LIMIT 0")                         # [1,2]
DBI::dbWriteTable(con, "temp_table_997", result[,c("east", "west", "postcode")]) # [3,4]
DBI::dbExecute(con, "
  INSERT INTO ons (oseast1m, osnrth1m)
    SELECT oseast1m, osnrth1m
    FROM temp_table_997
  ON CONFLICT ( Postcode ) DO
    UPDATE SET oseast1m=EXCLUDED.oseast1m, osnrth1m=EXCLUDED.osnrth1m
")                                                                               # [5]

Notes:

Other answers/articles that employ this technique might use select * ..., though best practices discourages this. It's typically better to be explicit with your table and just the fields necessary.
I use create table ... as select ... so that column types are preserved. Especially with all of the various types of numbers (float, integer, bigint, smallint, even "bit") and other fields ... and the fact that R does not go to this level of granularity, I find it best to be explicit when uploading data. Using this technique ensures that the type used in the target table is what is actually used. It might not be necessary in some DBMSes, but I don't think it hurts.
Similar to note 1, you should likely only upload columns that are needed, including the identification field and fields with updates; if there are fields that never update, there is no reason to waste the bandwidth, and on larger datasets this can have a sizable impact on upload time. (E.g., results[,c("Postcode",...)]).
While the tools and databases I employ are smart enough to deal with columns out-of-order, I don't know that that's the case with all DBMSes out there, so it's likely best and easy to keep the order of columns the same.
I'm inferring that Postcode is perfectly unique in the table. It doesn't necessarily need to be a key in the table (that's a separate discussion), but the assumption is that that field uniquely identifies rows. If not, then the above query may impact many more rows than intended.

This works (for me) on SQLite and Postgres, but the parlance for other DBMS may be the same or very similar.

For SQL Server, you need a slightly different query. (Realize that the CREATE ... SELECT ... LIMIT 0 above will need to be CREATE ... SELECT TOP 0 ..., due to SQL Server's dialect.)

DBI::dbExecute(con, "
  DECLARE @dummy int;
  MERGE ons WITH (HOLDLOCK) as tgt
  USING (SELECT ... FROM temp_table_997) as src
    ON ( tgt.Postcode )
  WHEN MATCHED THEN UPDATE SET tgt.oseast1m=src.oseast1m, tgt.osnrth1m=src.osnrth1m
  WHEN NOT MATCHED THEN INSERT (oseast1m, oseast1m) values (src.oseast1m, src.oseast1m)
")

If you're using this method, don't forget to clean up:

DBI::dbExecute(con, "drop table temp_table_997")

2. Binding (parameterized queries)

If it's only a handful of rows or you really don't see time penalties doing it this way, then try this.

res <- DBI::dbSendStatement(con, "
  UPDATE ons
  SET ons.oseast1m=?, ons.osnrth1m=?
  WHERE ons.Postcode=?")
DBI::dbBind(res, result[,c("east", "west", "postcode")]) # order and number of columns is important
DBI::dbGetRowsAffected(res)

The method of indicating parameters (? above) is solely dependent on the DBMS, not the DBI or odbc packages; you'll need to find the specific method for yours. This might be ?, ?name, $name, or :name; perhaps others exist.

(I admit that this might be just as efficient. I had tried several methods a couple of years ago, and whether due to the driver in use or one version of DBI or even my misunderstanding of things ... it's possible that this is just as efficient as an upsert. I'm not going to test it now, as the difference might only be relevant with larger datasets. YMMV.)

Thanks for the detailed answer and explanation. That was really really helpful. I used the dbBind method which worked exactly as expected. A great result. Thanks for making it so clear!
P.S. the dbBind was very slow... using dbExecute to upload the subset of rows then doing an UPDATE on the table in question was very fast indeed. Two great solutions. THanks

Collectives™ on Stack Overflow

Updating a table using dataframes and dbSendQuery

1 Answer 1

1. `UPSERT`

2. Binding (parameterized queries)

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1. UPSERT

2. Binding (parameterized queries)

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

1. `UPSERT`