I have a large dataframe which I need to upload to SQL server. Due to volume of data, my code does the insert in batches. But, I am facing insert failure if the batch has more than 1 record in it. The insert happens flawlessly if I give the batch size as 1. but if batch size is 2 or more, it fails for some batches.
I have narrowed down the problem to the Null values in some of the columns. I am not sure how I can handle the null values in some float and integer columns.
Following is a toy dataset that recreates the problem:
data = {
"unique_id": [
"String_3", "String_5", "String_10", "String_9", "String_4",
"String_7", "String_2", "String_6", "String_1", "String_8"
],
"entity_name": [
"Alice", None, "Eve", "Alice", "Bob",
"Alice", "Alice", None, "Charlie", "Eve"
],
"entity_address": [
"456 Elm St", "789 Oak St", "123 Main St", "456 Elm St", None,
"456 Elm St", "123 Main St", None, "789 Oak St", "789 Oak St"
],
"entity_age_code": [
763.0, 349.0, np.nan, np.nan, 888.0,
999.0, 711.0, 574.0, 963.0, 300.0
],
"entity_height": [
93.357616, 48.408745, 79.978718, 94.953377, 11.094891,
np.nan, 33.282917, np.nan, 82.714043, np.nan
],
"entity_weight": [
19.158688, np.nan, 73.853124, 54.005774, 70.846664,
70.996657, np.nan, 26.325328, 11.360588, 51.324372
]
}
df =pd.DataFrame(data)
df['entity_age_code'] = change_dtype(df['entity_age_code'], 'Int64')
The np.nan is used for Null values as thats how the data comes in. The entity_age_code comes in as a float but is supposed to be an integer.
Here is the code that can be used to recreate the error:
entity_key = ['unique_id']
entity_table_name = 'DscEntityTable'
batch_size = 3
conn_str = "xxxxxxx"
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
df = df.drop_duplicates()
dtype = infer_sql_server_dtype(df)
cursor.execute(f"SELECT OBJECT_ID('{entity_table_name}', 'U')")
table_exists = cursor.fetchone()[0] is not None
if table_exists:
key_tuples = list(df[entity_key].dropna().itertuples(index=False, name=None))
existing_keys = get_existing_keys_in_batches(cursor, entity_table_name, entity_key, key_tuples)
existing_keys= pd.DataFrame(existing_keys, columns=entity_key)
for col in entity_key:
dtype_str = str(df[col].dtype)
existing_keys[col] = change_dtype(existing_keys[col], dtype_str)
df = pd.merge(df, existing_keys, how='outer', indicator=True)
df = df[df['_merge'] == 'left_only'].drop(columns='_merge')
if df.empty:
logger.info("No new records to insert.")
sys.exit("No new records to insert.")
else:
columns_sql = ", ".join([
f"[{col}] {dtype[col]} COLLATE Latin1_General_100_CI_AS_SC_UTF8" if dtype[col].startswith("VARCHAR") else f"[{col}] {dtype[col]}"
for col in df.columns
])
pk_sql = ", ".join(f"[{col}]" for col in entity_key)
create_sql = f"CREATE TABLE {entity_table_name} ({columns_sql}, PRIMARY KEY ({pk_sql}))"
logger.debug(f"Create table query: {create_sql}")
cursor.execute(create_sql)
conn.commit()
logger.info(f"Table {entity_table_name} created.")
columns = list(df.columns)
placeholders = ", ".join(["?"] * len(columns))
insert_sql = f"INSERT INTO {entity_table_name} ({', '.join(columns)}) VALUES ({placeholders})"
logger.debug(f"Insert data query: {insert_sql}")
for batch_num, start in enumerate(range(0, len(df), batch_size)):
batch = df.iloc[start:start + batch_size]
cursor.fast_executemany = True
values = batch.map(lambda x: None if (pd.isna(x) ) else x).values.tolist()
logger.debug(f"Batch {batch_num} values: {values[:5]}... (total {len(values)})")
try:
cursor.executemany(insert_sql, values)
conn.commit()
logger.info(f"Written batch number: {batch_num} (rows {start} to {start+len(batch)-1})")
written_count += len(batch)
except Exception as e:
logger.error(f"Error writing batch {batch_num} (rows {start} to {start+len(batch)-1}): {e}")
continue # Skip this batch and try the next one
cursor.close()
conn.close()
The code checks if the table exists. In case the table already exists, it checks if the primary keys of the dataframe are already in the table and excludes those records. If the table doesn't exist, it converts the datatypes of the columns to corresponding sql server datatypes and creates the table.
Once table has been created, it creates the insert statement and uses a batch approach to insert the records. cursor.fast_executemany = True and cursor.executemany(insert_sql, values) are used for faster inserts.
I am using values = batch.map(lambda x: None if (pd.isna(x) ) else x).values.tolist() to check for NaN and replace them with None. But it doesn't handle np.nan. I have tried using math.isnan() or np.isnan but they throw an error when x is a string.
with a batch size of 3 records, following is the error:
DEBUG - sql server dtypes:
unique_id VARCHAR(20)
entity_name VARCHAR(20)
entity_address VARCHAR(20)
entity_age_code BIGINT
entity_height FLOAT
entity_weight FLOAT
DEBUG - Create table query: CREATE TABLE DscEntityTable ([unique_id] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_name] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_address] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_age_code] BIGINT, [entity_height] FLOAT, [entity_weight] FLOAT, PRIMARY KEY ([unique_id]))
INFO - Table DscEntityTable created.
DEBUG - Insert data query: INSERT INTO DscEntityTable (unique_id, entity_name, entity_address, entity_age_code, entity_height, entity_weight) VALUES (?, ?, ?, ?, ?, ?)
DEBUG - Batch 0 values: [['String_3', 'Alice', '456 Elm St', 763.0, 93.357616, 19.158688], ['String_5', None, '789 Oak St', 349.0, 48.408745, nan], ['String_10', 'Eve', '123 Main St', nan, 79.978718, 73.853124]]... (total 3)
ERROR - Error writing batch 0 (rows 0 to 2): ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 6 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision. (8023) (SQLExecute); [42000] [Microsoft][ODBC Driver 17 for SQL Server]Fractional truncation (0)')
DEBUG - Batch 1 values: [['String_9', 'Alice', '456 Elm St', nan, 94.953377, 54.005774], ['String_4', 'Bob', None, 888.0, 11.094891, 70.846664], ['String_7', 'Alice', '456 Elm St', 999.0, nan, 70.996657]]... (total 3)
ERROR - Error writing batch 1 (rows 3 to 5): ('01S07', '[01S07] [Microsoft][ODBC Driver 17 for SQL Server]Fractional truncation (0) (SQLExecute); [01S07] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 5 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision. (8023)')
DEBUG - Batch 2 values: [['String_2', 'Alice', '123 Main St', 711, 33.282917, nan], ['String_6', None, None, 574, nan, 26.325328], ['String_1', 'Charlie', '789 Oak St', 963, 82.714043, 11.360588]]... (total 3)
ERROR - Error writing batch 2 (rows 6 to 8): ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 6 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision. (8023) (SQLExecute)')
DEBUG - Batch 3 values: [['String_8', 'Eve', '789 Oak St', 300, None, 51.324372]]... (total 1)
INFO - Written batch number: 3 (rows 9 to 9, batch size: 1)
If I change the batch size to 1, the records are successfully inserted
DEBUG - sql server dtypes:
unique_id VARCHAR(20)
entity_name VARCHAR(20)
entity_address VARCHAR(20)
entity_age_code BIGINT
entity_height FLOAT
entity_weight FLOAT
DEBUG - Create table query: CREATE TABLE DscEntityTable ([unique_id] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_name] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_address] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_age_code] BIGINT, [entity_height] FLOAT, [entity_weight] FLOAT, PRIMARY KEY ([unique_id]))
INFO - Table DscEntityTable created.
DEBUG - Insert data query: INSERT INTO DscEntityTable (unique_id, entity_name, entity_address, entity_age_code, entity_height, entity_weight) VALUES (?, ?, ?, ?, ?, ?)
DEBUG - Batch 0 values: [['String_3', 'Alice', '456 Elm St', 763, 93.357616, 19.158688]]... (total 1)
INFO - Written batch number: 0 (rows 0 to 0, batch size: 1)
DEBUG - Batch 1 values: [['String_5', None, '789 Oak St', 349, 48.408745, None]]... (total 1)
INFO - Written batch number: 1 (rows 1 to 1, batch size: 1)
DEBUG - Batch 2 values: [['String_10', 'Eve', '123 Main St', None, 79.978718, 73.853124]]... (total 1)
INFO - Written batch number: 2 (rows 2 to 2, batch size: 1)
DEBUG - Batch 3 values: [['String_9', 'Alice', '456 Elm St', None, 94.953377, 54.005774]]... (total 1)
INFO - Written batch number: 3 (rows 3 to 3, batch size: 1)
DEBUG - Batch 4 values: [['String_4', 'Bob', None, 888, 11.094891, 70.846664]]... (total 1)
INFO - Written batch number: 4 (rows 4 to 4, batch size: 1)
DEBUG - Batch 5 values: [['String_7', 'Alice', '456 Elm St', 999, None, 70.996657]]... (total 1)
INFO - Written batch number: 5 (rows 5 to 5, batch size: 1)
DEBUG - Batch 6 values: [['String_2', 'Alice', '123 Main St', 711, 33.282917, None]]... (total 1)
INFO - Written batch number: 6 (rows 6 to 6, batch size: 1)
DEBUG - Batch 7 values: [['String_6', None, None, 574, None, 26.325328]]... (total 1)
INFO - Written batch number: 7 (rows 7 to 7, batch size: 1)
DEBUG - Batch 8 values: [['String_1', 'Charlie', '789 Oak St', 963, 82.714043, 11.360588]]... (total 1)
INFO - Written batch number: 8 (rows 8 to 8, batch size: 1)
DEBUG - Batch 9 values: [['String_8', 'Eve', '789 Oak St', 300, None, 51.324372]]... (total 1)
INFO - Written batch number: 9 (rows 9 to 9, batch size: 1)
Why would the number of records being inserted make a difference?
My actual data has millions of rows, so inserting records 1 at a time won't work.
Thanks in advance for any help.
The supplied value is not a valid instance of data type float.I don't think it has anything to do with number of rows, I think it's just that your first row (the one that succeeds on its own) happens to not havenanorNonein it. SQL Server and the drivers don't understand keywords likenan... if there isn't a validfloatvalue to pass, passNULL.