1

I have a large dataframe which I need to upload to SQL server. Due to volume of data, my code does the insert in batches. But, I am facing insert failure if the batch has more than 1 record in it. The insert happens flawlessly if I give the batch size as 1. but if batch size is 2 or more, it fails for some batches.

I have narrowed down the problem to the Null values in some of the columns. I am not sure how I can handle the null values in some float and integer columns.

Following is a toy dataset that recreates the problem:

data = {
    "unique_id": [
        "String_3", "String_5", "String_10", "String_9", "String_4",
        "String_7", "String_2", "String_6", "String_1", "String_8"
    ],
    "entity_name": [
        "Alice", None, "Eve", "Alice", "Bob",
        "Alice", "Alice", None, "Charlie", "Eve"
    ],
    "entity_address": [
        "456 Elm St", "789 Oak St", "123 Main St", "456 Elm St", None,
        "456 Elm St", "123 Main St", None, "789 Oak St", "789 Oak St"
    ],
    "entity_age_code": [
        763.0, 349.0, np.nan, np.nan, 888.0,
        999.0, 711.0, 574.0, 963.0, 300.0
    ],
    "entity_height": [
        93.357616, 48.408745, 79.978718, 94.953377, 11.094891,
        np.nan, 33.282917, np.nan, 82.714043, np.nan
    ],
    "entity_weight": [
        19.158688, np.nan, 73.853124, 54.005774, 70.846664,
        70.996657, np.nan, 26.325328, 11.360588, 51.324372
    ]
}

df =pd.DataFrame(data)
df['entity_age_code'] = change_dtype(df['entity_age_code'], 'Int64')

The np.nan is used for Null values as thats how the data comes in. The entity_age_code comes in as a float but is supposed to be an integer.

Here is the code that can be used to recreate the error:

entity_key = ['unique_id']
entity_table_name = 'DscEntityTable'
batch_size = 3
conn_str = "xxxxxxx"
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()

df = df.drop_duplicates()
dtype = infer_sql_server_dtype(df)

cursor.execute(f"SELECT OBJECT_ID('{entity_table_name}', 'U')")
table_exists = cursor.fetchone()[0] is not None

if table_exists:
    key_tuples = list(df[entity_key].dropna().itertuples(index=False, name=None))
    existing_keys = get_existing_keys_in_batches(cursor, entity_table_name, entity_key, key_tuples)
    existing_keys= pd.DataFrame(existing_keys, columns=entity_key)
    for col in entity_key:
        dtype_str = str(df[col].dtype)
        existing_keys[col] = change_dtype(existing_keys[col], dtype_str)
    df = pd.merge(df, existing_keys, how='outer', indicator=True)
    df = df[df['_merge'] == 'left_only'].drop(columns='_merge')
    if df.empty:
        logger.info("No new records to insert.")
        sys.exit("No new records to insert.")
else:
    columns_sql = ", ".join([
        f"[{col}] {dtype[col]} COLLATE Latin1_General_100_CI_AS_SC_UTF8" if dtype[col].startswith("VARCHAR") else f"[{col}] {dtype[col]}"
        for col in df.columns
        ])
    pk_sql = ", ".join(f"[{col}]" for col in entity_key)
    create_sql = f"CREATE TABLE {entity_table_name} ({columns_sql}, PRIMARY KEY ({pk_sql}))"
    logger.debug(f"Create table query: {create_sql}")
    cursor.execute(create_sql)
    conn.commit()
    logger.info(f"Table {entity_table_name} created.")

columns = list(df.columns)
placeholders = ", ".join(["?"] * len(columns))
insert_sql = f"INSERT INTO {entity_table_name} ({', '.join(columns)}) VALUES ({placeholders})"
logger.debug(f"Insert data query: {insert_sql}")

for batch_num, start in enumerate(range(0, len(df), batch_size)):
    batch = df.iloc[start:start + batch_size]
    cursor.fast_executemany = True

    values = batch.map(lambda x: None if (pd.isna(x) ) else x).values.tolist()
    logger.debug(f"Batch {batch_num} values: {values[:5]}... (total {len(values)})")
    try:
        cursor.executemany(insert_sql, values)
        conn.commit()
        logger.info(f"Written batch number: {batch_num} (rows {start} to {start+len(batch)-1})")
        written_count += len(batch)
    except Exception as e:
        logger.error(f"Error writing batch {batch_num} (rows {start} to {start+len(batch)-1}): {e}")
        continue  # Skip this batch and try the next one
cursor.close()
conn.close()

The code checks if the table exists. In case the table already exists, it checks if the primary keys of the dataframe are already in the table and excludes those records. If the table doesn't exist, it converts the datatypes of the columns to corresponding sql server datatypes and creates the table.

Once table has been created, it creates the insert statement and uses a batch approach to insert the records. cursor.fast_executemany = True and cursor.executemany(insert_sql, values) are used for faster inserts.

I am using values = batch.map(lambda x: None if (pd.isna(x) ) else x).values.tolist() to check for NaN and replace them with None. But it doesn't handle np.nan. I have tried using math.isnan() or np.isnan but they throw an error when x is a string.

with a batch size of 3 records, following is the error:

DEBUG - sql server dtypes:
unique_id   VARCHAR(20)
entity_name VARCHAR(20)
entity_address  VARCHAR(20)
entity_age_code BIGINT
entity_height   FLOAT
entity_weight   FLOAT
DEBUG - Create table query: CREATE TABLE DscEntityTable ([unique_id] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_name] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_address] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_age_code] BIGINT, [entity_height] FLOAT, [entity_weight] FLOAT, PRIMARY KEY ([unique_id]))
INFO - Table DscEntityTable created.
DEBUG - Insert data query: INSERT INTO DscEntityTable (unique_id, entity_name, entity_address, entity_age_code, entity_height, entity_weight) VALUES (?, ?, ?, ?, ?, ?)
DEBUG - Batch 0 values: [['String_3', 'Alice', '456 Elm St', 763.0, 93.357616, 19.158688], ['String_5', None, '789 Oak St', 349.0, 48.408745, nan], ['String_10', 'Eve', '123 Main St', nan, 79.978718, 73.853124]]... (total 3)
ERROR - Error writing batch 0 (rows 0 to 2): ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 6 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision. (8023) (SQLExecute); [42000] [Microsoft][ODBC Driver 17 for SQL Server]Fractional truncation (0)')
DEBUG - Batch 1 values: [['String_9', 'Alice', '456 Elm St', nan, 94.953377, 54.005774], ['String_4', 'Bob', None, 888.0, 11.094891, 70.846664], ['String_7', 'Alice', '456 Elm St', 999.0, nan, 70.996657]]... (total 3)
ERROR - Error writing batch 1 (rows 3 to 5): ('01S07', '[01S07] [Microsoft][ODBC Driver 17 for SQL Server]Fractional truncation (0) (SQLExecute); [01S07] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 5 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision. (8023)')
DEBUG - Batch 2 values: [['String_2', 'Alice', '123 Main St', 711, 33.282917, nan], ['String_6', None, None, 574, nan, 26.325328], ['String_1', 'Charlie', '789 Oak St', 963, 82.714043, 11.360588]]... (total 3)
ERROR - Error writing batch 2 (rows 6 to 8): ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 6 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision. (8023) (SQLExecute)')
DEBUG - Batch 3 values: [['String_8', 'Eve', '789 Oak St', 300, None, 51.324372]]... (total 1)
INFO - Written batch number: 3 (rows 9 to 9, batch size: 1)

If I change the batch size to 1, the records are successfully inserted

DEBUG - sql server dtypes:
unique_id   VARCHAR(20)
entity_name VARCHAR(20)
entity_address  VARCHAR(20)
entity_age_code BIGINT
entity_height   FLOAT
entity_weight   FLOAT
DEBUG - Create table query: CREATE TABLE DscEntityTable ([unique_id] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_name] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_address] VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8, [entity_age_code] BIGINT, [entity_height] FLOAT, [entity_weight] FLOAT, PRIMARY KEY ([unique_id]))
INFO - Table DscEntityTable created.
DEBUG - Insert data query: INSERT INTO DscEntityTable (unique_id, entity_name, entity_address, entity_age_code, entity_height, entity_weight) VALUES (?, ?, ?, ?, ?, ?)
DEBUG - Batch 0 values: [['String_3', 'Alice', '456 Elm St', 763, 93.357616, 19.158688]]... (total 1)
INFO - Written batch number: 0 (rows 0 to 0, batch size: 1)
DEBUG - Batch 1 values: [['String_5', None, '789 Oak St', 349, 48.408745, None]]... (total 1)
INFO - Written batch number: 1 (rows 1 to 1, batch size: 1)
DEBUG - Batch 2 values: [['String_10', 'Eve', '123 Main St', None, 79.978718, 73.853124]]... (total 1)
INFO - Written batch number: 2 (rows 2 to 2, batch size: 1)
DEBUG - Batch 3 values: [['String_9', 'Alice', '456 Elm St', None, 94.953377, 54.005774]]... (total 1)
INFO - Written batch number: 3 (rows 3 to 3, batch size: 1)
DEBUG - Batch 4 values: [['String_4', 'Bob', None, 888, 11.094891, 70.846664]]... (total 1)
INFO - Written batch number: 4 (rows 4 to 4, batch size: 1)
DEBUG - Batch 5 values: [['String_7', 'Alice', '456 Elm St', 999, None, 70.996657]]... (total 1)
INFO - Written batch number: 5 (rows 5 to 5, batch size: 1)
DEBUG - Batch 6 values: [['String_2', 'Alice', '123 Main St', 711, 33.282917, None]]... (total 1)
INFO - Written batch number: 6 (rows 6 to 6, batch size: 1)
DEBUG - Batch 7 values: [['String_6', None, None, 574, None, 26.325328]]... (total 1)
INFO - Written batch number: 7 (rows 7 to 7, batch size: 1)
DEBUG - Batch 8 values: [['String_1', 'Charlie', '789 Oak St', 963, 82.714043, 11.360588]]... (total 1)
INFO - Written batch number: 8 (rows 8 to 8, batch size: 1)
DEBUG - Batch 9 values: [['String_8', 'Eve', '789 Oak St', 300, None, 51.324372]]... (total 1)
INFO - Written batch number: 9 (rows 9 to 9, batch size: 1)

Why would the number of records being inserted make a difference?

My actual data has millions of rows, so inserting records 1 at a time won't work.

Thanks in advance for any help.

3
  • 2
    Did you read the error message? The supplied value is not a valid instance of data type float. I don't think it has anything to do with number of rows, I think it's just that your first row (the one that succeeds on its own) happens to not have nan or None in it. SQL Server and the drivers don't understand keywords like nan... if there isn't a valid float value to pass, pass NULL. Commented Jul 25 at 13:27
  • @AaronBertrand all the records are inserted without issues if i use a batch size of 1. but if i change batch size to 2, it gives an error for the same records. Commented Jul 28 at 15:32
  • But surely you can read the error message and see this isn't some magic about how many rows you are inserting, but a problem with the data in those rows. Maybe the former leads to the latter because of your code, but the latter is what you need to solve. Commented Jul 28 at 15:36

1 Answer 1

2

While pd.isna() corectly identifies np.nan as a missing value, the way batch.map(lambda x: ...) interacts with mixed data types in a DataFrame, followed by .values.tolist(), might not consistently convert all np.nan instances into Python's None object. pyodbc's fast_executemany is very strict: it expects None for SQL NULL, not np.nan.

For Int64 columns, Pandas uses pd.NA for missing values, not np.nan. Your current lambda function should handle pd.NA correctly, but the isue with np.nan in float columns is separate.

The solution I think is to ensure that all np.nan (from float columns) and pd.NA (from nullable integer/object columns) are explicitly converted to Python's None object before pyodbc attempts to process the batch.

The script

batch.replace({np.nan: None, pd.NA: None}).values.tolist() 

should be more robust because directly targets np.nan values (which are floats) and replaces them with Python's None. This is crcial for your entity_height and entity_weight columns. pd.NA: None explicitly handles pd.NA values, which are used by Pandas' nullable dtypes like Int64.

By performing this replacement directly on the DataFrame batch before converting it to values.tolist(), I think we can ensure that pyodbc receives a list of lists where all missing values are represented by the standard Python None object, which it can then correctly map to SQL NULL.

Sign up to request clarification or add additional context in comments.

3 Comments

But then how come the same rows with np.nan are correctly inserted when batch size is 1?
In example with batch size=1 you do not have value nan in data. Try both batch size=1 and size=3 with same data example. See your data ['String_2', 'Alice', '123 Main St', 711, 33.282917, nan] vs ['String_2', 'Alice', '123 Main St', 711, 33.282917, nan]
This solved my problem but there is still my question: why is the cursor able to insert a single row even when it has a nan value but cant do it when data has multiple rows.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.