Pyspark read data to dataframe as decimal

Question

Hello I am trying to read dataverse data to databricks. Generally getting the Data over the API works fine.

But converting the data into a pyspark dataframe throws errors if the dataverse data includes DecimalType or MoneyType (both being decimal numbers)

I try to read this into the dataframe as DecimalType(), but get an error. Below is a made-up example of the data the API returns and me trying to read it as a dataframe.

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", DecimalType(6, 2), True)
])

df = spark.createDataFrame(data=sample_data, schema=schema)

This throws the following error:

DecimalType(6,2) can not accept object 1234.56 in type float.

Is there any way to make it work. As a workaround I was able to read it as a Stringtype and cast it to Decimal after it is in the table

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

# Step 3: Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    #StructField("Salary", DecimalType(6, 2), True)
    StructField("Salary", StringType(), True)
])

# Step 4: Create the DataFrame
df = spark.createDataFrame(data=sample_data, schema=schema)

df = df.withColumn("Salary", psf.col("Salary").cast(DecimalType(6, 2)))

That way works but seems like an unneeded step

And on top of that the Dataverse API does not return Precision and Scale for its fields. I would go with the maximum(?) of 18 precision and 10 or 4 Scale (Decimal and Money respectively).

Pritam · Accepted Answer · 2025-06-16 09:21:48Z

2

Yes DecimalType(6, 2) cannot accept float literals (1234.56) directly because pyspark.sql.types.DecimalType expects values of type decimal.Decimal, not float.

In your workaround,you read the salary field as StringType, then cast it to DecimalType, which works because Spark will parse the string into a decimal internally.

So you can try the below workaround to fix the error.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DecimalType

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", DecimalType(6, 2), True)
])

If your data comes from JSON or API responses and is still in float format, convert to Decimal, we will create a function which will help to convert the data to convert to decimal dynamically:

from decimal import Decimal

def convert_floats_to_decimal(records):
    return [
        (name, age, Decimal(salary))
        for name, age, salary in records
    ]

df = spark.createDataFrame(data=convert_floats_to_decimal(sample_data), schema=schema)
display(df)

Output:

answered Jun 16 at 9:21

Pritam

5152 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Courier Jun 16 at 11:05

Is there a performance difference btween these two workarounds? Is parsing a string in pyspark or turning the values to decimal using pure python list iteration quicker/more performant?

Pritam Jun 16 at 15:30

turning the values to decimal using pure python list iteration quicker/more performant -- yes,Typically faster overall because Spark does not need to infer or cast at runtime

Collectives™ on Stack Overflow

Pyspark read data to dataframe as decimal

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related