1

Hello I am trying to read dataverse data to databricks. Generally getting the Data over the API works fine.

But converting the data into a pyspark dataframe throws errors if the dataverse data includes DecimalType or MoneyType (both being decimal numbers)

I try to read this into the dataframe as DecimalType(), but get an error. Below is a made-up example of the data the API returns and me trying to read it as a dataframe.

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", DecimalType(6, 2), True)
])

df = spark.createDataFrame(data=sample_data, schema=schema)

This throws the following error:

DecimalType(6,2) can not accept object 1234.56 in type float.

Is there any way to make it work. As a workaround I was able to read it as a Stringtype and cast it to Decimal after it is in the table

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

# Step 3: Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    #StructField("Salary", DecimalType(6, 2), True)
    StructField("Salary", StringType(), True)
])

# Step 4: Create the DataFrame
df = spark.createDataFrame(data=sample_data, schema=schema)

df = df.withColumn("Salary", psf.col("Salary").cast(DecimalType(6, 2)))

That way works but seems like an unneeded step

And on top of that the Dataverse API does not return Precision and Scale for its fields. I would go with the maximum(?) of 18 precision and 10 or 4 Scale (Decimal and Money respectively).

1 Answer 1

2

Yes DecimalType(6, 2) cannot accept float literals (1234.56) directly because pyspark.sql.types.DecimalType expects values of type decimal.Decimal, not float.

In your workaround,you read the salary field as StringType, then cast it to DecimalType, which works because Spark will parse the string into a decimal internally.

So you can try the below workaround to fix the error.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DecimalType

sample_data = [
    ("Alice", 34, 1234.56),
    ("Bob", 45, 2345.67),
    ("Cathy", 29, 3456.78)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", DecimalType(6, 2), True)
])

If your data comes from JSON or API responses and is still in float format, convert to Decimal, we will create a function which will help to convert the data to convert to decimal dynamically:

from decimal import Decimal

def convert_floats_to_decimal(records):
    return [
        (name, age, Decimal(salary))
        for name, age, salary in records
    ]

df = spark.createDataFrame(data=convert_floats_to_decimal(sample_data), schema=schema)
display(df)

Output:

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Is there a performance difference btween these two workarounds? Is parsing a string in pyspark or turning the values to decimal using pure python list iteration quicker/more performant?
turning the values to decimal using pure python list iteration quicker/more performant -- yes,Typically faster overall because Spark does not need to infer or cast at runtime

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.