Hello I am trying to read dataverse data to databricks. Generally getting the Data over the API works fine.
But converting the data into a pyspark dataframe throws errors if the dataverse data includes DecimalType or MoneyType (both being decimal numbers)
I try to read this into the dataframe as DecimalType(), but get an error. Below is a made-up example of the data the API returns and me trying to read it as a dataframe.
sample_data = [
("Alice", 34, 1234.56),
("Bob", 45, 2345.67),
("Cathy", 29, 3456.78)
]
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Salary", DecimalType(6, 2), True)
])
df = spark.createDataFrame(data=sample_data, schema=schema)
This throws the following error:
DecimalType(6,2) can not accept object 1234.56 in type float.
Is there any way to make it work. As a workaround I was able to read it as a Stringtype and cast it to Decimal after it is in the table
sample_data = [
("Alice", 34, 1234.56),
("Bob", 45, 2345.67),
("Cathy", 29, 3456.78)
]
# Step 3: Define the schema
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
#StructField("Salary", DecimalType(6, 2), True)
StructField("Salary", StringType(), True)
])
# Step 4: Create the DataFrame
df = spark.createDataFrame(data=sample_data, schema=schema)
df = df.withColumn("Salary", psf.col("Salary").cast(DecimalType(6, 2)))
That way works but seems like an unneeded step
And on top of that the Dataverse API does not return Precision and Scale for its fields. I would go with the maximum(?) of 18 precision and 10 or 4 Scale (Decimal and Money respectively).
