0

I am new with Pyspark in Databricks so thats why im struggeling with the following: I have a dataframe that has columns with datatype string. One example (see image below) is the column netto_resultaat.

The requirement is to change this column and other financial related columns from a string to a decimal. This because i want to export the schema and data in a Azure SQL database.

i tried the following:

from pyspark.sql.functions import col from pyspark.sql.types import DecimalType newouterjoindffinal = newouterjoindf.withColumn("netto_resultaat",col("netto_resultaat").cast(DecimalType())) newouterjoindffinal.printSchema()

The result is that the numbers in column netto_resultaat are converted as null. there are records that have data like -100,880 , 35,312,000 or 118,579,525.

Any suggestions?

enter image description here

1 Answer 1

2

It is because you have comma in your numbers and you have to have dots instead.

from pyspark.sql import functions as F

df = pd.DataFrame({"netto_resultaat": ["100,800", "10,20", "20,342"]})
df = spark.createDataFrame(df)

df.show()

# output
+---------------+
|netto_resultaat|
+---------------+
|        100,800|
|          10,20|
|         20,342|
+---------------+

You have to first replace commas with dots and then apply cast:

(
    df
    .withColumn("netto_resultaat", 
                F.regexp_replace("netto_resultaat", ",", ".")
                .cast(DecimalType())
               )
    .show()
)

# output
+---------------+
|netto_resultaat|
+---------------+
|            101|
|             10|
|             20|
+---------------+

Or replace with empty string if those are thousands (I don't know your case exactly):

(
    df
    .withColumn("netto_resultaat", 
                F.regexp_replace("netto_resultaat", ",", "")
                .cast(DecimalType())
               )
    .show()
)

# output
+---------------+
|netto_resultaat|
+---------------+
|         100800|
|           1020|
|          20342|
Sign up to request clarification or add additional context in comments.

2 Comments

I got a attributeError with description: name 'F' is not defined. What does the "F" means before the regexp?
from pyspark.sql import functions as F. I updated my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.