7

I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -

from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
|  revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
 |-- revenue: string (nullable = true)

Output desired: df.show()

+---------+
|  revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
 |-- revenue: float (nullable = true)

I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))

But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
|       |
+-------+
0

1 Answer 1

16

You need to escape . to match it literally, as . is a special character that matches almost any character in regex:

df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.