Replace string in PySpark

Question

I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -

from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
|  revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
 |-- revenue: string (nullable = true)

Output desired: df.show()

+---------+
|  revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
 |-- revenue: float (nullable = true)

I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))

But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
|       |
+-------+

akuiper · Accepted Answer · 2018-10-31 16:54:27Z

16

You need to escape . to match it literally, as . is a special character that matches almost any character in regex:

df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))

answered Oct 31, 2018 at 16:54

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Replace string in PySpark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related