Pyspark replace characters in DF column and cast as float

Question

Any ideas on this one in Pyspark?

I have salaries like the below in the Salary column. I've tried to remove the $

df = df.withColumn('clean_salary', regexp_replace(col("Salary"), '$', ''))
df.show()

It doesn't do anything, as you can see - any ideas why?

Thanks

+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| id|first_name| last_name|gender|           City|           Job Title|   Salary|  Latitude|  Longitude|clean_salary|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
|  1|   Melinde| Shilburne|Female|      Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|   $57438.18|
|  2|  Kimberly|Von Welden|Female|         Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|   $62846.60|
|  3|    Alvera|  Di Boldi|Female|           null|                null|$57576.52|39.9947462|116.3397725|   $57576.52|
|  4|   Shannon| O'Griffin|  Male|  Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|   $61489.23|
|  5|  Sherwood|   Macieja|  Male|      Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|   $63863.09|
|  6|     Maris|      Folk|Female|Kinsealy-Drinan|      Civil Engineer|$30101.16|53.4266145| -6.1644997|   $30101.16|
|  7|     Masha|    Divers|Female|         Dachun|                null|$25090.87| 24.879416| 118.930111|   $25090.87|
|  8|   Goddart|     Flear|  Male|      Trélissac|Desktop Support T...|$46116.36|45.1905186|  0.7423124|   $46116.36|
|  9|      Roth|O'Cannavan|  Male|         Heitan|VP Product Manage...|$73697.10| 32.027934| 106.657113|   $73697.10|

Bala · Accepted Answer · 2020-04-12 21:48:45Z

1

Rather than regex, it's easier to just remove the first character (unless salary column values are not that straightforward)

>>> df = sc.parallelize([('$123',),('$873',)]).toDF(['salary'])
>>> df.show()
+------+
|salary|
+------+
|  $123|
|  $873|
+------+

>>> df.select(df.salary.substr(2,100).cast('float').alias('salary')).show() #Float
+------+
|salary|
+------+
| 123.0|
| 873.0|
+------+

>>> df.select(df.salary.substr(2,100).cast('decimal(10,2)').alias('salary')).show() #Decimal
+------+
|salary|
+------+
|123.00|
|873.00|
+------+

edited Apr 12, 2020 at 21:48

answered Apr 12, 2020 at 16:48

Bala

11.3k19 gold badges75 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kikee1222 Over a year ago

Thanks for all your help!

dassum · Accepted Answer · 2020-04-12 15:06:10Z

0

try the below regexp_replace code

updatedDF = df.withColumn('clean_salary', regexp_replace(col("Salary"), "[\$]", ""))
updatedDF.show()

answered Apr 12, 2020 at 15:06

dassum

5,1422 gold badges30 silver badges40 bronze badges

Collectives™ on Stack Overflow

Pyspark replace characters in DF column and cast as float

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related