0

Any ideas on this one in Pyspark?

I have salaries like the below in the Salary column. I've tried to remove the $

df = df.withColumn('clean_salary', regexp_replace(col("Salary"), '$', ''))
df.show()

It doesn't do anything, as you can see - any ideas why?

Thanks

+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| id|first_name| last_name|gender|           City|           Job Title|   Salary|  Latitude|  Longitude|clean_salary|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
|  1|   Melinde| Shilburne|Female|      Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|   $57438.18|
|  2|  Kimberly|Von Welden|Female|         Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|   $62846.60|
|  3|    Alvera|  Di Boldi|Female|           null|                null|$57576.52|39.9947462|116.3397725|   $57576.52|
|  4|   Shannon| O'Griffin|  Male|  Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|   $61489.23|
|  5|  Sherwood|   Macieja|  Male|      Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|   $63863.09|
|  6|     Maris|      Folk|Female|Kinsealy-Drinan|      Civil Engineer|$30101.16|53.4266145| -6.1644997|   $30101.16|
|  7|     Masha|    Divers|Female|         Dachun|                null|$25090.87| 24.879416| 118.930111|   $25090.87|
|  8|   Goddart|     Flear|  Male|      Trélissac|Desktop Support T...|$46116.36|45.1905186|  0.7423124|   $46116.36|
|  9|      Roth|O'Cannavan|  Male|         Heitan|VP Product Manage...|$73697.10| 32.027934| 106.657113|   $73697.10|

2 Answers 2

1

Rather than regex, it's easier to just remove the first character (unless salary column values are not that straightforward)

>>> df = sc.parallelize([('$123',),('$873',)]).toDF(['salary'])
>>> df.show()
+------+
|salary|
+------+
|  $123|
|  $873|
+------+

>>> df.select(df.salary.substr(2,100).cast('float').alias('salary')).show() #Float
+------+
|salary|
+------+
| 123.0|
| 873.0|
+------+

>>> df.select(df.salary.substr(2,100).cast('decimal(10,2)').alias('salary')).show() #Decimal
+------+
|salary|
+------+
|123.00|
|873.00|
+------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for all your help!
0

try the below regexp_replace code

updatedDF = df.withColumn('clean_salary', regexp_replace(col("Salary"), "[\$]", ""))
updatedDF.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.