I would to clean up data in a dataframe column City. It can have the following values:
Venice® VeniceÆ Venice? Venice Venice® Venice
I would like to remove all the non ascii characters as well as ?, and . How can I achieve it?
You can clean up strings with Regex by filtering only on letters
# create dataframes
date_data = [
(1,"Venice®"),
(2,"VeniceÆ"),
(3,"Venice?"),
(4,"Venice")]
schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()
+---+--------+
|id |name |
+---+--------+
|1 |Venice®|
|2 |VeniceÆ |
|3 |Venice? |
|4 |Venice |
+---+--------+
# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()
+---+--------+----------+
| id| name|clean_name|
+---+--------+----------+
| 1|Venice®| Venice|
| 2| VeniceÆ| Venice|
| 3| Venice?| Venice|
| 4| Venice| Venice|
+---+--------+----------+
PS: But I doubt that you see such characters after correct import to spark. Superscript for example is ignored
Veniceis 6 Unicode characters. What you posted is the result of reading a UTF8 file using the wrong encoding. To fix this use the correct encoding. Post the code you used to load this file, and post an actual example of the correct text