Remove non-ASCII and specific characters from a dataframe column using Pyspark

Question

I would to clean up data in a dataframe column City. It can have the following values:

VeniceÂ® VeniceÆ Venice? Venice^{Venice^Â®
Venice}

I would like to remove all the non ascii characters as well as ?, and . How can I achieve it?

Python 3 strings are Unicode. There are no "non-ASCII" characters. Even Venice is 6 Unicode characters. What you posted is the result of reading a UTF8 file using the wrong encoding. To fix this use the correct encoding. Post the code you used to load this file, and post an actual example of the correct text — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Sep 29, 2022 at 9:40
In UTF8 characters outside the 7-bit US-ASCII range are represented as two or more bytes. If you try to read UTF8 text using eg Latin1 the extra bytes will appear as extra characters. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Sep 29, 2022 at 9:45

Alex Ortner · Accepted Answer · 2022-09-30 08:23:00Z

You can clean up strings with Regex by filtering only on letters

# create dataframes
date_data = [
    (1,"VeniceÂ®"),
    (2,"VeniceÆ"),
    (3,"Venice?"),
    (4,"Venice")]

schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()

+---+--------+
|id |name    |
+---+--------+
|1  |VeniceÂ®|
|2  |VeniceÆ |
|3  |Venice? |
|4  |Venice  |
+---+--------+

# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()

+---+--------+----------+
| id|    name|clean_name|
+---+--------+----------+
|  1|VeniceÂ®|    Venice|
|  2| VeniceÆ|    Venice|
|  3| Venice?|    Venice|
|  4|  Venice|    Venice|
+---+--------+----------+

PS: But I doubt that you see such characters after correct import to spark. Superscript for example is ignored

Collectives™ on Stack Overflow

Remove non-ASCII and specific characters from a dataframe column using Pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related