0

I would to clean up data in a dataframe column City. It can have the following values:

Venice® VeniceÆ Venice? Venice Venice® Venice

I would like to remove all the non ascii characters as well as ?, and . How can I achieve it?

2
  • Python 3 strings are Unicode. There are no "non-ASCII" characters. Even Venice is 6 Unicode characters. What you posted is the result of reading a UTF8 file using the wrong encoding. To fix this use the correct encoding. Post the code you used to load this file, and post an actual example of the correct text Commented Sep 29, 2022 at 9:40
  • In UTF8 characters outside the 7-bit US-ASCII range are represented as two or more bytes. If you try to read UTF8 text using eg Latin1 the extra bytes will appear as extra characters. Commented Sep 29, 2022 at 9:45

1 Answer 1

1

You can clean up strings with Regex by filtering only on letters

# create dataframes
date_data = [
    (1,"Venice®"),
    (2,"VeniceÆ"),
    (3,"Venice?"),
    (4,"Venice")]

schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()

+---+--------+
|id |name    |
+---+--------+
|1  |Venice®|
|2  |VeniceÆ |
|3  |Venice? |
|4  |Venice  |
+---+--------+

# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()

+---+--------+----------+
| id|    name|clean_name|
+---+--------+----------+
|  1|Venice®|    Venice|
|  2| VeniceÆ|    Venice|
|  3| Venice?|    Venice|
|  4|  Venice|    Venice|
+---+--------+----------+

PS: But I doubt that you see such characters after correct import to spark. Superscript for example is ignored

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.