0

I have a table looks like this:

+--------+-------------+
| Time   | Locations   |
+--------+-------------+
| 1/1/22 | A300-abc    |
+--------+-------------+
| 1/2/22 | A300-FFF    |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc    |
+--------+-------------+
| 1/5/22 | B750-EEE    |
+--------+-------------+
| 1/6/22 | M-200-68    |
+--------+-------------+
| 1/7/22 | ABC-abc     |
+--------+-------------+

I would like to derive to a table that looks like this:

+--------+-------------+-----------------+
| Time   | Locations   | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc    | A300            |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF    | A300            |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300            |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc    | B700            |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE    | B750            |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68    | M-200           |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc     | "not_listed"    |
+--------+-------------+-----------------+

Essentially I have a list of what the location code should be e.g. ["A300","B700","B750","M-200"], but currently the location column is very messy with other random strings. I want to create a new column that shows the "cleaned" version of the location code, and anything that is not in that list should be marked as "not_listed".

1 Answer 1

1

Use regex and when condition. In this case I check if string begins with a digit ^[0-9] then extract the the leading digits in the string. If it doesn then attribute it with not listed. Code below

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()

+--------------------+---------+---------------+
|                Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456|   300abc|            300|
|0.022727272727272728|   300FFF|            300|
| 0.01515151515151515|   300ABC|            300|
|0.011363636363636364|   700abc|            700|
|0.009090909090909092|   750EEE|            750|
|0.007575757575757575|   ABCabc|     not_listed|
+--------------------+---------+---------------+

With your new question, use regexp_replace

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))

+------+-----------+---------------+
|  Time|  Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22|   A300-abc|           A300|
|1/2/22|   A300-FFF|           A300|
|1/3/22|A300-ABC123|           A300|
|1/4/22|   B700-abc|           B700|
|1/5/22|   B750-EEE|           B750|
|1/7/22|   M-200-68|          M-200|
|1/6/22|     ABCabc|     not_listed|
+------+-----------+---------------+
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer! I updated the example above to show how the actual data would be like. (i.e. the desired string could include more than just numbers [0-9], but also include string, and some notations. The idea is to check whatever is in that list of "clean name". Apology for the confusion!
This a completely question from what you had asked. courtesy is you acknowledge efforts if initial question was answered as expected before asking for further help
Apology for that! Yes your answer above did answer my original question. It's my mistake for not stating clearly in the example that the desired string could include more than just numbers.
Edited, please accept and upvote, if any concerns let me know

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.