How to create a new column based on if certain strings exist in another column?

Question

I have a table looks like this:

+--------+-------------+
| Time   | Locations   |
+--------+-------------+
| 1/1/22 | A300-abc    |
+--------+-------------+
| 1/2/22 | A300-FFF    |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc    |
+--------+-------------+
| 1/5/22 | B750-EEE    |
+--------+-------------+
| 1/6/22 | M-200-68    |
+--------+-------------+
| 1/7/22 | ABC-abc     |
+--------+-------------+

I would like to derive to a table that looks like this:

+--------+-------------+-----------------+
| Time   | Locations   | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc    | A300            |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF    | A300            |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300            |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc    | B700            |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE    | B750            |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68    | M-200           |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc     | "not_listed"    |
+--------+-------------+-----------------+

Essentially I have a list of what the location code should be e.g. ["A300","B700","B750","M-200"], but currently the location column is very messy with other random strings. I want to create a new column that shows the "cleaned" version of the location code, and anything that is not in that list should be marked as "not_listed".

wwnde · Accepted Answer · 2022-03-18 23:08:41Z

1

Use regex and when condition. In this case I check if string begins with a digit ^[0-9] then extract the the leading digits in the string. If it doesn then attribute it with not listed. Code below

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()

+--------------------+---------+---------------+
|                Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456|   300abc|            300|
|0.022727272727272728|   300FFF|            300|
| 0.01515151515151515|   300ABC|            300|
|0.011363636363636364|   700abc|            700|
|0.009090909090909092|   750EEE|            750|
|0.007575757575757575|   ABCabc|     not_listed|
+--------------------+---------+---------------+

With your new question, use regexp_replace

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))

+------+-----------+---------------+
|  Time|  Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22|   A300-abc|           A300|
|1/2/22|   A300-FFF|           A300|
|1/3/22|A300-ABC123|           A300|
|1/4/22|   B700-abc|           B700|
|1/5/22|   B750-EEE|           B750|
|1/7/22|   M-200-68|          M-200|
|1/6/22|     ABCabc|     not_listed|
+------+-----------+---------------+

edited Mar 18, 2022 at 23:08

answered Mar 18, 2022 at 22:23

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

DPatrick Over a year ago

Thanks for your answer! I updated the example above to show how the actual data would be like. (i.e. the desired string could include more than just numbers [0-9], but also include string, and some notations. The idea is to check whatever is in that list of "clean name". Apology for the confusion!

wwnde Over a year ago

This a completely question from what you had asked. courtesy is you acknowledge efforts if initial question was answered as expected before asking for further help

DPatrick Over a year ago

Apology for that! Yes your answer above did answer my original question. It's my mistake for not stating clearly in the example that the desired string could include more than just numbers.

wwnde Over a year ago

Edited, please accept and upvote, if any concerns let me know

Collectives™ on Stack Overflow

How to create a new column based on if certain strings exist in another column?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related