1

I have a PySpark dataframe like this:

A B
1 abc_value
2 abc_value
3 some_other_value
4 anything_else

I have a mapping dictionary:

d = {
"abc":"X",
"some_other":Y,
"anything":Z
}

I need to create new column in my original Dataframe which should be like this:

A B C
1 abc_value X
2 abc_value X
3 some_other_value Y
4 anything_else Z

I tried mapping like this:

mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())]) and then applying it with withColumn however it is exact matching, however I need partial (regex) matching as you can see.

How to accomplish this, please?

0

1 Answer 1

2

I'm afraid in PySpark there's no implemented function that extracts substrings according to a defined dictionary; you probably need to resort to tricks.

In this case, you can first create a search string which includes all your dictionary keys to be searched:

keys = list(d.keys())
keys_expr = '|'.join(keys)

keys_expr
# 'abc|some_other|anything'

Then you can use regexp_extract to extract the first key from keys_expr that we encounter in column B, if present (that's the reason for the | operator).
Finally, you can use dictionary d to replace the values in the new column.

import pyspark.sql.functions as F

df = df\
  .withColumn('C', F.regexp_extract('B', keys_expr, 0))\
  .replace(d, subset=['C'])

df.show()

+---+----------------+---+
|  A|               B|  C|
+---+----------------+---+
|  1|       abc_value|  X|
|  2|       abc_value|  X|
|  3|some_other_value|  Y|
|  4|   anything_else|  Z|
+---+----------------+---+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.