How to Traverse Dataframe particular column in the loop

Question

Hi i have the following dataframe like

df.show()

this will output like

+----------+--------+-------+--------------------+-------+--------------------+
|      date|    time|from_to|      expression_col
+----------+--------+-------+--------------------+-------+--------------------+
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey1=BN zzzTemporary59 0
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey2=BN zzzTemporary59 0
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey3=BN zzzTemporary59 0

I am trying to traverse the expression_col, based on last comma seperated key before equalto sign(=) that is value are as below

lastkey1
lastkey2
lastkey3

based on this value if key is lastkey1 than then its comes under category one , lastkey2 then its comes under category 2,e.t.c the the final dataframe is

+----------+--------+-------+--------------------+-------+--------------------+
|      date|    time|from_to|      expression_col                                 | category
+----------+--------+-------+--------------------+-------+--------------------+
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey1=BN zzzTemporary59 0  | category-1
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey2=BN zzzTemporary59 0  | category-2  
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey3=BN zzzTemporary59 0  | category-3

I can find the required result with the help reqular expression like

.*,(.*)=.*$

but how to can get the same thing using custom function

ELinda · Accepted Answer · 2020-01-28 18:14:05Z

4

Assuming expression_col is a regular String:

scala> val df = Seq((100,"test=LN,x23=test,x5=66,lastkey1=BN zzzTemporary59"), (200,"test=LN,x23=test,x5=66,lastkey2=BN zzzTemporary59"), (300, "test=LN,x23=test,x5=66,lastkey3=BN zzzTemporary59 0")).toDF("id", "expression_col")
df: org.apache.spark.sql.DataFrame = [id: int, expression_col: string]

scala> df.withColumn("category", concat(lit("category-"), regexp_extract(df.col("expression_col"), "lastkey(\\d+)=", 1))).show()
+---+--------------------+----------+
| id|      expression_col|  category|
+---+--------------------+----------+
|100|test=LN,x23=test,...|category-1|
|200|test=LN,x23=test,...|category-2|
|300|test=LN,x23=test,...|category-3|
+---+--------------------+----------+

Use a regexp that extracts at least 1 digit i.e. \d+ from the string input, following "lastkey".

Use concat to add "category-" as a prefix.

Note that df above is a simplified version of yours.

answered Jan 28, 2020 at 18:14

ELinda

2,8211 gold badge13 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jackson smith Over a year ago

lastkey1 was only for test purpose i need to extract the text after last comma but before equalto sign (=) like a string like - test=123,test4=32432,test12=test test 12 than the last comma value before the equal to sign is ----- test4

jackson smith Over a year ago

please check my comment

ELinda Over a year ago

hm, wouldn't it be test12? In any case, you can simply change the regexp to your specification. E.g. ".*,[a-z]+(\\d+)=[^=,]+$" would extract the digits from the last key (between comma and equals sign) starting from the end (as denoted by the dollar sign in the regexp). I.e. it would be 12 for the input in your first comment

jackson smith Over a year ago

Yes, got it i need to extract last whole text between comma and equal to sign in the regular expression (starting from end)

abiratsis Over a year ago

Hi @jacksonsmith please consider to accept the answer as correct if it is the one you were looking for

Collectives™ on Stack Overflow

How to Traverse Dataframe particular column in the loop

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related