0

Hi i have the following dataframe like

df.show()

this will output like

+----------+--------+-------+--------------------+-------+--------------------+
|      date|    time|from_to|      expression_col
+----------+--------+-------+--------------------+-------+--------------------+
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey1=BN zzzTemporary59 0
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey2=BN zzzTemporary59 0
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey3=BN zzzTemporary59 0

I am trying to traverse the expression_col, based on last comma seperated key before equalto sign(=) that is value are as below

lastkey1
lastkey2
lastkey3

based on this value if key is lastkey1 than then its comes under category one , lastkey2 then its comes under category 2,e.t.c the the final dataframe is

+----------+--------+-------+--------------------+-------+--------------------+
|      date|    time|from_to|      expression_col                                 | category
+----------+--------+-------+--------------------+-------+--------------------+
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey1=BN zzzTemporary59 0  | category-1
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey2=BN zzzTemporary59 0  | category-2  
|2019-11-08|05:55:41|   MO-N|test=LN,x23=test,x5=66,lastkey3=BN zzzTemporary59 0  | category-3

I can find the required result with the help reqular expression like

.*,(.*)=.*$

but how to can get the same thing using custom function

1 Answer 1

4

Assuming expression_col is a regular String:

scala> val df = Seq((100,"test=LN,x23=test,x5=66,lastkey1=BN zzzTemporary59"), (200,"test=LN,x23=test,x5=66,lastkey2=BN zzzTemporary59"), (300, "test=LN,x23=test,x5=66,lastkey3=BN zzzTemporary59 0")).toDF("id", "expression_col")
df: org.apache.spark.sql.DataFrame = [id: int, expression_col: string]

scala> df.withColumn("category", concat(lit("category-"), regexp_extract(df.col("expression_col"), "lastkey(\\d+)=", 1))).show()
+---+--------------------+----------+
| id|      expression_col|  category|
+---+--------------------+----------+
|100|test=LN,x23=test,...|category-1|
|200|test=LN,x23=test,...|category-2|
|300|test=LN,x23=test,...|category-3|
+---+--------------------+----------+

Use a regexp that extracts at least 1 digit i.e. \d+ from the string input, following "lastkey".

Use concat to add "category-" as a prefix.

Note that df above is a simplified version of yours.

Sign up to request clarification or add additional context in comments.

5 Comments

lastkey1 was only for test purpose i need to extract the text after last comma but before equalto sign (=) like a string like - test=123,test4=32432,test12=test test 12 than the last comma value before the equal to sign is ----- test4
please check my comment
hm, wouldn't it be test12? In any case, you can simply change the regexp to your specification. E.g. ".*,[a-z]+(\\d+)=[^=,]+$" would extract the digits from the last key (between comma and equals sign) starting from the end (as denoted by the dollar sign in the regexp). I.e. it would be 12 for the input in your first comment
Yes, got it i need to extract last whole text between comma and equal to sign in the regular expression (starting from end)
Hi @jacksonsmith please consider to accept the answer as correct if it is the one you were looking for

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.