0

I have a PySpark data frame with a string column(URL) and all records look in the following way

ID                                   URL
1          https://app.xyz.com/inboxes/136636/conversations/2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796
3          https://app.drift.com/inboxes/136636/conversations/2938419189

I want to basically extract the number after conversations/ from URL column using regex into another column.

I tried the following code but it doesn't give me any results.

df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))

Expected:

ID                                   URL                                         CONV_ID
1          https://app.xyz.com/inboxes/136636/conversations/2686735685         2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796         2938415796     
3          https://app.drift.com/inboxes/136636/conversations/2938419189       2938419189

Result:

ID                                   URL                                         CONV_ID
1          https://app.xyz.com/inboxes/136636/conversations/2686735685         https://app.xyz.com/inboxes/136636/conversations/2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796         https://app.xyz.com/inboxes/136636/conversations/2938415796     
3          https://app.drift.com/inboxes/136636/conversations/2938419189       https://app.drift.com/inboxes/136636/conversations/2938419189

Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.

2 Answers 2

2

If you are URLs have always that form, you can actually just use substring_index to get the last path element :

import pyspark.sql.functions as F

df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))

df1.show(truncate=False)

#+---+-------------------------------------------------------------+----------+
#|ID |URL                                                          |CONV_ID   |
#+---+-------------------------------------------------------------+----------+
#|1  |https://app.xyz.com/inboxes/136636/conversations/2686735685  |2686735685|
#|2  |https://app.xyz.com/inboxes/136636/conversations/2938415796  |2938415796|
#|3  |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
Sign up to request clarification or add additional context in comments.

Comments

0

You can use regexp_extract instead:

import pyspark.sql.functions as F

df1 = df.withColumn(
    'CONV_ID',
    F.regexp_extract('URL', 'conversations/(.*)', 1)
)

df1.show()
+---+--------------------+----------+
| ID|                 URL|   CONV_ID|
+---+--------------------+----------+
|  1|https://app.xyz.c...|2686735685|
|  2|https://app.xyz.c...|2938415796|
|  3|https://app.drift...|2938419189|
+---+--------------------+----------+

Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.

import pyspark.sql.functions as F

df1 = df.withColumn(
    'CONV_ID',
    F.split('URL', '(?<=conversations/)')[1]    # just using 'conversations/' should also be enough
)

df1.show()
+---+--------------------+----------+
| ID|                 URL|   CONV_ID|
+---+--------------------+----------+
|  1|https://app.xyz.c...|2686735685|
|  2|https://app.xyz.c...|2938415796|
|  3|https://app.drift...|2938419189|
+---+--------------------+----------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.