PySpark split using regex doesn't work on a dataframe column with string type

Question

I have a PySpark data frame with a string column(URL) and all records look in the following way

ID                                   URL
1          https://app.xyz.com/inboxes/136636/conversations/2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796
3          https://app.drift.com/inboxes/136636/conversations/2938419189

I want to basically extract the number after conversations/ from URL column using regex into another column.

I tried the following code but it doesn't give me any results.

df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))

Expected:

ID                                   URL                                         CONV_ID
1          https://app.xyz.com/inboxes/136636/conversations/2686735685         2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796         2938415796     
3          https://app.drift.com/inboxes/136636/conversations/2938419189       2938419189

Result:

ID                                   URL                                         CONV_ID
1          https://app.xyz.com/inboxes/136636/conversations/2686735685         https://app.xyz.com/inboxes/136636/conversations/2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796         https://app.xyz.com/inboxes/136636/conversations/2938415796     
3          https://app.drift.com/inboxes/136636/conversations/2938419189       https://app.drift.com/inboxes/136636/conversations/2938419189

Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.

blackbishop · Accepted Answer · 2021-02-12 14:51:54Z

2

If you are URLs have always that form, you can actually just use substring_index to get the last path element :

import pyspark.sql.functions as F

df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))

df1.show(truncate=False)

#+---+-------------------------------------------------------------+----------+
#|ID |URL                                                          |CONV_ID   |
#+---+-------------------------------------------------------------+----------+
#|1  |https://app.xyz.com/inboxes/136636/conversations/2686735685  |2686735685|
#|2  |https://app.xyz.com/inboxes/136636/conversations/2938415796  |2938415796|
#|3  |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+

answered Feb 12, 2021 at 14:51

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mck · Accepted Answer · 2021-02-12 12:31:42Z

You can use regexp_extract instead:

import pyspark.sql.functions as F

df1 = df.withColumn(
    'CONV_ID',
    F.regexp_extract('URL', 'conversations/(.*)', 1)
)

df1.show()
+---+--------------------+----------+
| ID|                 URL|   CONV_ID|
+---+--------------------+----------+
|  1|https://app.xyz.c...|2686735685|
|  2|https://app.xyz.c...|2938415796|
|  3|https://app.drift...|2938419189|
+---+--------------------+----------+

Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.

import pyspark.sql.functions as F

df1 = df.withColumn(
    'CONV_ID',
    F.split('URL', '(?<=conversations/)')[1]    # just using 'conversations/' should also be enough
)

df1.show()
+---+--------------------+----------+
| ID|                 URL|   CONV_ID|
+---+--------------------+----------+
|  1|https://app.xyz.c...|2686735685|
|  2|https://app.xyz.c...|2938415796|
|  3|https://app.drift...|2938419189|
+---+--------------------+----------+

Collectives™ on Stack Overflow

PySpark split using regex doesn't work on a dataframe column with string type

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related