I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.