I have a Pyspark dataframe, that needs to be joined with another dataframe, based on a string column. For eg.
(bob1, "a.b.*.c") (bob2, "a.b.c")
when joined with
(tom1, "a.b.d.c") (tom2, "a.b.c")
on the second column (the pattern), should give: (bob1, tom1) (bob2, tom2). I understand this can be done using rlike but for for that I need to transform the pattern column into an actual regex. So
- a.b.*.c becomes ^a.b.(\w+).c$
- a.b.c becomes ^a.b.c$
Im having trouble doing this conversion. I tried using regex_replace(), but due to having \ in the output, it inserts \ twice instead of once.
*may appear everywhere not only in the 3rd position, right?df1.collect). Apply the patterns to df2 by replacing\wwith*and finally join them with inner join. But that has the drawback that you add complexity and extra action of course.