Parsing string using regexp_extract using pyspark

Question

I am trying to split the string to different columns using regular expression

Below is my data

decodeData = [('M|C705|Exx05','2'),
('M|Exx05','4'),
('M|C705 P|Exx05','6'),        
('M|C705 P|8960 L|Exx05','7'),('M|C705 P|78|8960','9')]

df = sc.parallelize(decodeData).toDF(['Decode',''])

dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '(M|P|M)(\\|Exx05)', 1)).withColumn('C705',regexp_extract(col('Decode'), '(M|P|M)(\\|C705)', 1)) .withColumn('8960',regexp_extract(col('Decode'), '(M|P|M)(\\|8960)', 1))
dfNew.show()

Result
+--------------------+---+-----+----+-----+
|              Decode|   |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
|        M|C705|Exx05 | 2  |     |   M|    |
|             M|Exx05 | 4  |    M|    |    |
|      M|C705 P|Exx05 | 6  |    P|   M|    |
|M|C705 P|8960 L|Exx05| 7  |    M|   M|   P|
|    M|C705 P|78|8960 | 9  |     |   M|    |
+--------------------+---+-----+----+-----+

Here I am trying to extract the Code for string Exx05,C705,8960 and this can fall into M/P/L codes eg: While decoding 'M|C705 P|8960 L|Exx05' I expect the results as L M P in respective columns. However I am missing some logic here,which I am finding difficulty to crack

Expected results

  +--------------------+---+-----+----+-----+
    |              Decode|   |Exx05|C705| 8960|
    +--------------------+---+-----+----+-----+
    |        M|C705|Exx05 |   |    M|   M|    |
    |             M|Exx05 |   |    M|    |    |
    |      M|C705 P|Exx05 |   |    P|   M|    |
    |M|C705 P|8960 L|Exx05|   |    L|   M|   P|
    |    M|C705 P|78|8960 |   |     |   M|   P|
    +--------------------+---+-----+----+-----+

When I am trying to change the reg expression accordingly , It works for some cases and it wont work for other sample cases, and this is just a subset of the actual data I am working on.

eg: 1. Exx05 can fall in any code M/L/P and even it can fall at any position,begining,middle,end etc

One Decode can only belong to 1 (M or L or P) code per entry/ID ie M|Exx05 P|8960 L|Exx05 - here Exx05 falls in M and L,This scenario will not exist.

mck · Accepted Answer · 2021-04-23 10:34:42Z

2

You can add ([^ ])* in the regex to extend it so that it matches any consecutive patterns that are not separated by a space:

dfNew = df.withColumn(
    'Exx05',
    regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|Exx05)', 1)
).withColumn(
    'C705',
    regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|C705)', 1)
).withColumn(
    '8960',
    regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|8960)', 1)
)

dfNew.show(truncate=False)
+---------------------+---+-----+----+----+
|Decode               |   |Exx05|C705|8960|
+---------------------+---+-----+----+----+
|M|C705|Exx05         |2  |M    |M   |    |
|M|Exx05              |4  |M    |    |    |
|M|C705 P|Exx05       |6  |P    |M   |    |
|M|C705 P|8960 L|Exx05|7  |L    |M   |P   |
|M|C705$P|78|8960     |9  |     |M   |P   |
+---------------------+---+-----+----+----+

edited Apr 23, 2021 at 10:34

answered Apr 23, 2021 at 9:47

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wwnde · Accepted Answer · 2021-04-23 11:26:23Z

What about we use X(?=Y) also known as lookahead assertion. This ensures we match X only if it is followed by Y

from pyspark.sql.functions import*
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '([A-Z](?=\|Exx05))', 1)).withColumn('C705',regexp_extract(col('Decode'), '([A-Z](?=\|C705))', 1)) .withColumn('8960',regexp_extract(col('Decode'), '([A-Z]+(?=\|[0-9]|8960))', 1))
dfNew.show()

+--------------------+---+-----+----+----+
|              Decode|  t|Exx05|C705|8960|
+--------------------+---+-----+----+----+
|        M|C705|Exx05|  2|     |   M|    |
|             M|Exx05|  4|    M|    |    |
|      M|C705 P|Exx05|  6|    P|   M|    |
|M|C705 P|8960 L|E...|  7|    L|   M|   P|
|    M|C705 P|78|8960|  9|     |   M|   P|
+--------------------+---+-----+----+----+

Collectives™ on Stack Overflow

Parsing string using regexp_extract using pyspark

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related