0

I am trying to split the string to different columns using regular expression

Below is my data

decodeData = [('M|C705|Exx05','2'),
('M|Exx05','4'),
('M|C705 P|Exx05','6'),        
('M|C705 P|8960 L|Exx05','7'),('M|C705 P|78|8960','9')]

df = sc.parallelize(decodeData).toDF(['Decode',''])

dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '(M|P|M)(\\|Exx05)', 1)).withColumn('C705',regexp_extract(col('Decode'), '(M|P|M)(\\|C705)', 1)) .withColumn('8960',regexp_extract(col('Decode'), '(M|P|M)(\\|8960)', 1))
dfNew.show()

Result
+--------------------+---+-----+----+-----+
|              Decode|   |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
|        M|C705|Exx05 | 2  |     |   M|    |
|             M|Exx05 | 4  |    M|    |    |
|      M|C705 P|Exx05 | 6  |    P|   M|    |
|M|C705 P|8960 L|Exx05| 7  |    M|   M|   P|
|    M|C705 P|78|8960 | 9  |     |   M|    |
+--------------------+---+-----+----+-----+

Here I am trying to extract the Code for string Exx05,C705,8960 and this can fall into M/P/L codes eg: While decoding 'M|C705 P|8960 L|Exx05' I expect the results as L M P in respective columns. However I am missing some logic here,which I am finding difficulty to crack

Expected results

  +--------------------+---+-----+----+-----+
    |              Decode|   |Exx05|C705| 8960|
    +--------------------+---+-----+----+-----+
    |        M|C705|Exx05 |   |    M|   M|    |
    |             M|Exx05 |   |    M|    |    |
    |      M|C705 P|Exx05 |   |    P|   M|    |
    |M|C705 P|8960 L|Exx05|   |    L|   M|   P|
    |    M|C705 P|78|8960 |   |     |   M|   P|
    +--------------------+---+-----+----+-----+

When I am trying to change the reg expression accordingly , It works for some cases and it wont work for other sample cases, and this is just a subset of the actual data I am working on.

eg: 1. Exx05 can fall in any code M/L/P and even it can fall at any position,begining,middle,end etc

  1. One Decode can only belong to 1 (M or L or P) code per entry/ID ie M|Exx05 P|8960 L|Exx05 - here Exx05 falls in M and L,This scenario will not exist.
0

2 Answers 2

2

You can add ([^ ])* in the regex to extend it so that it matches any consecutive patterns that are not separated by a space:

dfNew = df.withColumn(
    'Exx05',
    regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|Exx05)', 1)
).withColumn(
    'C705',
    regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|C705)', 1)
).withColumn(
    '8960',
    regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|8960)', 1)
)

dfNew.show(truncate=False)
+---------------------+---+-----+----+----+
|Decode               |   |Exx05|C705|8960|
+---------------------+---+-----+----+----+
|M|C705|Exx05         |2  |M    |M   |    |
|M|Exx05              |4  |M    |    |    |
|M|C705 P|Exx05       |6  |P    |M   |    |
|M|C705 P|8960 L|Exx05|7  |L    |M   |P   |
|M|C705$P|78|8960     |9  |     |M   |P   |
+---------------------+---+-----+----+----+
Sign up to request clarification or add additional context in comments.

Comments

0

What about we use X(?=Y) also known as lookahead assertion. This ensures we match X only if it is followed by Y

from pyspark.sql.functions import*
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '([A-Z](?=\|Exx05))', 1)).withColumn('C705',regexp_extract(col('Decode'), '([A-Z](?=\|C705))', 1)) .withColumn('8960',regexp_extract(col('Decode'), '([A-Z]+(?=\|[0-9]|8960))', 1))
dfNew.show()

+--------------------+---+-----+----+----+
|              Decode|  t|Exx05|C705|8960|
+--------------------+---+-----+----+----+
|        M|C705|Exx05|  2|     |   M|    |
|             M|Exx05|  4|    M|    |    |
|      M|C705 P|Exx05|  6|    P|   M|    |
|M|C705 P|8960 L|E...|  7|    L|   M|   P|
|    M|C705 P|78|8960|  9|     |   M|   P|
+--------------------+---+-----+----+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.