Extract values from spark dataframe column into new derived column

Question

I have the following dataframe schema below

        root
         |-- SOURCE: string (nullable = true)
         |-- SYSTEM_NAME: string (nullable = true)
         |-- BUCKET_NAME: string (nullable = true)
         |-- LOCATION: string (nullable = true)
         |-- FILE_NAME: string (nullable = true)
         |-- LAST_MOD_DATE: string (nullable = true)
         |-- FILE_SIZE: string (nullable = true)

I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following:

example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx

Question 1: I would like to derive a new column called "folder_num" and strip out the following:

1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.

How can I achieve this in spark? I'm new to this technology so your help is much appreciated.

df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column

Thank you for the help.

Added Code:

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), 
                                                     regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))



+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME|            LOCATION|          FILE_NAME|      LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|    s3|       xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21|    13124|       |
|    s3|       xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21|    61290|       |
|    s3|       xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21|    61700|       |

Matt · Accepted Answer · 2020-10-30 08:12:42Z

0

Well you are on a good way:

from pyspark.sql.functions import regexp_extract, trim

df = spark.createDataFrame([{"old_column": "ex@mple trimed"}], 'old_column string')

df.withColumn('new_column'. regexp_extract(trim('old_column'), '(e.*@)', 1)).show()

this will trim and extract the pattern of group 1 that matches the regex expression

answered Oct 30, 2020 at 8:12

Matt

6505 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

SD3 · Accepted Answer · 2020-10-30 08:16:55Z

0

You can use regexp_extract and when. Refer the sample scala spark code below.

  df.withColumn("folder_num",
  when(regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" ,1) =!= lit(""),
    regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" , 1))
    .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1))
).show(false)

+------------------------------------------------------+----------+
|LOCATION                                              |folder_num|
+------------------------------------------------------+----------+
|prod/docs/Folder1/AA160039/Folder2/XXX.pdf            |160039    |
|prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx|355       |
+------------------------------------------------------+----------+

If you need the output of first row to AA160039, just change the grouping in regex as below.

regexp_extract(col("LOCATION"),".*/([A-Z]{2}[0-9]{6})/.*" ,1)

answered Oct 30, 2020 at 8:16

SD3

2164 silver badges6 bronze badges

6 Comments

AJR Over a year ago

What does "=!=" mean in this case ? or is this a typo?

SD3 Over a year ago

@AJR, "=!=" is "not equals" for matching cols in scala spark. You may replace it with appropriate colum "not equals" operator in python. Basically, when your first regex does not match to pattern, regexp_extract will give column with empty string. And we are just checking that if it does not give empty string column then use it otherwise use next regex.

AJR Over a year ago

thanks @SD3. Another question so I understand your code ... Why are you checking if the value "not equals" empty string for the first expression but not the second one??

SD3 Over a year ago

@AJR, Since you have two expressions to match and either first expression will be present or second will be present (I assumed). And also if only one of these two expresions will be present then which one should have priority (I assumed the first expression).So, just doing if else thing here. if first expression is matched extract it, otherwise(if first returns empty string) match the second expression and extract it. if second is also not found, we ll get an empty string col as we do not have any third expression to check now.

AJR Over a year ago

Sorry @SD3 again to bother you ... but my frame is not displaying anything in Folder num , can you please glance at it.

|

AJR · Accepted Answer · 2020-10-30 17:56:56Z

0

The info was really helpful provided. I appreciate everyone for putting me on the right track. The final code version is below.

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1) != lit(""), regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1) != lit(""), regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1))
                                                .otherwise("Unknown"))

Thanks.

answered Oct 30, 2020 at 17:56

AJR

5894 gold badges15 silver badges34 bronze badges

1 Comment

AJR Over a year ago

shout out to @SD3.

Collectives™ on Stack Overflow

Extract values from spark dataframe column into new derived column

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related