I have the following dataframe schema below
root
|-- SOURCE: string (nullable = true)
|-- SYSTEM_NAME: string (nullable = true)
|-- BUCKET_NAME: string (nullable = true)
|-- LOCATION: string (nullable = true)
|-- FILE_NAME: string (nullable = true)
|-- LAST_MOD_DATE: string (nullable = true)
|-- FILE_SIZE: string (nullable = true)
I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following:
example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx
Question 1: I would like to derive a new column called "folder_num" and strip out the following:
1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.
How can I achieve this in spark? I'm new to this technology so your help is much appreciated.
df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column
Thank you for the help.
Added Code:
df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
.withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
.withColumn("LOCATION", trim(col('LOCATION')))\
.withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""),
regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
.otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME| LOCATION| FILE_NAME| LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
| s3| xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21| 13124| |
| s3| xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21| 61290| |
| s3| xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21| 61700| |