0

I have the following dataframe schema below

        root
         |-- SOURCE: string (nullable = true)
         |-- SYSTEM_NAME: string (nullable = true)
         |-- BUCKET_NAME: string (nullable = true)
         |-- LOCATION: string (nullable = true)
         |-- FILE_NAME: string (nullable = true)
         |-- LAST_MOD_DATE: string (nullable = true)
         |-- FILE_SIZE: string (nullable = true)

I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following:

example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx

Question 1: I would like to derive a new column called "folder_num" and strip out the following:

1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column. 

How can I achieve this in spark? I'm new to this technology so your help is much appreciated.

df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column

Thank you for the help.

Added Code:

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), 
                                                     regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))



+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME|            LOCATION|          FILE_NAME|      LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|    s3|       xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21|    13124|       |
|    s3|       xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21|    61290|       |
|    s3|       xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21|    61700|       |

3 Answers 3

0

Well you are on a good way:

from pyspark.sql.functions import regexp_extract, trim

df = spark.createDataFrame([{"old_column": "ex@mple trimed"}], 'old_column string')

df.withColumn('new_column'. regexp_extract(trim('old_column'), '(e.*@)', 1)).show()

this will trim and extract the pattern of group 1 that matches the regex expression

Sign up to request clarification or add additional context in comments.

Comments

0

You can use regexp_extract and when. Refer the sample scala spark code below.

  df.withColumn("folder_num",
  when(regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" ,1) =!= lit(""),
    regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" , 1))
    .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1))
).show(false)

+------------------------------------------------------+----------+
|LOCATION                                              |folder_num|
+------------------------------------------------------+----------+
|prod/docs/Folder1/AA160039/Folder2/XXX.pdf            |160039    |
|prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx|355       |
+------------------------------------------------------+----------+

If you need the output of first row to AA160039, just change the grouping in regex as below.

regexp_extract(col("LOCATION"),".*/([A-Z]{2}[0-9]{6})/.*" ,1)

6 Comments

What does "=!=" mean in this case ? or is this a typo?
@AJR, "=!=" is "not equals" for matching cols in scala spark. You may replace it with appropriate colum "not equals" operator in python. Basically, when your first regex does not match to pattern, regexp_extract will give column with empty string. And we are just checking that if it does not give empty string column then use it otherwise use next regex.
thanks @SD3. Another question so I understand your code ... Why are you checking if the value "not equals" empty string for the first expression but not the second one??
@AJR, Since you have two expressions to match and either first expression will be present or second will be present (I assumed). And also if only one of these two expresions will be present then which one should have priority (I assumed the first expression).So, just doing if else thing here. if first expression is matched extract it, otherwise(if first returns empty string) match the second expression and extract it. if second is also not found, we ll get an empty string col as we do not have any third expression to check now.
Sorry @SD3 again to bother you ... but my frame is not displaying anything in Folder num , can you please glance at it.
|
0

The info was really helpful provided. I appreciate everyone for putting me on the right track. The final code version is below.

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1) != lit(""), regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1) != lit(""), regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1))
                                                .otherwise("Unknown"))

Thanks.

1 Comment

shout out to @SD3.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.