0

i have a String column and need to extract values of it into multiple columns based on the name associated with it.

otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 

the columns need to be formed from above are

State     | Area      | Sub Area | ID | Name
DALLocate | SFO-4/3/9 | 8        | 8  | 7

any help is appreciated.

2 Answers 2

1

IF the pattern is always fixed you could use regexp_extract:

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([{"raw": "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 "}], 'raw string') 

(df
 .select(regexp_extract('raw', 'State ([^_]*)', 1).alias('State'), 
         regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), 
         regexp_extract('raw', 'Area=<(.*)>', 1).alias('Sub Area'), 
         regexp_extract('raw', 'ID ([^ ]*)', 1).alias('ID'),
         regexp_extract('raw', 'Name ([^ ]*)', 1).alias('Name')).show())

regexp_extract takes 3 arguments the first ist the column you want to match on. the second is the pattern and the third is the group you want to extract.

ref: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract

Sign up to request clarification or add additional context in comments.

3 Comments

thanks for the reply. I see this part regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), is failing. my data will sometimes be of 'DALLocate_SFO-4/3/9' or 'DALLocate_SFO-4/3/9_DAX-10/3/3'.. in this case I need to take 'SFO-4/3/9_DAX-10/3/3' as its value. any guess on how to handle this
try ('State ([^_]*)_(.*) sub',2)
little help. I have a string with value like ' SFOLocate ID Expose Name 10 ID 3 Area 10 '. when I use your regex, it is considering ID=Expose. but it should be ID=3, if I extract with string and space. can you help. @Matt
0

Try this:

import org.apache.spark.sql.functions.udf
def myFunc: String => Array[String] = s => Array(/* TODO parse the string as you wish */)
val myUDF = udf(myFunc)

df.withColumn("parsedInput", myUDF(df("input")))
  .select(
    $"parsedInput"(0).as("State"),
    $"parsedInput"(1).as("Area"),
    $"parsedInput"(2).as("Sub Area"),
    $"parsedInput"(3).as("ID"),
    $"parsedInput"(4).as("Name"))

Where 'input' is your original input (e.g. "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 ").

Make sure your UDF returns a valid array (num of items and order)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.