Spark extract value to multiple columns based on name

Question

i have a String column and need to extract values of it into multiple columns based on the name associated with it.

otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7

the columns need to be formed from above are

State     | Area      | Sub Area | ID | Name
DALLocate | SFO-4/3/9 | 8        | 8  | 7

any help is appreciated.

Matt · Accepted Answer · 2020-10-30 08:32:34Z

1

IF the pattern is always fixed you could use regexp_extract:

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([{"raw": "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 "}], 'raw string') 

(df
 .select(regexp_extract('raw', 'State ([^_]*)', 1).alias('State'), 
         regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), 
         regexp_extract('raw', 'Area=<(.*)>', 1).alias('Sub Area'), 
         regexp_extract('raw', 'ID ([^ ]*)', 1).alias('ID'),
         regexp_extract('raw', 'Name ([^ ]*)', 1).alias('Name')).show())

regexp_extract takes 3 arguments the first ist the column you want to match on. the second is the pattern and the third is the group you want to extract.

ref: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract

answered Oct 30, 2020 at 8:32

Matt

6505 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

marc Over a year ago

thanks for the reply. I see this part regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), is failing. my data will sometimes be of 'DALLocate_SFO-4/3/9' or 'DALLocate_SFO-4/3/9_DAX-10/3/3'.. in this case I need to take 'SFO-4/3/9_DAX-10/3/3' as its value. any guess on how to handle this

Matt Over a year ago

try ('State ([^_]*)_(.*) sub',2)

marc Over a year ago

little help. I have a string with value like ' SFOLocate ID Expose Name 10 ID 3 Area 10 '. when I use your regex, it is considering ID=Expose. but it should be ID=3, if I extract with string and space. can you help. @Matt

Nir Hedvat · Accepted Answer · 2020-10-30 08:29:47Z

0

Try this:

import org.apache.spark.sql.functions.udf
def myFunc: String => Array[String] = s => Array(/* TODO parse the string as you wish */)
val myUDF = udf(myFunc)

df.withColumn("parsedInput", myUDF(df("input")))
  .select(
    $"parsedInput"(0).as("State"),
    $"parsedInput"(1).as("Area"),
    $"parsedInput"(2).as("Sub Area"),
    $"parsedInput"(3).as("ID"),
    $"parsedInput"(4).as("Name"))

Where 'input' is your original input (e.g. "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 ").

Make sure your UDF returns a valid array (num of items and order)

answered Oct 30, 2020 at 8:29

Nir Hedvat

8707 silver badges7 bronze badges

Collectives™ on Stack Overflow

Spark extract value to multiple columns based on name

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related