How to convert single String column to multiple columns based on delimiter in Apache Spark

Question

I have a data frame with a string column and I want to create multiple columns out of it.

Here is my input data and pagename is my string column

I want to create multiple columns from it. The format of the string is the same - col1:value1 col2:value2 col3:value3 ... colN:valueN . In the output, I need multiple columns - col1 to colN with values as rows for each column. Here is the output -

How can I do this in spark? Scala or Python both is fine for me. Below code creates the input dataframe -

scala> val df = spark.sql(s"""select 1 as id, "a:100 b:500 c:200" as pagename union select 2 as id, "a:101 b:501 c:201" as pagename """)
df: org.apache.spark.sql.DataFrame = [id: int, pagename: string]

scala> df.show(false)
+---+-----------------+
|id |pagename         |
+---+-----------------+
|2  |a:101 b:501 c:201|
|1  |a:100 b:500 c:200|
+---+-----------------+

scala> df.printSchema
root
 |-- id: integer (nullable = false)
 |-- pagename: string (nullable = false)

Note - The example shows only 3 columns here but in general I have more than 100 columns that I expect to deal with.

mck · Accepted Answer · 2021-06-01 17:34:08Z

2

You can use str_to_map, explode the resulting map and pivot:

val df2 = df.select(
    col("id"), 
    expr("explode(str_to_map(pagename, ' ', ':'))")
).groupBy("id").pivot("key").agg(first("value"))

df2.show
+---+---+---+---+
| id|  a|  b|  c|
+---+---+---+---+
|  1|100|500|200|
|  2|101|501|201|
+---+---+---+---+

answered Jun 1, 2021 at 17:34

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Regressor Over a year ago

can we convert the numbers in column a,b,c to any data type that we want on the fly? For instance, right now the output shows as string, but i might need those columns as int , float etc.

mck Over a year ago

you can cast as needed in agg, e.g. agg(first("value").cast("int"))

Regressor Over a year ago

is str_to_map() built in function?

mck Over a year ago

available in SQL API but not in scala/python

Regressor Over a year ago

If have multiple columns where I want to apply this function then what is the best way to get out the data? So for instance, if there is pagename column and pagename1 column that both have such delimited data, then the number of output columns would be id,a,b,c,a1,b1,c1 where a1,b1,c1 are present in pagename1 column. The number of rows would be same but columns should expand out

Morgan Bye · Accepted Answer · 2021-06-01 17:27:53Z

0

So two options immediately come to mind

Delimiters

You've got some obvious delimiters that you can split on. For this use the split function

    from pyspark.sql import functions as F

    delimiter = ":"
    
    df = df.withColumn(
        "split_column", 
        F.split(F.col("pagename"), delimiter)
    )
    
    # "split_column" is now an array, so we need to pull items out the array
    df = df.withColumn(
        "a",
        F.col("split_column").getItem(0)
    )

Not ideal, as you'll still need to do some string manipulation to remove the whitespace and then do the int converter - but this is easily applied to multiple columns.

Regex

As the format is pretty fixed, you can do the same thing with a regex.

    import re
    
    regex_pattern = r"a\:() b\:() c\:()"
    match_groups = ["a", "b", "c"]
    
    for i in range(re.compile(regex_pattern).groups):
        df = df.withColumn(
            match_groups[i], 
            F.regexp_extract(F.col(pagename), regex_pattern, i + 1),
        )

CAVEAT: Check that Regex before you try and run anything (as I don't have an editor handy)

answered Jun 1, 2021 at 17:27

Morgan Bye

1506 bronze badges

1 Comment

Morgan Bye Over a year ago

You've tagged pyspark so I've replied in Python, but the gist would be the same in Scala

Collectives™ on Stack Overflow

How to convert single String column to multiple columns based on delimiter in Apache Spark

2 Answers 2

5 Comments

Delimiters

Regex

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Delimiters

Regex

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related