0

I have a data frame with a string column and I want to create multiple columns out of it.

Here is my input data and pagename is my string column

1

I want to create multiple columns from it. The format of the string is the same - col1:value1 col2:value2 col3:value3 ... colN:valueN . In the output, I need multiple columns - col1 to colN with values as rows for each column. Here is the output -

2

How can I do this in spark? Scala or Python both is fine for me. Below code creates the input dataframe -

scala> val df = spark.sql(s"""select 1 as id, "a:100 b:500 c:200" as pagename union select 2 as id, "a:101 b:501 c:201" as pagename """)
df: org.apache.spark.sql.DataFrame = [id: int, pagename: string]

scala> df.show(false)
+---+-----------------+
|id |pagename         |
+---+-----------------+
|2  |a:101 b:501 c:201|
|1  |a:100 b:500 c:200|
+---+-----------------+

scala> df.printSchema
root
 |-- id: integer (nullable = false)
 |-- pagename: string (nullable = false)

Note - The example shows only 3 columns here but in general I have more than 100 columns that I expect to deal with.

2 Answers 2

2

You can use str_to_map, explode the resulting map and pivot:

val df2 = df.select(
    col("id"), 
    expr("explode(str_to_map(pagename, ' ', ':'))")
).groupBy("id").pivot("key").agg(first("value"))

df2.show
+---+---+---+---+
| id|  a|  b|  c|
+---+---+---+---+
|  1|100|500|200|
|  2|101|501|201|
+---+---+---+---+
Sign up to request clarification or add additional context in comments.

5 Comments

can we convert the numbers in column a,b,c to any data type that we want on the fly? For instance, right now the output shows as string, but i might need those columns as int , float etc.
you can cast as needed in agg, e.g. agg(first("value").cast("int"))
is str_to_map() built in function?
available in SQL API but not in scala/python
If have multiple columns where I want to apply this function then what is the best way to get out the data? So for instance, if there is pagename column and pagename1 column that both have such delimited data, then the number of output columns would be id,a,b,c,a1,b1,c1 where a1,b1,c1 are present in pagename1 column. The number of rows would be same but columns should expand out
0

So two options immediately come to mind

Delimiters

You've got some obvious delimiters that you can split on. For this use the split function

    from pyspark.sql import functions as F

    delimiter = ":"
    
    df = df.withColumn(
        "split_column", 
        F.split(F.col("pagename"), delimiter)
    )
    
    # "split_column" is now an array, so we need to pull items out the array
    df = df.withColumn(
        "a",
        F.col("split_column").getItem(0)
    )

Not ideal, as you'll still need to do some string manipulation to remove the whitespace and then do the int converter - but this is easily applied to multiple columns.

Regex

As the format is pretty fixed, you can do the same thing with a regex.

    import re
    
    regex_pattern = r"a\:() b\:() c\:()"
    match_groups = ["a", "b", "c"]
    
    for i in range(re.compile(regex_pattern).groups):
        df = df.withColumn(
            match_groups[i], 
            F.regexp_extract(F.col(pagename), regex_pattern, i + 1),
        )

CAVEAT: Check that Regex before you try and run anything (as I don't have an editor handy)

1 Comment

You've tagged pyspark so I've replied in Python, but the gist would be the same in Scala

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.