Adding quotes to list objects to format as a dictionary pyspark

Question

I have a column of my dataframe data that contains date and value information in a long string. The column, we will call x for this purpose, is formatted as such:

x = "{date1:val1, date2:val2, date3:val3, ...}"

I want to ultimately explode this data such that I create two new columns, one for date and one for val. In order to utilize the explode function, I understand that the column must be formatted as an array, not a string. So far, to handle this issue, I have removed the braces at the start and end of the string:

from pyspark.sql import functions as F

data = data.withColumn('x_1', F.regexp.replace('x', r'\{', ''))
data = data.withColumn('x_1', F.regexp.replace('x_1', r'\}', '')

I then created a list variable:

data = data.withColumn('x_list', F.split('x_1', ', '))

I now have that x_list = [date1:val1, date2:val2, date3:val3, ...]

What I now need is to add quotes around each list element such that I ultimately get ['date1':'val1', 'date2':'val2', 'date3':'val3', ...]

I believe that it may be possible to iterate through the list and use regex to add quotes using the colon (:) as a split point, but I am struggling with how to do that. I believe that it would look something like:

for l in x_list:
   #some regex expression

Alternatively, I have considered creating a sublist of each list element, but I am not sure how I would then use those sublists to create two new columns.

Use ast.literal_eval() to parse it as a dictionary. Then you can call pd.DataFrame() with the result. — Barmar
– Barmar, Commented Dec 13, 2024 at 20:13

Devam · Accepted Answer · 2024-12-16 16:47:57Z

0

This way you could avoid using udf :)

date_val_string = "{date1:val1, date2:val2, date3:val3}"
(
    spark.createDataFrame(pd.DataFrame({"col": [date_val_string]}))
    .withColumn("array1", f.expr("split(regexp_replace(col, '[{}]', ''), ', ')"))
    .withColumn("array2", f.expr("transform(array1, x->split(x, ':'))"))
    .selectExpr("explode(array2) as date_val")
    .selectExpr("date_val[0] as date", "date_val[1] as val")
    .show(truncate = False)
)


+-----+----+
|date |val |
+-----+----+
|date1|val1|
|date2|val2|
|date3|val3|
+-----+----+

answered Dec 16, 2024 at 16:47

Devam

1041 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

rfs Dec 16, 2024 at 20:53

Thank you! This is exactly what I was looking for!

techtech · Accepted Answer · 2024-12-13 20:40:36Z

0

Your string is not a valid json and also not a valid dict string. You could go with this:

import pyspark.sql.functions as F
from pyspark.sql.types import StringType, ArrayType

@F.udf(returnType=ArrayType(StringType()))
def parse_(s):
    if s is None: return None
    return [item.split(":")[1] for item in s.strip("{}").split(",")]

df = spark.createDataFrame([[1, "{date1:val1, date2:val2, date3:val3}"]], schema=["col1", "col2"])
display(df.withColumn("date", F.explode(parse_("col2"))))

answered Dec 13, 2024 at 20:40

techtech

1882 silver badges11 bronze badges

2 Comments

rfs Dec 16, 2024 at 18:40

In this line of code: df = spark.createDataFrame([[1, "{date1:val1, date2:val2, date3:val3}"]], schema=["col1", "col2"]) , what is the part in quotes? Is that the name of the column that I want to explode?

techtech Dec 17, 2024 at 7:36

"col1" and "col2" are column names in my test data.

Collectives™ on Stack Overflow

Adding quotes to list objects to format as a dictionary pyspark

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related