1

A simplified version of my problem is this:

I have a Spark DataFrame ("my_df") with one column ("col1") and values 'a','b','c','d'

and a dictionary ("my_dict") like this: {'a':5, 'b':7', 'c':2, 'd':4}

I would like to combine these to create a DataFrame with an additional column containing the corresponding values from my_dict.

At the moment I am using the following method, which works for a small dataset, but it's very inefficient, and it causes a StackOverflowError on my full dataset

import pyspark.sql.functions as F

# start with an arbitrary df containing "col1"
# initialise new column with zeros
my_df = my_df.withColumn('dict_data', F.lit(0))

for k,v in my_dict.items():
    my_df = my_df.withColumn('dict_data',
                             F.when((my_df['col1']==k),
                                     v).otherwise(df['dict_data'])
                             )

Is there a better way to do this? I've tried using Window functions but I've had difficult applying it in this context...

2 Answers 2

1

You just need to map your dictionary values into a new column based on the values of your first column. You can refer to :

pyspark create new column with mapping from a dict

Sign up to request clarification or add additional context in comments.

Comments

1

You can do it with an intermediate dataframe and a join:

rows = [{'col1': key, 'dict_data': value} for key,value in my_dict.items()]
my_dict_df = rdd.parallelize(rows).toDF()

result_df = my_df.join(my_dict_df, 'col1', 'left')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.