Converting dataframe to dictionary in pyspark without using pandas

Question

Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this:

dictionary = df_2.unstack().to_dict(orient='index')

However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this?

EDIT:

I have now tried the following approach:

dictionary_list = map(lambda row: row.asDict(), df_2.collect())
dictionary  = {age['age']: age for age in dictionary_list}

(reference) but it is not yielding what it is supposed to.

In pandas, what I was obtaining was the following:

@mck the original code I had in pandas for the whole process was this: dictionary = (value/value.groupby(level=0).sum()).unstack().to_dict(orient='index'), refering to the dataframe in this question: stackoverflow.com/questions/65707148/… — Johanna
– Johanna, Commented Jan 14, 2021 at 11:50

mck · Accepted Answer · 2021-01-14 12:55:29Z

2

df2 is the dataframe from the previous post. You can do a pivot first, and then convert to dictionary as described in your linked post.

import pyspark.sql.functions as F

df3 = df2.groupBy('age').pivot('siblings').agg(F.first('count'))
list_persons = [row.asDict() for row in df3.collect()]
dict_persons = {person['age']: person for person in list_persons}

{15: {'age': 15, '0': 1.0, '1': None, '3': None}, 10: {'age': 10, '0': None, '1': None, '3': 1.0}, 14: {'age': 14, '0': None, '1': 1.0, '3': None}}

Or another way:

df4 = df3.fillna(float('nan')).groupBy().pivot('age').agg(F.first(F.struct(*df3.columns[1:])))
result_dict = eval(df4.select(F.to_json(F.struct(*df4.columns))).head()[0])

{'10': {'0': 'NaN', '1': 'NaN', '3': 1.0}, '14': {'0': 'NaN', '1': 1.0, '3': 'NaN'}, '15': {'0': 1.0, '1': 'NaN', '3': 'NaN'}}

edited Jan 14, 2021 at 12:55

answered Jan 14, 2021 at 12:01

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Johanna Over a year ago

Unfortunately it's not working :( "TypeError: 'map' object is not callable"

Johanna Over a year ago

I am using the edited version, unfortunately the error it's still there for me :(

mck Over a year ago

@Johanna I removed that annoying function, could you please try again?

Collectives™ on Stack Overflow

Converting dataframe to dictionary in pyspark without using pandas

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related