I cannot create the entire PySpark Dataframe that I need. My current dictionary is in this format:
d = {0:
{'Key Features': ['Obese', 'Exercise']},
'Properties': {'Balding': True, 'Tall': False, 'Obese': True, 'Exercise': False}},
1:
{'Key Features': [None]},
'Properties': {'Balding': True, 'Tall': False, 'Obese': False, 'Exercise': True}},
...}
I want to create a dataframe in this format:
+---------+------+-------+----------+---------------------+
|'Balding'|'Tall'|'Obese'|'Exercise'| 'Key Features'|
+---------+------+-------+----------+---------------------+
| true| false| false| false|['Obese', 'Exercise']|
+---------+------+-------+----------+---------------------+
| true| false| false| true| [None]|
+---------+------+-------+----------+---------------------+
I was able to create a DataFrame for the 'Properties' with this code:
df = spark.createDataFrame([d[i]['Properties'] for i in d]).show()
Which outputs this dataframe:
+---------+------+-------+----------+
|'Balding'|'Tall'|'Obese'|'Exercise'|
+---------+------+-------+----------+
| true| false| false| false|
+---------+------+-------+----------+
| true| false| false| true|
+---------+------+-------+----------+
I have tried to add a column like this and it failed:
df.withColumn('Key Features', array(lit([d[i]['Key Features'] for i in d])
But it simply fails and does not add the list as a column. And I tried to create a DataFrame like this and it also did not work:
df = spark.createDataFrame([d[i]['Key Features'] for i in d]).show()
Outputting:
Input row doesn't have expected number of values required by the schema. 4 fields are required while 1 values was provided.
How would I go about adding the 'Key Features' as a column with the list contained in the dictionary either by adding it at the start of the createDataFrame or using withColumn?