4

I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output.

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF()  # Result not as expected
df_dict.show()

Is there a way to do this without using Pandas?

6 Answers 6

8

Quoting myself:

I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.

So the easiest thing is to convert your dictionary into this format. You can easily do this using zip():

column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#+-------+-------+

The above assumes that all of the lists are the same length. If this is not the case, you would have to use itertools.izip_longest (python2) or itertools.zip_longest (python3).

from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30, 40]}

column_names, data = zip(*dict_lst.items())

spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#|   null|     40|
#+-------+-------+
Sign up to request clarification or add additional context in comments.

Comments

2

Your dict_lst is not really the format you want to adopt to create a dataframe. It would be better if you had a list of dict instead of a dict of list.

This code creates a DataFrame from you dict of list :

from pyspark.sql import SQLContext, Row

sqlContext = SQLContext(sc)

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30]}

values_lst = dict_lst.values()
nb_rows = [len(lst) for lst in values_lst]
assert min(nb_rows)==max(nb_rows) #We must have the same nb of elem for each key

row_lst = []
columns = dict_lst.keys()

for i in range(nb_rows[0]):
    row_values = [lst[i] for lst in values_lst]
    row_dict = {column: value for column, value in zip(columns, row_values)}
    row = Row(**row_dict)
    row_lst.append(row)

df = sqlContext.createDataFrame(row_lst)

Comments

0

Using pault's answer above I imposed a specific schema on my dataframe as follows:

import pyspark
from pyspark.sql import SparkSession, functions

spark = SparkSession.builder.appName('dictToDF').getOrCreate()

get data:

dict_lst = {'letters': ['a', 'b', 'c'],'numbers': [10, 20, 30]}
data = dict_lst.values()

create schema:

from pyspark.sql.types import *
myschema= StructType([ StructField("letters", StringType(), True)\
                      ,StructField("numbers", IntegerType(), True)\
                         ])

create df from dictionary - with schema:

df=spark.createDataFrame(zip(*data), schema = myschema)
df.show()
+-------+-------+
|letters|numbers|
+-------+-------+
|      a|     10|
|      b|     20|
|      c|     30|
+-------+-------+

show df schema:

df.printSchema()

root
 |-- letters: string (nullable = true)
 |-- numbers: integer (nullable = true)

Comments

0

You can also use a Python List to quickly prototype a DataFrame. The idea is based from Databricks's tutorial.

df = spark.createDataFrame(
    [(1, "a"), 
     (1, "a"), 
     (1, "b")],
    ("id", "value"))
df.show()
+---+-----+
| id|value|
+---+-----+
|  1|    a|
|  1|    a|
|  1|    b|
+---+-----+

Comments

-1

Try this out :

dict_lst = [{'letters': 'a', 'numbers': 10}, 
            {'letters': 'b', 'numbers': 20}, 
            {'letters': 'c', 'numbers': 30}]
df_dict = sc.parallelize(dict_lst).toDF()  # Result as expected

Output:

>>> df_dict.show()
+-------+-------+
|letters|numbers|
+-------+-------+
|      a|     10|
|      b|     20|
|      c|     30|
+-------+-------+

1 Comment

This is not really scalable if his dict_lst doesn't come in this format.
-1

The most efficient approach is to use Pandas

import pandas as pd

spark.createDataFrame(pd.DataFrame(dict_lst))

1 Comment

It says 'without using Pandas' in the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.