PySpark Dataframe from Python Dictionary without Pandas

Question

I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output.

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF()  # Result not as expected
df_dict.show()

Is there a way to do this without using Pandas?

pault · Accepted Answer · 2018-07-27 16:03:32Z

Quoting myself:

I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.

So the easiest thing is to convert your dictionary into this format. You can easily do this using zip():

column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#+-------+-------+

The above assumes that all of the lists are the same length. If this is not the case, you would have to use itertools.izip_longest (python2) or itertools.zip_longest (python3).

from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30, 40]}

column_names, data = zip(*dict_lst.items())

spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#|   null|     40|
#+-------+-------+

Pierre Gourseaud · Accepted Answer · 2018-07-27 14:49:18Z

Your dict_lst is not really the format you want to adopt to create a dataframe. It would be better if you had a list of dict instead of a dict of list.

This code creates a DataFrame from you dict of list :

from pyspark.sql import SQLContext, Row

sqlContext = SQLContext(sc)

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30]}

values_lst = dict_lst.values()
nb_rows = [len(lst) for lst in values_lst]
assert min(nb_rows)==max(nb_rows) #We must have the same nb of elem for each key

row_lst = []
columns = dict_lst.keys()

for i in range(nb_rows[0]):
    row_values = [lst[i] for lst in values_lst]
    row_dict = {column: value for column, value in zip(columns, row_values)}
    row = Row(**row_dict)
    row_lst.append(row)

df = sqlContext.createDataFrame(row_lst)

Grant Shannon · Accepted Answer · 2018-10-28 17:19:28Z

Using pault's answer above I imposed a specific schema on my dataframe as follows:

import pyspark
from pyspark.sql import SparkSession, functions

spark = SparkSession.builder.appName('dictToDF').getOrCreate()

get data:

dict_lst = {'letters': ['a', 'b', 'c'],'numbers': [10, 20, 30]}
data = dict_lst.values()

create schema:

from pyspark.sql.types import *
myschema= StructType([ StructField("letters", StringType(), True)\
                      ,StructField("numbers", IntegerType(), True)\
                         ])

create df from dictionary - with schema:

df=spark.createDataFrame(zip(*data), schema = myschema)
df.show()
+-------+-------+
|letters|numbers|
+-------+-------+
|      a|     10|
|      b|     20|
|      c|     30|
+-------+-------+

show df schema:

df.printSchema()

root
 |-- letters: string (nullable = true)
 |-- numbers: integer (nullable = true)

Dat · Accepted Answer · 2019-03-26 00:07:57Z

0

You can also use a Python List to quickly prototype a DataFrame. The idea is based from Databricks's tutorial.

df = spark.createDataFrame(
    [(1, "a"), 
     (1, "a"), 
     (1, "b")],
    ("id", "value"))
df.show()
+---+-----+
| id|value|
+---+-----+
|  1|    a|
|  1|    a|
|  1|    b|
+---+-----+

answered Mar 26, 2019 at 0:07

Dat

5,8632 gold badges34 silver badges33 bronze badges

Comments

pissall · Accepted Answer · 2018-07-27 10:07:25Z

-1

Try this out :

dict_lst = [{'letters': 'a', 'numbers': 10}, 
            {'letters': 'b', 'numbers': 20}, 
            {'letters': 'c', 'numbers': 30}]
df_dict = sc.parallelize(dict_lst).toDF()  # Result as expected

Output:

>>> df_dict.show()
+-------+-------+
|letters|numbers|
+-------+-------+
|      a|     10|
|      b|     20|
|      c|     30|
+-------+-------+

answered Jul 27, 2018 at 10:07

pissall

7,4442 gold badges29 silver badges47 bronze badges

1 Comment

Pierre Gourseaud Over a year ago

This is not really scalable if his dict_lst doesn't come in this format.

user10144290 · Accepted Answer · 2018-07-27 12:09:54Z

-1

The most efficient approach is to use Pandas

import pandas as pd

spark.createDataFrame(pd.DataFrame(dict_lst))

answered Jul 27, 2018 at 12:09

user10144290

11

1 Comment

Florian Over a year ago

It says 'without using Pandas' in the question.

Collectives™ on Stack Overflow

PySpark Dataframe from Python Dictionary without Pandas

6 Answers 6

Comments

Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related