Convert a standard python key value dictionary list to pyspark data frame

Question

Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?

 [{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]

How can i use the following construct to do it?

df = sc.parallelize([
    ...
]).toDF

Where to place arg1 arg2 in the above code (...)

You should edit your question, instead of "..." please show us where the "arg1" and "arg2" should go. — betterworld
– betterworld, Commented Jun 2, 2016 at 6:26

652bb3ca · Accepted Answer · 2016-06-02 06:54:27Z

31

Old way:

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()

New way:

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
    .map(convert_to_row) \ 
    .toDF()

edited Jun 2, 2016 at 6:54

answered Jun 2, 2016 at 6:44

652bb3ca

3193 silver badges3 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

stackit Over a year ago

thanks, can you please answer the related question :stackoverflow.com/questions/37584185/…

rado Over a year ago

Isn't this scala? def convert_to_row(d: dict) -> Row:

Arthur Tacca Over a year ago

@rado That is a Python 3 function annotation.

giaosudau Over a year ago

@Andre85 I think because the order of keys in each dictionary may difference that why we need to be sorted.

lego king Over a year ago

what happens if a key is missing, do we get null values or an error.

|

gepant · Accepted Answer · 2020-01-03 23:10:02Z

22

For anyone looking for the solution to something different I found this worked for me: I have a single dictionary with key value pairs - I was looking to convert that to two PySpark dataframe columns:

So

{k1:v1, k2:v2 ...}

Becomes

 ---------------- 
| col1   |  col2 |
|----------------|
| k1     |  v1   |
| k2     |  v2   |
 ----------------

lol= list(map(list, mydict.items()))
df = spark.createDataFrame(lol, ["col1", "col2"])

answered Jan 3, 2020 at 23:10

gepant

2412 silver badges4 bronze badges

1 Comment

dongle man Over a year ago

Even simpler: df = spark.createDataFrame(mydict.items(), ["col1", "col2"])

Brendan · Accepted Answer · 2021-07-28 01:19:37Z

5

The other answers work, but here's one more one-liner that works well with nested data. It's may not the most efficient, but if you're making a DataFrame from an in-memory dictionary, you're either working with small data sets like test data or using spark wrong, so efficiency should really not be a concern:

d = {any json compatible dict}
spark.read.json(sc.parallelize([json.dumps(d)]))

answered Jul 28, 2021 at 1:19

Brendan

2,1242 gold badges25 silver badges32 bronze badges

Comments

Jeston · Accepted Answer · 2018-03-07 21:20:40Z

3

I had to modify the accepted answer in order for it to work for me in Python 2.7 running Spark 2.0.

from collections import OrderedDict
from pyspark.sql import SparkSession, Row

spark = (SparkSession
        .builder
        .getOrCreate()
    )

schema = StructType([
    StructField('arg1', StringType(), True),
    StructField('arg2', StringType(), True)
])

dta = [{"arg1": "", "arg2": ""}, {"arg1": "", "arg2": ""}]

dtaRDD = spark.sparkContext.parallelize(dta) \
    .map(lambda x: Row(**OrderedDict(sorted(x.items()))))

dtaDF = spark.createDataFrame(dtaRdd, schema)

answered Mar 7, 2018 at 21:20

Jeston

3964 silver badges13 bronze badges

Comments

Union find · Accepted Answer · 2020-07-11 16:57:01Z

0

Assuming your data is a struct and not a string dictionary, you can just do

newdf = df.select(['df.arg1','df.arg2'])

answered Jul 11, 2020 at 16:57

Union find

8,27017 gold badges70 silver badges118 bronze badges

Collectives™ on Stack Overflow

Convert a standard python key value dictionary list to pyspark data frame

5 Answers 5

10 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

10 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related