23

Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?

 [{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]

How can i use the following construct to do it?

df = sc.parallelize([
    ...
]).toDF

Where to place arg1 arg2 in the above code (...)

2
  • You should edit your question, instead of "..." please show us where the "arg1" and "arg2" should go. Commented Jun 2, 2016 at 6:26
  • @betterworld ok done how to do Commented Jun 2, 2016 at 6:28

5 Answers 5

31

Old way:

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()

New way:

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
    .map(convert_to_row) \ 
    .toDF()
Sign up to request clarification or add additional context in comments.

10 Comments

thanks, can you please answer the related question :stackoverflow.com/questions/37584185/…
Isn't this scala? def convert_to_row(d: dict) -> Row:
@rado That is a Python 3 function annotation.
@Andre85 I think because the order of keys in each dictionary may difference that why we need to be sorted.
what happens if a key is missing, do we get null values or an error.
|
22

For anyone looking for the solution to something different I found this worked for me: I have a single dictionary with key value pairs - I was looking to convert that to two PySpark dataframe columns:

So

{k1:v1, k2:v2 ...}

Becomes

 ---------------- 
| col1   |  col2 |
|----------------|
| k1     |  v1   |
| k2     |  v2   |
 ----------------

lol= list(map(list, mydict.items()))
df = spark.createDataFrame(lol, ["col1", "col2"])

1 Comment

Even simpler: df = spark.createDataFrame(mydict.items(), ["col1", "col2"])
5

The other answers work, but here's one more one-liner that works well with nested data. It's may not the most efficient, but if you're making a DataFrame from an in-memory dictionary, you're either working with small data sets like test data or using spark wrong, so efficiency should really not be a concern:

d = {any json compatible dict}
spark.read.json(sc.parallelize([json.dumps(d)]))

Comments

3

I had to modify the accepted answer in order for it to work for me in Python 2.7 running Spark 2.0.

from collections import OrderedDict
from pyspark.sql import SparkSession, Row

spark = (SparkSession
        .builder
        .getOrCreate()
    )

schema = StructType([
    StructField('arg1', StringType(), True),
    StructField('arg2', StringType(), True)
])

dta = [{"arg1": "", "arg2": ""}, {"arg1": "", "arg2": ""}]

dtaRDD = spark.sparkContext.parallelize(dta) \
    .map(lambda x: Row(**OrderedDict(sorted(x.items()))))

dtaDF = spark.createDataFrame(dtaRdd, schema) 

Comments

0

Assuming your data is a struct and not a string dictionary, you can just do

newdf = df.select(['df.arg1','df.arg2'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.