Creating a Spark DataFrame from an RDD of lists

Question

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:

[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]

I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?

It's not exactly clear from your question where you're having trouble. Is it the fact that you have so many columns? Or just that records of your RDD are lists of tuples? — Kyle Heuton
– Kyle Heuton, Commented Apr 7, 2015 at 23:25

dapangmao · Accepted Answer · 2015-04-09 19:23:10Z

32

How about use the toDF method? You only need add the field names.

df = rdd.toDF(['column', 'value'])

answered Apr 9, 2015 at 19:23

dapangmao

2,8073 gold badges24 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mgoldwasser Over a year ago

this answer works, and the solution I posted below (based on your answer) would convert an rdd as described above to a DataFrame

Sushil Verma Over a year ago

what if you dont know the column names or want to use columns of some other dataframe ? related question from me: stackoverflow.com/questions/70882076/…

mgoldwasser · Accepted Answer · 2016-07-19 12:50:14Z

15

The answer by @dapangmao got me to this solution:

my_df = my_rdd.map(lambda l: Row(**dict(l))).toDF()

edited Jul 19, 2016 at 12:50

answered Apr 10, 2015 at 20:48

mgoldwasser

15.6k16 gold badges86 silver badges107 bronze badges

Comments

Kyle Heuton · Accepted Answer · 2015-04-07 22:25:16Z

4

Take a look at the DataFrame documentation to make this example work for you, but this should work. I'm assuming your RDD is called my_rdd

from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)

# You have a ton of columns and each one should be an argument to Row
# Use a dictionary comprehension to make this easier
def record_to_row(record):
    schema = {'column{i:d}'.format(i = col_idx):record[col_idx] for col_idx in range(1,100+1)}
    return Row(**schema)


row_rdd = my_rdd.map(lambda x: record_to_row(x))

# Now infer the schema and you have a DataFrame
schema_my_rdd = sqlContext.inferSchema(row_rdd)

# Now you have a DataFrame you can register as a table
schema_my_rdd.registerTempTable("my_table")

I haven't worked much with DataFrames in Spark but this should do the trick

edited Apr 7, 2015 at 22:25

answered Apr 7, 2015 at 21:51

Kyle Heuton

9,8564 gold badges45 silver badges54 bronze badges

3 Comments

Glenn Strycker Over a year ago

you might need to add a line after the sqlContext is created to load the implicits library: "import sqlContext .implicits._". See spark.apache.org/docs/1.3.0/sql-programming-guide.html

Kyle Heuton Over a year ago

Isn't that a scala-only thing? My answer is written in Python

Jérémy Over a year ago

I got: AttributeError: 'SQLContext' object has no attribute 'inferSchema'

aks · Accepted Answer · 2016-09-09 04:20:50Z

1

In pyspark, let's say you have a dataframe named as userDF.

>>> type(userDF)
<class 'pyspark.sql.dataframe.DataFrame'>

Lets just convert it to RDD (

userRDD = userDF.rdd
>>> type(userRDD)
<class 'pyspark.rdd.RDD'>

and now you can do some manipulations and call for example map function :

newRDD = userRDD.map(lambda x:{"food":x['favorite_food'], "name":x['name']})

Finally, lets create a DataFrame from resilient distributed dataset (RDD).

newDF = sqlContext.createDataFrame(newRDD, ["food", "name"])

>>> type(ffDF)
<class 'pyspark.sql.dataframe.DataFrame'>

That's all.

I was hitting this warning message before when I tried to call :

newDF = sc.parallelize(newRDD, ["food","name"] : 

.../spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row inst  warnings.warn("Using RDD of dict to inferSchema is deprecated. "

So no need to do this anymore...

answered Sep 9, 2016 at 4:20

aks

5746 silver badges10 bronze badges

1 Comment

ulkas Over a year ago

what to do if each row has plenty of columns, and potentially each row is different in definition?

Collectives™ on Stack Overflow

Creating a Spark DataFrame from an RDD of lists

4 Answers 4

2 Comments

Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related