Can I convert pandas dataframe to spark rdd?

Question

Pbm:

a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?

Why don't you read the local file into a Spark dataframe directly? — karlson
– karlson, Commented Apr 14, 2015 at 19:28
As I said, I want to use pandas DF to manipulate the data before writing it into HDFS using spark. Not sure if spark dataframe supports all the features that is support by pandas dataframe — Ram Narayanan
– Ram Narayanan, Commented Apr 15, 2015 at 18:49

caring-goat-913 · Accepted Answer · 2015-04-15 01:24:08Z

9

You can use the SQLContext object to invoke the createDataFrame method, which takes an input data which can optionally be a Pandas DataFrame object.

answered Apr 15, 2015 at 1:24

caring-goat-913

4,0495 gold badges41 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ram Narayanan Over a year ago

I am aware of that option. But trying to see if there is a direct way to convert a DF to RDD without creating a schemaRDD.

karlson Over a year ago

schemaRDD has been replaced by DataFrames in Spark 1.3. Call df.rdd.map(lambda x: [e for e in x]) if you don't want your RDDs elements to be Row instances. Although I don't really see why you'd want that. What format do you want to save to?

Ram Narayanan Over a year ago

Plan is to read a csv file from NFS and after manipulation using panda df, swap it to spark rdd and write it as avro/parquet file in hdfs. Also, do spark DF support all the features currently supported by pandas DF?

Yamir Encarnacion Over a year ago

From documentaton at spark.apache.org/docs/latest/api/python/… When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of Row, or namedtuple, or dict.

sam · Accepted Answer · 2017-03-22 11:52:15Z

4

Lets say dataframe is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this

rdd_data = spark.createDataFrame(dataframe)\
                .rdd

In case, if you want to rename any columns or select only few columns, you do them before use of .rdd

Hope it works for you also.

answered Mar 22, 2017 at 11:52

sam

1,9521 gold badge18 silver badges34 bronze badges

Comments

Erkan Şirin · Accepted Answer · 2017-06-08 00:45:02Z

3

I use Spark 1.6.0. First transform pandas dataframe into spark dataframe then spark dataframe spark rdd

sparkDF = sqlContext.createDataFrame(pandasDF)
sparkRDD = sparkDF.rdd.map(list)
type(sparkRDD)
pyspark.rdd.PipelinedRDD

answered Jun 8, 2017 at 0:45

Erkan Şirin

2,12521 silver badges29 bronze badges

Collectives™ on Stack Overflow

Can I convert pandas dataframe to spark rdd?

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related