5

Pbm:

a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?

2
  • Why don't you read the local file into a Spark dataframe directly? Commented Apr 14, 2015 at 19:28
  • As I said, I want to use pandas DF to manipulate the data before writing it into HDFS using spark. Not sure if spark dataframe supports all the features that is support by pandas dataframe Commented Apr 15, 2015 at 18:49

3 Answers 3

9

You can use the SQLContext object to invoke the createDataFrame method, which takes an input data which can optionally be a Pandas DataFrame object.

Sign up to request clarification or add additional context in comments.

4 Comments

I am aware of that option. But trying to see if there is a direct way to convert a DF to RDD without creating a schemaRDD.
schemaRDD has been replaced by DataFrames in Spark 1.3. Call df.rdd.map(lambda x: [e for e in x]) if you don't want your RDDs elements to be Row instances. Although I don't really see why you'd want that. What format do you want to save to?
Plan is to read a csv file from NFS and after manipulation using panda df, swap it to spark rdd and write it as avro/parquet file in hdfs. Also, do spark DF support all the features currently supported by pandas DF?
From documentaton at spark.apache.org/docs/latest/api/python/… When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of Row, or namedtuple, or dict.
4

Lets say dataframe is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this

rdd_data = spark.createDataFrame(dataframe)\
                .rdd

In case, if you want to rename any columns or select only few columns, you do them before use of .rdd

Hope it works for you also.

Comments

3

I use Spark 1.6.0. First transform pandas dataframe into spark dataframe then spark dataframe spark rdd

sparkDF = sqlContext.createDataFrame(pandasDF)
sparkRDD = sparkDF.rdd.map(list)
type(sparkRDD)
pyspark.rdd.PipelinedRDD

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.