How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

Question

I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.

SummerEla · Accepted Answer · 2020-01-10 23:38:17Z

14

writing DataFrame to HDFS (Spark 1.6).

df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.

some of the format options are csv, parquet, json etc.

reading DataFrame from HDFS (Spark 1.6).

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.format('parquet').load('/path/to/file')

the format method takes argument such as parquet, csv, json etc.

edited Jan 10, 2020 at 23:38

SummerEla

1,9823 gold badges27 silver badges46 bronze badges

answered May 31, 2017 at 17:15

rogue-one

11.6k8 gold badges56 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Ajg Over a year ago

Hey I get attributError : DataFrameWriter' object has no attribute 'csv. Also I need to read that dataframe later that is I think in new spark session.

rogue-one Over a year ago

what is the version of your spark installation?

Ajg Over a year ago

spark version 1.6.1

Ajg Over a year ago

Thanks a lot. I have one doubt, while reading what if there are multiple files in that location. How to specify which file I want to read. Thanks

rogue-one Over a year ago

to delete the data from hdfs you can use HDFS shell commands like hdfs dfs -rm -rf <path>. you can execute this using python subprocess like subprocess.call(["hdfs", "dfs", "-rm", "-rf", path])

|

Collectives™ on Stack Overflow

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related