17

I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.

1 Answer 1

14
  • writing DataFrame to HDFS (Spark 1.6).

    df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
    

some of the format options are csv, parquet, json etc.

  • reading DataFrame from HDFS (Spark 1.6).

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    sqlContext.read.format('parquet').load('/path/to/file') 
    

the format method takes argument such as parquet, csv, json etc.

Sign up to request clarification or add additional context in comments.

9 Comments

Hey I get attributError : DataFrameWriter' object has no attribute 'csv. Also I need to read that dataframe later that is I think in new spark session.
what is the version of your spark installation?
spark version 1.6.1
Thanks a lot. I have one doubt, while reading what if there are multiple files in that location. How to specify which file I want to read. Thanks
to delete the data from hdfs you can use HDFS shell commands like hdfs dfs -rm -rf <path>. you can execute this using python subprocess like subprocess.call(["hdfs", "dfs", "-rm", "-rf", path])
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.