I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.
1 Answer
writing DataFrame to HDFS (Spark 1.6).
df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
some of the format options are csv, parquet, json etc.
reading DataFrame from HDFS (Spark 1.6).
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) sqlContext.read.format('parquet').load('/path/to/file')
the format method takes argument such as parquet, csv, json etc.
9 Comments
Ajg
Hey I get attributError : DataFrameWriter' object has no attribute 'csv. Also I need to read that dataframe later that is I think in new spark session.
rogue-one
what is the version of your spark installation?
Ajg
spark version 1.6.1
Ajg
Thanks a lot. I have one doubt, while reading what if there are multiple files in that location. How to specify which file I want to read. Thanks
rogue-one
to delete the data from hdfs you can use HDFS shell commands like
hdfs dfs -rm -rf <path>. you can execute this using python subprocess like subprocess.call(["hdfs", "dfs", "-rm", "-rf", path]) |