0

I am currently working on storing a spark DataFrame as a .csv file in blob storage on Azure. I am using the following code.

 smtRef2_DF.dropDuplicates().coalesce(1).write
  .mode("overwrite")
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .save(csvBlobStorageMount + "/Output/Smt/SmtRef.csv")

This works but it creates a SmtRef.csv folder where the actual .csv file is stored as part-00000-tid.csv. How do I specify the name of the actual .csv file?

Thanks in Advance

1
  • I don't think this question should be closed - saving as a single file is not like renaming a file. here is an option for renaming with PYARROW & pathlib def rename_file_hdfs(hdfs_path): phc = pyarrow.hdfs.connect() fl = phc.ls(hdfs_path) fl = [f for f in fl if pathlib.Path(f).stem.startswith("part)] for i, f in enumerate(fl): pa = Path(fl[0]).parent nf = f"newf{i}.csv" tp = Path(pa, nf) tp = str(tp).replace("hdfs:/", "hdfs://") phc.mv(f"{f}", f"{tp}") Commented Apr 5, 2020 at 7:49

2 Answers 2

2

If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.

df_pd = df.toPandas()
df_pd.to_csv("path")
Sign up to request clarification or add additional context in comments.

Comments

1

It’s not possible with spark api.

If you want to achieve this please use .repartition(1) which will generate one PART file and then Use Hadoop file system api to rename the file in HDFS

import org.apache.hadoop.fs._ FileSystem.get(spark.sparkContext.hadoopConfiguration()).rename(new Path(“oldpathtillpartfile”), new path(“newpath”))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.