writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

Question

i have started the shell with databrick csv package

#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0

Then i read a csv file did some groupby op and dump that to a csv.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv')   ####it has columns and df.columns works fine
type(df)   #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names

Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.

Question1- while giving csv dump is there any way i can add column name with that???

Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

dirceusemighini · Accepted Answer · 2018-06-13 21:43:34Z

56

Try

df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')

Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.

edited Jun 13, 2018 at 21:43

dirceusemighini

1,3642 gold badges17 silver badges36 bronze badges

answered Jul 29, 2016 at 20:57

Mike Metzger

7237 silver badges4 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Satya Over a year ago

@Mike-yes i am using a huge dataset, but the datset(after doing some function over that large dataset) ,i want to o/p as a csv might be having 1 million rows or less . I am having 28GB RAM(on Master and both of its slaves). I will definitely try it out to check if it is giving me Memory error or not. Just for curiosity, can you suggest what will be ideal configuration, if i want to o/p a csv about 5Million Rows?

Mike Metzger Over a year ago

@Satya - I've mostly done my re-combining of the files with other tools outside of Spark (ie, cat, gzip, etc) if I needed that format. Regarding best configuration, it depends on what you're trying to read the file with. Most of my usage is preprocessing and then re-import into a SQL database for live querying - running bulk imports hasn't required a single file.

dirceusemighini · Accepted Answer · 2018-06-15 07:56:36Z

37

Just in case, on spark 2.1 you can create a single csv file with the following lines

dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")

edited Jun 15, 2018 at 7:56

dirceusemighini

1,3642 gold badges17 silver badges36 bronze badges

answered Apr 24, 2018 at 11:38

FrancescoM

1,39811 silver badges16 bronze badges

Comments

Giorgos Myrianthous · Accepted Answer · 2020-03-18 19:50:53Z

14

The following should do the trick:

df \
  .write \
  .mode('overwrite') \
  .option('header', 'true') \
  .csv('output.csv')

Alternatively, if you want the results to be in a single partition, you can use coalesce(1):

df \
  .coalesce(1) \
  .write \
  .mode('overwrite') \
  .option('header', 'true') \
  .csv('output.csv')

Note however that this is an expensive operation and might not be feasible with extremely large datasets.

answered Mar 18, 2020 at 19:50

Giorgos Myrianthous

40.4k21 gold badges156 silver badges175 bronze badges

Comments

Rishabh Agarwal · Accepted Answer · 2019-07-23 18:14:47Z

13

with spark >= 2.o, we can do something like

df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False)  ###single csv(Pandas Style)

edited Jul 23, 2019 at 18:14

Rishabh Agarwal

2,6942 gold badges24 silver badges27 bronze badges

answered Sep 8, 2016 at 5:49

Satya

5,94719 gold badges55 silver badges74 bronze badges

1 Comment

MichaelChirico Over a year ago

It should be noted that you can force a single csv from by doing df.coalesce(1).write.csv(..., header = True). If you're partitioning your csv, this will create one file for each partition. The name of the output file will be gobbledygook.

Satya · Accepted Answer · 2016-07-27 19:05:56Z

1

got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement

df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')

#Alternative for 2nd question

Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.

answered Jul 27, 2016 at 19:05

Satya

5,94719 gold badges55 silver badges74 bronze badges

Collectives™ on Stack Overflow

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

5 Answers 5

2 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

5 Answers 5

2 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related