0

I am trying use pyspark to analyze my data on databricks notebooks. Blob storage has been mounted on the databricks cluster and after ananlyzing, would like to write csv back into blob storage. As pyspark working in distributed fashion, csv file is broken into small blocks and written on the blob storage. How to overcome this and write as a single csv file on blob when we do analysis using pyspark. Thanks.

1 Answer 1

1

Do you really want a single file? If yes, the only way you can overcome it by merging all the small csv files into a single csv file. You can make use of map function on the databricks cluster to merge it or may be you can use some background job to do the same.

Have a look here: https://forums.databricks.com/questions/14851/how-to-concat-lots-of-1mb-cvs-files-in-pyspark.html

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.