19

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

2
  • Possible dupe of many in SO. Some are: this and this Commented Mar 26, 2018 at 12:45
  • 2
    spark.read.csv works with gzip files Commented Mar 26, 2018 at 12:54

1 Answer 1

21

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks, I did read the file directly using read csv option. I could observe the slowness. Is it best practice to read the whole file using single core ?
@prady Due to the file being a gzip it must be read using a single core. A work-around would be to first unzip the file and the use Spark to read the data. Or you could change the compression type, refer to this question: stackoverflow.com/questions/14820450/…
Thanks for the reference
can someone tell me how to read a csv.bz2 in to a dataframe?
@SithijaPiyumanThewaHettige: The same method as in this answer should apply, i.e.: spark.read.textFile("file.csv.bz2") (you could try spark.read.textFile as well).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.