How to read ".gz" compressed file using spark DF or DS?

Question

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

Possible dupe of many in SO. Some are: this and this

sujit
– sujit

2018-03-26 12:45:40 +00:00
Commented Mar 26, 2018 at 12:45 — sujit
– sujit, Commented Mar 26, 2018 at 12:45
spark.read.csv works with gzip files

philantrovert
– philantrovert

2018-03-26 12:54:55 +00:00
Commented Mar 26, 2018 at 12:54 — philantrovert
– philantrovert, Commented Mar 26, 2018 at 12:54

Shaido · Accepted Answer · 2019-04-24 09:52:28Z

21

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

edited Apr 24, 2019 at 9:52

answered Mar 27, 2018 at 1:17

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

prady Over a year ago

Thanks, I did read the file directly using read csv option. I could observe the slowness. Is it best practice to read the whole file using single core ?

Shaido Over a year ago

@prady Due to the file being a gzip it must be read using a single core. A work-around would be to first unzip the file and the use Spark to read the data. Or you could change the compression type, refer to this question: stackoverflow.com/questions/14820450/…

prady Over a year ago

Thanks for the reference

Sithija Piyuman Thewa Hettige Over a year ago

can someone tell me how to read a csv.bz2 in to a dataframe?

Shaido Over a year ago

@SithijaPiyumanThewaHettige: The same method as in this answer should apply, i.e.: spark.read.textFile("file.csv.bz2") (you could try spark.read.textFile as well).

|

Collectives™ on Stack Overflow

How to read ".gz" compressed file using spark DF or DS?

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related