I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?
Details : File is csv with tab delimited.
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
PySpark:
df = spark.read.csv("file.csv.gz", sep='\t')
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.
gzip it must be read using a single core. A work-around would be to first unzip the file and the use Spark to read the data. Or you could change the compression type, refer to this question: stackoverflow.com/questions/14820450/…spark.read.textFile("file.csv.bz2") (you could try spark.read.textFile as well).
spark.read.csvworks with gzip files