Decode a Set of Binary Files using Spark

Question

I have a thousand of binary files in compress format and each file need to be decoded separately in a single pass. The max size of the file is 500 MB. Currently I am able to do the decoding of files one by one using python (with struct package). But since the number of files are huge in numbers and size, so its not possible to decode the file sequentially.

I am thinking to process this data in spark but I don't have much experience in spark. Can you please suggest if this task can be done in the spark. Many thanks in advance.

venuktan · Accepted Answer · 2016-08-16 22:49:00Z

7

sc.textFiles will not work here as you have binary files. You should be using sc.binaryFiles

Here is an example in python, I am sure scala and java have the same binaryFiles API.

from pyspark import SparkContext
sc= SparkContext()

raw_binary = sc.binaryFiles("/path/to/my/files/directory")

import zlib
def decompress(val):
    try:
        s = zlib.decompress(val, 16 + zlib.MAX_WBITS)
    except:
        return val
    return s
raw_binary.mapValues(decompress).take(1)

You can use zlib to decompress

edited Aug 16, 2016 at 22:49

answered Aug 4, 2016 at 21:38

venuktan

1,6492 gold badges14 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rks Over a year ago

Thank you. +1 for this. Its works perfectly if the data is in uncompressed binary format. But does not work with compressed binary files (*.dat.gz). Could you please guide me how to run with the same.

venuktan Over a year ago

I have made changes to be able to decompress, hope that helps. Be sure to upvote and make this the correct answer if you like it.

Collectives™ on Stack Overflow

Decode a Set of Binary Files using Spark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related