3

I have a thousand of binary files in compress format and each file need to be decoded separately in a single pass. The max size of the file is 500 MB. Currently I am able to do the decoding of files one by one using python (with struct package). But since the number of files are huge in numbers and size, so its not possible to decode the file sequentially.

I am thinking to process this data in spark but I don't have much experience in spark. Can you please suggest if this task can be done in the spark. Many thanks in advance.

1 Answer 1

7

sc.textFiles will not work here as you have binary files. You should be using sc.binaryFiles

Here is an example in python, I am sure scala and java have the same binaryFiles API.

from pyspark import SparkContext
sc= SparkContext()

raw_binary = sc.binaryFiles("/path/to/my/files/directory")

import zlib
def decompress(val):
    try:
        s = zlib.decompress(val, 16 + zlib.MAX_WBITS)
    except:
        return val
    return s
raw_binary.mapValues(decompress).take(1)

You can use zlib to decompress

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. +1 for this. Its works perfectly if the data is in uncompressed binary format. But does not work with compressed binary files (*.dat.gz). Could you please guide me how to run with the same.
I have made changes to be able to decompress, hope that helps. Be sure to upvote and make this the correct answer if you like it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.