Spark Streaming - processing binary data file

Question

I'm using pyspark 1.6.0.

I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.

In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")

This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .

I tried: binaryRecordsStream(directory, recordLength)

but I couldn't get this working...

Can anyone share some lights how PYSPARK streaming read binary data file?

JuJoDi · Accepted Answer · 2017-01-10 17:07:39Z

1

In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API

answered Jan 10, 2017 at 17:07

JuJoDi

15.1k23 gold badges87 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Marcus · Accepted Answer · 2018-08-08 20:07:23Z

I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:

JavaPairInputDStream<String, PortableDataStream> rawAuctions = 
        sc.fileStream("s3n://<bucket>/<folder>", 
                String.class, PortableDataStream.class, StreamInputFormat.class);

Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

Collectives™ on Stack Overflow

Spark Streaming - processing binary data file

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related