6

I'm using pyspark 1.6.0.

I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.

In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")

This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .

I tried: binaryRecordsStream(directory, recordLength)

but I couldn't get this working...

Can anyone share some lights how PYSPARK streaming read binary data file?

2 Answers 2

1

In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API

Sign up to request clarification or add additional context in comments.

Comments

1

I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:

JavaPairInputDStream<String, PortableDataStream> rawAuctions = 
        sc.fileStream("s3n://<bucket>/<folder>", 
                String.class, PortableDataStream.class, StreamInputFormat.class);

Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.