1

I set up a kafka system with a producer and a consumer, streaming as messages the lines of a json file.

Using pyspark, I need to analyze the data for the different streaming windows. To do so, I need to have a look at the data as they are streamed by pyspark... How can I do it?

To run the code I used Yannael's Docker container. Here is my python code:

# Add dependencies and load modules
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 pyspark-shell'

from kafka import KafkaConsumer
from random import randint
from time import sleep

# Load modules and start SparkContext  
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
conf = SparkConf() \
    .setAppName("Streaming test") \
    .setMaster("local[2]") \
    .set("spark.cassandra.connection.host", "127.0.0.1")

try:
    sc.stop()
except:
    pass    

sc = SparkContext(conf=conf) 
sqlContext=SQLContext(sc)
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create streaming task
ssc = StreamingContext(sc, 0.60)
kafkaStream = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "spark-streaming-consumer", {'test': 1})
ssc.start()
1

1 Answer 1

3

You can either call kafkaStream.pprint(), or learn more about structured streaming and you can print like so

query = kafkaStream \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

I see that you have endpoints, so assuming you're writing into Cassandra, you can use Kafka Connect rather than writing Spark code for this

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks @cricket_007! As a first test, I included kafkaStream.pprint() and as a result I got the current time... Do you have any suggestion on how to have the proper messages?
Not sure I understand what you mean by "proper"
Shouldn't I see as output the messages in the test1 topic? As far as I understand, I subscribed to it in kafkaStream
Yes, you should, but only as they are actively being produced into the topic. By default, it consumes from the latest offsets

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.