Pyspark - print messages from Kafka

Question

I set up a kafka system with a producer and a consumer, streaming as messages the lines of a json file.

Using pyspark, I need to analyze the data for the different streaming windows. To do so, I need to have a look at the data as they are streamed by pyspark... How can I do it?

To run the code I used Yannael's Docker container. Here is my python code:

# Add dependencies and load modules
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 pyspark-shell'

from kafka import KafkaConsumer
from random import randint
from time import sleep

# Load modules and start SparkContext  
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
conf = SparkConf() \
    .setAppName("Streaming test") \
    .setMaster("local[2]") \
    .set("spark.cassandra.connection.host", "127.0.0.1")

try:
    sc.stop()
except:
    pass    

sc = SparkContext(conf=conf) 
sqlContext=SQLContext(sc)
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create streaming task
ssc = StreamingContext(sc, 0.60)
kafkaStream = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "spark-streaming-consumer", {'test': 1})
ssc.start()

DStream.pprint?

10465355
– 10465355

2018-11-18 19:52:52 +00:00
Commented Nov 18, 2018 at 19:52 — 10465355
– 10465355, Commented Nov 18, 2018 at 19:52

OneCricketeer · Accepted Answer · 2018-11-19 17:00:20Z

3

You can either call kafkaStream.pprint(), or learn more about structured streaming and you can print like so

query = kafkaStream \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

I see that you have cassandraendpoints, so assuming you're writing into Cassandra, you can use Kafka Connect rather than writing Spark code for this

edited Nov 19, 2018 at 17:00

answered Nov 18, 2018 at 21:18

OneCricketeer

193k20 gold badges146 silver badges276 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

albusdemens Over a year ago

Thanks @cricket_007! As a first test, I included kafkaStream.pprint() and as a result I got the current time... Do you have any suggestion on how to have the proper messages?

OneCricketeer Over a year ago

Not sure I understand what you mean by "proper"

albusdemens Over a year ago

Shouldn't I see as output the messages in the test1 topic? As far as I understand, I subscribed to it in kafkaStream

OneCricketeer Over a year ago

Yes, you should, but only as they are actively being produced into the topic. By default, it consumes from the latest offsets

Collectives™ on Stack Overflow

Pyspark - print messages from Kafka

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related