I set up a kafka system with a producer and a consumer, streaming as messages the lines of a json file.
Using pyspark, I need to analyze the data for the different streaming windows. To do so, I need to have a look at the data as they are streamed by pyspark... How can I do it?
To run the code I used Yannael's Docker container. Here is my python code:
# Add dependencies and load modules
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 pyspark-shell'
from kafka import KafkaConsumer
from random import randint
from time import sleep
# Load modules and start SparkContext
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
conf = SparkConf() \
.setAppName("Streaming test") \
.setMaster("local[2]") \
.set("spark.cassandra.connection.host", "127.0.0.1")
try:
sc.stop()
except:
pass
sc = SparkContext(conf=conf)
sqlContext=SQLContext(sc)
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# Create streaming task
ssc = StreamingContext(sc, 0.60)
kafkaStream = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "spark-streaming-consumer", {'test': 1})
ssc.start()
DStream.pprint?