Issue in Kafka to Spark Streaming Data pipeline with python pyspark

Ask Question

Asked 5 years ago

Modified 5 years ago

Viewed 338 times

I am using below program and runnign this in Anaconda(Spyder) for creating data pipeline from Kafka to Spark streaming & in python

import sys
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from uuid import uuid1
import os


##Step 1: Initialize sparkcontext
spark_context = SparkContext(appName="Transformation Application")

###Step 2: Initialize streaming context
ssc = StreamingContext(spark_context, 5)

def utf8_decoder(s):
    """ Decode the unicode as UTF-8 """
    if s is None:
        return None
    return s.decode('utf-8')

message = KafkaUtils.createDirectStream(ssc,topics=['testtopic'],kafkaParams={"metadata.broker.list":"localhost:9092","key.deserializer": "org.springframework.kafka.support.serializer.JsonDeserializer","value.deserializer": "org.springframework.kafka.support.serializer.JsonDeserializer"},fromOffsets=None,messageHandler=None,keyDecoder=utf8_decoder,valueDecoder=utf8_decoder)
message
words = message.map(lambda x: x[1]).flatMap(lambda x: x.split(" "))
wordcount=words.map(lambda x: (x,1)).reduceByKey(lambda a,b:a+b)
wordcount.pprint()

When I am printing message, words,wordscount i am getting no proper results,I am getting hexadecimal values .

message
Out[16]: <pyspark.streaming.kafka.KafkaDStream at 0x23f8b1f8248>

wordcount
Out[18]: <pyspark.streaming.dstream.TransformedDStream at 0x23f8b2324c8>

in my topic(testtopic) I am produced string - " Hi Hi Hi how are you doing" then wordcount should give count for each word but it is giving some encoded hexadecimal values

edited Nov 20, 2020 at 15:38

OneCricketeer

193k20 gold badges146 silver badges276 bronze badges

asked Nov 20, 2020 at 9:14

Sarvendra Singh

1391 gold badge3 silver badges10 bronze badges

What did you expect the line with just message to do? You're printing the Python object, not consuming the stream... Also, Spark has its own json functions, you shouldn't need to (try to) import Spring serializers

OneCricketeer
– OneCricketeer

2020-11-20 15:37:09 +00:00
Commented Nov 20, 2020 at 15:37
@OneCricketeer Sorry I am new to spark and kafka both..it means message output is fine. but printing wordcount and words both also giving TransformedDStream object..While wordcount should give counts of all words like 3 for Hi and 1 for each rest word for Topic with string produced as "my topic(testtopic) I am produced string - " Hi Hi Hi how are you doing"... Also If I have to create data pipeline from kafka>Spark Streaming>MYSQL DB then how can I make sure that data is kafka topic is avilable in Spark where can I see that data in Spark. Please help to guide on this.

Sarvendra Singh
– Sarvendra Singh

2020-11-21 12:35:22 +00:00
Commented Nov 21, 2020 at 12:35
My point is that you're printing an object, which is a Python "problem", unrelated to spark or Kafka... wordcount.pprint() is correct if you want to actually see the data

OneCricketeer
– OneCricketeer

2020-11-21 16:31:27 +00:00
Commented Nov 21, 2020 at 16:31
@OneCricketeer Thanks Sir...Just one last question Can we convert message (Kafka direct stream) into spark data frame ? I have to store my streaming records in my sql.Since spark dataframe can be stored to mysql DB so thats why asking

Sarvendra Singh
– Sarvendra Singh

2020-11-22 20:57:03 +00:00
Commented Nov 22, 2020 at 20:57
You should be using Structured Streaming if you want to do that. Alternatively, Kafka Connect is provided by Kafka and you can use that to write to mysql as well

OneCricketeer
– OneCricketeer

2020-11-23 01:29:54 +00:00
Commented Nov 23, 2020 at 1:29

| Show 5 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Issue in Kafka to Spark Streaming Data pipeline with python pyspark

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked