How to programmatically load and stream Kafka topic to a PySpark Dataframe

Question

There are many way to read/ write spark dataframe to kafka. Am trying to read messages from kafka topic and create a data frame out of it. Am able to get pull the messages from topic, but am unable to convert it to a datafame. Any suggestion would be helpful.

import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.context import SparkContext
from kafka import KafkaConsumer

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

consumer = KafkaConsumer('Jim_Topic')

for message in consumer:
    data = message
    print(data) # Printing the messages properly
    df = data.map # am unable to convert it to a dataframe.

I tried below way as well,

df = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "Jim_Topic") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Getting below error,

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

Giorgos Myrianthous · Accepted Answer · 2020-06-12 16:07:17Z

2

Depending on your use-case, you can

For Streaming Queries

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "Jim_Topic")
  .load()

# Query data
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
  .as[(String, String)]

For Batch Queries

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "Jim_Topic")
  .load()

# Query data
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
  .as[(String, String)]

Make sure to add the required dependencies as well:

org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.2

(replace with your Spark's version - the above refers to Spark version 2.0.2)

edited Jun 12, 2020 at 16:07

answered Jun 12, 2020 at 10:26

Giorgos Myrianthous

40.4k21 gold badges156 silver badges175 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Jim Macaulay Over a year ago

Thanks for the quick help. I tired this logic already, am getting below error

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

Giorgos Myrianthous Over a year ago

@JimMacaulay How are you running your app? Is it through spark-submit?

Jim Macaulay Over a year ago

am running from PyCharm directly. Not with spark-submit

Giorgos Myrianthous Over a year ago

@JimMacaulay You need to add the required dependency. See my updated answer.

Jim Macaulay Over a year ago

Could you please help me to add the dependency. Am not sure how to add it. If Java i would have added in maven dependency. Am not sure about Python

|

Collectives™ on Stack Overflow

How to programmatically load and stream Kafka topic to a PySpark Dataframe

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related