9

I am working on Kafka streaming and trying to integrate it with Apache Spark. However, while running I am getting into issues. I am getting the below error.

This is the command I am using.

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()

ERROR:

Py4JJavaError: An error occurred while calling o77.load.: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

How can I resolve this?

NOTE: I am running this in Jupyter Notebook

findspark.init('/home/karan/spark-2.1.0-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

Everything is running fine till here (above code)

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()

This is where things are going wrong (above code).

The blog which I am following: https://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/

1 Answer 1

11

Edit

Using spark.jars.packages works better than PYSPARK_SUBMIT_ARGS

Ref - PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition


It's not clear how you ran the code. Keep reading the blog, and you see

spark-submit \
  ...
  --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 \
  sstreaming-spark-out.py

Seems you missed adding the --packages flag

In Jupyter, you could add this

import os

# setup arguments
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0'

# initialize spark
import pyspark, findspark
findspark.init()

Note: _2.11:2.4.0 need to align with your Scala and Spark versions... Based on the question, yours should be Spark 2.1.0

Sign up to request clarification or add additional context in comments.

2 Comments

Post adding import OS, I am getting another error now. Py4JJavaError: An error occurred while calling o27.load. : java.lang.ClassNotFoundException: Failed to find data source: kafka.
@PKernel This is because the version of spark-sql-kafka does not match the spark version you are currently running.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.