Newest 'spark-kafka-integration' Questions

0 votes

0 answers

70 views

EMR Spark Job Fails to Connect to MSK with IAM Auth - Timeout Waiting for Node Assignment Error

I am running an Apache Spark job on Amazon EMR that needs to connect to an Amazon MSK cluster configured with IAM authentication. The EMR cluster has an IAM role with full MSK permissions, and I can ...

Vishwas Singh

1

asked Oct 1 at 11:20

0 votes

0 answers

49 views

[Spark-stream]: Stream Batches processing time reduce over time causing Kafka Lag

I have been using Spark v3.5 Spark Stream functionality for the below use case. I am observing the issue below on one of the environments with Spark Stream. Please if I can get some assistance with ...

Saurabh Agrawal

1

asked Feb 13 at 15:04

0 votes

1 answer

142 views

Unable to push the data from the written kafka topic to Postgres table

I am trying to load the data written into the Kafka topic into the Postgres table. I can see the topic is receiving new messages every second and also the data looks good. However, when I use the ...

RushHour

645

asked Oct 26, 2024 at 8:38

0 votes

1 answer

122 views

ERROR SparkContext: Failed to add spark-streaming-kafka-0-10_2.13-3.5.2.jar

ERROR SparkContext: Failed to add home/areaapache/software/spark-3.5.2-bin-hadoop3/jars/spark-streaming-kafka-0-10_2.13-3.5.2.jar \ to Spark environment import logging from pyspark.sql import ...

Lê Anh Tuấn 291N40

11

asked Sep 26, 2024 at 16:29

3 votes

0 answers

79 views

Is there options to send Spark streaming executor metrics directly instead via driver?

I have Spark Streaming application lives on Argo + K8S that reads Kafka topics by subscribe pattern then there are some transformations and writing to a target. Several different producers may write ...

Александр Трутнев

51

asked Sep 22, 2024 at 13:37

0 votes

1 answer

290 views

Save a file in Databricks Workspace using Scala/Java

My goal is to run a Spark job using Databricks, and my challenge is that I can't store files in the local filesystem since the file is saved in the driver, but when my executors tried to access the ...

John Doe

433

asked Sep 16, 2024 at 14:31

1 vote

0 answers

384 views

Spark : java.lang.NoClassDefFoundError: org/apache/spark/kafka010/KafkaConfigUpdater

I am working on spark streaming and reading data from kafka topic, but getting error java.lang.NoClassDefFoundError: org/apache/spark/kafka010/KafkaConfigUpdater. Running my code in K8s and provide ...

vivekdesai

45

asked May 28, 2024 at 14:08

2 votes

0 answers

130 views

Multiple Kafka source topic + Spark Structured streaming + multiple delta table sink

I have multiple topics in kafka that I need to sink in their respective delta table. A) 1 Streaming query for all topics If i use one streaming query, then the RDD/DF should contains data from ...

MaatDeamon

9,859

asked May 25, 2024 at 10:51

2 votes

1 answer

3k views

ClassNotFoundException for scala.$less$colon$less. Problem with different Scala versions?

When I try to run this .py: import logging from cassandra.cluster import Cluster from pyspark.sql import SparkSession from pyspark.sql.functions import from_json, col from pyspark.sql.types import ...

francollado99

31

asked May 12, 2024 at 17:00

2 votes

1 answer

548 views

"java.lang.NoSuchMethodError: 'scala.collection.JavaConverters$AsJava scala.collection" error when I stream kafka messages using Pyspark

I am in a bind here. I am trying to implement a very basic pipeline which reads data from kafka and process it in Spark. The problem I am facing is that apache spark shuts down abruptly giving the ...

Nanomachines Son

160

asked May 1, 2024 at 8:24

1 vote

0 answers

97 views

spark-connect with standalone spark cluster error

I'm trying to read stream from Kafka using pyspark. The Stack I'm working with: Kubernetes. Stand alone spark cluster with 2 workers. spark-connect connected to the cluster and has the dependencies ...

waseemoo1

11

asked Apr 22, 2024 at 13:25

0 votes

1 answer

167 views

Py4JJavaError An error occurred while calling javalangNoSuchMethodError org.apache.spark.sql.AnalysisException org.apache.spark.sql.kafka.KafkaWriter

I can't write to Kafka from Spark, Spark is reading but not writing, if I write to the console it doesn't give an error Traceback (most recent call last): File "f:\Sistema de Informação\TCC\...

Ingrid Iplinsky

21

asked Apr 5, 2024 at 14:50

0 votes

1 answer

78 views

Spark incoming JSON stream processing

I have been trying to complete a project in which I needed to send data stream using kafka to local Spark to process the incoming data. However I can not show and use the data frame in the right ...

AFORS

11

asked Jan 30, 2024 at 21:35

0 votes

0 answers

33 views

Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide Spark [duplicate]

Hello I am trying to use pyspark + kafka in order to do this I execute this command in order to set up the Spark application Spark version is 3.5.0 | spark-3.5.0-bin-hadoop3 Kafka version is - ...

Nícolas Farfán Cheneaux

1

asked Dec 17, 2023 at 0:20

0 votes

0 answers

268 views

Spark-Kafka Integration not working: Kafka broker with producer and consumer script are getting stuck as soon as we run spark script(consumer.py)

I'm trying to read data from kafka topic by using spark structured streaming on ec2(ubuntu) machine. If I try to read the data by using kafka stream only(kafka-console-consumer.sh) then there is no ...

Rushi

25

asked Oct 4, 2023 at 7:26

0 votes

1 answer

1k views

How to add Kafka dependencies for PySpark on a Jupyter notebook

I have setup kafka 2.1 on windows and able to successfully communicate a topic from producer to consumer over localhost:9092. I now want to consume this in a spark structured stream. For this I setup ...

rarpal

193

asked Aug 16, 2023 at 18:05

0 votes

0 answers

279 views

Spark Kafka: understanding offset management with enable.auto.commit

according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept. For example I have a Kafka ...

AzUser1

193

asked Jul 25, 2023 at 9:07

0 votes

1 answer

3k views

Running Kafka and Spark with docker-compose

My goal is to send/produce a txt from my Windows PC to a container running Kafka, to then be consume by pyspak (running in other container). I'm using docker-compose where I define a custom net and ...

yaviens

25

asked May 12, 2023 at 16:02

0 votes

1 answer

423 views

How to map a message to a object with `schema` and `payload` in Spark structured streaming correctly?

I am hoping to map a message to a object with schema and payload inside in during Spark structured streaming. This is my original code val input_schema = new StructType() .add("timestamp", ...

Hongbo Miao

50.7k

asked Apr 27, 2023 at 22:06

1 vote

1 answer

1k views

Difference between spark-streaming-kafka-0-10 vs spark-sql-kafka-0-10

I am hoping to read a parquet file and write to Kafka import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.struct import org.apache.spark.sql.functions.to_json object ...

Hongbo Miao

50.7k

asked Apr 25, 2023 at 2:38

0 votes

1 answer

573 views

Getting Error for org.apache.spark.sql.Encoder and missing or invalid dependency find while loading class file SQLImplicits, LowPrioritySQLImplicits

I am running following code to read kafka stream with spark-3.2.2, and scala 2.12.0. Earlier same code was working fine with spark-2.2 and scala 2.11.8, import spark.implicits._ val kafkaStream = ...

Chandan Gawri

451

asked Apr 19, 2023 at 9:36

0 votes

0 answers

176 views

What is the best way to set up a kafka connection with apache spark

How to make the kafka stream more stable? as it it will run constantly without having us to start the run again after it fails (so far we are thinking about using the "continous" run mode to ...

Trodenn

37

asked Mar 25, 2023 at 2:28

2 votes

1 answer

191 views

I have more data in a kafka topic but when i extract data using my pyspark application, I am getting only 1 row extracted, how to fix?

I have more data in a kafka topic but when i extract data using my pyspark application (which I use to extract from different kafka topics), I am getting only 1 row extracted. Previously I had ...

rakk

57

asked Mar 1, 2023 at 7:37

0 votes

0 answers

99 views

How to read DF Column of struct type and add they key value pairs to kafka headers?

I have a new dataframe with 2 columns one is headers and other is a payload, facing issues in reading the headers column and assigning the values to kafka headers while publishing. earlier the ...

abhimanyu

123

asked Feb 9, 2023 at 21:20

1 vote

0 answers

396 views

spark streaming from kafka on spark operator(Kubernetes)

I have a spark structured streaming job in scala, reading from kafka and writing to S3 as hudi tables. Now I am trying to move this job to spark operator on EKS. When I give the option in the yaml ...

haripriya rajendran

21

asked Feb 9, 2023 at 15:19

0 votes

1 answer

74 views

Pyspark : Kafka consumer for multiple topics

I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics ...

erك

3

asked Jan 21, 2023 at 9:08

0 votes

0 answers

1k views

I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark

I am using this tech stack: Spark version: 3.3.1 Scala Version: 2.12.15 Hadoop Version: 3.3.4 Kafka Version: 3.3.1 I am trying to get data from kafka topic through spark structure streaming, But I am ...

Muhammad Affan

155

asked Jan 16, 2023 at 8:27

0 votes

1 answer

1k views

java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

I am using Spark2.3.0 and kafka1.0.0.3. I have created a spark read stream df = spark.readStream. \ format("kafka"). \ option("kafka.bootstrap.servers", "...

Utpal Dutta

3

asked Jan 5, 2023 at 13:06

1 vote

1 answer

785 views

Error while pulling kafka jks certificates from hdfs (trying with s3 as well) in spark

I am Running spark in cluster mode which is giving error as ERROR SslEngineBuilder: Modification time of key store could not be obtained: hdfs://ip:port/user/hadoop/jks/kafka.client.truststore.jks ...

richa bharwal

29

asked Jan 4, 2023 at 15:13

0 votes

1 answer

2k views

How to process the dataframe which was read from Kafka Topic using Spark Streaming

I'm able to stream twitter data into my Kafka topic via a producer. When I try to consume through the default Kafka consumer I'm able to see the tweets as well. But when I try to use Spark Streaming ...

Kulasangar

9,532

asked Dec 15, 2022 at 11:15

1 vote

0 answers

222 views

problem with spark structured streaming after restart

I have a simple pyspark code,which reads data from kafka and write aggregated records to oracle with foreachbatch.I have set checkpointLocation on hdfs dir and it works well.when I kill application ...

Nestor

91

asked Nov 23, 2022 at 13:31

1 vote

0 answers

75 views

How to get commitedOffsets and availableOffsets from sparkstreaming

22/11/09 11:08:40 INFO MicroBatchExecution: Resuming at batch 206 with committed offsets {KafkaV2[Subscribe[test]]: {"test":{"0":3086,"1":3086,"2":3086,"3&...

qtcat

25

asked Nov 10, 2022 at 1:32

1 vote

0 answers

358 views

Inferring a schema of a kafka topic taking too much time in databricks

I'm trying to determine the schema of a json kafka topic. To achieve that, I lifted a code part from this blog(https://medium.com/wehkamp-techblog/streaming-kafka-topic-to-delta-table-s3-with-spark-...

Glimpse

23

asked Nov 9, 2022 at 14:19

0 votes

1 answer

346 views

How to writestream to specific kafka cluster in Azure databricks? " Topic mytopic not present in metadata after 60000 ms."

I am trying to write data to Kafka with the writestream method. I have been given the following properties by the source system. topic = 'mytopic' host = "myhost.us-west-1.aws.confluent.cloud:...

OrganicMustard

1,436

asked Oct 22, 2022 at 23:21

3 votes

0 answers

576 views

Structured Streaming - suddenly giving error while writing to (Strimzi)Kafka topic

i've a Structured Streaming code which reads data from a Kafka Topic (on a VM) & writes to another Kafka Topic on GKE (i should be using a Mirror Maker for this, but have not implemented that yet)....

Karan Alang

1,111

asked Oct 18, 2022 at 6:29

1 vote

2 answers

496 views

PySpark Structured Streaming with Kafka - Scaling Consumers for multiple topics with different loads

We subscribed to 7 topics with spark.readStream in 1 single running spark app. After transforming the event payloads, we save them with spark.writeStream to our database. For one of the topics, the ...

nikitira

86

asked Oct 13, 2022 at 13:29

0 votes

1 answer

715 views

spark-submit --packages returns Error: Missing application resource

I installed .NET for Apache Spark using the following guide: https://learn.microsoft.com/en-us/dotnet/spark/tutorials/get-started?WT.mc_id=dotnet-35129-website&tabs=windows The Hello World worked. ...

Kenci

4,902

asked Oct 6, 2022 at 16:43

0 votes

1 answer

247 views

Spark Structured Streaming - Kafka - Missing required configuration "partition.assignment.strategy

That is my code. import findspark findspark.init() import os os.environ[ "PYSPARK_SUBMIT_ARGS" ] = "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 pyspark-shell" ...

datadibi

5

asked Oct 2, 2022 at 12:35

0 votes

0 answers

262 views

Databricks Kafka Read Not connecting

I'm trying to read data from GCP kafka through azure databricks but getting below warning and notebook is simply not completing. Any suggestion please? WARN NetworkClient: Consumer groupId Bootstrap ...

Syed Mohammed Mehdi

313

asked Sep 17, 2022 at 11:15

2 votes

1 answer

2k views

PySpark Kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/security/JaasContext

I am encountering problem with printing the data to console from kafka topic. The error message I get is shown in below image. 22/09/06 10:14:02 ERROR MicroBatchExecution: Query [id = ba6cb0ca-a3b1-...

Ulrich Tedongmo Douanla

21

asked Sep 6, 2022 at 8:36

1 vote

0 answers

48 views

sparkstreaming kafka cunsumer auto close,

I don't want to use one consumer for all topics, I want to use this method to improve consumption efficiency val kafkaParams = Map( ConsumerConfig.GROUP_ID_CONFIG -> group, ...

1580923067

29

asked Aug 24, 2022 at 5:10

3 votes

3 answers

3k views

NoSuchMethodError: org.apache.spark.sql.kafka010.consumer

I am using Spark Structured Streaming to read messages from multiple topics in kafka. I am facing below error: java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.consumer....

Buzz97

53

asked Jul 27, 2022 at 13:55

1 vote

0 answers

112 views

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below : calludf = F.udf(lambda x: function_name(x)) dfraw = spark.readStream.format('kafka'...

Taimoor Abbasi

101

asked Jul 26, 2022 at 6:15

0 votes

2 answers

773 views

PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition

I'm running Spark version 2.3.0.2.6.5.1175-1 with Python 3. 6.8 on Ambari. While submitting the application I get the following logs in stderr 22/06/15 12:29:31 INFO StateStoreCoordinatorRef: ...

Taimoor Abbasi

101

asked Jun 15, 2022 at 7:52

1 vote

2 answers

8k views

How to send pyspark dataframe to kafka topic?

pyspark version - 2.4.7 kafka version - 2.13_3.2.0 Hi, I am new to pyspark and streaming properties. I have come across few resources in the internet, but still I am not able to figure out how to send ...

subh

27

asked Jun 13, 2022 at 12:12

0 votes

1 answer

357 views

Spark Streaming - Kafka - Java -jar not working but running by Java application it works

I've a simple Java Spark script. That basically it's to return kafka data: import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.Dataset; import ...

Pedro Alves

1,054

asked Jun 7, 2022 at 15:58

0 votes

1 answer

305 views

pyspark: how to perform structured streaming using KafkaUtils

I am doing a structured streaming using SparkSession.readStream and writing it to hive table, but seems it does not allow me to time-based micro-batches, i.e. I need a batch of 5 secs. All the ...

aiman

1,115

asked Apr 25, 2022 at 10:50

0 votes

1 answer

169 views

Get 2 different data from 1 kafka topic into 2 dataframes

I have a homework like this: Use python to read json files in 2 folders song_data and log_data. Use Python Kafka to publish a mixture of both song_data and log_data file types into a Kafka topic. Use ...

JustStartlDev

39

asked Apr 13, 2022 at 13:27

1 vote

2 answers

2k views

How to run Spark structured streaming using local JAR files

I'm using one of the Docker images of EMR on EKS (emr-6.5.0:20211119) and investigating how to work on Kafka with Spark Structured Programming (pyspark). As per the integration guide, I run a Python ...

Jaehyeon Kim

1,417

asked Mar 7, 2022 at 1:18

0 votes

1 answer

2k views

Spark structured stream with tumbling window delayed and duplicate data

I am attempting to read from a kafka topic, aggregate some data over a tumbling window and write that to a sink (I've been trying both with Kafka and console). The problems I'm seeing are a long ...

Barnesly

93

asked Feb 8, 2022 at 14:59

Collectives™ on Stack Overflow