108 questions
0
votes
0
answers
70
views
EMR Spark Job Fails to Connect to MSK with IAM Auth - Timeout Waiting for Node Assignment Error
I am running an Apache Spark job on Amazon EMR that needs to connect to an Amazon MSK cluster configured with IAM authentication. The EMR cluster has an IAM role with full MSK permissions, and I can ...
0
votes
0
answers
49
views
[Spark-stream]: Stream Batches processing time reduce over time causing Kafka Lag
I have been using Spark v3.5 Spark Stream functionality for the below use case. I am observing the issue below on one of the environments with Spark Stream. Please if I can get some assistance with ...
0
votes
1
answer
142
views
Unable to push the data from the written kafka topic to Postgres table
I am trying to load the data written into the Kafka topic into the Postgres table. I can see the topic is receiving new messages every second and also the data looks good.
However, when I use the ...
0
votes
1
answer
122
views
ERROR SparkContext: Failed to add spark-streaming-kafka-0-10_2.13-3.5.2.jar
ERROR SparkContext: Failed to add home/areaapache/software/spark-3.5.2-bin-hadoop3/jars/spark-streaming-kafka-0-10_2.13-3.5.2.jar \
to Spark environment
import logging
from pyspark.sql import ...
3
votes
0
answers
79
views
Is there options to send Spark streaming executor metrics directly instead via driver?
I have Spark Streaming application lives on Argo + K8S that reads Kafka topics by subscribe pattern then there are some transformations and writing to a target.
Several different producers may write ...
0
votes
1
answer
290
views
Save a file in Databricks Workspace using Scala/Java
My goal is to run a Spark job using Databricks, and my challenge is that I can't store files in the local filesystem since the file is saved in the driver, but when my executors tried to access the ...
1
vote
0
answers
384
views
Spark : java.lang.NoClassDefFoundError: org/apache/spark/kafka010/KafkaConfigUpdater
I am working on spark streaming and reading data from kafka topic, but getting error java.lang.NoClassDefFoundError: org/apache/spark/kafka010/KafkaConfigUpdater. Running my code in K8s and provide ...
2
votes
0
answers
130
views
Multiple Kafka source topic + Spark Structured streaming + multiple delta table sink
I have multiple topics in kafka that I need to sink in their respective delta table.
A) 1 Streaming query for all topics
If i use one streaming query, then the RDD/DF should contains data from ...
2
votes
1
answer
3k
views
ClassNotFoundException for scala.$less$colon$less. Problem with different Scala versions?
When I try to run this .py:
import logging
from cassandra.cluster import Cluster
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import ...
2
votes
1
answer
548
views
"java.lang.NoSuchMethodError: 'scala.collection.JavaConverters$AsJava scala.collection" error when I stream kafka messages using Pyspark
I am in a bind here. I am trying to implement a very basic pipeline which reads data from kafka and process it in Spark. The problem I am facing is that apache spark shuts down abruptly giving the ...
1
vote
0
answers
97
views
spark-connect with standalone spark cluster error
I'm trying to read stream from Kafka using pyspark.
The Stack I'm working with:
Kubernetes.
Stand alone spark cluster with 2 workers.
spark-connect connected to the cluster and has the dependencies ...
0
votes
1
answer
167
views
Py4JJavaError An error occurred while calling javalangNoSuchMethodError org.apache.spark.sql.AnalysisException org.apache.spark.sql.kafka.KafkaWriter
I can't write to Kafka from Spark, Spark is reading but not writing, if I write to the console it doesn't give an error
Traceback (most recent call last):
File "f:\Sistema de Informação\TCC\...
0
votes
1
answer
78
views
Spark incoming JSON stream processing
I have been trying to complete a project in which I needed to send data stream using kafka to local Spark to process the incoming data. However I can not show and use the data frame in the right ...
0
votes
0
answers
33
views
Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide Spark [duplicate]
Hello I am trying to use pyspark + kafka in order to do this I execute this command in order to set up the Spark application
Spark version is 3.5.0 | spark-3.5.0-bin-hadoop3
Kafka version is - ...
0
votes
0
answers
268
views
Spark-Kafka Integration not working: Kafka broker with producer and consumer script are getting stuck as soon as we run spark script(consumer.py)
I'm trying to read data from kafka topic by using spark structured streaming on ec2(ubuntu) machine.
If I try to read the data by using kafka stream only(kafka-console-consumer.sh) then there is no ...
0
votes
1
answer
1k
views
How to add Kafka dependencies for PySpark on a Jupyter notebook
I have setup kafka 2.1 on windows and able to successfully communicate a topic from producer to consumer over localhost:9092.
I now want to consume this in a spark structured stream.
For this I setup ...
0
votes
0
answers
279
views
Spark Kafka: understanding offset management with enable.auto.commit
according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept.
For example I have a Kafka ...
0
votes
1
answer
3k
views
Running Kafka and Spark with docker-compose
My goal is to send/produce a txt from my Windows PC to a container running Kafka, to then be consume by pyspak (running in other container).
I'm using docker-compose where I define a custom net and ...
0
votes
1
answer
423
views
How to map a message to a object with `schema` and `payload` in Spark structured streaming correctly?
I am hoping to map a message to a object with schema and payload inside in during Spark structured streaming.
This is my original code
val input_schema = new StructType()
.add("timestamp", ...
1
vote
1
answer
1k
views
Difference between spark-streaming-kafka-0-10 vs spark-sql-kafka-0-10
I am hoping to read a parquet file and write to Kafka
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.functions.to_json
object ...
0
votes
1
answer
573
views
Getting Error for org.apache.spark.sql.Encoder and missing or invalid dependency find while loading class file SQLImplicits, LowPrioritySQLImplicits
I am running following code to read kafka stream with spark-3.2.2, and scala 2.12.0.
Earlier same code was working fine with spark-2.2 and scala 2.11.8,
import spark.implicits._
val kafkaStream = ...
0
votes
0
answers
176
views
What is the best way to set up a kafka connection with apache spark
How to make the kafka stream more stable? as it it will run constantly without having us to start the run again after it fails (so far we are thinking about using the "continous" run mode to ...
2
votes
1
answer
191
views
I have more data in a kafka topic but when i extract data using my pyspark application, I am getting only 1 row extracted, how to fix?
I have more data in a kafka topic but when i extract data using my pyspark application (which I use to extract from different kafka topics), I am getting only 1 row extracted. Previously I had ...
0
votes
0
answers
99
views
How to read DF Column of struct type and add they key value pairs to kafka headers?
I have a new dataframe with 2 columns one is headers and other is a payload, facing issues in reading the headers column and assigning the values to kafka headers while publishing.
earlier the ...
1
vote
0
answers
396
views
spark streaming from kafka on spark operator(Kubernetes)
I have a spark structured streaming job in scala, reading from kafka and writing to S3 as hudi tables. Now I am trying to move this job to spark operator on EKS.
When I give the option in the yaml ...
0
votes
1
answer
74
views
Pyspark : Kafka consumer for multiple topics
I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics ...
0
votes
0
answers
1k
views
I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark
I am using this tech stack:
Spark version: 3.3.1
Scala Version: 2.12.15
Hadoop Version: 3.3.4
Kafka Version: 3.3.1
I am trying to get data from kafka topic through spark structure streaming, But I am ...
0
votes
1
answer
1k
views
java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition
I am using Spark2.3.0 and kafka1.0.0.3. I have created a spark read stream
df = spark.readStream. \
format("kafka"). \
option("kafka.bootstrap.servers", "...
1
vote
1
answer
785
views
Error while pulling kafka jks certificates from hdfs (trying with s3 as well) in spark
I am Running spark in cluster mode which is giving error as
ERROR SslEngineBuilder: Modification time of key store could not be obtained: hdfs://ip:port/user/hadoop/jks/kafka.client.truststore.jks
...
0
votes
1
answer
2k
views
How to process the dataframe which was read from Kafka Topic using Spark Streaming
I'm able to stream twitter data into my Kafka topic via a producer. When I try to consume through the default Kafka consumer I'm able to see the tweets as well.
But when I try to use Spark Streaming ...
1
vote
0
answers
222
views
problem with spark structured streaming after restart
I have a simple pyspark code,which reads data from kafka and write aggregated records to oracle with foreachbatch.I have set checkpointLocation on hdfs dir and it works well.when I kill application ...
1
vote
0
answers
75
views
How to get commitedOffsets and availableOffsets from sparkstreaming
22/11/09 11:08:40 INFO MicroBatchExecution: Resuming at batch 206 with committed offsets {KafkaV2[Subscribe[test]]:
{"test":{"0":3086,"1":3086,"2":3086,"3&...
1
vote
0
answers
358
views
Inferring a schema of a kafka topic taking too much time in databricks
I'm trying to determine the schema of a json kafka topic. To achieve that, I lifted a code part from this blog(https://medium.com/wehkamp-techblog/streaming-kafka-topic-to-delta-table-s3-with-spark-...
0
votes
1
answer
346
views
How to writestream to specific kafka cluster in Azure databricks? " Topic mytopic not present in metadata after 60000 ms."
I am trying to write data to Kafka with the writestream method.
I have been given the following properties by the source system.
topic = 'mytopic'
host = "myhost.us-west-1.aws.confluent.cloud:...
3
votes
0
answers
576
views
Structured Streaming - suddenly giving error while writing to (Strimzi)Kafka topic
i've a Structured Streaming code which reads data from a Kafka Topic (on a VM) & writes to another Kafka Topic on GKE (i should be using a Mirror Maker for this, but have not implemented that yet)....
1
vote
2
answers
496
views
PySpark Structured Streaming with Kafka - Scaling Consumers for multiple topics with different loads
We subscribed to 7 topics with spark.readStream in 1 single running spark app.
After transforming the event payloads, we save them with spark.writeStream to our database.
For one of the topics, the ...
0
votes
1
answer
715
views
spark-submit --packages returns Error: Missing application resource
I installed .NET for Apache Spark using the following guide:
https://learn.microsoft.com/en-us/dotnet/spark/tutorials/get-started?WT.mc_id=dotnet-35129-website&tabs=windows
The Hello World worked.
...
0
votes
1
answer
247
views
Spark Structured Streaming - Kafka - Missing required configuration "partition.assignment.strategy
That is my code.
import findspark
findspark.init()
import os
os.environ[
"PYSPARK_SUBMIT_ARGS"
] = "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 pyspark-shell"
...
0
votes
0
answers
262
views
Databricks Kafka Read Not connecting
I'm trying to read data from GCP kafka through azure databricks but getting below warning and notebook is simply not completing. Any suggestion please?
WARN NetworkClient: Consumer groupId Bootstrap ...
2
votes
1
answer
2k
views
PySpark Kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/security/JaasContext
I am encountering problem with printing the data to console from kafka topic. The error message I get is shown in below image.
22/09/06 10:14:02 ERROR MicroBatchExecution: Query [id = ba6cb0ca-a3b1-...
1
vote
0
answers
48
views
sparkstreaming kafka cunsumer auto close,
I don't want to use one consumer for all topics, I want to use this method to improve consumption efficiency
val kafkaParams = Map(
ConsumerConfig.GROUP_ID_CONFIG -> group,
...
3
votes
3
answers
3k
views
NoSuchMethodError: org.apache.spark.sql.kafka010.consumer
I am using Spark Structured Streaming to read messages from multiple topics in kafka.
I am facing below error:
java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.consumer....
1
vote
0
answers
112
views
Spark Structured Streaming inconsistent output to multiple sinks
I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka'...
0
votes
2
answers
773
views
PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition
I'm running Spark version 2.3.0.2.6.5.1175-1 with Python 3.
6.8 on Ambari. While submitting the application I get the following logs in stderr
22/06/15 12:29:31 INFO StateStoreCoordinatorRef: ...
1
vote
2
answers
8k
views
How to send pyspark dataframe to kafka topic?
pyspark version - 2.4.7
kafka version - 2.13_3.2.0
Hi, I am new to pyspark and streaming properties. I have come across few resources in the internet, but still I am not able to figure out how to send ...
0
votes
1
answer
357
views
Spark Streaming - Kafka - Java -jar not working but running by Java application it works
I've a simple Java Spark script. That basically it's to return kafka data:
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.Dataset;
import ...
0
votes
1
answer
305
views
pyspark: how to perform structured streaming using KafkaUtils
I am doing a structured streaming using SparkSession.readStream and writing it to hive table, but seems it does not allow me to time-based micro-batches, i.e. I need a batch of 5 secs. All the ...
0
votes
1
answer
169
views
Get 2 different data from 1 kafka topic into 2 dataframes
I have a homework like this:
Use python to read json files in 2 folders song_data and log_data.
Use Python Kafka to publish a mixture of both song_data and log_data file types into a Kafka topic.
Use ...
1
vote
2
answers
2k
views
How to run Spark structured streaming using local JAR files
I'm using one of the Docker images of EMR on EKS (emr-6.5.0:20211119) and investigating how to work on Kafka with Spark Structured Programming (pyspark). As per the integration guide, I run a Python ...
0
votes
1
answer
2k
views
Spark structured stream with tumbling window delayed and duplicate data
I am attempting to read from a kafka topic, aggregate some data over a tumbling window and write that to a sink (I've been trying both with Kafka and console).
The problems I'm seeing are
a long ...