All Questions
Tagged with elastic-map-reduce or amazon-emr
4,988 questions
-3
votes
1
answer
145
views
Flink Job Manager Direct Buffer Memory gets exhausted when checkpointing enabled
Issue:
Flink application throws Thread 'jobmanager-io-thread-25' produced an uncaught exception. java.lang.OutOfMemoryError: Direct buffer memory and terminates after running for 2-3 days.
No matter ...
0
votes
0
answers
76
views
Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries
I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior).
When I ...
0
votes
0
answers
70
views
EMR Spark Job Fails to Connect to MSK with IAM Auth - Timeout Waiting for Node Assignment Error
I am running an Apache Spark job on Amazon EMR that needs to connect to an Amazon MSK cluster configured with IAM authentication. The EMR cluster has an IAM role with full MSK permissions, and I can ...
1
vote
0
answers
67
views
Sagemaker Unified Studio overriding delta lake configuration to iceberg on EMR
I am connecting to an EMR cluster through SageMaker Unified Studio(JupyterLab).
My EMR cluster is configured with Delta Lake support, and I have the following Spark properties set on the cluster:
...
0
votes
0
answers
61
views
How do you expire snapshot from Iceberg Glue Table
I have one Iceberg table in Glue Catalog. I am unable to runw a select * as one of metadata file is missing. I am trying to point to latest metadata file. How can I do that? I am using EMR 7.7 with ...
2
votes
0
answers
166
views
Unable to connect to EMR cluster from SageMaker Unified Studio using runtime role – credentials are null
I'm trying to connect to an existing EMR cluster from SageMaker Unified Studio to run SQL queries via JupyterLab.
SageMaker requires that the EMR cluster be runtime role-enabled to integrate with ...
0
votes
1
answer
59
views
Unable to register database/table in aws glue when hudi job is submitted from emrserverless
I am using emr 6.15 and hudi 0.14
I submitted following hudi job which should create a database and a table in aws glue. IAM Role assigned to EMR serverless has all neccessary permissions of s3 and ...
1
vote
0
answers
56
views
Spark Dynamic Resource Allocation Configuration while using IBM S3 Shuffle Plugin on EMR on EKS
I have successfully implemented the IBM S3 Shuffle Plugin v0.9.6 (https://github.com/IBM/spark-s3-shuffle)
on EMR on EKS (Spark 3.5.0) and the shuffle operations are
working correctly with S3 storage. ...
0
votes
1
answer
141
views
Why Iceberg load is creating many folders in s3?
I am writing data into s3 and table format is Iceberg in Glue Catalog. I see the /data and /metadata folders are getting created. However when I am writing data, it's creating 001/002 kind of folders. ...
0
votes
0
answers
40
views
Installing external python packages on EMR on EC2
I want to install external Python packages on EMR with an EC2 setup, but currently, apart from bootstrap actions, nothing else seems to be working. The problem with this setup is that if I want to ...
3
votes
1
answer
104
views
EMR on EKS: Dynamic Allocation + FSx Lustre -- Executors with shuffle data won't terminate despite idle timeout
Having trouble getting dynamic allocation to properly terminate idle executors when using FSx Lustre for shuffle persistence on EMR 7.8 (Spark 3.5.4) on EKS. Trying this strategy out to battle cost ...
0
votes
0
answers
41
views
Data write into Iceberg Glue Table (saveAsTable vs option("path", s3_output_path))
I am exploring data write into glue Table (Iceberg Table format). I have been using saveAsTable method mentioned as option1 . However is there any difference between two methods. Iceberg stores ...
0
votes
1
answer
104
views
Can not read from S3 with AssumedRoleCredentialProvider after upgrade from EMR serverless 6.9 to 7.5
I have a pyspark script that reads data from S3 in a different AWS account, using AssumedRoleCredentialProvider , it is working on emr serverless 6.9 but when I upgrade to EMR Serverless 7.5 it fails ...
0
votes
0
answers
33
views
Unable to access Livy after enabling IAM Identity Center (SSO) on my EMR cluster
I have an EMR cluster configured with the following SecurityConfiguration:
"AuthenticationConfiguration": {
"IdentityCenterConfiguration": {
"EnableIdentityCenter":...
0
votes
0
answers
59
views
How to extract a string which contains a digit followed by a letter?
Gives the below JSON:
{
"environments": [
{"env": "dev", "description": "dev environment"},
{"env": "dev01", "...
0
votes
0
answers
50
views
How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?
I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has:
A few high-frequency categories (e.g., 90% of records fall into 2-3 ...
0
votes
0
answers
61
views
FileNotFound Exception occurs when pyspark write after persist().count()
I encountered java.io.FileNotFoundException in AWS EMR batch.
My code processes data as below :
updateDF = spark.read.load(paths, ..)
userIDs = uniqueList(matchedUserIDs)
nones = [None for _ in range(...
0
votes
0
answers
23
views
How to sort time parser error when using EMR and pyspark script used as step
I am having this error when running a EMR with a notebook passing some dates:
An error occurred: An error occurred while calling o236.showString.
: org.apache.spark.SparkException: Job aborted due ...
0
votes
1
answer
42
views
how to build pyspark emr app using python to spin and apply the steps?
I am trying to update the version of apache-sedona to start using 1.7.1 version, but it keeps failing when being spined using python functions.
If i spin the cluster manually everything works fine and ...
0
votes
0
answers
44
views
Insert overwrite multiple partitions in an external Hive table
I am trying to overwrite multiple partitions in a large table. Basically I have my main external S3 table sandbox, partitioned by part:
scala> q("select * from sandbox")
+---+-------------...
0
votes
0
answers
58
views
Calling SQL Server stored procedure via AWS EMR PySpark notebook with JDBC driver
I'm trying to load data from SQL Server from a stored procedure in PySpark using the JDBC driver.
Calling a stored procedure is supposed to be possible with this driver according to this
I'm tried the ...
0
votes
0
answers
63
views
AWS EMR Serverless Spark resources timeout
I have a spark EMR serverless, which loads some shapefiles data (geolocation data) from an S3 Bucket, all components are deployed on eu-west-1
This spark job is scheduled to run hourly (via Airflow), ...
0
votes
0
answers
43
views
Setting JAR in EMR workspace using EMR Serverless application
I have since switched from EMR on EC2 to EMR serverless. I used to use interactive notebooks with EMR on EC2.
I am trying to use the EMR studio workspace (notebooks) with EMR serverless application ...
0
votes
1
answer
152
views
How can I allow an AWS EMR Cluster to create service-linked roles
I'm trying to stand up a new cluster in AWS EMR, but it immediately fails with the following error:
Service-linked role 'AWSServiceRoleForEMRCleanup' for EMR is required.
Please create this role ...
2
votes
1
answer
67
views
Pyspark JDBC read with partitions
I'm reading data in pyspark from postgres using jdbc connection. The table being read is large, about 240 million rows. I'm attempting to read it into 16 partitions. The read is being performed like ...
1
vote
0
answers
58
views
unexplained awseditorssparkmonitoringwidget KeyError
I ran a Jupyter PySpark notebook on an EMR 7.3.0 cluster and encountered the error below after a simple df.count() call. This was not an issue with my code; the same dataframe (df) had already been ...
0
votes
1
answer
344
views
How do you get AWS EMR to access a Lake Formation Resource Link Table
I have a Lake formation resource link database table, from another AWS account, of which I can query in Athena just find with permissions. But I cannot query this data in EMR. The permission access ...
-1
votes
1
answer
187
views
Incremental lakehouse update
I'm implementing a lakehouse (Apache Iceberg) with Pyspark and I'm running into some issues. So I come from a SQL background so originally was trying to implement this solution in the same way I ...
0
votes
0
answers
42
views
How to connect glue iceberg table from presto in EMR?
I am not able to access the iceberg table created using spark and present in glue catalog.
Error Message:
Query 20250113_163609_00011_ypmu9 failed: Not a Hive table 'search_iceberg.gweekly'
Cluster ...
0
votes
0
answers
67
views
S3 Access via EMR Serverless using PySpark
I am trying to access an S3 bucket on Account B from Account A using Python and PySpark from EMR Studio on Serverless. I can access the data using Python via a cross-account IAM role but get an ...
0
votes
0
answers
104
views
How to get the EMR Serverless Job URL with EMR Studio Information Missing from the Event?
I'm working with AWS EMR Serverless, and I need to construct a job URL for an EMR Serverless job to be sent in a message notification in case of state change. The desired URL includes the associated ...
0
votes
0
answers
203
views
Iceberg table load is failing in AWS EMR/Glue
I am working in a script to load data to a iceberg table using AWS Glue/EMR (tried in both).
Error message:
pyspark.errors.exceptions.captured.AnalysisException: Cannot write into v1 table: ...
0
votes
0
answers
25
views
Increase number of concurrent tasks on EMR spark shell
I am trying to use EMR spark-shell to do some analysis etc on a ~5 TB dataset that lives in S3, so I have a 32 x i3.16xlarge cluster.
If I start spark shell with default configurations, I get exactly ...
0
votes
0
answers
76
views
pyspark drop_duplicates() unexpectedly increases count
I am getting somewhat unexpected results with df.drop_duplicates().
df2 = df.dropDuplicates()
print(df2.count())
# prints 424527
print(df.count())
# prints 424510
I do not understand why count is ...
0
votes
0
answers
18
views
EMR ResourceManaged UI Physical Mem Used %
I'm using AWS EMR with Hadoop and Yarn and when I go to UI of the RM I can see information like "Physical Mem Used %" and "Physical VCores Used %". I cannot find anything online (...
0
votes
1
answer
181
views
Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
When I try to run a pyspark step on my EMR cluster I get an error Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found. My understanding from AWS ...
1
vote
0
answers
26
views
BootstrapActions Failing to create a hdfs directory
I need to create a hdfs folder for my run_job_flow to work. Currently I am using this sh script command sudo -u hdfs hdfs dfs -mkdir -p /apps/hudi/lib but for some reason I am getting this error : ...
0
votes
0
answers
78
views
AWS Emr serverless spark: Live UI takes a few seconds to update due to its asynchronous nature. Please check again in a few seconds
Unable to see live spark ui on aws emr serverless spark jobs. Once job is completed, UI is available but not avaialble for the running jobs
Message:
Live UI takes a few seconds to update due to its ...
1
vote
0
answers
77
views
AWS EMR 7.3 not showing any logs from my java program
my use case is to create an EMR 7.3 in AWS and invoke lambda to http request to pass the configs and payload over to initiate a spark job.
this is my software settings
[
{
"Classification&...
0
votes
1
answer
133
views
EMR: Pyspark conda environment error on AWS Graviton
The same pyspark code works on r7a but not r7g or r8g on a EMR cluster (7.5).
I build the python environment with conda, and use it in pyspark:
conda create -n pyspark python=3.9 --show-channel-urls --...
0
votes
1
answer
62
views
apache-beam installation issue on AWS EMR-EC2 cluster
I started an AWS EMR-EC2 cluster, I am having trouble getting the sparkrunner of apache-beam to work.
I have a python script that will use apache-beam. I have tried either aws emr add-steps or ssh ...
1
vote
0
answers
108
views
Config params are not propagated when using Spark Connect
I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath:
/usr/...
-1
votes
2
answers
388
views
Jackson Databind Conflicts in Apache Spark Project Using Maven Shade Plugin
I am working on a project that processes IMDb data using Apache Spark. My setup involves Spark Core and Spark SQL dependencies, along with Jackson for handling JSON serialization and deserialization. ...
0
votes
1
answer
73
views
Pyspark error: " Class org.apache.hadoop.fs.s3a.S3AFileSystem not found" in EMR 7.0.0
I am using EMR 7.0.0 version, which has python 3.9, spark 3.5.0, Hadoop 3.3.6 in AWS.
I got the error:
File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/...
0
votes
1
answer
76
views
Jobs failed in airflow, despite Spark History UI jobs stuck in running. AWS Serverless
Has anyone ever experienced jobs failing in Airflow despite in Spark History UI, the jobs are still stuck in running. Also, I after I added a line of code to write the data to S3 (without reading it ...
0
votes
1
answer
136
views
Flink Job Execution Fails with `NoClassDefFoundError` on AWS EMR with Python
I am trying to run a Flink job on an AWS EMR cluster (v7.3.0) using Python 3.9 and Apache Flink with PyFlink. My job reads from an AWS Kinesis stream and prints the stream data to console. However, ...
-2
votes
1
answer
82
views
spark-submit using --py-files option could not find path to modules
I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 :
/bin/spark-submit \
--py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip ...
1
vote
0
answers
102
views
Spark EMR executor container failing due to Java heap space
One of my Spark code is failing due to executor container failing due to "java.lang.OutOfMemoryError: Java heap space". Any recommendation is appreciated.
I am using emr 200 -r7g16xlarge ...
1
vote
0
answers
75
views
Optimizing PySpark Feature Engineering with Over a Billion Rows on EMR
I’m working with a large transaction dataset (~1 billion rows) in PySpark on AWS EMR. My goal is to perform feature engineering where I compute statistics like sum, mean, standard deviation, and ...
1
vote
0
answers
104
views
Spark Shuffle FetchFailedException for large dataset in emr
My job takes a input data of 400 TB parquet dataset from s3. This job runs with 250 r716x large (each having 64 vCore, 488 GiB memory). The job fails with below error org.apache.spark.shuffle....