Newest 'elastic-map-reduce' Questions

-3 votes

1 answer

145 views

Flink Job Manager Direct Buffer Memory gets exhausted when checkpointing enabled

Issue: Flink application throws Thread 'jobmanager-io-thread-25' produced an uncaught exception. java.lang.OutOfMemoryError: Direct buffer memory and terminates after running for 2-3 days. No matter ...

Strange

1,514

asked Nov 12 at 18:14

0 votes

0 answers

76 views

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...

shiva

2,781

asked Oct 21 at 20:58

0 votes

0 answers

70 views

EMR Spark Job Fails to Connect to MSK with IAM Auth - Timeout Waiting for Node Assignment Error

I am running an Apache Spark job on Amazon EMR that needs to connect to an Amazon MSK cluster configured with IAM authentication. The EMR cluster has an IAM role with full MSK permissions, and I can ...

Vishwas Singh

1

asked Oct 1 at 11:20

1 vote

0 answers

67 views

Sagemaker Unified Studio overriding delta lake configuration to iceberg on EMR

I am connecting to an EMR cluster through SageMaker Unified Studio(JupyterLab). My EMR cluster is configured with Delta Lake support, and I have the following Spark properties set on the cluster: ...

sakshi

41

asked Sep 11 at 17:55

0 votes

0 answers

61 views

How do you expire snapshot from Iceberg Glue Table

I have one Iceberg table in Glue Catalog. I am unable to runw a select * as one of metadata file is missing. I am trying to point to latest metadata file. How can I do that? I am using EMR 7.7 with ...

user3858193

1,558

asked Aug 18 at 16:41

2 votes

0 answers

166 views

Unable to connect to EMR cluster from SageMaker Unified Studio using runtime role – credentials are null

I'm trying to connect to an existing EMR cluster from SageMaker Unified Studio to run SQL queries via JupyterLab. SageMaker requires that the EMR cluster be runtime role-enabled to integrate with ...

valzor

315

asked Jul 30 at 19:00

0 votes

1 answer

59 views

Unable to register database/table in aws glue when hudi job is submitted from emrserverless

I am using emr 6.15 and hudi 0.14 I submitted following hudi job which should create a database and a table in aws glue. IAM Role assigned to EMR serverless has all neccessary permissions of s3 and ...

Roobal Jindal

294

asked Jul 9 at 7:00

1 vote

0 answers

56 views

Spark Dynamic Resource Allocation Configuration while using IBM S3 Shuffle Plugin on EMR on EKS

I have successfully implemented the IBM S3 Shuffle Plugin v0.9.6 (https://github.com/IBM/spark-s3-shuffle) on EMR on EKS (Spark 3.5.0) and the shuffle operations are working correctly with S3 storage. ...

metersk

12.7k

asked Jul 1 at 16:26

0 votes

1 answer

141 views

Why Iceberg load is creating many folders in s3?

I am writing data into s3 and table format is Iceberg in Glue Catalog. I see the /data and /metadata folders are getting created. However when I am writing data, it's creating 001/002 kind of folders. ...

user3858193

1,558

asked Jun 28 at 11:19

0 votes

0 answers

40 views

Installing external python packages on EMR on EC2

I want to install external Python packages on EMR with an EC2 setup, but currently, apart from bootstrap actions, nothing else seems to be working. The problem with this setup is that if I want to ...

RushHour

645

asked Jun 27 at 6:23

3 votes

1 answer

104 views

EMR on EKS: Dynamic Allocation + FSx Lustre -- Executors with shuffle data won't terminate despite idle timeout

Having trouble getting dynamic allocation to properly terminate idle executors when using FSx Lustre for shuffle persistence on EMR 7.8 (Spark 3.5.4) on EKS. Trying this strategy out to battle cost ...

metersk

12.7k

asked Jun 26 at 18:59

0 votes

0 answers

41 views

Data write into Iceberg Glue Table (saveAsTable vs option("path", s3_output_path))

I am exploring data write into glue Table (Iceberg Table format). I have been using saveAsTable method mentioned as option1 . However is there any difference between two methods. Iceberg stores ...

user3858193

1,558

asked Jun 26 at 15:21

0 votes

1 answer

104 views

Can not read from S3 with AssumedRoleCredentialProvider after upgrade from EMR serverless 6.9 to 7.5

I have a pyspark script that reads data from S3 in a different AWS account, using AssumedRoleCredentialProvider , it is working on emr serverless 6.9 but when I upgrade to EMR Serverless 7.5 it fails ...

Sayed

11

asked Jun 14 at 16:00

0 votes

0 answers

33 views

Unable to access Livy after enabling IAM Identity Center (SSO) on my EMR cluster

I have an EMR cluster configured with the following SecurityConfiguration: "AuthenticationConfiguration": { "IdentityCenterConfiguration": { "EnableIdentityCenter":...

ExK

1

asked Jun 3 at 17:08

0 votes

0 answers

59 views

How to extract a string which contains a digit followed by a letter?

Gives the below JSON: { "environments": [ {"env": "dev", "description": "dev environment"}, {"env": "dev01", "...

dplvs

43

asked Apr 30 at 5:00

0 votes

0 answers

50 views

How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...

Bilal Jamil

27

asked Apr 30 at 2:51

0 votes

0 answers

61 views

FileNotFound Exception occurs when pyspark write after persist().count()

I encountered java.io.FileNotFoundException in AWS EMR batch. My code processes data as below : updateDF = spark.read.load(paths, ..) userIDs = uniqueList(matchedUserIDs) nones = [None for _ in range(...

Jisu Choi

1

asked Apr 11 at 5:46

0 votes

0 answers

23 views

How to sort time parser error when using EMR and pyspark script used as step

I am having this error when running a EMR with a notebook passing some dates: An error occurred: An error occurred while calling o236.showString. : org.apache.spark.SparkException: Job aborted due ...

gcj

298

asked Apr 8 at 5:59

0 votes

1 answer

42 views

how to build pyspark emr app using python to spin and apply the steps?

I am trying to update the version of apache-sedona to start using 1.7.1 version, but it keeps failing when being spined using python functions. If i spin the cluster manually everything works fine and ...

gcj

298

asked Mar 27 at 14:06

0 votes

0 answers

44 views

Insert overwrite multiple partitions in an external Hive table

I am trying to overwrite multiple partitions in a large table. Basically I have my main external S3 table sandbox, partitioned by part: scala> q("select * from sandbox") +---+-------------...

kot

85

asked Mar 20 at 18:47

0 votes

0 answers

58 views

Calling SQL Server stored procedure via AWS EMR PySpark notebook with JDBC driver

I'm trying to load data from SQL Server from a stored procedure in PySpark using the JDBC driver. Calling a stored procedure is supposed to be possible with this driver according to this I'm tried the ...

Vercinegetorix

119

asked Mar 14 at 17:35

0 votes

0 answers

63 views

AWS EMR Serverless Spark resources timeout

I have a spark EMR serverless, which loads some shapefiles data (geolocation data) from an S3 Bucket, all components are deployed on eu-west-1 This spark job is scheduled to run hourly (via Airflow), ...

blackstorm

5

asked Mar 14 at 15:52

0 votes

0 answers

43 views

Setting JAR in EMR workspace using EMR Serverless application

I have since switched from EMR on EC2 to EMR serverless. I used to use interactive notebooks with EMR on EC2. I am trying to use the EMR studio workspace (notebooks) with EMR serverless application ...

DirtyDan

1

asked Feb 20 at 17:52

0 votes

1 answer

152 views

How can I allow an AWS EMR Cluster to create service-linked roles

I'm trying to stand up a new cluster in AWS EMR, but it immediately fails with the following error: Service-linked role 'AWSServiceRoleForEMRCleanup' for EMR is required. Please create this role ...

FoxMulder900

1,281

asked Feb 19 at 21:50

2 votes

1 answer

67 views

Pyspark JDBC read with partitions

I'm reading data in pyspark from postgres using jdbc connection. The table being read is large, about 240 million rows. I'm attempting to read it into 16 partitions. The read is being performed like ...

Kevin Smeeks

227

asked Jan 30 at 22:57

1 vote

0 answers

58 views

unexplained awseditorssparkmonitoringwidget KeyError

I ran a Jupyter PySpark notebook on an EMR 7.3.0 cluster and encountered the error below after a simple df.count() call. This was not an issue with my code; the same dataframe (df) had already been ...

mwarrior

589

asked Jan 21 at 20:02

0 votes

1 answer

344 views

How do you get AWS EMR to access a Lake Formation Resource Link Table

I have a Lake formation resource link database table, from another AWS account, of which I can query in Athena just find with permissions. But I cannot query this data in EMR. The permission access ...

vfrank66

1,508

asked Jan 16 at 6:13

-1 votes

1 answer

187 views

Incremental lakehouse update

I'm implementing a lakehouse (Apache Iceberg) with Pyspark and I'm running into some issues. So I come from a SQL background so originally was trying to implement this solution in the same way I ...

user172839

1,075

asked Jan 15 at 0:40

0 votes

0 answers

42 views

How to connect glue iceberg table from presto in EMR?

I am not able to access the iceberg table created using spark and present in glue catalog. Error Message: Query 20250113_163609_00011_ypmu9 failed: Not a Hive table 'search_iceberg.gweekly' Cluster ...

user3858193

1,558

asked Jan 13 at 17:12

0 votes

0 answers

67 views

S3 Access via EMR Serverless using PySpark

I am trying to access an S3 bucket on Account B from Account A using Python and PySpark from EMR Studio on Serverless. I can access the data using Python via a cross-account IAM role but get an ...

Ashesh Aryak

1

asked Jan 13 at 11:40

0 votes

0 answers

104 views

How to get the EMR Serverless Job URL with EMR Studio Information Missing from the Event?

I'm working with AWS EMR Serverless, and I need to construct a job URL for an EMR Serverless job to be sent in a message notification in case of state change. The desired URL includes the associated ...

user27008283

35

asked Jan 12 at 13:28

0 votes

0 answers

203 views

Iceberg table load is failing in AWS EMR/Glue

I am working in a script to load data to a iceberg table using AWS Glue/EMR (tried in both). Error message: pyspark.errors.exceptions.captured.AnalysisException: Cannot write into v1 table: ...

user3858193

1,558

asked Jan 8 at 15:40

0 votes

0 answers

25 views

Increase number of concurrent tasks on EMR spark shell

I am trying to use EMR spark-shell to do some analysis etc on a ~5 TB dataset that lives in S3, so I have a 32 x i3.16xlarge cluster. If I start spark shell with default configurations, I get exactly ...

kot

85

asked Jan 8 at 0:05

0 votes

0 answers

76 views

pyspark drop_duplicates() unexpectedly increases count

I am getting somewhat unexpected results with df.drop_duplicates(). df2 = df.dropDuplicates() print(df2.count()) # prints 424527 print(df.count()) # prints 424510 I do not understand why count is ...

Gaurav Singhal

1,126

asked Jan 7 at 15:30

0 votes

0 answers

18 views

EMR ResourceManaged UI Physical Mem Used %

I'm using AWS EMR with Hadoop and Yarn and when I go to UI of the RM I can see information like "Physical Mem Used %" and "Physical VCores Used %". I cannot find anything online (...

salapura.stefan

53

asked Jan 3 at 12:46

0 votes

1 answer

181 views

Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found

When I try to run a pyspark step on my EMR cluster I get an error Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found. My understanding from AWS ...

Wev

295

asked Jan 2 at 15:24

1 vote

0 answers

26 views

BootstrapActions Failing to create a hdfs directory

I need to create a hdfs folder for my run_job_flow to work. Currently I am using this sh script command sudo -u hdfs hdfs dfs -mkdir -p /apps/hudi/lib but for some reason I am getting this error : ...

Lucas Vaz

11

asked Dec 28, 2024 at 2:42

0 votes

0 answers

78 views

AWS Emr serverless spark: Live UI takes a few seconds to update due to its asynchronous nature. Please check again in a few seconds

Unable to see live spark ui on aws emr serverless spark jobs. Once job is completed, UI is available but not avaialble for the running jobs Message: Live UI takes a few seconds to update due to its ...

Roobal Jindal

294

asked Dec 24, 2024 at 9:19

1 vote

0 answers

77 views

AWS EMR 7.3 not showing any logs from my java program

my use case is to create an EMR 7.3 in AWS and invoke lambda to http request to pass the configs and payload over to initiate a spark job. this is my software settings [ { "Classification&...

Voon see hong

11

asked Dec 17, 2024 at 15:57

0 votes

1 answer

133 views

EMR: Pyspark conda environment error on AWS Graviton

The same pyspark code works on r7a but not r7g or r8g on a EMR cluster (7.5). I build the python environment with conda, and use it in pyspark: conda create -n pyspark python=3.9 --show-channel-urls --...

Guillaume

3,081

asked Dec 6, 2024 at 14:35

0 votes

1 answer

62 views

apache-beam installation issue on AWS EMR-EC2 cluster

I started an AWS EMR-EC2 cluster, I am having trouble getting the sparkrunner of apache-beam to work. I have a python script that will use apache-beam. I have tried either aws emr add-steps or ssh ...

Shiyi Yin

1

asked Dec 6, 2024 at 6:19

1 vote

0 answers

108 views

Config params are not propagated when using Spark Connect

I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath: /usr/...

Ninad

81

asked Nov 29, 2024 at 12:02

-1 votes

2 answers

388 views

Jackson Databind Conflicts in Apache Spark Project Using Maven Shade Plugin

I am working on a project that processes IMDb data using Apache Spark. My setup involves Spark Core and Spark SQL dependencies, along with Jackson for handling JSON serialization and deserialization. ...

prashantjerk

1

asked Nov 28, 2024 at 4:28

0 votes

1 answer

73 views

Pyspark error: " Class org.apache.hadoop.fs.s3a.S3AFileSystem not found" in EMR 7.0.0

I am using EMR 7.0.0 version, which has python 3.9, spark 3.5.0, Hadoop 3.3.6 in AWS. I got the error: File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/...

TripleH

489

asked Nov 13, 2024 at 3:23

0 votes

1 answer

76 views

Jobs failed in airflow, despite Spark History UI jobs stuck in running. AWS Serverless

Has anyone ever experienced jobs failing in Airflow despite in Spark History UI, the jobs are still stuck in running. Also, I after I added a line of code to write the data to S3 (without reading it ...

laggyPC

29

asked Nov 2, 2024 at 12:41

0 votes

1 answer

136 views

Flink Job Execution Fails with `NoClassDefFoundError` on AWS EMR with Python

I am trying to run a Flink job on an AWS EMR cluster (v7.3.0) using Python 3.9 and Apache Flink with PyFlink. My job reads from an AWS Kinesis stream and prints the stream data to console. However, ...

Mughees Asif

87

asked Oct 27, 2024 at 14:52

-2 votes

1 answer

82 views

spark-submit using --py-files option could not find path to modules

I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 : /bin/spark-submit \ --py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip ...

Smruti Prakash Mohanty

53

asked Oct 11, 2024 at 7:22

1 vote

0 answers

102 views

Spark EMR executor container failing due to Java heap space

One of my Spark code is failing due to executor container failing due to "java.lang.OutOfMemoryError: Java heap space". Any recommendation is appreciated. I am using emr 200 -r7g16xlarge ...

user3858193

1,558

asked Sep 25, 2024 at 16:13

1 vote

0 answers

75 views

Optimizing PySpark Feature Engineering with Over a Billion Rows on EMR

I’m working with a large transaction dataset (~1 billion rows) in PySpark on AWS EMR. My goal is to perform feature engineering where I compute statistics like sum, mean, standard deviation, and ...

Meriiiiii

11

asked Sep 21, 2024 at 21:36

1 vote

0 answers

104 views

Spark Shuffle FetchFailedException for large dataset in emr

My job takes a input data of 400 TB parquet dataset from s3. This job runs with 250 r716x large (each having 64 vCore, 488 GiB memory). The job fails with below error org.apache.spark.shuffle....

user3858193

1,558

asked Sep 12, 2024 at 14:43

Collectives™ on Stack Overflow

All Questions