Skip to main content

All Questions

Filter by
Sorted by
Tagged with
-3 votes
1 answer
145 views

Issue: Flink application throws Thread 'jobmanager-io-thread-25' produced an uncaught exception. java.lang.OutOfMemoryError: Direct buffer memory and terminates after running for 2-3 days. No matter ...
Strange's user avatar
  • 1,514
0 votes
0 answers
76 views

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...
shiva's user avatar
  • 2,781
0 votes
0 answers
70 views

I am running an Apache Spark job on Amazon EMR that needs to connect to an Amazon MSK cluster configured with IAM authentication. The EMR cluster has an IAM role with full MSK permissions, and I can ...
Vishwas Singh's user avatar
1 vote
0 answers
67 views

I am connecting to an EMR cluster through SageMaker Unified Studio(JupyterLab). My EMR cluster is configured with Delta Lake support, and I have the following Spark properties set on the cluster: ...
sakshi's user avatar
  • 41
0 votes
0 answers
61 views

I have one Iceberg table in Glue Catalog. I am unable to runw a select * as one of metadata file is missing. I am trying to point to latest metadata file. How can I do that? I am using EMR 7.7 with ...
user3858193's user avatar
  • 1,558
2 votes
0 answers
166 views

I'm trying to connect to an existing EMR cluster from SageMaker Unified Studio to run SQL queries via JupyterLab. SageMaker requires that the EMR cluster be runtime role-enabled to integrate with ...
valzor's user avatar
  • 315
0 votes
1 answer
59 views

I am using emr 6.15 and hudi 0.14 I submitted following hudi job which should create a database and a table in aws glue. IAM Role assigned to EMR serverless has all neccessary permissions of s3 and ...
Roobal Jindal's user avatar
1 vote
0 answers
56 views

I have successfully implemented the IBM S3 Shuffle Plugin v0.9.6 (https://github.com/IBM/spark-s3-shuffle) on EMR on EKS (Spark 3.5.0) and the shuffle operations are working correctly with S3 storage. ...
metersk's user avatar
  • 12.7k
0 votes
1 answer
141 views

I am writing data into s3 and table format is Iceberg in Glue Catalog. I see the /data and /metadata folders are getting created. However when I am writing data, it's creating 001/002 kind of folders. ...
user3858193's user avatar
  • 1,558
0 votes
0 answers
40 views

I want to install external Python packages on EMR with an EC2 setup, but currently, apart from bootstrap actions, nothing else seems to be working. The problem with this setup is that if I want to ...
RushHour's user avatar
  • 645
3 votes
1 answer
104 views

Having trouble getting dynamic allocation to properly terminate idle executors when using FSx Lustre for shuffle persistence on EMR 7.8 (Spark 3.5.4) on EKS. Trying this strategy out to battle cost ...
metersk's user avatar
  • 12.7k
0 votes
0 answers
41 views

I am exploring data write into glue Table (Iceberg Table format). I have been using saveAsTable method mentioned as option1 . However is there any difference between two methods. Iceberg stores ...
user3858193's user avatar
  • 1,558
0 votes
1 answer
104 views

I have a pyspark script that reads data from S3 in a different AWS account, using AssumedRoleCredentialProvider , it is working on emr serverless 6.9 but when I upgrade to EMR Serverless 7.5 it fails ...
Sayed's user avatar
  • 11
0 votes
0 answers
33 views

I have an EMR cluster configured with the following SecurityConfiguration: "AuthenticationConfiguration": { "IdentityCenterConfiguration": { "EnableIdentityCenter":...
ExK's user avatar
  • 1
0 votes
0 answers
59 views

Gives the below JSON: { "environments": [ {"env": "dev", "description": "dev environment"}, {"env": "dev01", "...
dplvs's user avatar
  • 43
0 votes
0 answers
50 views

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...
Bilal Jamil's user avatar
0 votes
0 answers
61 views

I encountered java.io.FileNotFoundException in AWS EMR batch. My code processes data as below : updateDF = spark.read.load(paths, ..) userIDs = uniqueList(matchedUserIDs) nones = [None for _ in range(...
Jisu Choi's user avatar
0 votes
0 answers
23 views

I am having this error when running a EMR with a notebook passing some dates: An error occurred: An error occurred while calling o236.showString. : org.apache.spark.SparkException: Job aborted due ...
gcj's user avatar
  • 298
0 votes
1 answer
42 views

I am trying to update the version of apache-sedona to start using 1.7.1 version, but it keeps failing when being spined using python functions. If i spin the cluster manually everything works fine and ...
gcj's user avatar
  • 298
0 votes
0 answers
44 views

I am trying to overwrite multiple partitions in a large table. Basically I have my main external S3 table sandbox, partitioned by part: scala> q("select * from sandbox") +---+-------------...
kot's user avatar
  • 85
0 votes
0 answers
58 views

I'm trying to load data from SQL Server from a stored procedure in PySpark using the JDBC driver. Calling a stored procedure is supposed to be possible with this driver according to this I'm tried the ...
Vercinegetorix's user avatar
0 votes
0 answers
63 views

I have a spark EMR serverless, which loads some shapefiles data (geolocation data) from an S3 Bucket, all components are deployed on eu-west-1 This spark job is scheduled to run hourly (via Airflow), ...
blackstorm's user avatar
0 votes
0 answers
43 views

I have since switched from EMR on EC2 to EMR serverless. I used to use interactive notebooks with EMR on EC2. I am trying to use the EMR studio workspace (notebooks) with EMR serverless application ...
DirtyDan's user avatar
0 votes
1 answer
152 views

I'm trying to stand up a new cluster in AWS EMR, but it immediately fails with the following error: Service-linked role 'AWSServiceRoleForEMRCleanup' for EMR is required. Please create this role ...
FoxMulder900's user avatar
  • 1,281
2 votes
1 answer
67 views

I'm reading data in pyspark from postgres using jdbc connection. The table being read is large, about 240 million rows. I'm attempting to read it into 16 partitions. The read is being performed like ...
Kevin Smeeks's user avatar
1 vote
0 answers
58 views

I ran a Jupyter PySpark notebook on an EMR 7.3.0 cluster and encountered the error below after a simple df.count() call. This was not an issue with my code; the same dataframe (df) had already been ...
mwarrior's user avatar
  • 589
0 votes
1 answer
344 views

I have a Lake formation resource link database table, from another AWS account, of which I can query in Athena just find with permissions. But I cannot query this data in EMR. The permission access ...
vfrank66's user avatar
  • 1,508
-1 votes
1 answer
187 views

I'm implementing a lakehouse (Apache Iceberg) with Pyspark and I'm running into some issues. So I come from a SQL background so originally was trying to implement this solution in the same way I ...
user172839's user avatar
  • 1,075
0 votes
0 answers
42 views

I am not able to access the iceberg table created using spark and present in glue catalog. Error Message: Query 20250113_163609_00011_ypmu9 failed: Not a Hive table 'search_iceberg.gweekly' Cluster ...
user3858193's user avatar
  • 1,558
0 votes
0 answers
67 views

I am trying to access an S3 bucket on Account B from Account A using Python and PySpark from EMR Studio on Serverless. I can access the data using Python via a cross-account IAM role but get an ...
Ashesh Aryak's user avatar
0 votes
0 answers
104 views

I'm working with AWS EMR Serverless, and I need to construct a job URL for an EMR Serverless job to be sent in a message notification in case of state change. The desired URL includes the associated ...
user27008283's user avatar
0 votes
0 answers
203 views

I am working in a script to load data to a iceberg table using AWS Glue/EMR (tried in both). Error message: pyspark.errors.exceptions.captured.AnalysisException: Cannot write into v1 table: ...
user3858193's user avatar
  • 1,558
0 votes
0 answers
25 views

I am trying to use EMR spark-shell to do some analysis etc on a ~5 TB dataset that lives in S3, so I have a 32 x i3.16xlarge cluster. If I start spark shell with default configurations, I get exactly ...
kot's user avatar
  • 85
0 votes
0 answers
76 views

I am getting somewhat unexpected results with df.drop_duplicates(). df2 = df.dropDuplicates() print(df2.count()) # prints 424527 print(df.count()) # prints 424510 I do not understand why count is ...
Gaurav Singhal's user avatar
0 votes
0 answers
18 views

I'm using AWS EMR with Hadoop and Yarn and when I go to UI of the RM I can see information like "Physical Mem Used %" and "Physical VCores Used %". I cannot find anything online (...
salapura.stefan's user avatar
0 votes
1 answer
181 views

When I try to run a pyspark step on my EMR cluster I get an error Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found. My understanding from AWS ...
Wev's user avatar
  • 295
1 vote
0 answers
26 views

I need to create a hdfs folder for my run_job_flow to work. Currently I am using this sh script command sudo -u hdfs hdfs dfs -mkdir -p /apps/hudi/lib but for some reason I am getting this error : ...
Lucas Vaz's user avatar
0 votes
0 answers
78 views

Unable to see live spark ui on aws emr serverless spark jobs. Once job is completed, UI is available but not avaialble for the running jobs Message: Live UI takes a few seconds to update due to its ...
Roobal Jindal's user avatar
1 vote
0 answers
77 views

my use case is to create an EMR 7.3 in AWS and invoke lambda to http request to pass the configs and payload over to initiate a spark job. this is my software settings [ { "Classification&...
Voon see hong's user avatar
0 votes
1 answer
133 views

The same pyspark code works on r7a but not r7g or r8g on a EMR cluster (7.5). I build the python environment with conda, and use it in pyspark: conda create -n pyspark python=3.9 --show-channel-urls --...
Guillaume's user avatar
  • 3,081
0 votes
1 answer
62 views

I started an AWS EMR-EC2 cluster, I am having trouble getting the sparkrunner of apache-beam to work. I have a python script that will use apache-beam. I have tried either aws emr add-steps or ssh ...
Shiyi Yin's user avatar
1 vote
0 answers
108 views

I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath: /usr/...
Ninad's user avatar
  • 81
-1 votes
2 answers
388 views

I am working on a project that processes IMDb data using Apache Spark. My setup involves Spark Core and Spark SQL dependencies, along with Jackson for handling JSON serialization and deserialization. ...
prashantjerk's user avatar
0 votes
1 answer
73 views

I am using EMR 7.0.0 version, which has python 3.9, spark 3.5.0, Hadoop 3.3.6 in AWS. I got the error: File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/...
TripleH's user avatar
  • 489
0 votes
1 answer
76 views

Has anyone ever experienced jobs failing in Airflow despite in Spark History UI, the jobs are still stuck in running. Also, I after I added a line of code to write the data to S3 (without reading it ...
laggyPC's user avatar
  • 29
0 votes
1 answer
136 views

I am trying to run a Flink job on an AWS EMR cluster (v7.3.0) using Python 3.9 and Apache Flink with PyFlink. My job reads from an AWS Kinesis stream and prints the stream data to console. However, ...
Mughees Asif's user avatar
-2 votes
1 answer
82 views

I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 : /bin/spark-submit \ --py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip ...
Smruti Prakash Mohanty's user avatar
1 vote
0 answers
102 views

One of my Spark code is failing due to executor container failing due to "java.lang.OutOfMemoryError: Java heap space". Any recommendation is appreciated. I am using emr 200 -r7g16xlarge ...
user3858193's user avatar
  • 1,558
1 vote
0 answers
75 views

I’m working with a large transaction dataset (~1 billion rows) in PySpark on AWS EMR. My goal is to perform feature engineering where I compute statistics like sum, mean, standard deviation, and ...
Meriiiiii's user avatar
1 vote
0 answers
104 views

My job takes a input data of 400 TB parquet dataset from s3. This job runs with 250 r716x large (each having 64 vCore, 488 GiB memory). The job fails with below error org.apache.spark.shuffle....
user3858193's user avatar
  • 1,558

1
2 3 4 5
100