All Questions
Tagged with elastic-map-reduce or amazon-emr
4,988 questions
0
votes
1
answer
743
views
AWS EMR history server - ERROR 500 for large job
I'm using AWS EMR v6.7.0.
I can view the history server UI after the cluster has already terminated. However, when I try to go to large jobs, I get the following exception (for smaller jobs, ...
1
vote
0
answers
108
views
How do you get the editor ID from within a PySpark Jupyter notebook running on an EMR cluster?
In an effort to keep my code modular, I have Jupyter notebooks calling other notebooks. Unfortunately, I've had to hard-code the editor ID (for example, e-BKTM2DIHXBEDRU44ANWRKIU8N) into my notebook. ...
0
votes
1
answer
529
views
Querying Apache Hudi using PySpark on EMR by table name
While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name.
See
hudiOptions = {
'hoodie.table.name': 'tableName',
'hoodie.datasource.write....
0
votes
0
answers
218
views
Terraform emr module, core_instance_group appears in the plan even though it is not used
I'm using Terraform emr module to deploy an AWS EMR cluster. In the emr module I have declared core instance fleet :
module "emr_trino_cluster" {
source = "...
0
votes
1
answer
2k
views
Role of command-runner.jar and script-runner.jar in aws emr
When we execute a spark job in emr cluster,we add step as
'HadoopJarStep': {
'Args': [
'spark-submit',
's3://spark-test-bucket-pr/spark_job/...
1
vote
0
answers
127
views
Error creating FlinkKafkaConsumer in PyFlink: An error occurred while calling None.org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
Setup:
aws emr cluster for flink
aws msk for kafka
I'm trying to create a FlinkKafkaConsumer in PyFlink on aws emr cluster for reading data from Kafka topics, but I'm encountering an error during ...
1
vote
1
answer
861
views
Install Package in PySpark running on AWS EMR
I need to install a package in AWS EMR PySpark kernel. I use the following code
sc.install_pypi_package("statsmodels")
On doing this I get the error
statsmodels/tsa/_stl.c:39:10: fatal ...
0
votes
1
answer
347
views
Spark: How to reduce the time to read files from S3?
I need to read the JSON files present in S3 and process them. There are roughly 120,000 JSONL files present in a single directory of S3. Each file is roughly around 1.6MB in size.
My spark code is ...
0
votes
0
answers
81
views
EMR Spark Job Keep running
I have a Spark script that I am trying to execute via EMR. My scripts works fine on EMR and completes successfully in 4 minutes but some times the same script with no change keeps on running for hours ...
1
vote
1
answer
1k
views
EMR - Pyspark, No module named 'boto3'
I am running an EMR with the following creation statement:
$ aws emr create-cluster \
--name "my_cluster" \
--log-uri "s3n://somebucket/" \
--release-label "emr-6.8.0" ...
0
votes
1
answer
999
views
Terraform EMR Studio error: The service role does not have permission to access the <cluster name>
trying to attach a emr studio and workspace to a emr cluster via terraform. But get an error saying:
Error: creating EMR Studio: InvalidRequestException: The service role does not have permission to ...
2
votes
1
answer
462
views
AWS EMR network connection
I am trying to install a package into EMR cluster. Every time I get the following error
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken ...
0
votes
0
answers
226
views
EMR ec2 cluster terraform creation fails due to iinsufficient EC2 permissions
trying to build an ec2 emr cluster but getting an iam error. I ran a gui build into a throw away account with auto scaling set up and no other config, then copied the defaultroles it used with my ...
3
votes
1
answer
708
views
Terraform EMR on EKS virtual cluster error
im trying to add emr on eks via the terraform blueprint,
I have added the following which creates adds the blueprint side successfully:
module "emr-blueprint" {
source = "github.com/...
1
vote
1
answer
724
views
Hadoop gcs-connector throws Java heap space error
The issue is simple. I am using the hadoop gcs-connector (https://github.com/GoogleCloudDataproc/hadoop-connectors) for writting data to google cloud stroage from a MapReduce job running in an EMR ...
2
votes
0
answers
768
views
Update or refresh AWS credentials in an active Pyspark session
So I'm creating and using a SparkSession on Amazon EMR as follows:
os.environ["AWS_ACCESS_KEY_ID"] = access_key_id
os.environ["AWS_SECRET_ACCESS_KEY"] = secret_access_key
os....
0
votes
0
answers
100
views
How to parallelize curl for a large file via AWS EMR
I am trying to pull a very large file (>1Tb) from the web into AWS S3. Normally I'd use Requests + multipart upload to do this, but given the size of the file this ends up being extremely slow. In ...
0
votes
1
answer
939
views
What node type (Primary, Core, or Task) I am connected to in Amazon EMR cluster?
I am trying to run a script as a bootstrap action on all the EMR nodes (Primary, Core or Task nodes). This script will be publishing metrics to AWS CloudWatch. When publishing metrics to AWS ...
0
votes
1
answer
1k
views
Why Flink core node not releasing JVM Metaspace memory?
I am running a 1.13.1 flink cluster, where I execute batch jobs which executes athena query and save the result in athena tables.
I submit these jobs multiple times in a day.
In every execution, ...
0
votes
1
answer
2k
views
How can I pass environment variable to project which run on EMR Serverless?
In my PySpark project I'm using a python package that uses Dynaconf so I need to set the following environment variable - ENV_FOR_DYNACONF = platform.
The problem is I don't understand how can I pass ...
0
votes
1
answer
974
views
Reading json files using custom schema in spark not returning results
I'm new to emr/hdfs/hive/spark world. I have a collection of large json files (>50GB per file) that I am attempting to load so as to query specific keys. There is a standard layout for the json ...
1
vote
1
answer
928
views
AWS EMR PySpark UDF fails with `Failed to run command /usr/bin/virtualenv (...)`
I have an emr cluster with emr version 6.10.0, and I'm trying to use pyspark udf within my code but it keeps failing with the same error all the time.
data = [("AAA",), ("BBB",), (&...
0
votes
0
answers
319
views
Spark Scala job in AWS EMR fails randomly with the error org.xml.sax.SAXParseException; Premature end of file
I have a Spark(2.4.6) Scala job running in AWS EMR(emr-5.31.0) that fails randomly with the error org.xml.sax.SAXParseException; Premature end of file. The job consistently overwrites parquet files in ...
0
votes
1
answer
2k
views
How to generate sentence embeddings with sentence transformers using pyspark in an optimized way?
I am trying to generate sentence embedding using hugging face sbert transformers. Currently, I am using all-MiniLM-L6-v2 pre-trained model to generate sentence embedding using pyspark on AWS EMR ...
0
votes
0
answers
548
views
java.io.FileNotFoundException: (Permission denied) while renaming the file within Spark application
Getting an exception when trying to rename a file within Spark application. Permission denied - new file name. The same thing works good with the spark-shell with by the same user. P.S. The path is ...
0
votes
0
answers
233
views
AWS EMR HDFS Excluding datanode DatanodeInfoWithStorage Error (UnknownHostException: ip-172-31-23-85.ec2.internal<unresolved>:9866)
I am trying to create file from spring boot to aws emr hdfs but i got this below error: UnknownHostException: ip-172-31-23-85.ec2.internal/:9866
Abandoning BP-1515286748-172.31.29.184-1681364405694:...
0
votes
1
answer
563
views
Shuffle logs filling disk in EMR task nodes
I have Spark 3 job running on EMR 6.9 and it is continuously running job. I am noticing gradual increase in disk usage of task nodes over time. I have noticed errors like this on the task nodes -
2023-...
0
votes
1
answer
527
views
Hudi DeltaStreamer with AWS Glue Data Catalog syncs the database, but not the tables
This is similar to being unable to sync AWS Glue Data Catalog where you run a spark-submit with Hudi DeltaStreamer, except you only sync the database (and not the tables).
E.g. you submit:
spark-...
1
vote
1
answer
500
views
Spark read S3 path with "/" as prefix
I have source data at S3 path like:
s3://mybucket/prefix1/prefix2//prefixX/prefixY/partitionColumn=2023/
I need to create Data frame to read
s3://mybucket/prefix1/prefix2//prefixX/prefixY/ but I am ...
0
votes
1
answer
277
views
Running Hudi DeltaStreameron EMR succeeds, but does not sync to AWS Glue Data Catalog
When I run Hudi DeltaStreamer on EMR, I see the hudi files get created in S3 (e.g. I see a .hoodie/ dir and the expected parquet files in S3. The command looks something like:
spark-submit \
--conf ...
0
votes
1
answer
1k
views
how to fetch the stdout of spark job on AWS EMR
I can submit a spark task on AWS EMR with the following command.
How do I fetch the stdout of the Spark job?
aws emr add-steps --cluster-id ${CLUSTERID} \
--output json \
--steps Type=spark,Name=${...
0
votes
1
answer
43
views
Conditionally filtering parquets based on parameters provided to function
I have a set of partitioned parquets I'm attempting to read in Spark. To simplify filtering, I've written a wrapper function to optionally allow filtering based on the parquets' partition columns. The ...
1
vote
1
answer
311
views
Orchestration and EMR
I want to run multiple spark jobs in emr cluster with these jobs having some dependency among each other and once everything is complete the last sstep should trigger a lambda which will start ...
3
votes
0
answers
287
views
Spark Executor hang on ShuffleBlockFetcherIterator remote fetches
I am running some Sedona geospatial queries on top of a Spark cluster hosted in the Amazon EMR environment. My query works for some input datasets, but would hang on the 'count()' method of Spark SQL ...
0
votes
0
answers
318
views
EMR on EKS | Spark job retries 5 times if failed
I am using the release: emr-5.33.0-latest
Whenever a spark job fails, it tries to restart 5 times. I am not able to find where this behavior is configured or if its something default, and how to ...
4
votes
1
answer
1k
views
unable to read s3 files from within aws emr studio notebooks or consoles
We have an EMR Studio that has an S3 default bucket set, i.e. s3://OurBucketName/Subdirectory/work, and within which we've created a Workspace that is attached to an EC2 cluster running emr-6.10.0 ...
0
votes
1
answer
51
views
How do I delete an ERM InstanceGroupConfig via the Java SDK?
we have a jar that runs on Jenkins to create and delete our EMR stacks based on some json files. The delete will fail because of an InstanceGroupConfig resource. I'm not an AWS guru and though I've ...
0
votes
1
answer
460
views
AWS EMR Resource Manager REST APIs
For Apache Hadoop installation, there are REST APIs available to get the status of an application or to know a list of running applications etc. Those are mentioned at https://hadoop.apache.org/docs/...
0
votes
1
answer
50
views
Python & Pyspark code traceability using EMR service
There is a need to integrate our EMR with one of the services of AWS for one of the use case i.e., "Using EMR the python/pyspark code is running around 1 billion transactions & processing ...
0
votes
1
answer
266
views
Iterative algorithm with Spark
We have a use case where in a Spark job
We iterate over partitions of an external table
Load data of this partition (almost same data vol in each partition)
Do transformation(self joins, no udfs) on ...
1
vote
1
answer
106
views
Number of cores in AWS r4.16xlarge cluster
I'm having a 10 node AWS r4.16xlarge cluster, Under the Executor tab in spark UI, the number under "cores" is being different every time I spin the cluster, sometimes it shows 200, some ...
0
votes
1
answer
607
views
How to dump heap to s3 using HeapDumpOnOutOfMemoryError in spark?
I am trying to dump a heap file from spark(EMR) to s3 bucket using
new SparkConf().set("spark.driver.extraJavaOptions", "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=s3://my-bucket/...
0
votes
1
answer
304
views
Creating a 50Giga parquet file of random integers using pyspark fails
I've tried using different sizes of clusters (EMR on AWS) and it always fails due to YARN killing all the nodes:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/
I ...
2
votes
0
answers
307
views
Airflow EMR Hook failing while requesting to add a step
Hope everyone is doing well!
Here's the context of the issue I'm facing, I'm working on a company that is supporting a really old airflow version, here are the details of the version and some ...
-1
votes
1
answer
181
views
15 TB data ingestion from S3 to DynamoDB [closed]
I have to ingest 15 TB of data from S3 to DynamoDB. There isn't any transformation required except that for adding a new column (insert date).
The data in S3 is in parquet format with snappy ...
2
votes
0
answers
243
views
Long delay between Spark driver + executor allocation and an initial stage starting
In the Spark UI for one of my applications, I reliably see a long delay (10 - 15 minutes) between allocation of a driver to the application and the first stage starting. What situations might cause a ...
0
votes
1
answer
193
views
Submitting Multiple Jobs in Sequence
I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit ...
0
votes
0
answers
420
views
How to insert a large amount of data into an hbase table with bulk load?
I need to insert data with more than 50 million lines that is in s3 in a hbase table. I am using AWS EMR to use cluster with hadoop services like hbase. I've already managed to put the s3 data in the ...
0
votes
0
answers
542
views
Unable to create boto3 client on AWS EMR pyspark worker (botocore.exceptions.ProfileNotFound)
I have set up an AWS EMR cluster. I have included this script as the bootstrap script:
#!/bin/bash
# Install needed libraries
sudo pip3 install pandas==1.3.5 awswrangler==2.19.0 boto3==1.26.72
When ...
1
vote
0
answers
122
views
What is the difference between AWS EMRs "Elapsed Time" and Spark UI "Task Time"?
On EMR I see that my job took 12 minutes to run - according to the Elapsed Time column. However, when I go to the Spark UI > Executors tab the Task Time (GC Time) shows 1 hr (4 s).
I totalled up ...