Newest 'elastic-map-reduce' Questions - Page 5

0 votes

1 answer

743 views

AWS EMR history server - ERROR 500 for large job

I'm using AWS EMR v6.7.0. I can view the history server UI after the cluster has already terminated. However, when I try to go to large jobs, I get the following exception (for smaller jobs, ...

Golan Kiviti

4,295

asked May 22, 2023 at 15:26

1 vote

0 answers

108 views

How do you get the editor ID from within a PySpark Jupyter notebook running on an EMR cluster?

In an effort to keep my code modular, I have Jupyter notebooks calling other notebooks. Unfortunately, I've had to hard-code the editor ID (for example, e-BKTM2DIHXBEDRU44ANWRKIU8N) into my notebook. ...

mwarrior

589

asked May 19, 2023 at 1:13

0 votes

1 answer

529 views

Querying Apache Hudi using PySpark on EMR by table name

While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name. See hudiOptions = { 'hoodie.table.name': 'tableName', 'hoodie.datasource.write....

Anurag A S

751

asked May 18, 2023 at 14:23

0 votes

0 answers

218 views

Terraform emr module, core_instance_group appears in the plan even though it is not used

I'm using Terraform emr module to deploy an AWS EMR cluster. In the emr module I have declared core instance fleet : module "emr_trino_cluster" { source = "...

Marksman

23

asked May 18, 2023 at 8:32

0 votes

1 answer

2k views

Role of command-runner.jar and script-runner.jar in aws emr

When we execute a spark job in emr cluster,we add step as 'HadoopJarStep': { 'Args': [ 'spark-submit', 's3://spark-test-bucket-pr/spark_job/...

nayak0765

193

asked May 17, 2023 at 13:27

1 vote

0 answers

127 views

Error creating FlinkKafkaConsumer in PyFlink: An error occurred while calling None.org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer

Setup: aws emr cluster for flink aws msk for kafka I'm trying to create a FlinkKafkaConsumer in PyFlink on aws emr cluster for reading data from Kafka topics, but I'm encountering an error during ...

Kalaiarasu M

31

asked May 17, 2023 at 4:14

1 vote

1 answer

861 views

Install Package in PySpark running on AWS EMR

I need to install a package in AWS EMR PySpark kernel. I use the following code sc.install_pypi_package("statsmodels") On doing this I get the error statsmodels/tsa/_stl.c:39:10: fatal ...

Srinath

31

asked May 16, 2023 at 11:21

0 votes

1 answer

347 views

Spark: How to reduce the time to read files from S3?

I need to read the JSON files present in S3 and process them. There are roughly 120,000 JSONL files present in a single directory of S3. Each file is roughly around 1.6MB in size. My spark code is ...

shashank93rao

140

asked May 16, 2023 at 9:19

0 votes

0 answers

81 views

EMR Spark Job Keep running

I have a Spark script that I am trying to execute via EMR. My scripts works fine on EMR and completes successfully in 4 minutes but some times the same script with no change keeps on running for hours ...

seou1

506

asked May 12, 2023 at 21:48

1 vote

1 answer

1k views

EMR - Pyspark, No module named 'boto3'

I am running an EMR with the following creation statement: $ aws emr create-cluster \ --name "my_cluster" \ --log-uri "s3n://somebucket/" \ --release-label "emr-6.8.0" ...

Flo

485

asked May 12, 2023 at 15:57

0 votes

1 answer

999 views

Terraform EMR Studio error: The service role does not have permission to access the <cluster name>

trying to attach a emr studio and workspace to a emr cluster via terraform. But get an error saying: Error: creating EMR Studio: InvalidRequestException: The service role does not have permission to ...

Staggerlee011

1,053

asked May 12, 2023 at 8:25

2 votes

1 answer

462 views

AWS EMR network connection

I am trying to install a package into EMR cluster. Every time I get the following error WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken ...

Srinath

31

asked May 11, 2023 at 8:55

0 votes

0 answers

226 views

EMR ec2 cluster terraform creation fails due to iinsufficient EC2 permissions

trying to build an ec2 emr cluster but getting an iam error. I ran a gui build into a throw away account with auto scaling set up and no other config, then copied the defaultroles it used with my ...

Staggerlee011

1,053

asked May 10, 2023 at 15:37

3 votes

1 answer

708 views

Terraform EMR on EKS virtual cluster error

im trying to add emr on eks via the terraform blueprint, I have added the following which creates adds the blueprint side successfully: module "emr-blueprint" { source = "github.com/...

Staggerlee011

1,053

asked May 9, 2023 at 14:02

1 vote

1 answer

724 views

Hadoop gcs-connector throws Java heap space error

The issue is simple. I am using the hadoop gcs-connector (https://github.com/GoogleCloudDataproc/hadoop-connectors) for writting data to google cloud stroage from a MapReduce job running in an EMR ...

Dinesh Raj

313

asked May 4, 2023 at 13:24

2 votes

0 answers

768 views

Update or refresh AWS credentials in an active Pyspark session

So I'm creating and using a SparkSession on Amazon EMR as follows: os.environ["AWS_ACCESS_KEY_ID"] = access_key_id os.environ["AWS_SECRET_ACCESS_KEY"] = secret_access_key os....

flamefrost

31

asked May 4, 2023 at 8:17

0 votes

0 answers

100 views

How to parallelize curl for a large file via AWS EMR

I am trying to pull a very large file (>1Tb) from the web into AWS S3. Normally I'd use Requests + multipart upload to do this, but given the size of the file this ends up being extremely slow. In ...

Doug MacArthur

175

asked Apr 28, 2023 at 22:02

0 votes

1 answer

939 views

What node type (Primary, Core, or Task) I am connected to in Amazon EMR cluster?

I am trying to run a script as a bootstrap action on all the EMR nodes (Primary, Core or Task nodes). This script will be publishing metrics to AWS CloudWatch. When publishing metrics to AWS ...

contemplator

103

asked Apr 28, 2023 at 21:04

0 votes

1 answer

1k views

Why Flink core node not releasing JVM Metaspace memory?

I am running a 1.13.1 flink cluster, where I execute batch jobs which executes athena query and save the result in athena tables. I submit these jobs multiple times in a day. In every execution, ...

amitwdh

711

asked Apr 28, 2023 at 4:50

0 votes

1 answer

2k views

How can I pass environment variable to project which run on EMR Serverless?

In my PySpark project I'm using a python package that uses Dynaconf so I need to set the following environment variable - ENV_FOR_DYNACONF = platform. The problem is I don't understand how can I pass ...

nirkov

829

asked Apr 23, 2023 at 14:28

0 votes

1 answer

974 views

Reading json files using custom schema in spark not returning results

I'm new to emr/hdfs/hive/spark world. I have a collection of large json files (>50GB per file) that I am attempting to load so as to query specific keys. There is a standard layout for the json ...

The Crusher

3

asked Apr 20, 2023 at 21:50

1 vote

1 answer

928 views

AWS EMR PySpark UDF fails with `Failed to run command /usr/bin/virtualenv (...)`

I have an emr cluster with emr version 6.10.0, and I'm trying to use pyspark udf within my code but it keeps failing with the same error all the time. data = [("AAA",), ("BBB",), (&...

JstFlip

23

asked Apr 20, 2023 at 11:08

0 votes

0 answers

319 views

Spark Scala job in AWS EMR fails randomly with the error org.xml.sax.SAXParseException; Premature end of file

I have a Spark(2.4.6) Scala job running in AWS EMR(emr-5.31.0) that fails randomly with the error org.xml.sax.SAXParseException; Premature end of file. The job consistently overwrites parquet files in ...

sgallagher

207

asked Apr 20, 2023 at 0:20

0 votes

1 answer

2k views

How to generate sentence embeddings with sentence transformers using pyspark in an optimized way?

I am trying to generate sentence embedding using hugging face sbert transformers. Currently, I am using all-MiniLM-L6-v2 pre-trained model to generate sentence embedding using pyspark on AWS EMR ...

cs_abhi

11

asked Apr 14, 2023 at 13:08

0 votes

0 answers

548 views

java.io.FileNotFoundException: (Permission denied) while renaming the file within Spark application

Getting an exception when trying to rename a file within Spark application. Permission denied - new file name. The same thing works good with the spark-shell with by the same user. P.S. The path is ...

user21555659

asked Apr 13, 2023 at 13:32

0 votes

0 answers

233 views

AWS EMR HDFS Excluding datanode DatanodeInfoWithStorage Error (UnknownHostException: ip-172-31-23-85.ec2.internal<unresolved>:9866)

I am trying to create file from spring boot to aws emr hdfs but i got this below error: UnknownHostException: ip-172-31-23-85.ec2.internal/:9866 Abandoning BP-1515286748-172.31.29.184-1681364405694:...

Karuppusamy Mani

1

asked Apr 13, 2023 at 9:33

0 votes

1 answer

563 views

Shuffle logs filling disk in EMR task nodes

I have Spark 3 job running on EMR 6.9 and it is continuously running job. I am noticing gradual increase in disk usage of task nodes over time. I have noticed errors like this on the task nodes - 2023-...

blue01

2,105

asked Apr 13, 2023 at 0:44

0 votes

1 answer

527 views

Hudi DeltaStreamer with AWS Glue Data Catalog syncs the database, but not the tables

This is similar to being unable to sync AWS Glue Data Catalog where you run a spark-submit with Hudi DeltaStreamer, except you only sync the database (and not the tables). E.g. you submit: spark-...

Will

11.5k

asked Apr 11, 2023 at 19:43

1 vote

1 answer

500 views

Spark read S3 path with "/" as prefix

I have source data at S3 path like: s3://mybucket/prefix1/prefix2//prefixX/prefixY/partitionColumn=2023/ I need to create Data frame to read s3://mybucket/prefix1/prefix2//prefixX/prefixY/ but I am ...

Aanchal Aron

33

asked Apr 11, 2023 at 10:08

0 votes

1 answer

277 views

Running Hudi DeltaStreameron EMR succeeds, but does not sync to AWS Glue Data Catalog

When I run Hudi DeltaStreamer on EMR, I see the hudi files get created in S3 (e.g. I see a .hoodie/ dir and the expected parquet files in S3. The command looks something like: spark-submit \ --conf ...

Will

11.5k

asked Apr 7, 2023 at 16:11

0 votes

1 answer

1k views

how to fetch the stdout of spark job on AWS EMR

I can submit a spark task on AWS EMR with the following command. How do I fetch the stdout of the Spark job? aws emr add-steps --cluster-id ${CLUSTERID} \ --output json \ --steps Type=spark,Name=${...

ndemir

1,931

asked Apr 5, 2023 at 3:03

0 votes

1 answer

43 views

Conditionally filtering parquets based on parameters provided to function

I have a set of partitioned parquets I'm attempting to read in Spark. To simplify filtering, I've written a wrapper function to optionally allow filtering based on the parquets' partition columns. The ...

maxwellray

131

asked Apr 4, 2023 at 14:47

1 vote

1 answer

311 views

Orchestration and EMR

I want to run multiple spark jobs in emr cluster with these jobs having some dependency among each other and once everything is complete the last sstep should trigger a lambda which will start ...

dba

11

asked Mar 31, 2023 at 0:36

3 votes

0 answers

287 views

Spark Executor hang on ShuffleBlockFetcherIterator remote fetches

I am running some Sedona geospatial queries on top of a Spark cluster hosted in the Amazon EMR environment. My query works for some input datasets, but would hang on the 'count()' method of Spark SQL ...

View Delft

31

asked Mar 30, 2023 at 21:21

0 votes

0 answers

318 views

EMR on EKS | Spark job retries 5 times if failed

I am using the release: emr-5.33.0-latest Whenever a spark job fails, it tries to restart 5 times. I am not able to find where this behavior is configured or if its something default, and how to ...

hsnsd

1,823

asked Mar 21, 2023 at 14:21

4 votes

1 answer

1k views

unable to read s3 files from within aws emr studio notebooks or consoles

We have an EMR Studio that has an S3 default bucket set, i.e. s3://OurBucketName/Subdirectory/work, and within which we've created a Workspace that is attached to an EC2 cluster running emr-6.10.0 ...

dragonscience

41

asked Mar 16, 2023 at 20:14

0 votes

1 answer

51 views

How do I delete an ERM InstanceGroupConfig via the Java SDK?

we have a jar that runs on Jenkins to create and delete our EMR stacks based on some json files. The delete will fail because of an InstanceGroupConfig resource. I'm not an AWS guru and though I've ...

lpayson

1

asked Mar 10, 2023 at 21:22

0 votes

1 answer

460 views

AWS EMR Resource Manager REST APIs

For Apache Hadoop installation, there are REST APIs available to get the status of an application or to know a list of running applications etc. Those are mentioned at https://hadoop.apache.org/docs/...

vinayakshukre

305

asked Mar 9, 2023 at 12:56

0 votes

1 answer

50 views

Python & Pyspark code traceability using EMR service

There is a need to integrate our EMR with one of the services of AWS for one of the use case i.e., "Using EMR the python/pyspark code is running around 1 billion transactions & processing ...

Somen Swain

31

asked Mar 6, 2023 at 11:11

0 votes

1 answer

266 views

Iterative algorithm with Spark

We have a use case where in a Spark job We iterate over partitions of an external table Load data of this partition (almost same data vol in each partition) Do transformation(self joins, no udfs) on ...

Ankit Raj

1

asked Mar 4, 2023 at 16:14

1 vote

1 answer

106 views

Number of cores in AWS r4.16xlarge cluster

I'm having a 10 node AWS r4.16xlarge cluster, Under the Executor tab in spark UI, the number under "cores" is being different every time I spin the cluster, sometimes it shows 200, some ...

user7343922

306

asked Mar 4, 2023 at 13:52

0 votes

1 answer

607 views

How to dump heap to s3 using HeapDumpOnOutOfMemoryError in spark?

I am trying to dump a heap file from spark(EMR) to s3 bucket using new SparkConf().set("spark.driver.extraJavaOptions", "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=s3://my-bucket/...

Danniel_Lee

1

asked Mar 1, 2023 at 15:04

0 votes

1 answer

304 views

Creating a 50Giga parquet file of random integers using pyspark fails

I've tried using different sizes of clusters (EMR on AWS) and it always fails due to YARN killing all the nodes: https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/ I ...

Daniel

3

asked Mar 1, 2023 at 11:58

2 votes

0 answers

307 views

Airflow EMR Hook failing while requesting to add a step

Hope everyone is doing well! Here's the context of the issue I'm facing, I'm working on a company that is supporting a really old airflow version, here are the details of the version and some ...

nariver1

395

asked Feb 27, 2023 at 14:03

-1 votes

1 answer

181 views

15 TB data ingestion from S3 to DynamoDB [closed]

I have to ingest 15 TB of data from S3 to DynamoDB. There isn't any transformation required except that for adding a new column (insert date). The data in S3 is in parquet format with snappy ...

dba

11

asked Feb 26, 2023 at 6:20

2 votes

0 answers

243 views

Long delay between Spark driver + executor allocation and an initial stage starting

In the Spark UI for one of my applications, I reliably see a long delay (10 - 15 minutes) between allocation of a driver to the application and the first stage starting. What situations might cause a ...

josh

21

asked Feb 26, 2023 at 4:06

0 votes

1 answer

193 views

Submitting Multiple Jobs in Sequence

I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit ...

maxwellray

131

asked Feb 25, 2023 at 6:45

0 votes

0 answers

420 views

How to insert a large amount of data into an hbase table with bulk load?

I need to insert data with more than 50 million lines that is in s3 in a hbase table. I am using AWS EMR to use cluster with hadoop services like hbase. I've already managed to put the s3 data in the ...

Lucas Emanuel

1

asked Feb 23, 2023 at 23:12

0 votes

0 answers

542 views

Unable to create boto3 client on AWS EMR pyspark worker (botocore.exceptions.ProfileNotFound)

I have set up an AWS EMR cluster. I have included this script as the bootstrap script: #!/bin/bash # Install needed libraries sudo pip3 install pandas==1.3.5 awswrangler==2.19.0 boto3==1.26.72 When ...

Austin Wolff

635

asked Feb 22, 2023 at 21:29

1 vote

0 answers

122 views

What is the difference between AWS EMRs "Elapsed Time" and Spark UI "Task Time"?

On EMR I see that my job took 12 minutes to run - according to the Elapsed Time column. However, when I go to the Spark UI > Executors tab the Task Time (GC Time) shows 1 hr (4 s). I totalled up ...

tallwithknees

341

asked Feb 22, 2023 at 11:22

Collectives™ on Stack Overflow

All Questions