Skip to main content

All Questions

Filter by
Sorted by
Tagged with
0 votes
1 answer
743 views

I'm using AWS EMR v6.7.0. I can view the history server UI after the cluster has already terminated. However, when I try to go to large jobs, I get the following exception (for smaller jobs, ...
Golan Kiviti's user avatar
  • 4,295
1 vote
0 answers
108 views

In an effort to keep my code modular, I have Jupyter notebooks calling other notebooks. Unfortunately, I've had to hard-code the editor ID (for example, e-BKTM2DIHXBEDRU44ANWRKIU8N) into my notebook. ...
mwarrior's user avatar
  • 589
0 votes
1 answer
529 views

While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name. See hudiOptions = { 'hoodie.table.name': 'tableName', 'hoodie.datasource.write....
Anurag A S's user avatar
0 votes
0 answers
218 views

I'm using Terraform emr module to deploy an AWS EMR cluster. In the emr module I have declared core instance fleet : module "emr_trino_cluster" { source = "...
Marksman's user avatar
0 votes
1 answer
2k views

When we execute a spark job in emr cluster,we add step as 'HadoopJarStep': { 'Args': [ 'spark-submit', 's3://spark-test-bucket-pr/spark_job/...
nayak0765's user avatar
  • 193
1 vote
0 answers
127 views

Setup: aws emr cluster for flink aws msk for kafka I'm trying to create a FlinkKafkaConsumer in PyFlink on aws emr cluster for reading data from Kafka topics, but I'm encountering an error during ...
Kalaiarasu M's user avatar
1 vote
1 answer
861 views

I need to install a package in AWS EMR PySpark kernel. I use the following code sc.install_pypi_package("statsmodels") On doing this I get the error statsmodels/tsa/_stl.c:39:10: fatal ...
Srinath's user avatar
  • 31
0 votes
1 answer
347 views

I need to read the JSON files present in S3 and process them. There are roughly 120,000 JSONL files present in a single directory of S3. Each file is roughly around 1.6MB in size. My spark code is ...
shashank93rao's user avatar
0 votes
0 answers
81 views

I have a Spark script that I am trying to execute via EMR. My scripts works fine on EMR and completes successfully in 4 minutes but some times the same script with no change keeps on running for hours ...
seou1's user avatar
  • 506
1 vote
1 answer
1k views

I am running an EMR with the following creation statement: $ aws emr create-cluster \ --name "my_cluster" \ --log-uri "s3n://somebucket/" \ --release-label "emr-6.8.0" ...
Flo's user avatar
  • 485
0 votes
1 answer
999 views

trying to attach a emr studio and workspace to a emr cluster via terraform. But get an error saying: Error: creating EMR Studio: InvalidRequestException: The service role does not have permission to ...
Staggerlee011's user avatar
2 votes
1 answer
462 views

I am trying to install a package into EMR cluster. Every time I get the following error WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken ...
Srinath's user avatar
  • 31
0 votes
0 answers
226 views

trying to build an ec2 emr cluster but getting an iam error. I ran a gui build into a throw away account with auto scaling set up and no other config, then copied the defaultroles it used with my ...
Staggerlee011's user avatar
3 votes
1 answer
708 views

im trying to add emr on eks via the terraform blueprint, I have added the following which creates adds the blueprint side successfully: module "emr-blueprint" { source = "github.com/...
Staggerlee011's user avatar
1 vote
1 answer
724 views

The issue is simple. I am using the hadoop gcs-connector (https://github.com/GoogleCloudDataproc/hadoop-connectors) for writting data to google cloud stroage from a MapReduce job running in an EMR ...
Dinesh Raj's user avatar
2 votes
0 answers
768 views

So I'm creating and using a SparkSession on Amazon EMR as follows: os.environ["AWS_ACCESS_KEY_ID"] = access_key_id os.environ["AWS_SECRET_ACCESS_KEY"] = secret_access_key os....
flamefrost's user avatar
0 votes
0 answers
100 views

I am trying to pull a very large file (>1Tb) from the web into AWS S3. Normally I'd use Requests + multipart upload to do this, but given the size of the file this ends up being extremely slow. In ...
Doug MacArthur's user avatar
0 votes
1 answer
939 views

I am trying to run a script as a bootstrap action on all the EMR nodes (Primary, Core or Task nodes). This script will be publishing metrics to AWS CloudWatch. When publishing metrics to AWS ...
contemplator's user avatar
0 votes
1 answer
1k views

I am running a 1.13.1 flink cluster, where I execute batch jobs which executes athena query and save the result in athena tables. I submit these jobs multiple times in a day. In every execution, ...
amitwdh's user avatar
  • 711
0 votes
1 answer
2k views

In my PySpark project I'm using a python package that uses Dynaconf so I need to set the following environment variable - ENV_FOR_DYNACONF = platform. The problem is I don't understand how can I pass ...
nirkov's user avatar
  • 829
0 votes
1 answer
974 views

I'm new to emr/hdfs/hive/spark world. I have a collection of large json files (>50GB per file) that I am attempting to load so as to query specific keys. There is a standard layout for the json ...
The Crusher's user avatar
1 vote
1 answer
928 views

I have an emr cluster with emr version 6.10.0, and I'm trying to use pyspark udf within my code but it keeps failing with the same error all the time. data = [("AAA",), ("BBB",), (&...
JstFlip's user avatar
  • 23
0 votes
0 answers
319 views

I have a Spark(2.4.6) Scala job running in AWS EMR(emr-5.31.0) that fails randomly with the error org.xml.sax.SAXParseException; Premature end of file. The job consistently overwrites parquet files in ...
sgallagher's user avatar
0 votes
1 answer
2k views

I am trying to generate sentence embedding using hugging face sbert transformers. Currently, I am using all-MiniLM-L6-v2 pre-trained model to generate sentence embedding using pyspark on AWS EMR ...
cs_abhi's user avatar
  • 11
0 votes
0 answers
548 views

Getting an exception when trying to rename a file within Spark application. Permission denied - new file name. The same thing works good with the spark-shell with by the same user. P.S. The path is ...
user avatar
0 votes
0 answers
233 views

I am trying to create file from spring boot to aws emr hdfs but i got this below error: UnknownHostException: ip-172-31-23-85.ec2.internal/:9866 Abandoning BP-1515286748-172.31.29.184-1681364405694:...
Karuppusamy Mani's user avatar
0 votes
1 answer
563 views

I have Spark 3 job running on EMR 6.9 and it is continuously running job. I am noticing gradual increase in disk usage of task nodes over time. I have noticed errors like this on the task nodes - 2023-...
blue01's user avatar
  • 2,105
0 votes
1 answer
527 views

This is similar to being unable to sync AWS Glue Data Catalog where you run a spark-submit with Hudi DeltaStreamer, except you only sync the database (and not the tables). E.g. you submit: spark-...
Will's user avatar
  • 11.5k
1 vote
1 answer
500 views

I have source data at S3 path like: s3://mybucket/prefix1/prefix2//prefixX/prefixY/partitionColumn=2023/ I need to create Data frame to read s3://mybucket/prefix1/prefix2//prefixX/prefixY/ but I am ...
Aanchal Aron's user avatar
0 votes
1 answer
277 views

When I run Hudi DeltaStreamer on EMR, I see the hudi files get created in S3 (e.g. I see a .hoodie/ dir and the expected parquet files in S3. The command looks something like: spark-submit \ --conf ...
Will's user avatar
  • 11.5k
0 votes
1 answer
1k views

I can submit a spark task on AWS EMR with the following command. How do I fetch the stdout of the Spark job? aws emr add-steps --cluster-id ${CLUSTERID} \ --output json \ --steps Type=spark,Name=${...
ndemir's user avatar
  • 1,931
0 votes
1 answer
43 views

I have a set of partitioned parquets I'm attempting to read in Spark. To simplify filtering, I've written a wrapper function to optionally allow filtering based on the parquets' partition columns. The ...
maxwellray's user avatar
1 vote
1 answer
311 views

I want to run multiple spark jobs in emr cluster with these jobs having some dependency among each other and once everything is complete the last sstep should trigger a lambda which will start ...
dba's user avatar
  • 11
3 votes
0 answers
287 views

I am running some Sedona geospatial queries on top of a Spark cluster hosted in the Amazon EMR environment. My query works for some input datasets, but would hang on the 'count()' method of Spark SQL ...
View Delft's user avatar
0 votes
0 answers
318 views

I am using the release: emr-5.33.0-latest Whenever a spark job fails, it tries to restart 5 times. I am not able to find where this behavior is configured or if its something default, and how to ...
hsnsd's user avatar
  • 1,823
4 votes
1 answer
1k views

We have an EMR Studio that has an S3 default bucket set, i.e. s3://OurBucketName/Subdirectory/work, and within which we've created a Workspace that is attached to an EC2 cluster running emr-6.10.0 ...
dragonscience's user avatar
0 votes
1 answer
51 views

we have a jar that runs on Jenkins to create and delete our EMR stacks based on some json files. The delete will fail because of an InstanceGroupConfig resource. I'm not an AWS guru and though I've ...
lpayson's user avatar
0 votes
1 answer
460 views

For Apache Hadoop installation, there are REST APIs available to get the status of an application or to know a list of running applications etc. Those are mentioned at https://hadoop.apache.org/docs/...
vinayakshukre's user avatar
0 votes
1 answer
50 views

There is a need to integrate our EMR with one of the services of AWS for one of the use case i.e., "Using EMR the python/pyspark code is running around 1 billion transactions & processing ...
Somen Swain's user avatar
0 votes
1 answer
266 views

We have a use case where in a Spark job We iterate over partitions of an external table Load data of this partition (almost same data vol in each partition) Do transformation(self joins, no udfs) on ...
Ankit Raj's user avatar
1 vote
1 answer
106 views

I'm having a 10 node AWS r4.16xlarge cluster, Under the Executor tab in spark UI, the number under "cores" is being different every time I spin the cluster, sometimes it shows 200, some ...
user7343922's user avatar
0 votes
1 answer
607 views

I am trying to dump a heap file from spark(EMR) to s3 bucket using new SparkConf().set("spark.driver.extraJavaOptions", "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=s3://my-bucket/...
Danniel_Lee's user avatar
0 votes
1 answer
304 views

I've tried using different sizes of clusters (EMR on AWS) and it always fails due to YARN killing all the nodes: https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/ I ...
Daniel's user avatar
  • 3
2 votes
0 answers
307 views

Hope everyone is doing well! Here's the context of the issue I'm facing, I'm working on a company that is supporting a really old airflow version, here are the details of the version and some ...
nariver1's user avatar
  • 395
-1 votes
1 answer
181 views

I have to ingest 15 TB of data from S3 to DynamoDB. There isn't any transformation required except that for adding a new column (insert date). The data in S3 is in parquet format with snappy ...
dba's user avatar
  • 11
2 votes
0 answers
243 views

In the Spark UI for one of my applications, I reliably see a long delay (10 - 15 minutes) between allocation of a driver to the application and the first stage starting. What situations might cause a ...
josh's user avatar
  • 21
0 votes
1 answer
193 views

I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit ...
maxwellray's user avatar
0 votes
0 answers
420 views

I need to insert data with more than 50 million lines that is in s3 in a hbase table. I am using AWS EMR to use cluster with hadoop services like hbase. I've already managed to put the s3 data in the ...
Lucas Emanuel 's user avatar
0 votes
0 answers
542 views

I have set up an AWS EMR cluster. I have included this script as the bootstrap script: #!/bin/bash # Install needed libraries sudo pip3 install pandas==1.3.5 awswrangler==2.19.0 boto3==1.26.72 When ...
Austin Wolff's user avatar
1 vote
0 answers
122 views

On EMR I see that my job took 12 minutes to run - according to the Elapsed Time column. However, when I go to the Spark UI > Executors tab the Task Time (GC Time) shows 1 hr (4 s). I totalled up ...
tallwithknees's user avatar

1
3 4
5
6 7
100