0

There is a need to integrate our EMR with one of the services of AWS for one of the use case i.e., "Using EMR the python/pyspark code is running around 1 billion transactions & processing some 200K files per month", now this is causing data traceability issues like how many files were successfully processed, how many failed and can be reprocessed, etc.. Now, there is a need to integrate this with a service which can trace these metrics along with other log files. Any inputs Or pointers on how to achieve this solution by giving some reference architectural docs Or set-up docs would really help. I was thinking like if services of DyanamoDB can be utilized to achieve this OR if I can get some more inputs on this problem statement it would really help.

1 Answer 1

0

I would use AWS CloudWatch for this. Specifically, I would install CloudWatch agent as a boostrap action or a step on the EMR cluster following CloudWatch agent guide: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html

Then, you can query the log events from CloudWatch Insights along with other log streams

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @Jakub Kaplan for your inputs

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.