There is a need to integrate our EMR with one of the services of AWS for one of the use case i.e., "Using EMR the python/pyspark code is running around 1 billion transactions & processing some 200K files per month", now this is causing data traceability issues like how many files were successfully processed, how many failed and can be reprocessed, etc.. Now, there is a need to integrate this with a service which can trace these metrics along with other log files. Any inputs Or pointers on how to achieve this solution by giving some reference architectural docs Or set-up docs would really help. I was thinking like if services of DyanamoDB can be utilized to achieve this OR if I can get some more inputs on this problem statement it would really help.
1 Answer
I would use AWS CloudWatch for this. Specifically, I would install CloudWatch agent as a boostrap action or a step on the EMR cluster following CloudWatch agent guide: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
Then, you can query the log events from CloudWatch Insights along with other log streams
1 Comment
Somen Swain
Thank you @Jakub Kaplan for your inputs