1

I have made a training job on AWS Sagemaker and it runs well - reads from an s3 location and stores model checkpoints as intended in s3. Now, I need to trigger this trigger job with specified parameters (s3 location having data for eg.) from a website (via API gateway). The very first idea was to make a lambda function that gets called from an API call and it training job using the Sagemaker API:

HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

# staarting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

But, AWS lambda has a max runtime of 15 mins which is less than the training time required. I was wondering if there is a serverless way of doing the same thing? Is AWS step function any different from lambda in this regard?

4
  • I'm not sure but doesn't Lambda function only need to trigger sagemaker? What the Lambda do next? (I mean sagemaker itself can upload to S3.) Commented Jan 30, 2022 at 3:39
  • 1
    The flow is - lambda Starts a sagemaker training job -> on completion lambda deploys the trained model and sends the deployed model link back to the API as a response. So, lambda or any other backend (typically an ec2 instance) is supposed to be the central point of contact for external calls. Commented Jan 31, 2022 at 4:11
  • I think you can create 2 lambdas. One is for starting the sagemaker job. Another lambda is for deploy the model and send back the link. 2nd lambda should be called from at the end of sagemaker with boto3 or called with stepfunction. Commented Jan 31, 2022 at 20:37
  • This sounds like a nice hack. Thanks Commented Feb 1, 2022 at 10:04

1 Answer 1

1

you can launch the training job asynchronously, either by adding wait=False in the fit(), or by using boto3 create_training_job. That way, you can launch the job from a Lambda, that will not need to wait for it to complete;

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.