2

I am using SpaCy's en-core-web-sm in my Python AWS Lambda. I ran pip freeze > requirements.txt to get all the dependencies in the requirements.txt file. en-core-web-sm==2.1.0 is one of the lines in the file.

When I try to make a serverless deployment, I get ERROR: Could not find a version that satisfies the requirement en-core-web-sm==2.1.0 (from versions: none) ERROR: No matching distribution found for en-core-web-sm==2.1.0 .

Even though I am not using Heroku, I followed Heroku Deployment Error: No matching distribution found for en-core-web-sm and added the line https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0 in my requirements.txt file only to get Unzipped size must be smaller than 262144000 bytes (Service: AWSLambdaInternal; Status Code: 400; Error Code: InvalidParameterValueException; Request ID: XxX-XxX)

How to wire up en-web-core-sm to my Lambda?

2
  • Hope this will help you. Commented Jul 22, 2019 at 9:43
  • 1
    @ChandanGupta: I get An error occurred while installing https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz#egg=en-core-web-sm! Will try again when I add this to my requirements.txt file. Commented Jul 22, 2019 at 18:41

1 Answer 1

4

Take the advantage of the model being a separate component to the library and uploaded the model in an S3 bucket. Before initialising spaCy, I download the model from S3. This is accomplished by the method below.

def download_dir(dist, local, bucket):
    client = get_boto3_client('s3', lambda n: boto3.client('s3'))
    resource = get_boto3_client('s3r', lambda n: boto3.resource('s3'))

    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(subdir.get('Prefix'), local, bucket)
        if result.get('Contents') is not None:
            for file in result.get('Contents'):

                if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
                    os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
                dest_path = local + os.sep + file.get('Key')

                if not dest_path.endswith('/'):
                    resource.meta.client.download_file(bucket, file.get('Key'), dest_path)

And the code using spaCy looks like this:

import spacy
if not os.path.isdir(f'/tmp/en_core_web_sm-2.0.0'):
       download_dir(lang, '/tmp', mapping_bucket)
spacy.util.set_data_path('/tmp')

nlp = spacy.load(f'/tmp/en_core_web_sm-2.0.0')
doc = nlp(spacy_input)
for token in doc:
    print(token.text, token.pos_, token.label_)
Sign up to request clarification or add additional context in comments.

3 Comments

@Nagarajan Shanmuganathan, Is this answer helpful for you?
I started using a 3P Lambda Layer which had SpaCy bundled in it since I wanted a quick deployment and that avoided all the overhead of managing the files on S3 and bootstrapping.
Where does get_boto3_client bit come from? I get the error: NameError: name 'get_boto3_client' is not defined. I've tried importing it from boto3 but no luck

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.