2

I have a list of dictionaries that represents a CSV file, and I would like to write them to S3, however I am getting a memory error. Here Id my code:

import csv
import io

dicts = [] # populated with about 1,000,000 dictionaries representing a CSV
f = io.StringIO()
writer = csv.DictWriter(f, fieldnames=dicts[0].keys())
writer.writeheader()
            
for k in dicts:
    writer.writerow(k)
            
print("Writing to S3...")
response = s3.upload_fileobj(Bucket='mybucket', Key=f"key.csv", Fileobj=f.getvalue())
f.close()

However, when I run this I get the following error:

[ERROR] MemoryErrorTraceback (most recent call last):  
File "/var/task/lambda_function.py", line 85, in lambda_handler
response = s3.upload_fileobj(Bucket='mybucket', Key=f"key.csv", Fileobj=f.getvalue())

How can I go about writing this to S3 in a more memory efficient way? the CSV is about 400mb and has around 1,000,000 rows.

EDIT:

I have the max amount of memory available, here is the report from lambda:

REPORT RequestId: c8f651cf-9869-4217-921f-52edcf577234  
Duration: 123484.03 ms  
Billed Duration: 123485 ms  
Memory Size: 10240 MB   
Max Memory Used: 10043 MB   
Init Duration: 453.23 ms    

I have run a memory profiler and the vast majority of the memory is used writing to f and f.getvalue() unsurprisingly

EDIT:

Here is the full lambda function code:

for i in event['files']:
    try:
        file = s3.get_object(Bucket="incomingbucket", Key=i)
        print(file)
    except Exception as e:
        print(e, i)

    file_id = str(uuid.uuid4())
    jsonRootLs = i.split(".")
    if len(jsonRootLs) > 1:
        jsonRoot = '.'.join(j for j in jsonRootLs[0:len(jsonRootLs)-1])
        jsonFileName = f"{jsonRoot}.json"
    else:
        jsonRoot = jsonRootLs[0]
        jsonFileName = f"{jsonRoot}.json"
        
    mapper = s3.get_object(Key=jsonFileName, Bucket='slm-addressfile-incoming')
    mapperJSON = json.loads(mapper['Body'].read().decode('utf-8'))

    dicts = modelerFile(file, mapperJSON)
    for j in dicts:
        j['mail_filename'] = i
        j['file_id'] = file_id
    dictsToSend.extend(dicts)
    print("Records added to list")
        
    f = io.StringIO()
    writer = csv.DictWriter(f, fieldnames=dicts[0].keys())
    writer.writeheader()
    
    for k in dicts:
        writer.writerow(k)
    
    print("Writing to S3...")
    response = s3.upload_fileobj(Bucket='slm-test-bucket-transactional', Key=f"{jsonRoot}.csv", Fileobj=f.getvalue())
    f.close()

# Function to re map columns
def customFile(file, mapperjson):
    NCOAFields = mapperjson['mappings']
    lines1 = []
    for line in file['Body'].iter_lines():
        lines1.append(line.decode('utf-8', errors='ignore'))

    fieldnames = lines1[0].replace('"','').split(',')
    jlist1 = (dict(row) for row in csv.DictReader(lines1[1:], fieldnames))
    
    dicts = []
    for i in jlist1:
        d = {}
        metadata = {}
        for k, v in i.items():
            if k in NCOAFields:
                d[NCOAFields[k]] = v
            else:
                metadata[k] = v
        if len(metadata) > 0:
            d['metadata'] = metadata
        d['individual_id'] = str(uuid.uuid4())
        dicts.append(d)
        
    del jlist1

    return dicts

Basically it reads a CSV rom S3 which also has a JSON mapping file to change names of the columns to our destination schema

5
  • 1
    What are the memory settings on the Lambda function currently? Have you tried simply increasing the memory available? aws.amazon.com/about-aws/whats-new/2020/12/… Commented Feb 16, 2021 at 18:46
  • yes, I have the maximum amount of memory. will update the post Commented Feb 16, 2021 at 19:02
  • Uhhh I'm skeptical that the file size is your problem. Your file is 400mb, your Lambda memory is 10gb... That means a 25x difference. In other words there are 9.6gb of RAM unaccounted for. That's a lot. This seems like a memory leak. Commented Feb 16, 2021 at 20:47
  • @MyStackRunnethOver I will update the post with the full function code Commented Feb 16, 2021 at 20:50
  • What is dictsToSend? It only appears once and you don't do anything with it Commented Feb 17, 2021 at 21:12

1 Answer 1

1

I can't find anything in the code that should obviously be taking up a ton of memory (particularly: taking up memory across for iterations without releasing it in between iterations). You're closing the StringIO virtual file, which would be my prime suspect.

Given what you've said about memory profiling, here are possible solutions:

  1. Change
response = s3.upload_fileobj(..., Fileobj=f.getValue())

to

response = s3.upload_fileobj(..., Fileobj=f)

This should avoid making a copy of the buffer (f) as a String in memory. This will take a single significant chunk out of memory usage - this may or may not be enough.

  1. Refactor your code to stream your data - specifically, most of your collections are created, then iterated through once, then never used again. Instead, you could operate entry-by-entry across your data, doing all your transforms to each datapoint one-by-one. Unless you use multi-part upload you'll still need to hold all your data in memory before uploading it to S3, but this should still reduce memory usage.

  2. (This is a bit of a nuclear option) at the end of your main loop, set your variables to None and trigger garbage collection.

I would prefer 1 and/or 2 to 3. If 3 does work, I would be suspicious that something else is going wrong.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for the suggestions, I will definitely try these
(great name btw), the smart_open route did work for me, im only using about 5 gigs of memory as opposed to blowing up at 10. the 400mb file becomes 1.5 Gb in size after transforming it so I guess that is just the way it is

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.