How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

Question

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:

s3 = boto3.client('s3')
     if event:
        file_obj = event["Records"][0]
        bucketname = str(file_obj["s3"]["bucket"]["name"])
        csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
        print("Filename: ", csv_filename)
        csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
        file_content = csv_fileObj["Body"].read().decode("utf-8").split()

I have tried put my csv contents into a list but didnt work.

 results = []
        for row in csv.DictReader(file_content):
         results.append(row.values())
        print(results)
        print(file_content)
        return {
           'statusCode': 200,
           'body': json.dumps('S3 file processed')
         }

Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.

Later I would follow this approach to load the data into RDS

import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername' 
database = 'AdventureWorks' 
username = 'username' 
password = 'yourpassword' 
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
     cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()

Can anyone suggest how to go about it?

Side-note: Your code is only processing the first record sent to the Lambda function (event["Records"][0]). It is possible that multiple event records can be sent to the Lambda function, so your code should loop through and process each Record. — John Rotenstein
– John Rotenstein, Commented Feb 3, 2022 at 10:47
Why do you particularly want to use Dataframes? The AWS Lambda function can read the CSV file directly and generate the SQL commands. — John Rotenstein
– John Rotenstein, Commented Feb 4, 2022 at 0:32
I tried creating a list but didnt work. updated my question above. Hence, tried creating a dataframe. could you pls suggest anything else? — adey27
– adey27, Commented Feb 4, 2022 at 0:35

Simon Hawe · Accepted Answer · 2022-02-03 13:25:11Z

1

You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.

import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs

answered Feb 3, 2022 at 13:25

Simon Hawe

4,62913 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

adey27 Over a year ago

Hi @simon, I tried your solution above. but i m getting an error saying " [ERROR] MemoryError" in this line "data = io.BytesIO(b"".join(chunks))". my csv file size is just 12mb

adey27 Over a year ago

It worked. I had to increase lambda function memory..

John Rotenstein · Accepted Answer · 2022-02-04 00:37:56Z

1

It appears that your goal is to load the contents of a CSV file from Amazon S3 into SQL Server.

You could do this without using Dataframes:

Loop through the Event Records (multiple can be passed-in)
For each object:
- Download the object to /tmp/
- Use the Python CSVReader to loop through the contents of the file
- Generate INSERT statements to insert the data into the SQL Server table

You might also consider using aws-data-wrangler: Pandas on AWS, which is available as a Lambda Layer.

answered Feb 4, 2022 at 0:37

John Rotenstein

273k28 gold badges456 silver badges541 bronze badges

3 Comments

adey27 Over a year ago

Hi @John, yes my goal is to load contents of a CSV file from Amazon S3 into RDS MSSQL Server. i am unable to perform 2 steps mentioned above. not sure how to do it though!. could u pls assist. this is something new for me.

John Rotenstein Over a year ago

You are welcome to create a new Question, show your code and provide details of the problem you are experiencing.

adey27 Over a year ago

Problem got fixed. i am able to load s3 contents into RDS..

Collectives™ on Stack Overflow

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related