How to load multiple huge csv (with different columns) into AWS S3

Question

I have around 50 csv files each of different structure. Each csv file has close to 1000 columns. I am using DictReader to merge csv files locally, but it is taking too much time to merge. The approach was to merge 1.csv and 2.csv to create 12.csv. Then merge 12.csv with 3.csv. This is not the right approach.

for filename in inputs:
    with open(filename, "r", newline="") as f_in:
      reader = csv.DictReader(f_in)  # Uses the field names in this file

Since I have to finally upload this huge single csv to AWS, I was thinking about a better AWS based solution. Any suggestions on how I can import these multiple different structure csv and merge it in AWS?

illagrenan · Accepted Answer · 2019-01-29 00:31:45Z

1

Launch an EMR cluster and merge the files with Apache Spark. This gives you complete control over the schema. This answer might help for example.

Alternatively, you can also try your luck and see how AWS Glue handles the multiple schemas when you create a crawler.

You should copy your data to s3 in both cases.

edited Jan 29, 2019 at 0:31

illagrenan

6,6452 gold badges59 silver badges67 bronze badges

answered May 18, 2018 at 6:16

Frederic

3,2841 gold badge24 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3567195 Over a year ago

Thanks for this. I think this should work. I will try and update the outcome in this thread.

Collectives™ on Stack Overflow

How to load multiple huge csv (with different columns) into AWS S3

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related