0

I have around 50 csv files each of different structure. Each csv file has close to 1000 columns. I am using DictReader to merge csv files locally, but it is taking too much time to merge. The approach was to merge 1.csv and 2.csv to create 12.csv. Then merge 12.csv with 3.csv. This is not the right approach.

for filename in inputs:
    with open(filename, "r", newline="") as f_in:
      reader = csv.DictReader(f_in)  # Uses the field names in this file

Since I have to finally upload this huge single csv to AWS, I was thinking about a better AWS based solution. Any suggestions on how I can import these multiple different structure csv and merge it in AWS?

1 Answer 1

1

Launch an EMR cluster and merge the files with Apache Spark. This gives you complete control over the schema. This answer might help for example.

Alternatively, you can also try your luck and see how AWS Glue handles the multiple schemas when you create a crawler.

You should copy your data to s3 in both cases.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this. I think this should work. I will try and update the outcome in this thread.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.