1

I have multiple CSV files of time series data - each file is for each day of the month. Every file has one datetime column followed by multiple other columns. I want to merge (not pd.concat) all the files into one data frame. How can I loop to select the files from the folder and merge it?

Basically, what I am trying to do is, read the first 2 CSV files and merge it into a dataframe ('df_merged'), then pick the third csv file and merge it with the dataframe created and then pick the fourth csv file and merge it to the dataframe. Could you please tell me how I can do it using a loop? Note : The first merge should not be a outer merge.

What I have so far is without the for loop. Code for just 4 files

## Reading the csv files
df_file_one = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-01.csv")
df_file_two = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-02.csv")
df_file_three = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-03.csv")
df_file_four = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-04.csv")

#Merging the files
df_merged = pd.merge(df_file_one, df_file_two, left_index=True, right_index=True) # The first merge should not be outer merge.
df_merged = pd.merge(df_merged, df_file_three, left_index=True, right_index=True,how='outer')
df_merged = pd.merge(df_merged, df_file_four, left_index=True, right_index=True, how = 'outer')
3
  • You should first work through the Python tutorial if not done yet. Commented Oct 11, 2022 at 10:00
  • Does the order of files matter? Commented Oct 11, 2022 at 10:32
  • @Vitalizzare , yes, the order of the files matter since they are time series data. Commented Oct 11, 2022 at 10:36

1 Answer 1

1

If the order of files doesn't matter then we can loop through the files like this:

from glob import glob
import pandas as pd

files = glob("C:/Users/Desktop/2022/prices_*.csv")
df = pd.merge(
    pd.read_csv(files.pop()),
    pd.read_csv(files.pop()),
    left_index=True, 
    right_index=True
)

while files:
    df = pd.merge(
        df, 
        pd.read_csv(files.pop()), 
        left_index=True, 
        right_index=True,
        how='outer'
    )

If the order is meaningful but all you have is the files for a single month named like prices_Jan_2022-01-02.csv then we can use the natural lexicographic order like this:

files = sorted(glob(...), reverse=True)

If more then one month is presented, but the name pattern is still the same, then we can use the yyyy-mm-dd pattern at the end of names as a key to sort files:

files = sorted(glob(...), key=lambda f: f[-14, -4], reverse=True)

All the other code stay the same.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you but when I tried this, using pop(), starts by using the last csv from the folder. Is there a way to start by picking the first csv in the folder?
Thank you very much! This works perfectly. Although I noticed that I get a error on the 'read_csv()' saying that read_csv not a defined Pylance. Also, if I change it to -> pd.merge( pd.read_csv(files.pop() ), it gives an error.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.