Loop pandas directory

Question

I have many csv files in a directory with two column each

miRNA  read_counts  
miR1      10
miR1      5
miR2      2
miR2      3
miR3     100

I would like to sum read_counts if the miRNA id is the same.

Result:

miRNA  read_counts  
miR1      15
miR2      5
miR3     100

To do that I wrote a little script. However I don't know how to loop it through all my csv files so I don't have to copy paste file names and output each time. Any help will be very appreciated. Thanks for the help!

import pandas as pd

df = pd.read_csv("modified_LC1a_miRNA_expressed.csv")
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv('sum_LC1a_miRNA_expressed.csv')

Asif Ali · Accepted Answer · 2019-11-28 18:11:41Z

1

Try looking into glob module.

from glob import glob
import os

path = "./your/path"
files = glob(os.path.join(path, "*.csv"))

dataframes = []
for file in files:
    df = pd.read_csv(file)
    # rest you would want to append these to dataframes
    dataframes.append(df)

Then, use pd.concat to join the dataframes and perform the groupby operation.

EDIT 1: Based on the request mentioned in the comment:

results = {}
for file in files:
    df = pd.read_csv(file)
    # perform operation
    df_new = df.groupby('miRNA')['read_count'].sum()
    results[file] = df_new

edited Nov 28, 2019 at 18:11

answered Nov 28, 2019 at 17:57

Asif Ali

1,4323 gold badges12 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Amaranta_Remedios Over a year ago

Thanks, However, I don't really want to get one unique file. I want to perform the same operation in each file and give me an individual output for each of them.

Asif Ali Over a year ago

Well, in that case you would want to perform the operation inside the loop and append the results or keep them in a dictionary! Nevertheless, the question asked above problem was related to reading multiple files and not about performing the operation.

Asif Ali Over a year ago

Added to the changes requested, I hope this answers your question.

Amaranta_Remedios Over a year ago

Somehow not working, I'm going through it. I have no output. I'll keep trying. Thanks a lot !

Amaranta_Remedios Over a year ago

results = {} for file in files:     df = pd.read_csv(file)     # perform operation     df_new = df.groupby('miRNA')['read_count'].sum()     results[file] = df_new     df_new.to.csv()

|

Haliaetus · Accepted Answer · 2019-11-28 18:58:31Z

0

Not trying to steal the answer. I would have put this in a comment under @Asif Ali's answer if I had enough rep.

Assuming all input .csv files follow the format: "modified_{rest_of_the_file_name}.csv"

And you want the outputs to be: "sum_{same_rest_of_the_file_name}.csv"

import os
import glob

path = "./your/path"
files = glob.glob(os.path.join(path, "*.csv"))

for file in files:
    df = pd.read_csv(file)
    df_new = df.groupby('miRNA')['read_count'].sum()
    print(df_new)
    df_new.to_csv(file.split('modified')[:-1] + \
                  'sum' + \
                  '_'.join(file.split('modified')[-1:]))

edited Nov 28, 2019 at 18:58

answered Nov 28, 2019 at 18:20

Haliaetus

4903 silver badges13 bronze badges

6 Comments

Amaranta_Remedios Over a year ago

I am trying to do it but not giving me any output neither. I'll keep trying to see if I have some mistake and update here. Thanks a lot !

Haliaetus Over a year ago

I adjusted the last line to be more universal to the file path. Try it now.

Amaranta_Remedios Over a year ago

Still not working, no output. Your code does seems good and understandable, so I don't know whats the problem with my machine. Even when I tried just to print the output I got nothing.

Amaranta_Remedios Over a year ago

This is what I got line by line:

import pandas as pd import os from glob import glob   path = "./Users/user/Desktop/2019.11.28_for_DESEQ2" files = glob(os.path.join(path, "*.csv"))  for file in files:     df = pd.read_csv(file)     df_new = df.groupby('miRNA')['read_count'].sum()     print(df_new)     df_new.to_csv(file.split('modified')[0] + \                   'sum' + \                   '_'.join(file.split('modified')[1:]))

Haliaetus Over a year ago

Try it with removing the dot from the beginning of your filepath

|

Collectives™ on Stack Overflow

Loop pandas directory

2 Answers 2

6 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related