0

I have many csv files in a directory with two column each

miRNA  read_counts  
miR1      10
miR1      5
miR2      2
miR2      3
miR3     100

I would like to sum read_counts if the miRNA id is the same.

Result:

miRNA  read_counts  
miR1      15
miR2      5
miR3     100

To do that I wrote a little script. However I don't know how to loop it through all my csv files so I don't have to copy paste file names and output each time. Any help will be very appreciated. Thanks for the help!

import pandas as pd

df = pd.read_csv("modified_LC1a_miRNA_expressed.csv")
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv('sum_LC1a_miRNA_expressed.csv')

2 Answers 2

1

Try looking into glob module.

from glob import glob
import os

path = "./your/path"
files = glob(os.path.join(path, "*.csv"))

dataframes = []
for file in files:
    df = pd.read_csv(file)
    # rest you would want to append these to dataframes
    dataframes.append(df)

Then, use pd.concat to join the dataframes and perform the groupby operation.

EDIT 1: Based on the request mentioned in the comment:

results = {}
for file in files:
    df = pd.read_csv(file)
    # perform operation
    df_new = df.groupby('miRNA')['read_count'].sum()
    results[file] = df_new
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks, However, I don't really want to get one unique file. I want to perform the same operation in each file and give me an individual output for each of them.
Well, in that case you would want to perform the operation inside the loop and append the results or keep them in a dictionary! Nevertheless, the question asked above problem was related to reading multiple files and not about performing the operation.
Added to the changes requested, I hope this answers your question.
Somehow not working, I'm going through it. I have no output. I'll keep trying. Thanks a lot !
results = {} for file in files: df = pd.read_csv(file) # perform operation df_new = df.groupby('miRNA')['read_count'].sum() results[file] = df_new df_new.to.csv()
|
0

Not trying to steal the answer. I would have put this in a comment under @Asif Ali's answer if I had enough rep.

Assuming all input .csv files follow the format: "modified_{rest_of_the_file_name}.csv"

And you want the outputs to be: "sum_{same_rest_of_the_file_name}.csv"

import os
import glob

path = "./your/path"
files = glob.glob(os.path.join(path, "*.csv"))

for file in files:
    df = pd.read_csv(file)
    df_new = df.groupby('miRNA')['read_count'].sum()
    print(df_new)
    df_new.to_csv(file.split('modified')[:-1] + \
                  'sum' + \
                  '_'.join(file.split('modified')[-1:]))

6 Comments

I am trying to do it but not giving me any output neither. I'll keep trying to see if I have some mistake and update here. Thanks a lot !
I adjusted the last line to be more universal to the file path. Try it now.
Still not working, no output. Your code does seems good and understandable, so I don't know whats the problem with my machine. Even when I tried just to print the output I got nothing.
This is what I got line by line: import pandas as pd import os from glob import glob path = "./Users/user/Desktop/2019.11.28_for_DESEQ2" files = glob(os.path.join(path, "*.csv")) for file in files: df = pd.read_csv(file) df_new = df.groupby('miRNA')['read_count'].sum() print(df_new) df_new.to_csv(file.split('modified')[0] + \ 'sum' + \ '_'.join(file.split('modified')[1:]))
Try it with removing the dot from the beginning of your filepath
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.