Process multiple csv files on pandas [duplicate]

Question

I have got three different .csv files which contain the grades for students in three different assignment. I would like to read them with pandas and calculate the average for each student. The template for each file is:

Student id, Mark, extra fields, ...
4358975489, 9,  ... ...
2345234523, 10,  ... ...
7634565323, 7,  ... ...
7653563366, 7,  ... ...
...         ...,  ... ...

For the second assignment:

Student id, Mark, extra fields, ...
4358975489, 6,  ... ...
2345234523, 8,  ... ...
7634565323, 4,  ... ...
7653563366, 5,  ... ...
...         ...,  ... ...

Desired output for the two doc for instance:

Student id, average, extra fields, ...
4358975489, 7.5,  ... ...
2345234523, 9,  ... ...
7634565323, 5.5,  ... ...
7653563366, 6,  ... ...
...         ...,  ... ...

the same for the last doc. I want to read these docs separately and for each student id to average the Mark.

Now, my code for reading one file is the following:

i_df1 = pandas.read_csv('first.csv')
i_df2 = pandas.read_csv('second.csv')
i_df3 = pandas.read_csv('third.csv')

print (o_df.keys())
for i, row in i_df1.iterrows():
    pdb.set_trace()

How can I process all three files simultaneously and extract the average grade?

pd.concat([i_df1, i_df2, i_df3]).groupby('Student id').mean(). — Quang Hoang
– Quang Hoang, Commented Nov 9, 2020 at 14:30
Would be great if you can show some sample data with expected output. — Mayank Porwal
– Mayank Porwal, Commented Nov 9, 2020 at 14:30

Mayank Porwal · Accepted Answer · 2020-11-09 14:51:33Z

1

If you want to process all dfs together, you can do this:

df = df1.append([df2, df3]).groupby('Student id', as_index=False).mean()

OR:

If you want to do it simultaneously, you can use list comprehension with df.append and mean:

Below is the list of your dfs:

In [1220]: df_list = [i_df1, i_df2, i_df3]

You can simultaneously find average of each student in a file and store the output in another list:

In [1223]: df = [i.groupby('Student_id', as_index=False).mean() for i in df_list]

edited Nov 9, 2020 at 14:51

answered Nov 9, 2020 at 14:42

Mayank Porwal

34.2k9 gold badges45 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Jose Ramon Over a year ago

I want to average the Mark though and not the student id.

Mayank Porwal Over a year ago

Just to avoid confusions, can you please add some sample dataframes with final expected output. It would really help, otherwise we will keep going back and forth.

Jose Ramon Over a year ago

Hey I just did so.

Mayank Porwal Over a year ago

@JoseRamon Great. My first answer is exactly what you need. Try it with just 2 data frames. Like this df = df1.append(df2).groupby('Student_id').mean()

Mehdi Golzadeh · Accepted Answer · 2020-11-09 14:45:05Z

1

Use pd.concat to concat 3 df:

 i_df1 = pandas.read_csv('first.csv')
 i_df2 = pandas.read_csv('second.csv')
 i_df3 = pandas.read_csv('third.csv')

 df = pd.concat([i_df1, i_df2, i_df3])
 df.groupby('Student id').agg({'Mark':'mean'})

edited Nov 9, 2020 at 14:45

answered Nov 9, 2020 at 14:35

Mehdi Golzadeh

2,5931 gold badge18 silver badges28 bronze badges

4 Comments

Quang Hoang Over a year ago

I don't think .agg({'Mark','mean'}) would work. Did you mean .agg({'Mark':'mean'})

Jose Ramon Over a year ago

df = pd.concat([i_df1, i_df2, i_df3]) returns for df just i_df1

Jose Ramon Over a year ago

Still, after the groupby and the averaging, I am getting just the i_df1 result.

Mehdi Golzadeh Over a year ago

Are u sure? I've already tested it and it works fine

KenHBS · Accepted Answer · 2020-11-09 14:45:58Z

1

You could also use the filenames and directly concatenate the dataframes together:

fnames = ["first.csv", "second.csv", "third.csv"]
df = pd.concat(pd.read_csv(fname) for fname in fnames)

df.groupby("Student id")["Mark"].mean()

Probably practically irrelevant, but perhaps nice to know anyway: This approach doesn't load the data into your memory twice, but only once.

answered Nov 9, 2020 at 14:45

KenHBS

7,2546 gold badges41 silver badges55 bronze badges

Comments

Kim Rop · Accepted Answer · 2020-11-09 14:48:37Z

1

using the data you gave assuming they get same marks three times

import pandas as pd
import numpy as np


data = [
    [4358975489, 9],
     [2345234523, 10],
    [7634565323, 7]]



data = np.array(data)
data = pd.DataFrame(data, columns=["student", "mark"])
data1 = pd.DataFrame(data, columns=["student", "mark"])
data2 = pd.DataFrame(data, columns=["student", "mark"])

std_maks = pd.concat([data, data1, data2]).groupby('student')
print(std_maks['mark'].mean())

answered Nov 9, 2020 at 14:48

Kim Rop

1421 silver badge10 bronze badges

Collectives™ on Stack Overflow

Process multiple csv files on pandas [duplicate]

4 Answers 4

4 Comments

4 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

4 Comments

Comments

Comments

Linked

Related