Find average in large csv file using pandas

Question

I have 60 HUGE csv files (around 2.5 GB each). Each cover data for a month and has a 'distance' column I am interested in. Each has around 14 million rows.

I need to find the average distance for each month.

This is what I have so far:

import pandas as pd
for x in range(1, 60):
    df=pd.read_csv(r'x.csv', error_bad_lines=False, chunksize=100000)
    for chunk in df:
        print df["distance"].mean()

First I know 'print' is not a good idea. I need to assign the mean to a variable I guess. Second, what I need is the average for the whole dataframe and not just each chunk.

But I don't know how to do that. I was thinking of getting the average of each chunk and taking the simple average of all the chunks. That should give me the average for the dataframe as long as chunksize is equal for all chunks.

Third, I need to do this for all of the 60 csv files. Is my looping for that correct in the code above? My files are named 1.csv to 60.csv .

Keep track of the aggregate sum of distances and line count; then divide. Also if speed is an issue, consider looking at something like this: (stackoverflow.com/questions/3122442/…) — hilberts_drinking_problem
– hilberts_drinking_problem, Commented Dec 6, 2016 at 1:58
You want to do the job only in Python or you can use Gnu/Linux tools like sed and awk ? — Chiheb Nexus
– Chiheb Nexus, Commented Dec 6, 2016 at 2:11
Sorry, not familiar with sed and awk. Would prefer Python, if possible. — Mishal Ahmed
– Mishal Ahmed, Commented Dec 6, 2016 at 2:39
Have a look at the 'usecols' and 'squeeze' arguments to pd.read_csv(). No sense loading columns that you are not using, right? — Alex O
– Alex O, Commented Dec 6, 2016 at 2:53

Andrew · Accepted Answer · 2016-12-06 02:29:49Z

3

Few things I would fix based on how your file is named. I presume your files are named like "1.csv", "2.csv". Also remember that range is exclusive, and thus you would need to go to 61 in the range.

distance_array = []
for x in range(1,61):
   df = pd.read((str(x) + ".csv", error_bad_lines=False, chunksize=100000)
   for index, row in df.iterrows():
      distance_array.append(x['distance'])
print(sum(distance_array)/len(distance_array))

answered Dec 6, 2016 at 2:29

Andrew

462 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:09:02Z

0

I am presuming that the datasets are too large to load into memory as a pandas dataframe. If that is the case consider using a generator on each csv file, something similar too: Where to use yield in Python best?

As the overall result that you are after is the average you can accumulate the the total sum over each row and track how many rows with incremental step.

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Dec 6, 2016 at 4:29

Mike Dale

711 gold badge2 silver badges8 bronze badges

Collectives™ on Stack Overflow

Find average in large csv file using pandas

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related