5

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.

In this case first i need to create a DataFrame for all the 10gb csv data.

text_csv=Pandas.read_csv("target.csv")
df=Pandas.DataFrame(text_csv)
df.describe()

Does this mean all the 10gb will get loaded in to memory and calculate the statistics?

1
  • 1
    There are options to iterate over the csv by setting chunksize=XX or setting iterator=True. But then you will need to do the aggregation of the statistics yourself. Commented Feb 23, 2016 at 6:56

2 Answers 2

5

Yes, I think you are right. And you can omit df=Pandas.DataFrame(text_csv), because output from read_csv is DataFrame:

import pandas as pd

df = pd.read_csv("target.csv")
print df.describe()

Or you can use dask:

import dask.dataframe as dd

df = dd.read_csv('target.csv.csv')

print df.describe()

You can use parameter chunksize of read_csv, but you get output TextParser not DataFrame, so then you need concat:

import pandas as pd
import io

temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
#chunksize = 2 for testing
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
<pandas.io.parsers.TextFileReader object at 0x000000001995ADA0>
df = pd.concat(tp, ignore_index=True)
print df.describe()
               a           b
count  15.000000   15.000000
mean    3.333333  527.600000
std     1.877181    5.082182
min     1.000000  519.000000
25%     2.000000  524.500000
50%     3.000000  528.000000
75%     5.000000  531.500000
max     6.000000  535.000000

You can convert TextFileReader to DataFrame, but aggregate this output can be difficult:

import pandas as pd

import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""

#after testing replace io.StringIO(temp) to filename
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp

dfs = []
for t in tp:
    df = pd.DataFrame(t)
    df1 = df.describe()
    dfs.append(df1.T)

df2 = pd.concat(dfs)
print df2
   count   mean        std  min     25%    50%     75%  max
a      2    1.0   0.000000    1    1.00    1.0    1.00    1
b      2  525.5   0.707107  525  525.25  525.5  525.75  526
a      2    1.5   0.707107    1    1.25    1.5    1.75    2
b      2  530.0   4.242641  527  528.50  530.0  531.50  533
a      2    2.0   0.000000    2    2.00    2.0    2.00    2
b      2  530.0   2.828427  528  529.00  530.0  531.00  532
a      2    3.0   0.000000    3    3.00    3.0    3.00    3
b      2  526.5  10.606602  519  522.75  526.5  530.25  534
a      2    3.5   0.707107    3    3.25    3.5    3.75    4
b      2  532.5   3.535534  530  531.25  532.5  533.75  535
a      2    5.0   0.000000    5    5.00    5.0    5.00    5
b      2  530.0   1.414214  529  529.50  530.0  530.50  531
a      2    6.0   0.000000    6    6.00    6.0    6.00    6
b      2  520.5   0.707107  520  520.25  520.5  520.75  521
a      1    6.0        NaN    6    6.00    6.0    6.00    6
b      1  524.0        NaN  524  524.00  524.0  524.00  524
Sign up to request clarification or add additional context in comments.

Comments

1

Seems there is no limitation of file size for pandas.read_csv method.

According to @fickludd's and @Sebastian Raschka's answer in Large, persistent DataFrame in pandas, you can use iterator=True and chunksize=xxx to load the giant csv file and calculate the statistics you want:

import pandas as pd

df = pd.read_csv('some_data.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with chunks of 1000 rows.
partial_desc = df.describe()

And aggregate all the partial describe info all yourself.

1 Comment

Hmmm, there is error: AttributeError: 'TextFileReader' object has no attribute 'describe'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.