How to count rows in multiple csv file

Question

I have csv file like below

file1

A B
1 2
3 4

file2

A B
1 2

file3

I would like to count the rows in all the csv file

I tried

f=pd.read_csv(file1)

f.shape

But When I have a lot of csv file ,it takes too much time.

I would like to get the result like below

      rows
file1  2
file2  1
file3  3

How can I get this result?

jezrael · Accepted Answer · 2017-04-10 11:31:36Z

7

You can create dict of length of all files and then Seriesm for DataFrame add to_frame:

import glob
import pandas as pd

files = glob.glob('files/*.csv')

d = {f: sum(1 for line in open(f)) for f in files}

print (pd.Series(d))

print (pd.Series(d).rename('rows').rename_axis('filename').reset_index())

open does not guarantee the file to be closed properly, so another solution:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

d = {f: file_len(f) for f in files}

edited Apr 10, 2017 at 11:31

answered Apr 10, 2017 at 10:38

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Roelant Over a year ago

Better to use a for loop than open in a list comprehension :)

jezrael Over a year ago

@Claudio - sure, I delete them too.

efajardo · Accepted Answer · 2017-04-10 20:05:41Z

4

In *nix systems and if you can do it outside of Python:

wc -l *.csv

Should do the trick.

answered Apr 10, 2017 at 20:05

efajardo

8075 silver badges9 bronze badges

5 Comments

user7711283 Over a year ago

subprocess.getoutput("wc -l " + fileName).split()[0] is about three times faster than sed -n '$=' ", BUT ... doesn't count last line in file if the last line doesn't end with LF (line feed) ...

user7711283 Over a year ago

Do you know how to extract the last char from a file also that FAST, so that it will be possible to get the correct line count from wc -l by +1 if the last char is not LF and +0 if it is?

efajardo Over a year ago

The POSIX definition of a line is "A sequence of zero or more non- <newline> characters plus a terminating <newline> character." Don't have an immediate idea on how to treat efficiently files that don't terminate in newline...

user7711283 Over a year ago

The md5deep I am using is creating non-POSIX defined files with significant last line not terminated by newline, so skipping the last line from consideration isn't an option - in such case it is necessary to check also for EOF. Anyway, all the methods getting lines from file give all also the last non-LF terminated one not sticking to the POSIX definition? Hmmm ... This would be not the first time it is necessary to set by side the intuitive understanding of things in order to 'agree' with what is ...

hatirlatici Over a year ago

If a csv file has a field which allows multiple line in it, this would give the false positive result.

score 2 · Accepted Answer · 2017-04-11 01:04:27Z

For the sake of completeness as a kind of summary of all what was said about speed and proper opening/closing of files here a solution that works FAST and don't need much fancy code, ... limited to *nix systems(?) (but I think similar technique can be used on other systems too).

The code below runs a tiny bit faster then rawincount() and counts also last lines which don't have a '\n' at the end of line (a problem rawincount() has):

import glob, subprocess, pandas
files = glob.glob('files/*.csv') 
d = {f: subprocess.getoutput("sed -n '$=' " + f) for f in files}
print(pandas.Series(d))

P.S. Here some timings I have run on a set of large text files (39 files with a total size of 3.7 GByte, Linux Mint 18.1, Python 3.6). Fascinating is here the timing of the proposed wc -l *.csv method:

    Results of TIMING functions for getting number of lines in a file:
    -----------------------------------------------------------------
            getNoOfLinesInFileUsing_bash_wc :  1.04  !!! doesn't count last non empty line
          getNoOfLinesInFileUsing_bash_grep :  1.59
  getNoOfLinesInFileUsing_mmapWhileReadline :  2.75
           getNoOfLinesInFileUsing_bash_sed :  3.42
 getNoOfLinesInFileUsing_bytearrayCountLF_B :  3.90  !!! doesn't count last non empty line
          getNoOfLinesInFileUsing_enumerate :  4.37
      getNoOfLinesInFileUsing_forLineInFile :  4.49
  getNoOfLinesInFileUsing_sum1ForLineInFile :  4.82      
 getNoOfLinesInFileUsing_bytearrayCountLF_A :  5.30  !!! doesn't count last non empty line
     getNoOfLinesInFileUsing_lenListFileObj :  6.02
           getNoOfLinesInFileUsing_bash_awk :  8.61

Community · Accepted Answer · 2017-05-23 11:54:06Z

1

The solutions provided so far are not the quickest when working with very large csv's. Also, using open() in a list comprehension does not guarantee the file to be closed properly as e.g. when using with (see this question). So combining that with the insights from this question for speed:

from itertools import takewhile, repeat

def rawincount(filename):
    with open(filename, 'rb') as f:
        bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
        return sum(buf.count(b'\n') for buf in bufgen)

And applying the solution provided by @jezrael:

import glob
import pandas as pd

files = glob.glob('files/*.csv')
d = {f: rawincount(f) for f in files}
df = pd.Series(d).to_frame('rows')

edited May 23, 2017 at 11:54

CommunityBot

11 silver badge

answered Apr 10, 2017 at 11:10

Roelant

5,2295 gold badges43 silver badges77 bronze badges

2 Comments

user7711283 Over a year ago

There are two issues with it: #1: it does not return the number of lines in file (counting '\n's doesn't do that) #2: The speed is not really an issue. In my tests: 4.7 sec. for the method used in the question, 3.8 sec. for your function and 4.3 sec. when using for i, l in enumerate(f): pass. Anyway I am glad you provided this here. By the way: the mapcount method (in link you provided) gives 2.7 sec. on my box (Python 3.6, Linux Mint 18.1)

Roelant Over a year ago

:) Thanks for checking. In linux I would wonder if calling subprocess with wc -l wouldn't be even quicker.

Rahul K P · Accepted Answer · 2017-04-10 21:16:36Z

1

Try this,

it adds each entry with file name and no.of rows and the columns have appropriate labels :

import os      
df = pd.DataFrame(columns=('file_name', 'rows'))
for index,i in enumerate(os.listdir('.')):
    df.loc[index] = [i,len(pd.read_csv(i).index)]

edited Apr 10, 2017 at 21:16

user7711283

answered Apr 10, 2017 at 10:43

Rahul K P

16.2k4 gold badges40 silver badges56 bronze badges

Collectives™ on Stack Overflow

How to count rows in multiple csv file

5 Answers 5

2 Comments

5 Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

5 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related