2

I have csv file like below

file1

A B
1 2
3 4

file2

A B
1 2

file3

A B
1 2
3 4
5 6

I would like to count the rows in all the csv file

I tried

f=pd.read_csv(file1)

f.shape

But When I have a lot of csv file ,it takes too much time.

I would like to get the result like below

      rows
file1  2
file2  1
file3  3

How can I get this result?

5 Answers 5

7

You can create dict of length of all files and then Seriesm for DataFrame add to_frame:

import glob
import pandas as pd

files = glob.glob('files/*.csv')

d = {f: sum(1 for line in open(f)) for f in files}

print (pd.Series(d))

print (pd.Series(d).rename('rows').rename_axis('filename').reset_index())

open does not guarantee the file to be closed properly, so another solution:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

d = {f: file_len(f) for f in files}
Sign up to request clarification or add additional context in comments.

2 Comments

Better to use a for loop than open in a list comprehension :)
@Claudio - sure, I delete them too.
4

In *nix systems and if you can do it outside of Python:

wc -l *.csv

Should do the trick.

5 Comments

subprocess.getoutput("wc -l " + fileName).split()[0] is about three times faster than sed -n '$=' ", BUT ... doesn't count last line in file if the last line doesn't end with LF (line feed) ...
Do you know how to extract the last char from a file also that FAST, so that it will be possible to get the correct line count from wc -l by +1 if the last char is not LF and +0 if it is?
The POSIX definition of a line is "A sequence of zero or more non- <newline> characters plus a terminating <newline> character." Don't have an immediate idea on how to treat efficiently files that don't terminate in newline...
The md5deep I am using is creating non-POSIX defined files with significant last line not terminated by newline, so skipping the last line from consideration isn't an option - in such case it is necessary to check also for EOF. Anyway, all the methods getting lines from file give all also the last non-LF terminated one not sticking to the POSIX definition? Hmmm ... This would be not the first time it is necessary to set by side the intuitive understanding of things in order to 'agree' with what is ...
If a csv file has a field which allows multiple line in it, this would give the false positive result.
2

For the sake of completeness as a kind of summary of all what was said about speed and proper opening/closing of files here a solution that works FAST and don't need much fancy code, ... limited to *nix systems(?) (but I think similar technique can be used on other systems too).

The code below runs a tiny bit faster then rawincount() and counts also last lines which don't have a '\n' at the end of line (a problem rawincount() has):

import glob, subprocess, pandas
files = glob.glob('files/*.csv') 
d = {f: subprocess.getoutput("sed -n '$=' " + f) for f in files}
print(pandas.Series(d))

P.S. Here some timings I have run on a set of large text files (39 files with a total size of 3.7 GByte, Linux Mint 18.1, Python 3.6). Fascinating is here the timing of the proposed wc -l *.csv method:

    Results of TIMING functions for getting number of lines in a file:
    -----------------------------------------------------------------
            getNoOfLinesInFileUsing_bash_wc :  1.04  !!! doesn't count last non empty line
          getNoOfLinesInFileUsing_bash_grep :  1.59
  getNoOfLinesInFileUsing_mmapWhileReadline :  2.75
           getNoOfLinesInFileUsing_bash_sed :  3.42
 getNoOfLinesInFileUsing_bytearrayCountLF_B :  3.90  !!! doesn't count last non empty line
          getNoOfLinesInFileUsing_enumerate :  4.37
      getNoOfLinesInFileUsing_forLineInFile :  4.49
  getNoOfLinesInFileUsing_sum1ForLineInFile :  4.82      
 getNoOfLinesInFileUsing_bytearrayCountLF_A :  5.30  !!! doesn't count last non empty line
     getNoOfLinesInFileUsing_lenListFileObj :  6.02
           getNoOfLinesInFileUsing_bash_awk :  8.61

Comments

1

The solutions provided so far are not the quickest when working with very large csv's. Also, using open() in a list comprehension does not guarantee the file to be closed properly as e.g. when using with (see this question). So combining that with the insights from this question for speed:

from itertools import takewhile, repeat

def rawincount(filename):
    with open(filename, 'rb') as f:
        bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
        return sum(buf.count(b'\n') for buf in bufgen)

And applying the solution provided by @jezrael:

import glob
import pandas as pd

files = glob.glob('files/*.csv')
d = {f: rawincount(f) for f in files}
df = pd.Series(d).to_frame('rows')

2 Comments

There are two issues with it: #1: it does not return the number of lines in file (counting '\n's doesn't do that) #2: The speed is not really an issue. In my tests: 4.7 sec. for the method used in the question, 3.8 sec. for your function and 4.3 sec. when using for i, l in enumerate(f): pass. Anyway I am glad you provided this here. By the way: the mapcount method (in link you provided) gives 2.7 sec. on my box (Python 3.6, Linux Mint 18.1)
:) Thanks for checking. In linux I would wonder if calling subprocess with wc -l wouldn't be even quicker.
1

Try this,

it adds each entry with file name and no.of rows and the columns have appropriate labels :

import os      
df = pd.DataFrame(columns=('file_name', 'rows'))
for index,i in enumerate(os.listdir('.')):
    df.loc[index] = [i,len(pd.read_csv(i).index)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.