1

I have a csv file that has only one column. I want to extract the number of rows. When I run the the code below:

import pandas as pd
df = pd.read_csv('data.csv')
print(df)

I get the following output:

[65422771 rows x 1 columns]

But when I run the code below:

file = open("data.csv")
numline = len(file.readlines())
print (numline)

I get the following output:

130845543

What is the correct number of rows in my csv file? What is the difference between the two outputs?

3
  • What does df.shape[0] return? Commented Feb 18, 2021 at 21:09
  • df.shape[0] returns 65422771 Commented Feb 18, 2021 at 21:15
  • 1
    Given that for read_csv the parameter skip_blank_lines is TRUE by default I'm guessing you have many blank lines in the CSV file, per @Giovannirison's answer below. An answer to this is going to need a sample of what is in the CSV? Commented Feb 18, 2021 at 21:29

1 Answer 1

1

Is it possible that you have an empty line after each entry? because the readlines count is exactly double wrt pandas df rows. So pandas is skipping empty lines while readlines count them

in order to check the number of empty lines try:

import sys
import csv

csv.field_size_limit(sys.maxsize)
   
data= open ('data.csv')
for line in csv.reader(data): 
    if not line: 
        empty_lines += 1 
        continue
    print line
Sign up to request clarification or add additional context in comments.

4 Comments

There are empty rows but not after each entry.
@BNoor try to run the code above to check the number of empty lines in your csv
_csv.Error: field larger than field limit (131072)
@BNoor updated, but that error means that your csv is badly formatted

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.