8

So I've got about 5008 rows in a CSV file, a total of 5009 with the headers. I'm creating and writing this file all within the same script. But when i read it at the end, with either pandas pd.read_csv, or python3's csv module, and print the len, it outputs 4967. I checked the file for any weird characters that may be confusing python but don't see any. All the data is delimited by commas.

I also opened it in sublime and it shows 5009 rows not 4967.

I could try other methods from pandas like merge or concat, but if python wont read the csv correct, that's no use.

This is one method i tried.

df1=pd.read_csv('out.csv',quoting=csv.QUOTE_NONE, error_bad_lines=False)
df2=pd.read_excel(xlsfile)

print (len(df1))#4967
print (len(df2))#5008

df2['Location']=df1['Location']
df2['Sublocation']=df1['Sublocation']
df2['Zone']=df1['Zone']
df2['Subnet Type']=df1['Subnet Type']
df2['Description']=df1['Description']

newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df2.to_csv(newfile, index=False)
print('Done.')

target.close()

Another way I tried is

dfcsv = pd.read_csv('out.csv')

wb = xlrd.open_workbook(xlsfile)
ws = wb.sheet_by_index(0)
xlsdata = []
for rx in range(ws.nrows):
    xlsdata.append(ws.row_values(rx))

print (len(dfcsv))#4967
print (len(xlsdata))#5009

df1 = pd.DataFrame(data=dfcsv)
df2 = pd.DataFrame(data=xlsdata)

df3 = pd.concat([df2,df1], axis=1)

newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df3.to_csv(newfile, index=False)    
print('Done.')

target.close()

But not matter what way I try the CSV file is the actual issue, python is writing it correctly but not reading it correctly.

Edit: Weirdest part is that i'm getting absolutely no encoding errors or any errors when running the code...

Edit2: Tried testing it with nrows param in first code example, works up to 4000 rows. Soon as i specify 5000 rows, it reads only 4967.

Edit3: manually saved csv file with my data instead of using the one written by the program, and it read 5008 rows. Why is python not writing the csv file correctly?

11
  • 1
    are you sure every line ends with a proper newline? Did you try error_bad_lines=True? Commented Aug 9, 2016 at 14:58
  • Are you sure that the source file does not contain any encoding errors? Can you open it with open() without any errors? Commented Aug 9, 2016 at 14:59
  • You use sure then there is no (guarded/shielded) newlines in the middle of field? Commented Aug 9, 2016 at 14:59
  • @Tommy Yes, the way i'm writing the CSV in the script each row of data ends in a newline. Commented Aug 9, 2016 at 15:09
  • @DaVinci what happens when error_bad_lines=True? Commented Aug 9, 2016 at 15:10

2 Answers 2

5

I ran into this issue also. I realized that some of my lines had open-ended quotes, which was for some reason interfering with the reader.

So for example, some rows were written as:

GO:0000026  molecular_function  "alpha-1
GO:0000027  biological_process  ribosomal large subunit assembly
GO:0000033  molecular_function  "alpha-1

and this led to rows being read incorrectly. (Unfortunately I don't know enough about how csvreader works to tell you why. Hopefully someone can clarify the quote behavior!)

I just removed the quotes and it worked out.

Edited: This option works too, if you want to maintain the quotes:

quotechar=None
Sign up to request clarification or add additional context in comments.

Comments

0

My best guess without seeing the file is that you have some lines with too many or not enough commas, maybe due to values like foo,bar.

Please try setting error_bad_lines=True. From Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html to see if it catches lines with errors in them, and my guess is that there will be 41 such lines.

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

The csv.QUOTE_NONE option seems to not quote fields and replace the current delimiter with escape_char + delimiter when writing, but you didn't paste your writing code, but on read it's unclear what this option does. https://docs.python.org/3/library/csv.html#csv.Dialect

3 Comments

I did try setting that to True, however, the len of the csv still outputs as 4967.
@DaVinci do you have any values with your delimiter in them?
I checked for that and no, no data values with commas in them.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.