Python CSV reader returns fewer rows than in file

Question

I am struggling with the csv module. I have a sample CSV file that has 5000 lines (each line contains 7 values 0 or 1) with headers. I want to iterate through file in read mode and append file in write mode with new column values (prediction), but iteration stop after 478th row (like in sample code):

import csv
import random


def input_to_csv():

    prediction = [round(random.uniform(0, 1), 0) for _ in range(1, 5000)]

    combined_set = list(map(str, prediction))

    export_columns = ['COLUMN ' + str(n) for n in range(1, 8)] + ['OUTPUT'] 

    rr = 0
    with open('test.csv', 'r') as input_file:

        csv_input = csv.reader(input_file)
        next(csv_input)

        with open('test.csv', 'w', newline='') as csv_file:

            writer = csv.writer(csv_file)
            writer.writerow(export_columns)

            for row in csv_input:

                rr += 1

        print(rr)

I have checked length of the csv_input file using row_count = sum(1 for _ in input_file) which gave me 5000 lines.

Masklinn · Accepted Answer · 2019-04-24 08:18:39Z

2

You're opening the same file twice, once for reading and once for writing.

Because you're getting some data from the file before reopening it (the next() call) it's going to fill a read buffer (buffered reads are the default in Python) and iterate on that fine.

However once it reaches the end of the read buffer it's going to go back to the file and try and get some data, which re-opening the file in "w" mode has truncated. So the reader will get no data, assume it's reached end of file (which is not entirely wrong) and stop.

I expect the code looked to be working as long as you'd stayed below Python's default buffer size (io.DEFAULT_BUFFER_SIZE, that's 8kB on my system).

You should write to a different file than you're reading from. Either move the file before reading from it, or open a completely different file for writing (and possibly move it afterwards).

edited Apr 24, 2019 at 8:18

answered Apr 24, 2019 at 7:16

Masklinn

43.7k4 gold badges58 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MrDominikku Over a year ago

That's good approach when you work with small files and memory is not full. My goal is to open file with millions of lines and add 15-20 columns while ML algorithm is learning and gives prediction output after each iteration. When I allocate memory for csv file before algorithms starts its occupant 2GB, so every GB counts ;)

Masklinn Over a year ago

That's a fine answer to the second suggestion, which leaves you with the first: write to a different file than you're reading from. Either move the file before reading from it, or open a completely different file for writing (and possibly move it afterwards). That's what tools like e.g. sed -i do.

Masklinn Over a year ago

An other alternative is to do your storage in something more resilient and flexible e.g. an sqlite db or something like that.

MrDominikku Over a year ago

Could you add

write to a different file than you're reading from. Either move the file before reading from it, or open a completely different file for writing (and possibly move it afterwards).

as answer and I will mark it as complete? That resolved my issue.

Masklinn Over a year ago

Done, I've replaced the last paragraph of the answer by that.

Collectives™ on Stack Overflow

Python CSV reader returns fewer rows than in file

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related