I have as input a potentially large CSV file (gzip compressed) with a known structure. I don't know in advance the size of this file, but let's say it can't fit in memory. The rows in this CSV are ordered like the following:
key1, …other fields
key1,
key1,
key1,
key2,
key2,
key3,
key4,
key4,
⋮
They are ordered by the value in the first column (let's call it key), but it is unknown how many rows are there for each distinct key. I need to scan the whole file and process only the first N rows matching each key (there could be more than N rows for some of the keys). These N rows per key can be processed in memory.
I came up with this code, but I don't like it very much. It is a bit messy:
import gzip
def process_rows(key rows):
print(f'Processed rows for key {key}')
def main(file_path, N=1000):
with gzip.GzipFile(filename=file_path) as file:
curr_key = None
rows_to_process = []
for line in file:
line = line.decode().strip()
if len(line) == 0:
continue
fields = line.split(',')
[key, field2, field3] = fields
if curr_key is not None:
if curr_key != key or (len(rows_to_process) > 0 and len(rows_to_process) % N == 0):
process_rows(key, rows_to_process)
# Find next key if needed
while curr_key == key:
line = next(file, None)
if line is None:
return # End of file, exit
line = line.decode().strip()
if len(line) < 1:
continue
fields = line.split(',')
[key, field2, field3] = fields
print('Found next key', key)
# Reset rows to process
rows_to_process = []
curr_key = key
rows_to_process.append([key, field2, field3])
# Flush trailing data
if (len(rows_to_process) > 0):
process_rows(key, rows_to_process)
Is there a cleaner way to do this?