Here is a python script you can use for splitting large files using subprocess:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
You can call it externally:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
You can also import subprocess and run it directly in your program.
The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.
Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
Here is another example using readlines():
"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5
def yield_rows(reader, chunk_size):
"""
Yield row chunks
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
def batch_operation(data):
for item in data:
print(item)
with open('file', 'r') as f:
chunks = yield_rows(f.readlines(), CHUNK_SIZE)
for _chunk in chunks:
batch_operation(_chunk)
The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.