37

I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.

What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.

Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?

I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)

2
  • 4
    Unwelcome suggestion: get a better text editor. :-) If you're on Windows, EmEditor is one I know of that will seamlessly edit files without having to load them completely into memory. Commented Nov 15, 2008 at 13:00
  • See my answer here on how to split large text files in Python without running any Linux commands. Commented Apr 8, 2023 at 14:18

17 Answers 17

46

linux has a split command

split -l 100000 file.txt

would split into files of equal 100,000 line size

Sign up to request clarification or add additional context in comments.

3 Comments

And if you're base OS is Windows you can get Cygwin for access to basically all of the cool command-line utilities.
Unixtools for windows also have the split tool: split.exe
I have a 120 GB file. While using this command it is getting stuck after some 1928613 lines. It does not proceeds any furthur. I was trying to do what was said in stackoverflow.com/a/291759/6143004 but the same problem is occuring.
17

Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)

1 Comment

Thanks for the answer - your suggestions are working well so far for reading the file. When I've finished, I'll also try a binary version that doesn't read one line at a time.
10

As an alternative method, using the logging library:

>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt", 
     maxBytes=2**20*100, backupCount=100) 
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
...     log.info(f.readline().strip())

Your files will appear as follows:

filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)

This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.

1 Comment

since it splits line by line, how to do it faster?
10

Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out

https://pypi.org/project/filesplit/

1 Comment

Does this package support splitting by number of lines? I see that it does split by a given size.
7

This generator method is a (slow) way to get a slice of lines without blowing up your memory.

import itertools

def slicefile(filename, start, end):
    lines = open(filename)
    return itertools.islice(lines, start, end)

out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
    out.write(line)

Comments

6

don't forget seek() and mmap() for random access to files.

def getSomeChunk(filename, start, len):
    fobj = open(filename, 'r+b')
    m = mmap.mmap(fobj.fileno(), 0)
    return m[start:start+len]

Comments

6

While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:

def splitfile(infilepath, chunksize):
    fname, ext = infilepath.rsplit('.',1)
    i = 0
    written = False
    with open(infilepath) as infile:
        while True:
            outfilepath = "{}{}.{}".format(fname, i, ext)
            with open(outfilepath, 'w') as outfile:
                for line in (infile.readline() for _ in range(chunksize)):
                    outfile.write(line)
                written = bool(line)
            if not written:
                break
            i += 1

Comments

2

You can use wc and split (see the respective manpages) to get the desired effect. In bash:

split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.

produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.

3 Comments

Yes, it is not Python, but why use a screwdriver to apply a nail?
Well it's not really a screwdriver vs. nail... python often is a great way to accomplish simple tasks such as this. And I don't want to bash bash (pun intended) but that is not really... readable :)
@chrisfs: Naja, rückblickend würde ich vielleicht eher awk '{print $1}' statt der sed-Konstruktion verwenden. Trotzdem kann man ziemlich direkt sehen, was passiert: wc zählt die Zeilen, sed zieht die reine Zahl aus der Ausgabe, die wird durch drei geteilt und um 1 erhöht; split erzeugt dann Teile dieser Länge aus filename und benennt sie filename.chunk. plus fortlaufende Nummer. Es wäre natürlich nett, wenn wc eine Option hätte, direkt nur die Zahl auszugeben, aber auch so kann man damit gut arbeiten.
2

I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.

def Split(inputFile,numParts,outputName):
    fileSize=os.stat(inputFile).st_size
    parts=FileSizeParts(fileSize,numParts)
    openInputFile = open(inputFile, 'r')
    outPart=1
    for part in parts:
        if openInputFile.tell()<fileSize:
            fullOutputName=outputName+os.extsep+str(outPart)
            outPart+=1
            openOutputFile=open(fullOutputName,'w')
            openOutputFile.writelines(openInputFile.readlines(part))
            openOutputFile.close()
    openInputFile.close()
    return outPart-1

Comments

2

usage - split.py filename splitsizeinkb

import os
import sys

def getfilesize(filename):
   with open(filename,"rb") as fr:
       fr.seek(0,2) # move to end of the file
       size=fr.tell()
       print("getfilesize: size: %s" % size)
       return fr.tell()

def splitfile(filename, splitsize):
   # Open original file in read only mode
   if not os.path.isfile(filename):
       print("No such file as: \"%s\"" % filename)
       return

   filesize=getfilesize(filename)
   with open(filename,"rb") as fr:
    counter=1
    orginalfilename = filename.split(".")
    readlimit = 5000 #read 5kb at a time
    n_splits = filesize//splitsize
    print("splitfile: No of splits required: %s" % str(n_splits))
    for i in range(n_splits+1):
        chunks_count = int(splitsize)//int(readlimit)
        data_5kb = fr.read(readlimit) # read
        # Create split files
        print("chunks_count: %d" % chunks_count)
        with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
            fw.seek(0) 
            fw.truncate()# truncate original if present
            while data_5kb:                
                fw.write(data_5kb)
                if chunks_count:
                    chunks_count-=1
                    data_5kb = fr.read(readlimit)
                else: break            
        counter+=1 

if __name__ == "__main__":
   if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage:     filesplit.py filename splitsizeinkb ")
   else:
       filesize = int(sys.argv[2]) * 1000 #make into kb
       filename = sys.argv[1]
       splitfile(filename, filesize)

2 Comments

Worked for me perfectly in 2017! Thanks a lot @Mudit
Can you make this code extract line by line not charcter by character. Is there a way we can get the number of characters in the next line?
2

Here is a python script you can use for splitting large files using subprocess:

"""
Splits the file into the same directory and
deletes the original file
"""

import subprocess
import sys
import os

SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2'  # subprocess expects a string, i.e. 2 = aa, ab, ac etc..

if __name__ == "__main__":

    file_path = sys.argv[1]
    # i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
    subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
                     os.path.dirname(file_path) + '/'])

    # Remove the original file once done splitting
    try:
        os.remove(file_path)
    except OSError:
        pass

You can call it externally:

import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))

You can also import subprocess and run it directly in your program.

The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.

Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:

CHUNK_SIZE = 5000

def yield_csv_rows(reader, chunk_size):
    """
    Opens file to ingest, reads each line to return list of rows
    Expects the header is already removed
    Replacement for ingest_csv
    :param reader: dictReader
    :param chunk_size: int, chunk size
    """
    chunk = []
    for i, row in enumerate(reader):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

with open(local_file_path, 'rb') as f:
    f.readline().strip().replace('"', '')
    reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
    chunks = yield_csv_rows(reader, CHUNK_SIZE)
    for chunk in chunks:
        if not chunk:
            break
        # Do something with your chunk here

Here is another example using readlines():

"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5


def yield_rows(reader, chunk_size):
    """
    Yield row chunks
    """
    chunk = []
    for i, row in enumerate(reader):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk


def batch_operation(data):
    for item in data:
        print(item)


with open('file', 'r') as f:
    chunks = yield_rows(f.readlines(), CHUNK_SIZE)
    for _chunk in chunks:
        batch_operation(_chunk)

The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.

2 Comments

The first one is to call a external linux command, I did not get the point... For the second, readlines will read whole file, which consume a lot of memory, besides why we need another chunks to do this???
Using linux split command is faster in many cases, uses more memory since using subprocess.. its all explained in the answer. The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks.
1

You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :

for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
    data=val
    index=idx

def get_chunk(content,size):
        for i in range(0,len(content),size):
            yield content[i:i+size]

Comments

0

This worked for me

import os

fil = "inputfile"
outfil = "outputfile"

f = open(fil,'r')

numbits = 1000000000

for i in range(0,os.stat(fil).st_size/numbits+1):
    o = open(outfil+str(i),'w')
    segment = f.readlines(numbits)
    for c in range(0,len(segment)):
        o.write(segment[c]+"\n")
    o.close()

Comments

0

I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!

# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
    FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
    FileCount = FileCount + 1
    if FileName == 'Done':
        break
    else:
        FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)

for FileName in FileNames:
    File = open(FileName)

    # get Header row
    for Line in File:
        Header = Line
        break

    FileCount = 0
    Linecount = 1
    for Line in File:

        #skip Header in File
        if Line == Header:
            continue

        #create NewFile with Header every [LinesPerFile] Lines
        if Linecount % LinesPerFile == 1:
            FileCount = FileCount + 1
            NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
            NewFile = open(NewFileName,'w')
            NewFile.write(Header)

        NewFile.write(Line)
        Linecount = Linecount + 1

    NewFile.close()

Comments

0
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)

For example if you want 50000 lines in one files and path is /home/data then you can run below command

subprocess.run('split -l 50000 /home/data', shell = True)

If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want

! wc -l file_path

in this case

! wc -l /home/data

And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

Comments

0

You can use filesplit package to split the large files into multiple chunks based on size or line count.

pip install filesplit
from filesplit.split import Split
split = Split("inputfilename" , "outputfolderPath")

With the instance created above, you can split by size.

split.bysize(18000)

You can split by the number of lines with the instance created above.

split.bylinecount(1000)

For information on the parameters, you can check out the https://pypi.org/project/filesplit/

Comments

-2

Or, a python version of wc and split:

lines = 0
for l in open(filename): lines += 1

Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

1 Comment

No need to keep the count manually, use enumerate: for l, line in enumerate(open(filename)):...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.