How do I split a huge text file in python

Question

I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.

What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.

Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?

I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)

Unwelcome suggestion: get a better text editor. :-) If you're on Windows, EmEditor is one I know of that will seamlessly edit files without having to load them completely into memory. — bobince
– bobince, Commented Nov 15, 2008 at 13:00
See my answer here on how to split large text files in Python without running any Linux commands. — Tyler
– Tyler, Commented Apr 8, 2023 at 14:18

pevik · Accepted Answer · 2024-03-06 22:14:26Z

46

linux has a split command

split -l 100000 file.txt

would split into files of equal 100,000 line size

edited Mar 6, 2024 at 22:14

pevik

4,8513 gold badges40 silver badges51 bronze badges

answered Feb 4, 2010 at 22:42

James

15.6k26 gold badges89 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

neilh Over a year ago

And if you're base OS is Windows you can get Cygwin for access to basically all of the cool command-line utilities.

aldux Over a year ago

Unixtools for windows also have the split tool: split.exe

Piyush Chauhan Over a year ago

I have a 120 GB file. While using this command it is getting stuck after some 1928613 lines. It does not proceeds any furthur. I was trying to do what was said in stackoverflow.com/a/291759/6143004 but the same problem is occuring.

Kamil Kisiel · Accepted Answer · 2008-11-14 23:18:32Z

17

Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)

answered Nov 14, 2008 at 23:18

Kamil Kisiel

20.6k13 gold badges50 silver badges57 bronze badges

1 Comment

quamrana Over a year ago

Thanks for the answer - your suggestions are working well so far for reading the file. When I've finished, I'll also try a binary version that doesn't read one line at a time.

Alex L · Accepted Answer · 2012-05-15 11:04:13Z

10

As an alternative method, using the logging library:

>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt", 
     maxBytes=2**20*100, backupCount=100) 
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
...     log.info(f.readline().strip())

Your files will appear as follows:

filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)

This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.

answered May 15, 2012 at 11:04

Alex L

9,0136 gold badges53 silver badges77 bronze badges

1 Comment

luisfelipe18 Over a year ago

since it splits line by line, how to do it faster?

Ram · Accepted Answer · 2018-06-07 18:38:57Z

10

Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out

https://pypi.org/project/filesplit/

answered Jun 7, 2018 at 18:38

Ram

5953 gold badges8 silver badges18 bronze badges

1 Comment

davidbilla Over a year ago

Does this package support splitting by number of lines? I see that it does split by a given size.

Ryan Ginstrom · Accepted Answer · 2012-04-12 04:23:18Z

7

This generator method is a (slow) way to get a slice of lines without blowing up your memory.

import itertools

def slicefile(filename, start, end):
    lines = open(filename)
    return itertools.islice(lines, start, end)

out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
    out.write(line)

edited Apr 12, 2012 at 4:23

answered Feb 4, 2010 at 23:42

Ryan Ginstrom

14.2k5 gold badges49 silver badges60 bronze badges

Comments

Joe Koberg · Accepted Answer · 2010-02-04 22:53:33Z

6

don't forget seek() and mmap() for random access to files.

def getSomeChunk(filename, start, len):
    fobj = open(filename, 'r+b')
    m = mmap.mmap(fobj.fileno(), 0)
    return m[start:start+len]

answered Feb 4, 2010 at 22:53

Joe Koberg

27k6 gold badges50 silver badges54 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 10:31:26Z

6

While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:

def splitfile(infilepath, chunksize):
    fname, ext = infilepath.rsplit('.',1)
    i = 0
    written = False
    with open(infilepath) as infile:
        while True:
            outfilepath = "{}{}.{}".format(fname, i, ext)
            with open(outfilepath, 'w') as outfile:
                for line in (infile.readline() for _ in range(chunksize)):
                    outfile.write(line)
                written = bool(line)
            if not written:
                break
            i += 1

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Dec 24, 2014 at 19:53

inspectorG4dget

115k30 gold badges159 silver badges253 bronze badges

Comments

Svante · Accepted Answer · 2008-11-15 00:36:36Z

2

You can use wc and split (see the respective manpages) to get the desired effect. In bash:

split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.

produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.

edited Nov 15, 2008 at 0:36

answered Nov 15, 2008 at 0:11

Svante

51.8k11 gold badges84 silver badges127 bronze badges

3 Comments

Svante Over a year ago

Yes, it is not Python, but why use a screwdriver to apply a nail?

Agos Over a year ago

Well it's not really a screwdriver vs. nail... python often is a great way to accomplish simple tasks such as this. And I don't want to bash bash (pun intended) but that is not really... readable :)

Svante Over a year ago

@chrisfs: Naja, rückblickend würde ich vielleicht eher awk '{print $1}' statt der sed-Konstruktion verwenden. Trotzdem kann man ziemlich direkt sehen, was passiert: wc zählt die Zeilen, sed zieht die reine Zahl aus der Ausgabe, die wird durch drei geteilt und um 1 erhöht; split erzeugt dann Teile dieser Länge aus filename und benennt sie filename.chunk. plus fortlaufende Nummer. Es wäre natürlich nett, wenn wc eine Option hätte, direkt nur die Zahl auszugeben, aber auch so kann man damit gut arbeiten.

quamrana · Accepted Answer · 2008-11-16 20:54:21Z

I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.

def Split(inputFile,numParts,outputName):
    fileSize=os.stat(inputFile).st_size
    parts=FileSizeParts(fileSize,numParts)
    openInputFile = open(inputFile, 'r')
    outPart=1
    for part in parts:
        if openInputFile.tell()<fileSize:
            fullOutputName=outputName+os.extsep+str(outPart)
            outPart+=1
            openOutputFile=open(fullOutputName,'w')
            openOutputFile.writelines(openInputFile.readlines(part))
            openOutputFile.close()
    openInputFile.close()
    return outPart-1

Mudit Verma · Accepted Answer · 2015-10-15 14:45:22Z

2

usage - split.py filename splitsizeinkb

import os
import sys

def getfilesize(filename):
   with open(filename,"rb") as fr:
       fr.seek(0,2) # move to end of the file
       size=fr.tell()
       print("getfilesize: size: %s" % size)
       return fr.tell()

def splitfile(filename, splitsize):
   # Open original file in read only mode
   if not os.path.isfile(filename):
       print("No such file as: \"%s\"" % filename)
       return

   filesize=getfilesize(filename)
   with open(filename,"rb") as fr:
    counter=1
    orginalfilename = filename.split(".")
    readlimit = 5000 #read 5kb at a time
    n_splits = filesize//splitsize
    print("splitfile: No of splits required: %s" % str(n_splits))
    for i in range(n_splits+1):
        chunks_count = int(splitsize)//int(readlimit)
        data_5kb = fr.read(readlimit) # read
        # Create split files
        print("chunks_count: %d" % chunks_count)
        with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
            fw.seek(0) 
            fw.truncate()# truncate original if present
            while data_5kb:                
                fw.write(data_5kb)
                if chunks_count:
                    chunks_count-=1
                    data_5kb = fr.read(readlimit)
                else: break            
        counter+=1 

if __name__ == "__main__":
   if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage:     filesplit.py filename splitsizeinkb ")
   else:
       filesize = int(sys.argv[2]) * 1000 #make into kb
       filename = sys.argv[1]
       splitfile(filename, filesize)

edited Oct 15, 2015 at 14:45

answered Oct 15, 2015 at 14:39

Mudit Verma

4695 silver badges3 bronze badges

2 Comments

Bhaskar Pramanik Over a year ago

Worked for me perfectly in 2017! Thanks a lot @Mudit

Piyush Chauhan Over a year ago

Can you make this code extract line by line not charcter by character. Is there a way we can get the number of characters in the next line?

radtek · Accepted Answer · 2020-05-13 20:40:40Z

Here is a python script you can use for splitting large files using subprocess:

"""
Splits the file into the same directory and
deletes the original file
"""

import subprocess
import sys
import os

SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2'  # subprocess expects a string, i.e. 2 = aa, ab, ac etc..

if __name__ == "__main__":

    file_path = sys.argv[1]
    # i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
    subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
                     os.path.dirname(file_path) + '/'])

    # Remove the original file once done splitting
    try:
        os.remove(file_path)
    except OSError:
        pass

You can call it externally:

import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))

You can also import subprocess and run it directly in your program.

The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.

Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:

CHUNK_SIZE = 5000

def yield_csv_rows(reader, chunk_size):
    """
    Opens file to ingest, reads each line to return list of rows
    Expects the header is already removed
    Replacement for ingest_csv
    :param reader: dictReader
    :param chunk_size: int, chunk size
    """
    chunk = []
    for i, row in enumerate(reader):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

with open(local_file_path, 'rb') as f:
    f.readline().strip().replace('"', '')
    reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
    chunks = yield_csv_rows(reader, CHUNK_SIZE)
    for chunk in chunks:
        if not chunk:
            break
        # Do something with your chunk here

Here is another example using readlines():

"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5


def yield_rows(reader, chunk_size):
    """
    Yield row chunks
    """
    chunk = []
    for i, row in enumerate(reader):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk


def batch_operation(data):
    for item in data:
        print(item)


with open('file', 'r') as f:
    chunks = yield_rows(f.readlines(), CHUNK_SIZE)
    for _chunk in chunks:
        batch_operation(_chunk)

The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.

The first one is to call a external linux command, I did not get the point... For the second, readlines will read whole file, which consume a lot of memory, besides why we need another chunks to do this???
Using linux split command is faster in many cases, uses more memory since using subprocess.. its all explained in the answer. The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks.

Ajith · Accepted Answer · 2020-10-02 09:38:24Z

1

You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :

for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
    data=val
    index=idx

def get_chunk(content,size):
        for i in range(0,len(content),size):
            yield content[i:i+size]

answered Oct 2, 2020 at 9:38

Ajith

973 silver badges14 bronze badges

Comments

Ryan · Accepted Answer · 2013-12-02 19:05:08Z

0

This worked for me

import os

fil = "inputfile"
outfil = "outputfile"

f = open(fil,'r')

numbits = 1000000000

for i in range(0,os.stat(fil).st_size/numbits+1):
    o = open(outfil+str(i),'w')
    segment = f.readlines(numbits)
    for c in range(0,len(segment)):
        o.write(segment[c]+"\n")
    o.close()

answered Dec 2, 2013 at 19:05

Ryan

91 bronze badge

Comments

Ron Smith · Accepted Answer · 2014-05-24 17:16:39Z

I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!

# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
    FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
    FileCount = FileCount + 1
    if FileName == 'Done':
        break
    else:
        FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)

for FileName in FileNames:
    File = open(FileName)

    # get Header row
    for Line in File:
        Header = Line
        break

    FileCount = 0
    Linecount = 1
    for Line in File:

        #skip Header in File
        if Line == Header:
            continue

        #create NewFile with Header every [LinesPerFile] Lines
        if Linecount % LinesPerFile == 1:
            FileCount = FileCount + 1
            NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
            NewFile = open(NewFileName,'w')
            NewFile.write(Header)

        NewFile.write(Line)
        Linecount = Linecount + 1

    NewFile.close()

manoj · Accepted Answer · 2022-05-22 07:28:03Z

0

import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)

For example if you want 50000 lines in one files and path is /home/data then you can run below command

subprocess.run('split -l 50000 /home/data', shell = True)

If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want

! wc -l file_path

in this case

! wc -l /home/data

And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

answered May 22, 2022 at 7:28

manoj

4041 gold badge5 silver badges14 bronze badges

Comments

Karthikeyan VK · Accepted Answer · 2023-07-10 14:59:16Z

0

You can use filesplit package to split the large files into multiple chunks based on size or line count.

pip install filesplit
from filesplit.split import Split
split = Split("inputfilename" , "outputfolderPath")

With the instance created above, you can split by size.

split.bysize(18000)

You can split by the number of lines with the instance created above.

split.bylinecount(1000)

For information on the parameters, you can check out the https://pypi.org/project/filesplit/

answered Jul 10, 2023 at 14:59

Karthikeyan VK

6,0964 gold badges45 silver badges55 bronze badges

Comments

Claudiu · Accepted Answer · 2008-11-15 18:05:32Z

-2

Or, a python version of wc and split:

lines = 0
for l in open(filename): lines += 1

Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

answered Nov 15, 2008 at 18:05

Claudiu

231k174 gold badges507 silver badges702 bronze badges

1 Comment

Matthew Trevor Over a year ago

No need to keep the count manually, use enumerate: for l, line in enumerate(open(filename)):...

Collectives™ on Stack Overflow

How do I split a huge text file in python

17 Answers 17

3 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

3 Comments

Comments

2 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

17 Answers 17

3 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

3 Comments

Comments

2 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related