1

I need to read lines from a text file but, where the 'end of line' caracter is not always \n or \x or a combination and may be any combination of characters like 'xyz' or '|', but the 'end of line' is always the same and known for each type of file.

As the text file may be a big one and I have to keep performances and memory usage in mind what seems to be the best solution ? Today I use a combinaison of string.read(1000) and split(myendofline) or partition(myendofline) but I would know if a more elegant and standard solution exists.

2
  • if your delimiter was just 1 character, you can use the csv module. Commented Mar 11, 2011 at 19:06
  • alas..no it 'may be any combination of characters like 'xyz'' Commented Mar 11, 2011 at 19:14

4 Answers 4

2

Here's a generator function thats acts as an iterator on a file, cuting the lines according exotic newline being identical in all the file.

It reads the file by chunks of lenchunk characters and displays the lines in each current chunk, chunk after chunk.

Since the newline is 3 characters in my exemple (':;:'), it may happen that a chunk ends with a cut newline: this generator function takes care of this possibility and manages to display the correct lines.

In case of a newline being only one character, the function could be simplified. I wrote only the function for the most delicate case.

Employing this function allows to read a file one line at a time, without reading the entire file into memory.

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    # nl = 0 or 1 acts as 0 or 1 in splitlines()
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            chunk = tail + kept
            tail = newtail
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            chunk = f.read(lenchunk)
        yield tail
    


for line in liner('fofo.txt',':;:'):
    print line

Here's the same, with printings here and there to allow to follow the algorithm.

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        ch = f.read()
        the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\
                  '\nend of the file=='+ch[-50:]+\
                  '\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'
        f.seek(0,0)
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;':
                wr = [' ##########---------- cut newline cut ----------##########'+\
                     '\nchunk== '+chunk+\
                     '\n---------------------------------------------------']
            else:
                wr = ['chunk== '+chunk+\
                     '\n---------------------------------------------------']
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            wr.append('\nkept== '+kept+\
                      '\n---------------------------------------------------'+\
                      '\nnewtail== '+newtail)
            chunk = tail + kept
            tail = newtail
            wr.append('\n---------------------------------------------------'+\
                      '\ntail + kept== '+chunk+\
                      '\n---------------------------------------------------')
            print ''.join(wr)
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            print '\n\n==================================================='
            chunk = f.read(lenchunk)
        yield tail
        print the_end
    


for line in liner('fofo.txt',':;:',1):
    print 'line== '+line

.

EDIT

I compared the times of execution of my code and of the chmullig's code.

With a 'fofo.txt' file about 10 MB, created with

alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60)))
                for i in xrange(324000))
with open('fofo.txt','wb') as g:
    g.write(ch)

and measuring times like that:

te = clock()
for line in liner('fofo.txt',':;:', 65536):
    pass
print clock()-te


fh = open('fofo.txt', 'rb')
zenBreaker = SpecialDelimiters(fh, ':;:', 65536)

te = clock()
for line in zenBreaker:
    pass
print clock()-te

I obtained the following minimum times observed on several essays:

............my code 0,7067 seconds

chmullig's code 0.8373 seconds

.

EDIT 2

I changed my generator function: liner2() takes a file-handler instead of the file's name. So the opening of the file can be put out of the measuring of time, as it is for the measuring of chmullig's code

def liner2(fh,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    chunk = fh.read(lenchunk)
    tail = ''
    while chunk:
        last = chunk.rfind(eol)
        if last==-1:
            kept = chunk
            newtail = ''
        else:
            kept = chunk[0:last+L]   # here: L
            newtail = chunk[last+L:] # here: L
        chunk = tail + kept
        tail = newtail
        x = y = 0
        while y+1:
            y = chunk.find(eol,x)
            if y+1: yield chunk[x:y+NL] # here: NL
            else: break
            x = y+L # here: L
        chunk = fh.read(lenchunk)
    yield tail

fh = open('fofo.txt', 'rb')
te = clock()
for line in liner2(fh,':;:', 65536):
    pass
print clock()-te

The results, after numerous essays to see the minimum times, are

.........with liner() 0.7067seconds

.......with liner2() 0.7064 seconds

chmullig's code 0.8373 seconds

In fact the opening of the file counts for an infinitesimal part in the total time.

Sign up to request clarification or add additional context in comments.

3 Comments

@philnext @chmullig I wish to know if other persons oberve the same order in the timings of execution. I knew that the string method find() is extremely fast, that's why I used it. But I'd like to have confirmation of timings performed on other machines than mine.
with another text file : with liner : 1.26600003242 seconds with liner2 : 1.23399996758 seconds
@philnext Thank you. A lot of execution's times must be measured for each function, because this time is variable according what happens elsewhere in the computer. From a lot of measures, the minimum one is the best to evaluate the "absolute" speed of a code. I think you didn't obtain these results on a lot of essays, because in my opinion the times with liner and with liner2 are epsilon-different. -- But what I wonder about is to verify on other machines that my code with liner or liner2 is really faster than the code of chmullig.
2

Obviously simplest would be to just read the whole thing and then call .split('|').

However if that's undesirable because it requires you to read the whole thing into memory you might read in arbitrary chunks and perform the split on them. You could write a class that grabs another arbitrary chunk when the current one runs out, and the rest of your application doesn't need to know about it.

Here's the input, zen.txt

The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!

Here's my little test case, that works for me. It doesn't handle a whole bunch corner cases, nor is it particularly pretty, but it should get you started.

class SpecialDelimiters(object):
    def __init__(self, filehandle, terminator, chunksize=10):
        self.file = filehandle
        self.terminator = terminator
        self.chunksize = chunksize
        self.chunk = ''
        self.lines = []
        self.done = False

    def __iter__(self):
        return self

    def next(self):
        if self.done:
            raise StopIteration
        try:
            return self.lines.pop(0)
        except IndexError:
            #The lines list is empty, so let's read some more!
            while True:
                #Looping so even if our chunksize is smaller than one line we get at least one chunk
                newchunk = self.file.read(self.chunksize)
                self.chunk += newchunk
                rawlines = self.chunk.split(self.terminator)
                if len(rawlines) > 1 or not newchunk:
                    #we want to keep going until we have at least one block
                    #or reached the end of the file
                    break
            self.lines.extend(rawlines[:-1])
            self.chunk = rawlines[-1]
            try:
                return self.lines.pop(0)
            except IndexError:
                #The end of the road, return last remaining stuff
                self.done = True
                return self.chunk               

zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
    print line  

5 Comments

+1, good that you made chunk size an option. however, you're not handling quoted lines, but I don't think you need to worry about that.
Ah, you mean if they include \| or something? Don't do that! ;)
Ah, I didn't see the edit when I started on the code. There's an old ticket about this for python which doesn't have many other, better solutions either.
@chmullig With | as separator of lines, the beginning of your 'zen.txt' contains a blank line between The Zen of Python, by Tim Peters and Beautiful is better than ugly. But your code displays only two lines, without the blank line.
eyquem, good catch. It's actually worse than that, as it was originally posted! In transfering from ipython -> stackoverflow I left off the .pop(0) and just had .pop() which reversed the order of certain subchunks of lines. Edited to correct.
1

Given your contraints, it maybe would be best to convert the known unusual newlines to normal newlines first and then use the usual:

for line in file:
    ...

1 Comment

Good idea but not for my problem with big (+/- 500 Mo) files
1

TextFileData.split(EndOfLine_char) seems to be your solution. If it's not working fast enough, then you should consider using a lower-level programming level.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.