Python string manipulation -- performance problems

Question

I have the following piece of code that I execute around 2 million times in my application to parse that many records. This part seems to be the bottleneck and I was wondering if anyone could help me by suggesting some nifty tricks that could make these simple string manipulations faster.

try:
    data = []
    start = 0
    end = 0
    for info in self.Columns():
        end = start + (info.columnLength)
        slice = line[start:end]
        if slice == '' or len(slice) != info.columnLength:
            raise 'Wrong Input'
        if info.hasSignage:
            if(slice[0:1].strip() != '+' and slice[0:1].strip() != '-'):
                raise 'Wrong Input'
        if not info.skipColumn:
            data.append(slice)
        start = end 
    parsedLine = data
except:
    parsedLine = False

codereview.stackexchange.com is probably better suited for questions like this. — Wooble
– Wooble, Commented Sep 2, 2011 at 18:21
Perhaps you could post a full, runnable performance test for people to run against. — Mike Graham
– Mike Graham, Commented Sep 2, 2011 at 18:21
String exceptions have been deprecated for many, many years--always raise an Exception subclass instead. The only reason they appear to work anymore is that they're wrong and therefore raise an error. — Mike Graham
– Mike Graham, Commented Sep 2, 2011 at 18:22
Bare except: is almost always wrong can can lead to frustrating, untrackable bugs. If all you wanted to catch was raise "Wrong Input" then what you would do is define class InvalidInputError(Exception): pass and then except InvalidInputError:. — Mike Graham
– Mike Graham, Commented Sep 2, 2011 at 18:24
@Mike Graham: even better than a class body of pass is a docstring. — SingleNegationElimination
– SingleNegationElimination, Commented Sep 2, 2011 at 19:25

John Machin · Accepted Answer · 2011-09-04 23:13:38Z

3

def fubarise(data):
    try:
        if nasty(data):
            raise ValueError("Look, Ma, I'm doing a big fat GOTO ...") # sheesh #1
        more_of_the_same()
        parsed_line = data
    except ValueError:
        parsed_line = False
        # so it can be a "data" or False -- sheesh #2
    return parsed_line

There is no point in having different error messages in the raise statement; they are never seen. Sheesh #3.

Update: Here is a suggested improvement which uses struct.unpack to partition input lines rapidly. It also illustrates better exception handling, under the assumption that the writer of the code is also running it and stopping on the first error is acceptable. A robust implementation which logs all errors in all columns of all lines for a user audience is another matter. Note that typically the error checking for each column would be much more extensive e.g. checking for a leading sign but not checking whether the column contains a valid number seems a little odd.

import struct

def unpacked_records(self):
    cols = self.Columns()
    unpack_fmt = ""
    sign_checks = []
    start = 0
    for colx, info in enumerate(cols, 1):
        clen = info.columnLength
        if clen < 1:
            raise ValueError("Column %d: Bad columnLength %r" % (colx, clen))
        if info.skipColumn:
            unpack_fmt += str(clen) + "x"
        else:
            unpack_fmt += str(clen) + "s"
            if info.hasSignage:
                sign_checks.append(start)
        start += clen
    expected_len = start
    unpack = struct.Struct(unpack_fmt).unpack

    for linex, line in enumerate(self.whatever_the_list_of_lines_is, 1):
        if len(line) != expected_len:
            raise ValueError(
                "Line %d: Actual length %d, expected %d"
                % (linex, len(line), expected_len))
        if not all(line[i] in '+-' for i in sign_checks):
            raise ValueError("Line %d: At least one column fails sign check" % linex)
        yield unpack(line) # a tuple

edited Sep 4, 2011 at 23:13

answered Sep 3, 2011 at 11:31

John Machin

83.3k12 gold badges147 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

steveha Over a year ago

Wow, you really found ways to make your points in obnoxious ways. Is that how you make friends and influence people? This isn't an answer to his question at all, just a critique of his coding style, and made rather insulting just to minimize the chances he will actually pay attention to you.

steveha Over a year ago

Thank you for actually answering his question. Because you are using a builtin module, this should be faster than my answer as well.

Remi · Accepted Answer · 2011-09-02 19:50:36Z

what about (using some classes to have an executable example):

class Info(object):
    columnLength = 5
    hasSignage = True
    skipColumn = False

class Something(object):

    def Columns(self):
        return [Info()]*4

    def bottleneck(self):
        try:
            data = []
            start = 0
            end = 0
            line = '+this-is just a line for testing'
            for info in self.Columns():
                start = end
                collength = info.columnLength
                end = start + collength
                if info.skipColumn:  # start with this
                    continue

                elif collength == 0: 
                    raise ValueError('Wrong Input')

                slice = line[start:end] # only now slicing, because it
                                        # is probably most expensive part

                if len(slice) != collength: 
                    raise ValueError('Wrong Input')

                elif info.hasSignage and slice[0] not in '+-': # bit more compact
                    raise ValueError('Wrong Input')

                else:
                    data.append(slice)

            parsedLine = data
        except:
            parsedLine = False

Something().bottleneck()

edit: when length of slice is 0, slice[0] does not exist, so if collength == 0 has to be checked for first

edit2: You are using this bit of code for many many lines, but the column info does not change, right? That allows you, to

pre-calculate a list of start points of each colum (no more need to calculate start, end)
knowing start-end in advance, .Columns() only needs to return columns that are not skipped and have a columnlength >0 (or do you really need to raise an input for length==0 at each line??)
the manditory length of each line is known and equal or each line and can be checked before looping over the column infos

edit3: I wonder how you will know what data index belongs to which column if you use 'skipColumn'...

+1, I was in the process of suggesting moving the skipColumn check up and simplifying the signage check myself.

steveha · Accepted Answer · 2011-09-14 19:22:43Z

1

EDIT: I'm changing this answer a bit. I'll leave the original answer below.

In my other answer I commented that the best thing would be to find a built-in Python module that would do the unpacking. I couldn't think of one, but perhaps I should have Google searched for one. @John Machin provided an answer that showed how to do it: use the Python struct module. Since that is written in C, it should be faster than my pure Python solution. (I haven't actually measured anything so it is a guess.)

I do agree that the logic in the original code is "un-Pythonic". Returning a sentinel value isn't best; it's better to either return a valid value or raise an exception. The other way to do it is to return a list of valid values, plus another list of invalid values. Since @John Machin offered code to yield up valid values, I thought I'd write a version here that returns two lists.

NOTE: Perhaps the best possible answer would be to take @John Machin's answer and modify it to save the invalid values to a file for possible later review. His answer yields up answers one at a time, so there is no need to build a large list of parsed records; and saving the bad lines to disk means there is no need to build a possibly-large list of bad lines.

import struct

def parse_records(self):
    """
    returns a tuple: (good, bad)
    good is a list of valid records (as tuples)
    bad is a list of tuples: (line_num, line, err)
    """

    cols = self.Columns()
    unpack_fmt = ""
    sign_checks = []
    start = 0
    for colx, info in enumerate(cols, 1):
        clen = info.columnLength
        if clen < 1:
            raise ValueError("Column %d: Bad columnLength %r" % (colx, clen))
        if info.skipColumn:
            unpack_fmt += str(clen) + "x"
        else:
            unpack_fmt += str(clen) + "s"
            if info.hasSignage:
                sign_checks.append(start)
        start += clen
    expected_len = start
    unpack = struct.Struct(unpack_fmt).unpack

    good = []
    bad = []
    for line_num, line in enumerate(self.whatever_the_list_of_lines_is, 1):
        if len(line) != expected_len:
            bad.append((line_num, line, "bad length"))
            continue
        if not all(line[i] in '+-' for i in sign_checks):
            bad.append((line_num, line, "sign check failed"))
            continue
        good.append(unpack(line))

    return good, bad

ORIGINAL ANSWER TEXT: This answer should be a lot faster if the self.Columns() information is identical over all the records. We do the processing of the self.Columns() information one time, and build a couple of lists that contain just what we need to process a record.

This code shows how to compute parsedList but doesn't actually yield it up or return it or do anything with it. Obviously you would need to change that.

def parse_records(self):
    cols = self.Columns()

    slices = []
    sign_checks = []
    start = 0
    for info in cols:
        if info.columnLength < 1:
            raise ValueError, "bad columnLength"
        end = start + info.columnLength
        if not info.skipColumn:
            tup = (start, end)
            slices.append(tup)   
            if info.hasSignage:
                sign_checks.append(start)

    expected_len = end # or use (end - 1) to not count a newline

    try:
        for line in self.whatever_the_list_of_lines_is:
            if len(line) != expected_len:
                raise ValueError, "wrong length"
            if not all(line[i] in '+-' for i in sign_checks):
                raise ValueError, "wrong input"
            parsedLine = [line[s:e] for s, e in slices]

    except ValueError:
        parsedLine = False

edited Sep 14, 2011 at 19:22

answered Sep 2, 2011 at 19:33

steveha

77.1k21 gold badges94 silver badges119 bronze badges

12 Comments

Remi Over a year ago

That's what I meant in my commentary edits, very nice, and 'info.hasSignage' must be indented 4 spaces. bbekdemir is having a Royal treatment on his first post...

bbekdemir Over a year ago

And thanks everyone for your responses - especially @steveha. I am amazed by how quickly I got answers.

steveha Over a year ago

@Remi, I edited the code to indent as you suggested, on the assumption that if we are skipping the column, we don't need to check the sign. Without the indent, we check every sign whether skipping that column or not; that might actually be desired.

Chris Morgan Over a year ago

This would be made more efficient by using (a) for..else rather than try..except, or (b) parsedLine = None; break instead of raise and throw away the try..except wrapping.

steveha Over a year ago

I just added a note, suggesting that perhaps the best answer would be to take the @John Machin answer and modify it to write bad input lines to a file while yielding up the successfully parsed records. That would allow for parsing all input records even in the face of errors, and wouldn't require large lists.

|

S.Lott · Accepted Answer · 2011-09-02 18:27:02Z

1

Don't compute start and end every time through this loop.

Compute them exactly once prior to using self.Columns() (Whatever that is. If 'Columns` is class with static values, that's silly. If it's a function with a name that begins with a capital letter, that's confusing.)

if slice == '' or len(slice) != info.columnLength can only happen if line is too short compared to the total size required by Columns. Check once, outside the loop.

slice[0:1].strip() != '+' sure looks like .startswith().

if not info.skipColumn. Apply this filter before even starting the loop. Remove these from self.Columns().

answered Sep 2, 2011 at 18:27

S.Lott

393k83 gold badges521 silver badges791 bronze badges

4 Comments

steveha Over a year ago

Sorry, don't agree about start and end. Those are being used to take a different slice out of the input string each time; they depend on info.columnLength. Also, the check of info.columnLength makes sense if you are worried about possible garbage .columnLength values; I think the correction there is to try to handle it with an exception rather than inline code (ETAFTGP). Completely agree about .startswith(); and .startswith() should be called with a tuple argument of ('+', '-').

steveha Over a year ago

But wait... he says he is processing many lines at a time, and it's possible that the self.Columns() value is constant over all the lines. In which case, we could save a lot of time by precomputing a set of (start, end) tuples. If that is what you meant, I withdraw my disagreement.

S.Lott Over a year ago

@steveha: "and it's possible that the self.Columns() value is constant over all the lines". Not "possible". Generally, it's essential. That's what fixed -- positional -- file formats depend on. Fixed positions in each row.

steveha Over a year ago

I started off just looking at what the code is actually doing, and making no assumptions otherwise. Tunnel vision on the code itself. Then I took a step back and said, "hmm, he says he is processing records."

phimuemue · Accepted Answer · 2011-09-02 18:38:18Z

1

First thing I would consider is slice = line[start:end]. Slicing creates new instances; you could try to avoid explicitly constructing line [start:end] and examine its contents manually.

Why are you doing slice[0:1]? This should yield a subsequence containing a single item of slice (shouldn't it?), thus it can probably be checked more efficiently.

answered Sep 2, 2011 at 18:38

phimuemue

36.3k11 gold badges95 silver badges124 bronze badges

Comments

steveha · Accepted Answer · 2011-09-02 19:13:11Z

I want to tell you to use some sort of built-in Python feature to split the string, but I can't think of one. So I'm left with just trying to reduce the amount of code you have.

When we are done, end should be pointing at the end of the string; if this is the case, then all of the .columnLength values must have been okay. (Unless one was negative or something!)

Since this has a reference to self it must be a snip from a member function. So, instead of raising exceptions, you could just return False to exit the function early and return an error flag. But I like the debugging potential of changing the except clause to not catch the exception anymore, and getting a stack trace letting you identify where the problem came from.

@Remi used slice[0] in '+-' where I used slice.startswith(('+', '-)). I think I like @Remi's code better there, but I left mine unchanged just to show you a different way. The .startswith() way will work for strings longer than length 1, but since this is only a string of length 1 the terse solution works.

try:
    line = line.strip('\n')
    data = []
    start = 0
    for info in self.Columns():
        end = start + info.columnLength
        slice = line[start:end]
        if info.hasSignage and not slice.startswith(('+', '-')):
            raise ValueError, "wrong input"
        if not info.skipColumn:
            data.append(slice)
        start = end

    if end - 1 != len(line):
        raise ValueError, "bad .columnLength"

    parsedLine = data

except ValueError:
    parsedLine = False

Collectives™ on Stack Overflow

Python string manipulation -- performance problems

6 Answers 6

2 Comments

1 Comment

12 Comments

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

1 Comment

12 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related