7

In the article linked below, the author compares the efficiency of different string concatenation methodologies in python: http://www.skymind.com/~ocrow/python_string/

One thing that I did not understand is, why does method 3 (Mutable Character Arrays) result in a significantly slower performance than method 4 (joining a list of strings)

both of them are mutable and I would think that they should have comparable performance.

7
  • Presumably array.fromstring hasn't been optimized the way str.join has. Commented Aug 29, 2013 at 0:38
  • 3
    Also, note that this article is from 2004. Naive str appending is much faster in newer versions of Python, and so is str.join Commented Aug 29, 2013 at 0:39
  • What @abarnert says is totally relevant. When that article was written, the latest version of Python was 2.3.3 (see python.org/download/releases and python.org/download/releases/2.3.4). Benchmarks done at that time are largely meaningless today. Commented Aug 29, 2013 at 0:45
  • Related: how does string concatenation compare between 2.7.5 and 3.3.2 -- unicode vs ascii? Commented Aug 29, 2013 at 0:47
  • 1
    @blakev: But from a very quick test of join with an already-constructed list: bytes takes 1.0x as long in 3.3, unicode with all ASCII 0.5x, unicode with all UCS-2 0.8x, unicode with non-UCS-2 1.0x, unicode with mixed strings 1.05x. Commented Aug 29, 2013 at 0:58

1 Answer 1

6

"Both of them are mutable" is misleading you a bit.

It's true that in the list-append method, the list is mutable. But building up the list isn't the slow part. If you have 1000 strings of average length 1000, you're doing 1000000 mutations to the array, but only 1000 mutations to the list (plus 1000 increfs to string objects).

In particular, that means the array will have to spend 1000x as much time expanding (allocating new storage and copying the whole thing so far).

The slow part for the list method is the str.join call at the end. But that isn't mutable, and doesn't require any expanding. It uses two passes, to first calculate the size needed, then copy everything into it.

Also, that code inside str.join has had (and has continued to have since that article was written 9 years ago) a lot of work to optimize it, because it's a very common, and recommended, idiom that many real programs depend on every day; array has barely been touched since it was first added to the language.

But if you really want to understand the differences, you have to look at the source. In 2.7, the main work for the array method is in array_fromstring, while the main work for the list method is in string_join. You can see how the latter takes advantage of the fact that we already know all of the strings we're going to be joining up at the start, while the former can't.

Sign up to request clarification or add additional context in comments.

4 Comments

"It uses two passes, to first calculate the size needed, then copy everything into it." - really? Is that a list-specific optimization I wasn't aware of? It can't work for general iterables.
Ah, it makes a tuple out of the input if it's not already a tuple or list.
Your bytes_join link is broken. Try string_join.
@user2357112: Thanks; I was looking at the 3.x code, and I thought I'd switched over to 2.7 and re-clicked the line, but obviously not… Anyway, str.join has had that optimization forever, which is why people are often surprised when they attempt to micro-optimize by passing it a genexpr instead of a listcomp and find that it's actually slightly slower…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.