1

I have pretty basic benchmark comparing performance of mutex vs atomic:

const (
    numCalls = 1000
)

var (
    wg                  sync.WaitGroup
)

func BenchmarkCounter(b *testing.B) {
    var counterLock sync.Mutex
    var counter int
    var atomicCounter atomic.Int64

    b.Run("mutex", func(b *testing.B) {
        wg.Add(b.N)
        for i := 0; i < b.N; i++ {
            go func(wg *sync.WaitGroup) {
                for i := 0; i < numCalls; i++ {
                    counterLock.Lock()
                    counter++
                    counterLock.Unlock()
                }
                wg.Done()
            }(&wg)
        }
        wg.Wait()
    })

    b.Run("atomic", func(b *testing.B) {

        wg.Add(b.N)
        for i := 0; i < b.N; i++ {
            go func(wg *sync.WaitGroup) {
                for i := 0; i < numCalls; i++ {
                    atomicCounter.Add(1)
                }
                wg.Done()
            }(&wg)
        }
        wg.Wait()
    })
}

Typical output of go test -bench. -benchmem looks as follows:

BenchmarkCounter/mutex-8        7680        188508 ns/op         618 B/op          3 allocs/op
BenchmarkCounter/atomic-8      38649         31006 ns/op          40 B/op          2 allocs/op

Running escape analysis with go test -gcflags '-m' show that one allocation in each benchmark iteration (op) belongs with running goroutine:

./counter_test.go:57:17: func literal escapes to heap
./counter_test.go:60:7: func literal escapes to heap
./counter_test.go:72:18: func literal escapes to heap
./counter_test.go:75:7: func literal escapes to heap

(lines 57 and 72 are b.Run() calls, and lines 60 and 75 are go func() calls, so exactly 1 call within each of b.N iteration)

The same analysis shows that variables declared at the beginning of the benchmark function are also moved to heap:

./counter_test.go:21:6: moved to heap: counterLock
./counter_test.go:22:6: moved to heap: counter
./counter_test.go:23:6: moved to heap: atomicCounter

I'm just fine with that. What really bothers me is that I expect alloc/op measure memory allocations per iteration (b.N iterations in total). So, for example, one allocation of, say, counterLock divided by b.N iterations (7.680 in the benchmark output above) should add 1/7.680 = 0 (rounding division result to closest integer). Same should apply to counter and atomicCounter.

However, this is not the case, and I get 3 allocations instead of just 1 for "mutex" benchmark (1 goroutine + counterLock + counter) and 2 for "atomic" (1 goroutine + atomicCounter). It seems thus that benchmarking logic considers function scope variables (counterLock, counter, atomicCounter) being allocated anew during each of b.N iterations, not just once at the beginning of BenchmarkCounter(). Is this logic correct? Am I missing something here?

EDIT. Investigating memprofile with pprof shows allocations for go func() only: enter image description here

3
  • Running benchmark with -memprofile and using pprof show that only go func() cause allocations (I updated my question with pprof output) Nothing about function scope variables. Nothing suggests where these 3 alloc/op, 2 alloc/op may come from Commented Dec 18, 2024 at 20:58
  • 2
    The goroutines themselves require allocation, but it's not going to show up in escape analysis. Make a simpler version with no goroutines and you see 0 allocs. Commented Dec 18, 2024 at 21:15
  • 1
    Yes, removing goroutines and incrementing counters in sequential order removes all allocations. However my purpose is not to get rid of allocations, I want to understand which variables are allocated within goroutine call(s) and why?. Commented Dec 19, 2024 at 7:34

1 Answer 1

5

Starting goroutines is allocating memory for the stack, so your go func accounts for two allocs per loop, one for the function stack, one for the argument evaluation. I have to check how exactly the memory layout would be, but remember that is is “go expression, so your function value and parameters have to be evaluated first. One allocation goes away when you use the (global) wait group with func() {...}() instead of func(wg *sync.WaitGroup) {...}(&wg).

When you have 1,000 goroutines fighting for a mutex, you'll run into lockSlow/unlockSlow, which is also not allocation free. you can easily test that with:

    b.Run("mutex", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            wg.Add(1)
            go func(wg *sync.WaitGroup) {
                for i := 0; i < numCalls; i++ {
                    counterLock.Lock()
                    counter++
                    counterLock.Unlock()
                }
                wg.Done()
            }(&wg)
            wg.Wait()
        }
    })

That would explain the three and two allocations per loop you are seeing.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.