2

I wrote a very simple program in go to test performances of a parallel program. I wrote a very simple program that factorizes a big semiprime number by division trials. Since no communications are involved, I expected an almost perfect speedup. However, the program seems to scale very badly.

I timed the program with 1, 2, 4, and 8 processes, running on a 8 (real, not HT) cores computer, using the system timecommand. The number I factorized is "28808539627864609". Here are my results:

cores  time (sec) speedup
  1   60.0153      1
  2   47.358       1.27
  4   34.459       1.75
  8   28.686       2.10

How to explain such bad speedups? Is it a bug in my program, or is it a problem with go runtime? How could I get better performances? I'm not talking about the algorithm by itself (I know there are better algorithms to factorize semiprime numbers), but about the way I parallelized it.

Here is the source code of my program:

package main

import (
    "big"
    "flag"
    "fmt"
    "runtime"
)

func factorize(n *big.Int, start int, step int, c chan *big.Int) {

    var m big.Int
    i := big.NewInt(int64(start))
    s := big.NewInt(int64(step))
    z := big.NewInt(0)

    for {
        m.Mod(n, i)
        if m.Cmp(z) == 0{
            c <- i
        }
        i.Add(i, s)
    }
}

func main() {
    var np *int = flag.Int("n", 1, "Number of processes")
    flag.Parse()

    runtime.GOMAXPROCS(*np)

    var n big.Int
    n.SetString(flag.Arg(0), 10) // Uses number given on command line
    c := make(chan *big.Int)
    for i:=0; i<*np; i++ {
        go factorize(&n, 2+i, *np, c)
    }
    fmt.Println(<-c)
}

EDIT

Problem really seems to be related to Mod function. Replacing it by Rem gives better but still imperfect performances and speedups. Replacing it by QuoRem gives 3 times faster performances, and perfect speedup. Conclusion: it seems memory allocation kills parallel performances in Go. Why? Do you have any references about this?

3 Answers 3

3

Big.Int methods generally have to allocate memory, usually to hold the result of the computation. The problem is that there is just one heap and all memory operations are serialized. In this program, the numbers are fairly small and the (parallelizable) computation time needed for things like Mod and Add is small compared to the non-parallelizable operations of repeatedly allocating all the tiny little bits of memory.

As far as speeding it up, there is the obvious answer of don't use big.Ints if you don't have to. Your example number happens to fit in 64 bits. If you plan on working with really big big numbers though, the problem will kind of go away on its own. You will spend much more time doing computations, and the time spent in the heap will be relatively much less.

There is a bug in your program, by the way, although it's not related to performance. When you find a factor and return the result on the channel, you send a pointer to the local variable i. This is fine, except that you don't break out of the loop then. The loop in the goroutine continues incrementing i and by the time the main goroutine gets around to fishing the pointer out of the channel and following it, the value is almost certain to be wrong.

Sign up to request clarification or add additional context in comments.

3 Comments

It isn't true that in the Go implementation "all memory (allocation) operations are serialized".
Do you have any reference about serial memory allocation in Go? It could make sense, but I'd like to see facts.
The source has relevant comments. So, I stand corrected. There is one heap, but it looks like goroutines can get a bunch of memory at once and then do repeated small allocations out of that. Regardless, allocations are pretty clearly the problem in your program, and as Atom correctly identified, allocations in the Mod method specifically. I tried the change he suggested of replacing Mod with QuoRem, and your program not only scaled well with cores, but ran much faster. (With the latest weeky, that is.)
3

After sending i through the channel, i should be replaced with a newly allocated big.Int:

if m.Cmp(z) == 0 {
    c <- i
    i = new(big.Int).Set(i)
}

This is necessary because there is no guarantee when fmt.Println will process the integer received on line fmt.Println(<-c). It isn't usual for fmt.Println to cause goroutine switching, so if i isn't replaced with a newly allocated big.Int and the run-time switches back to executing the for-loop in function factorize then the for-loop will overwrite i before it is printed - in which case the program won't print out the 1st integer sent through the channel.


The fact that fmt.Println can cause goroutine switching means that the for-loop in function factorize may potentially consume a lot of CPU time between the moment when the main goroutine receives from channel c and the moment when the main goroutine terminates. Something like this:

run factorize()
<-c in main()
call fmt.Println()
continue running factorize()   // Unnecessary CPU time consumed
return from fmt.Println()
return from main() and terminate program

Another reason for the small multi-core speedup is memory allocation. The function (*Int).Mod is internally using (*Int).QuoRem and will create a new big.Int each time it is called. To avoid the memory allocation, use QuoRem directly:

func factorize(n *big.Int, start int, step int, c chan *big.Int) {
    var q, r big.Int
    i := big.NewInt(int64(start))
    s := big.NewInt(int64(step))
    z := big.NewInt(0)

    for {
        q.QuoRem(n, i, &r)
        if r.Cmp(z) == 0 {
            c <- i
            i = new(big.Int).Set(i)
        }
        i.Add(i, s)
    }
}

Unfortunately, the goroutine scheduler in Go release r60.3 contains a bug which prevents this code to use all CPU cores. When the program is started with -n=2 (GOMAXPROCS=2), the run-time will utilize only 1 thread.

Go weekly release has a better run-time and can utilize 2 threads if n=2 is passed to the program. This gives a speedup of approximately 1.9 on my machine.


Another potential contributing factor to multi-core slowdown has been mentioned in the answer by user "High Performance Mark". If the program is splitting the work into multiple sub-tasks and the result comes only from 1 sub-task, it means that the other sub-tasks may do some "extra work". Running the program with n>=2 may in total consume more CPU time than running the program with n=1.

The learn how much extra work is being done, you may want to (somehow) print out values of all i's in all goroutines at the moment the program is exiting the function main().

2 Comments

Do you have any references about the runtime scheduler bug you mentioned?
I don't. I based my judgement on what I observed on my computer.
0

I don't read go so this is probably the answer to a question which is not what you asked. If so, downvote or delete as you wish.

If you were to make a plot of 'time to factorise integer n' against 'n' you would get a plot that goes up and down somewhat randomly. For any n you choose there will be an integer in the range 1..n that takes longest to factorise on one processor. If your parallelisation strategy is to distribute the n integers across p processors one of those processors will take at least the time to factorise the hardest integer, then the time to factorise the rest of its load.

Perhaps you've done something similar ?

1 Comment

Not exactly. For a given semiprime n, there are exactly 2 factors, pand q. On 1 processor, I would try to divide n by a, with a going from 2 to p. On N processors, I will start on each process with a, = 2, a = 3, ... a = 2+N, then increment a by steps of N. It should then be N times faster to reach p on one of the processes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.