0

I am working on an SSH client application for configuring network devices concurrently, and I am running into issues implementing the concurrency. My program takes in a slice of hosts and a slice of config commands to send to each host. I am using a sync.WaitGroup to wait for all the hosts to finish being configured. This works fine for small batches of hosts, but soon the functions within my configuration goroutines start randomly failing. If I rerun the program on the failed hosts, some will succeed and again some will fail. I have to repeat this process until only the hosts with actual errors remain. It always fails either on the authentication saying authentication failed: auth methods tried [none password]... or the values from sysDescr don't get added to some Devices fields. It's as if when there are many hosts and goroutines running, they start returning early or something. I'm really not sure what is going on.

Here is a sample of my code:

package main

import (
    "fmt"
    "net"
    "os"
    "sync"
    "time"

    "golang.org/x/crypto/ssh"
)

func main() {
    // When there are many hosts, many calls to Dial and
    // sysDescr fail. If I rerun the program on the unsuccessful
    // hosts, nothing fails and the expected output is produced.
    var hosts []string
    cfg := &ssh.ClientConfig{
        User:            "user",
        Auth:            []ssh.AuthMethod{ssh.Password("pass")},
        HostKeyCallback: ssh.InsecureIgnoreHostKey(),
        Timeout:         10 * time.Second,
    }
    results := make(chan *result, len(hosts))

    var wg sync.WaitGroup
    wg.Add(len(hosts))
    for _, host := range hosts {
        go connect(host, cfg, results, &wg)
    }
    wg.Wait()
    close(results)

    for res := range results {
        if res.err != nil {
            fmt.Fprintln(os.Stderr, res.Err)
            continue
        }
        d := res.device
        fmt.Println(d.addr, d.hostname, d.vendor, d.os, d.model, d.version)
        d.Client.Close()
    }
}

// Device represents a network device.
type Device struct {
    *ssh.Client
    addr     string
    hostname string
    vendor   string
    os       string
    model    string
    version  string
}

// Dial establishes an ssh client connection to a remote host.
func Dial(host, port string, cfg *ssh.ClientConfig) (*Device, error) {
    // get host info in background, may take a second
    info := make(chan map[string]string)
    go func(c *Client) {
        info <- sysDescr(host)
        close(info)
    }(c)

    // establish ssh client connection to host
    client, err := ssh.Dial("tcp", net.JoinHostPort(host, addr), cfg)
    if err != nil {
        return nil, err
    }

    m := <-info
    d := &Device{
        Client: client,
        addr: m["addr"],
        hostname: m["hostname"],
        vendor: m["vendor"],
        os: m["os"],
        model: m["model"],
        version: m["version"],
    }
    return d, nil
}

// sysDescr attempts to gather information about a remote host.
func sysDescr(host string) map[string]string {
    // get host info
}

// result is the result of connecting to a device.
type result struct {
    device *Device
    err    error
}

// connect establishes an ssh client connection to a host.
func connect(host string, cfg *ssh.ClientConfig, results chan<- *result, wg *sync.WaitGroup) {
    defer wg.Done()
    device, err := Dial(host, "22", cfg)
    results <- &result{device, err}
}

Am I doing something wrong? Can I limit the number of goroutines being spawned instead of spawning a goroutine for each host?

11
  • 1
    Where does your code fail? And what does sysDescr function do? And cfg is accessed by multiple goroutines, I wonder adding a mutex to protect it helps. Just an idea. Commented Jul 29, 2018 at 16:01
  • @NeoWang it fails on ssh.Dial within my Dial function with the error referencing an authentication error as the reason for failure. So, I think you’re right about the cfg access. Is there another way than using mutexes? Maybe copying the cfg for each call to connect? Commented Jul 29, 2018 at 16:05
  • @NeoWang also, sysDescr uses SNMP to poll a network device’s sysDescr.0 OID to gather information about the device. Commented Jul 29, 2018 at 16:08
  • 2
    Yes, you can also try copying cfg for each goroutine. Commented Jul 29, 2018 at 16:17
  • Reusing the config for multiple concurrent connections is almost certainly the issue. Commented Jul 30, 2018 at 13:24

1 Answer 1

1

Yes! To answer your second question, there are many patterns in go to limit concurrency. Two of the largest are:

I personally prefer the worker pool variant because IMO it better helps to keep the business logic separate from the scheduling, but both are completely valid and are present out in the wild in many many projects.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the examples! I will look into them further.
For the worker pool method, is there an easy way to find the optimal amount of workers and jobs to create? Also, in the worker pool example you gave, can you explain why they initialize the jobs and results channels with capacity 100 if they only create 3 workers and 5 jobs? Wouldn’t both of the channels only need capacity 5, or capacity len(jobs)?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.