I am working on an SSH client application for configuring network devices concurrently, and I am running into issues implementing the concurrency. My program takes in a slice of hosts and a slice of config commands to send to each host. I am using a sync.WaitGroup to wait for all the hosts to finish being configured. This works fine for small batches of hosts, but soon the functions within my configuration goroutines start randomly failing. If I rerun the program on the failed hosts, some will succeed and again some will fail. I have to repeat this process until only the hosts with actual errors remain. It always fails either on the authentication saying authentication failed: auth methods tried [none password]... or the values from sysDescr don't get added to some Devices fields. It's as if when there are many hosts and goroutines running, they start returning early or something. I'm really not sure what is going on.
Here is a sample of my code:
package main
import (
"fmt"
"net"
"os"
"sync"
"time"
"golang.org/x/crypto/ssh"
)
func main() {
// When there are many hosts, many calls to Dial and
// sysDescr fail. If I rerun the program on the unsuccessful
// hosts, nothing fails and the expected output is produced.
var hosts []string
cfg := &ssh.ClientConfig{
User: "user",
Auth: []ssh.AuthMethod{ssh.Password("pass")},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
Timeout: 10 * time.Second,
}
results := make(chan *result, len(hosts))
var wg sync.WaitGroup
wg.Add(len(hosts))
for _, host := range hosts {
go connect(host, cfg, results, &wg)
}
wg.Wait()
close(results)
for res := range results {
if res.err != nil {
fmt.Fprintln(os.Stderr, res.Err)
continue
}
d := res.device
fmt.Println(d.addr, d.hostname, d.vendor, d.os, d.model, d.version)
d.Client.Close()
}
}
// Device represents a network device.
type Device struct {
*ssh.Client
addr string
hostname string
vendor string
os string
model string
version string
}
// Dial establishes an ssh client connection to a remote host.
func Dial(host, port string, cfg *ssh.ClientConfig) (*Device, error) {
// get host info in background, may take a second
info := make(chan map[string]string)
go func(c *Client) {
info <- sysDescr(host)
close(info)
}(c)
// establish ssh client connection to host
client, err := ssh.Dial("tcp", net.JoinHostPort(host, addr), cfg)
if err != nil {
return nil, err
}
m := <-info
d := &Device{
Client: client,
addr: m["addr"],
hostname: m["hostname"],
vendor: m["vendor"],
os: m["os"],
model: m["model"],
version: m["version"],
}
return d, nil
}
// sysDescr attempts to gather information about a remote host.
func sysDescr(host string) map[string]string {
// get host info
}
// result is the result of connecting to a device.
type result struct {
device *Device
err error
}
// connect establishes an ssh client connection to a host.
func connect(host string, cfg *ssh.ClientConfig, results chan<- *result, wg *sync.WaitGroup) {
defer wg.Done()
device, err := Dial(host, "22", cfg)
results <- &result{device, err}
}
Am I doing something wrong? Can I limit the number of goroutines being spawned instead of spawning a goroutine for each host?
cfgis accessed by multiple goroutines, I wonder adding a mutex to protect it helps. Just an idea.ssh.Dialwithin myDialfunction with the error referencing an authentication error as the reason for failure. So, I think you’re right about thecfgaccess. Is there another way than using mutexes? Maybe copying thecfgfor each call toconnect?sysDescruses SNMP to poll a network device’ssysDescr.0OID to gather information about the device.cfgfor each goroutine.