0

I have been struggling to figure out the best way to approach this problem for a bash script. I have a command that will check groups of servers for their uptime in minutes. I only want to continue on to the next group of reboots once all of the servers have been up for 5 minutes but also want to verify they haven't been up for over an hour in-case the reboot doesn't take.

I was originally trying to setup a while loop that would keep issuing the command to check uptimes and send the output into an array. I am trying to figure out how you can loop through an array until all elements of that array are greater than 5 and less than. I haven't even been successful in the first check of greater than 5. Is it even possible to continually write to an array and perform arithmetic checks against every value in the array so that all values must be greater than X in a while loop? The number of servers that will be putting their current uptime into the array is varied per group so it won't always be the same number of values in the array.

Is an array even the proper way to do this? I'd provide examples of what I have tried so far but it's a huge mess and I think starting from scratch just asking for input might be best to start with.

Output of the command I am running to pull uptimes looks similar to the following:

1
2
1
4
3
2

Edit

Due to the help provided I was able to get a functional proof of concept together for this and I'm stoked. Here it is in case it might help anyone trying to do something similar in the future. The problem at hand was that we utilize AWS SSM for all of our Windows server patching and many times when SSM tells servers to reboot after patching the SSM Agent takes ages to check in. This slows our entire process down which right now is fairly manual across dozens of patch groups. Many times we have to go and manually verify a server did indeed reboot after we told it to from SSM so that we know we can start the reboots for the next patch group. With this we will be able to issue a single script that issues reboots for our patch groups in the proper order and verifies that the servers have properly rebooted before continuing on to the next group.

#!/bin/bash

### The purpose of this script is to automate the execution of commands required to reboot groups of AWS Windows servers utilizing SSM while also verifying their uptime and only continuing on to the next group once the previous has reached X # of minutes. This solves the problems of AWS SSM Agents not properly checking in with SSM post-reboot.

patchGroups=(01 02 03)                      # array containing the values of the RebootGroup tag


for group in "${patchGroups[@]}"
do
    printf "Rebooting Patch Group %q\n" "$group"
    aws ec2 reboot-instances --instance-ids `aws ec2 describe-instances --filters "Name=tag:RebootGroup,Values=$group" --query 'Reservations[].Instances[].InstanceId' --output text`

    sleep 2m

    unset      passed failed serverList                      # wipe arrays
    declare -A passed failed serverList                      # declare associative arrays

    serverList=$(aws ec2 describe-instances --filter "Name=tag:RebootGroup,Values=$group" --query 'Reservations[*].Instances[*].[InstanceId]' --output text)

    for server in ${serverList}                  # loop through list of servers
    do
        failed["${server}"]=0                     # add to the failed[] array
    done

    while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
    do
        for server in "${!failed[@]}"             # loop through servers in the failed[] array
        do
            ssmID=$(aws ssm send-command --document-name "AWS-RunPowerShellScript" --document-version "1" --targets "[{\"Key\":\"InstanceIds\",\"Values\":[\"$server\"]}]" --parameters '{"commands":["$wmi = Get-WmiObject -Class Win32_OperatingSystem ","$uptimeMinutes =    ($wmi.ConvertToDateTime($wmi.LocalDateTime)-$wmi.ConvertToDateTime($wmi.LastBootUpTime) | select-object -expandproperty \"TotalMinutes\")","[int]$uptimeMinutes"],"workingDirectory":[""],"executionTimeout":["3600"]}' --timeout-seconds 600 --max-concurrency    "50" --max-errors "0" --region us-west-2 --output text --query "Command.CommandId")

            sleep 5

            uptime=$(aws ssm list-command-invocations --command-id "$ssmID" --details --query 'CommandInvocations[].CommandPlugins[].Output' --output text | sed 's/\r$//')

            printf "Checking instance ID %q\n" "$server"
            printf "Value of uptime is = %q\n" "$uptime"

            # if uptime is within our 'success' window then move server to passed[] array

            if [[ "${uptime}" -ge 3 && "${uptime}" -lt 60 ]] 
            then
                passed["${server}"]="${uptime}"   # add to passed[] array
                printf "Server with instance ID %q has successfully rebooted.\n" "$server"
                unset failed["${server}"]         # remove from failed[] array
            fi
        done

        # display current status (edit/remove as desired)

        printf "\n++++++++++++++ successful reboots\n"
        printf "%s\n" "${!passed[@]}" | sort -n

        printf "\n++++++++++++++ failed reboot\n"

        for server in ${!failed[@]}
        do
            printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
        done | sort -n

        printf "\n"

        sleep 60                            # adjust as necessary
    done
done
9
  • could you please provide the input data for your script? Commented Dec 18, 2020 at 19:41
  • 1
    unit of measurement for that list of numbers (1 2 1 4 3 2), seconds? minutes? how are you managing the list of servers ... stored in an array? stored in a file?; again, how big is huge mess and can you post a minimal version of your code that represents your activities? thinking about this some more ... an associative array where the index is the server name and the value is the latest 'uptime' (uptime[server1]=3 (min)); assuming a main while true type of loop, the inner loop would loop through the array indices/values ... and break out of main while loop when counter=0 Commented Dec 18, 2020 at 20:00
  • 1
    alternative ... all servers originally loaded into out-of-range array; as they pass the test, remove from the out-of-range array and into the in-range array; when no more entries in the out-of-range array ... move onto next set of servers ... Commented Dec 18, 2020 at 20:03
  • 1
    @markp-fuso thank you again. I updated the original post to show the proof of concept script which I owe you 98% of the credit for creating. Likely wouldn't have gotten there without your assistance. I appreciate the thoroughness of your answers and how quickly you were able to whip up something from scratch. Commented Dec 21, 2020 at 6:11
  • 1
    @ChrisSmith glad I could help and thanks for the detailed update on what you're trying to accomplish; not sure how to re-word it but ... you may want to see if you can change the subject of the question to something more befitting what you're trying to do (ie, something more descriptive than 'how to tell if a number is between 2 other numbers' ?) :-) Commented Dec 21, 2020 at 12:14

2 Answers 2

2

It sounds like you need to keep re-evaluating the output of uptime to get the data you need, so an array or other variable may just get you stuck. Think about this functionally (as in functions). You need a function that checks if the uptime is within the bounds you want, just once. Then, you need to run that function periodically. If it is successful, you trigger the reboot. If it fails, you let it try again later.

Consider this code:

uptime_in_bounds() {
    local min="$1"
    local max="$2"
    local uptime_secs

    # The first value in /proc/uptime is the number of seconds the
    # system has been up. We have to truncate it to an integer…
    read -r uptime_float _ < /proc/uptime
    uptime_secs="${uptime_float%.*}"

    # A shell function reflects the exit status of its last command.
    # This function "succeeds" if the uptime_secs is between min and max.
    (( min < uptime_secs && max > uptime_secs ))
}
if uptime_in_bounds 300 3600; then
    sudo reboot  # or whatever
fi
Sign up to request clarification or add additional context in comments.

1 Comment

How would I go about utilizing this exactly? For example my first group of servers I am rebooting post patching will report something along the lines of ` 1 2 1 4 3 2 ` I would only like to proceed with rebooting my next group of servers once the uptime of each of those servers reports back a number greater than 5.
1

General idea ... will likely need some tweaking based on how OP is tracking servers, obtaining uptimes, etc ...

# for a given set of servers, and assuming stored in variable ${server_list} ...

unset      passed failed                      # wipe arrays
declare -A passed failed                      # declare associative arrays

for server in ${server_list}                  # loop through list of servers
do
    failed["${server}"]=0                     # add to the failed[] array
done

while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
do
    for server in "${!failed[@]}"             # loop through servers in the failed[] array
    do
        uptime=$( some_command_to_get_uptime_for_server "${server}" )

        # if uptime is within our 'success' window then move server to passed[] array

        if [[ "${uptime}" -gt 5 && "${uptime}" -lt 60 ]] 
        then
            passed["${server}"]="${uptime}"   # add to passed[] array
            unset failed["${server}"]         # remove from failed[] array
        else
            failed["${server}"]="${uptime}"
        fi
    done

    # display current status (edit/remove as desired)

    printf "\n++++++++++++++ successful reboots\n"
    printf "%s\n" "${!passed[@]}" | sort -n

    printf "\n++++++++++++++ failed reboot\n"

    for server in ${!failed[@]}
    do
        printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
    done | sort -n

    printf "\n"

    sleep 30                            # adjust as necessary
done

NOTES:

  • this code would likely be part of a larger looping construct based on sets of servers (ie, new ${server_list}
  • if list of servers is in another format (eg, file, another array, etc) will need to modify the for loop to properly populate the failed[] array
  • OP will need to edit to add code for finding uptime for a given ${server}
  • OP (obviously) free to rename variables/arrays as desired
  • OP will probably need to decide on what to do if the while loop continues 'too long'
  • if a new ${uptime} is not within the 5-60 min range, OP can add an else block to perform some other operation(s) for the problematic ${server}

1 Comment

Markp-fuso, thank you. This is fantastic. Very well laid out and I think it will truly help me towards a final solution. This is extremely helpful, especially when I was just providing a general problem and no existing code to work with. I will report back on my results.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.