Random, yet predictable cronjob schedule based upon arbitrary user data in Bash

Question

I'm trying to construct a random, yet predictable cronjob schedule for a monthly and daily cronjob based upon arbitrary user-provided data. The daily and monthly cronjobs should run with a sufficient delay of at least 30 minutes.

The goal is that if a user provides the same input repeatedly, the cronjobs will always run at the exact same time. Which data yields what schedule should "feel" random, but doesn't have to be cryptographically random. However, if the user just slightly changes the input, the cronjobs should run at totally different (i.e. not similar) times. If the user then changes the input back to the original time, the cronjobs should run at the exact same time as before.

There's no way of storing the resulting cronjob schedule persistently, i.e. it must be constructed strictly from the user data. The user data is arbitrary, i.e. we mustn't expect it to have a certain form or length - it can be an empty string, or a string with 1 gigabytes of random data.

The cronjob schedule should be constructed using Bash with just Busybox tools installed.

My idea was the following:

The schedule of the monthly cronjob might be represented by an integer representing the nth minute of the month. So, n=123 represents the 123rd minute of the month, or 2:03 on the 1st day of the month (cronjob schedule 3 2 1 * *). In contrast, n=12345 represents the 12,345th minute of the month, or 13:45 on the 9th day of the month (cronjob schedule 45 13 9 * *). Since the cronjob wouldn't run in February otherwise, we accept 23:59 on the 28th day max. Thus we need an integer between 0 and 40319 (= 28 days * 24 hours * 60 minutes - 1).

For this we might create a __crontab_monthly() function, accepting an arbitrary integer. Since the input integer isn't necessarily within the expected range, we first perform a modulo operation of 40320. We then perform modulo operations and divisions to get the respective day, hour, and minute. Lastly we concat the cronjob schedule.

The same principle also works for the daily cronjob, just limited to an integer between 0 and 1439 (= 24 hours * 60 minutes - 1). We might create a similar __crontab_daily() function for that.

I don't really have a idea yet about how to ensure that the monthly and daily cronjobs run with a sufficient delay - besides just offsetting the value by some magic value... Any ideas?
However, for this to work we first need to calculate a random, yet persistent integer from the user data to feed them into our __crontab_{monthly,daily}() functions. Since we must accept arbitrary user data, my idea was to first calculate the md5 hash of the user data. This ensures that the result is perceived as random (tiny changes in the input yield a totally different result), yet it is predictable and yields consistent results for the same data.

The reason why a md5 hash might be a good starting point is that a md5 hash is just the hexdecimal string representation of a 128-bit number. However, we can't do math with a 128-bit number in Bash. I thus had the idea of uniformly consolidating two unique 128-bit hashes to the same 64-bit integer. My approach was to first split the hash into two 64-bit slices, perform modulo operations of 2^32 to condense the slices down to 32 bit each and then concat the two binary numbers. This should yield a 64-bit signed integer which can then be fed into the __crontab_{monthly,daily}() functions.

I came up with the following solution so far. The user input is stored inside the $USER_DATA variable. I'm rather certain that the __crontab_{monthly,daily}() functions do what they are supposed to do, but the __crontab_reference() function is more tricky... I'm just not sure whether the calculations are correct, binary isn't "my thing". Can someone help please?

__crontab_reference() {
    local HASH="$(md5sum <<< "$1" | cut -d ' ' -f 1)"
    echo $(( 0x$(printf '%x\n' $(( 0x${HASH:0:16} % 2147483648 )))$(printf '%x\n' $(( 0x${HASH:16:16} % 2147483648 ))) ))
}

__crontab_daily() {
    local NUMBER=$(( "$1" % 1440 ))
    NUMBER=$(( NUMBER * ((NUMBER>0) - (NUMBER<0)) ))

    local HOUR=$(( NUMBER / 60 ))
    local MINUTE=$(( NUMBER % 60 ))
    echo "$MINUTE $HOUR * * *"
}

__crontab_monthly() {
    local NUMBER=$(( "$1" % 40320 ))
    NUMBER=$(( NUMBER * ((NUMBER>0) - (NUMBER<0)) ))

    local DAY=$(( NUMBER / 1440 + 1 ))
    local HOUR=$(( NUMBER % 1440 / 60 ))
    local MINUTE=$(( NUMBER % 1440 % 60 ))
    echo "$MINUTE $HOUR $DAY * *"
}

CRONTAB_REFERENCE="$(__crontab_reference "$USER_DATA")"
echo "Daily cronjob schedule: $(__crontab_daily "$CRONTAB_REFERENCE")"
echo "Monthly cronjob schedule: $(__crontab_monthly "$CRONTAB_REFERENCE")"

For USER_DATA="" this yields:

Daily cronjob schedule: 36 1 * * *
Monthly cronjob schedule: 36 1 20 * *

Last but not least, does anyone have ideas (please including PoC code) for alternative approaches?

#edit 2024-08-05: Amending the task because I just noticed that I asked for the wrong thing (the goal is sufficient delay of the cronjobs, different hours won't guarantee that)

Transforming some hash value is the way to go. But without the need for any cryptographic strength, you could also directly extract the different target values (day, hour, minute etc.) from different slices of the hash, eliminating the need to calculate between time scales (just one mod for each target range). — pmf
– pmf, Commented Aug 4, 2024 at 22:05

jhnc · Accepted Answer · 2024-08-05 10:45:56Z

3

bc can operate on hex directly.

Using pmf's idea from the comments:

hash=$(md5sum <<<"$data")

read md mh mm dh dm <<<$( echo $(
    echo "
        ibase=16; s=${hash^^}0; ibase=A
        k=2^8
              md= s%28+1
        s/=k; mh= s%24
        s/=k; mm= s%60
        s/=k; dh= (mh+(s%23)+1)%24
        s/=k; dm= s%60

        md;mh;mm; dh;dm
    " | bc
))

To ensure mh != dh, we offset it by a non-zero amount. The form is dh = (mh + (s%(MaxExtraOffset+1)) + MinOffset) %24. For this work properly, make sure that MaxExtraOffset + 2*MinOffset <= 24.

md5sum output is lowercase but bc needs hex specified in uppercase. tr can be used to do the conversion if the shell being used doesn't have ${var^^} parameter expansion.

Also, << heredoc syntax can be used if <<< is not available.

bash can also operate on hex directly, and won't overflow if the numbers are small enough.

If you replace s% with (s%k)% everywhere above, the result will be the same as this:

hash=$(md5sum <<<"$data")

md=$(( 16#${hash:30:2} %28+1 ))
mh=$(( 16#${hash:28:2} %24 ))
mm=$(( 16#${hash:26:2} %60 ))
dh=$(( (mh+(16#${hash:24:2} %23)+1)%24 ))
dm=$(( 16#${hash:22:2} %60 ))

:2 can be larger (and the offsets adjusted) to make use of more digits of the hash.

edited Aug 5, 2024 at 10:45

answered Aug 5, 2024 at 0:05

jhnc

18.7k2 gold badges14 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

jhnc Over a year ago

@PhrozenByte using more bits of the hash just means less chance of different data resulting in same crontab values, so more "random". you can change the magic numbers: mh+(s%23)+1 can become mh+(s%22)+2 for example, to make hour differ by at least two (just make sure 23+1 / 22+2 are <= 24). Might be better to use mh+(s%20)+4, etc, to allow for timezone changes

jhnc Over a year ago

@PhrozenByte now I look, I think my bash answer is just Renaud's earlier approach but with magic numbers rather than parameters.

jhnc Over a year ago

(of course if you use :6 instead of :2 the offsets also have to be reduced, or you fall off the end of the string - it doesn't actually matter if the slices selected overlap, but the result will be less "random" if they do)

jhnc Over a year ago

@PhrozenByte I think I calculated wrong. possibly should be mh+(s%21)+2 and mh+(s%17)+4 to avoid wraparound making the hours too close the other way around. see edit.

PhrozenByte Over a year ago

Regarding (1): I feel like that :2 (= 8 bit = 256 values) isn't much randomness for a 0-59 range. I'm curious, does using :2 mean that it's significantly more likely that we see some minutes more often? I might be very wrong, but my thought is as follows: Unless the range (0-59) fits exactly into the base (2^8 values) it's expected that we see the values yielded by the remainder more often. Since 60 only fits 4.3 times into 256, this gets significant: minutes 0-16 are yielded in 33.2% of the cases; expected is just 17/60=28.3%. The effect never disappears, but is proportional, isn't it?

|

Renaud Pacalet · Accepted Answer · 2024-08-05 13:10:27Z

2

You could try the following that uses only bash and md5sum, as you apparently have this in your busybox. RANDOM is a seed-able pseudo random generator. We seed it with the 4 left hexadecimal digits of the md5sum of USER_DATA, that is, 16 bits. This should work on any architecture, even the quite rare with 16 bits integers, and even if your bash was compiled with the CHECK_OVERFLOW macro defined. If your target platform is a 32 (64) bits architecture, and you think 16 bits seeds are not enough, you can seed with the 8 (16) left hexadecimal digits (replace 0x${h:0:4} with 0x${h:0:8} or 0x${h:0:16}).

declare -a n=() m=( 1440 1381 28 ) o=( 0 30 1 )
read h dummy < <( md5sum <<< "$USER_DATA" )
for(( (RANDOM = 0x${h:0:4}) && (i = 0); i < 3; i++ )); do
  (( n[2 * i] = RANDOM % m[i] + o[i] ))
done
(( n[2] = (n[0] + n[2]) % 1440 ))
for(( i = 0; i < 3; i += 2 )); do
  (( n[i + 1] = n[i] / 60, n[i] = n[i] % 60 ))
done
printf 'Daily cronjob schedule: %d %d * * *\n' "${n[@]:0:2}"
printf 'Monthly cronjob schedule: %d %d %d * *\n' "${n[@]:2:3}"

Arrays m and o store the modulus and offsets to apply. After the first for loop, array n contains:

Index	Range	Description
0	0-1439	Daily minutes (full day offset)
1	?
2	30-1410	Offset of monthly minutes from daily minutes
3	?
4	1-28	Monthly day

n[2] = (n[0] + n[2]) % 1440 guarantees that daily and monthly minutes are at least 30 minutes apart because after this adjustment we have 30 <= |n[2] - n[0]| % 1440 <= 1410, that is, greater or equal 30 minutes and less or equal 23 hours and 30 minutes. We convert them to minutes and hours with the second for loop, after which array n contains:

Index	Range	Crontab field
0	0-59	Daily minutes
1	0-23	Daily hours
2	0-59	Monthly minutes
3	0-23	Monthly hours
4	1-28	Monthly day

With the guarantee that there is at least 30 minutes between the two dates.

edited Aug 5, 2024 at 13:10

answered Aug 5, 2024 at 5:58

Renaud Pacalet

30.7k3 gold badges42 silver badges60 bronze badges

10 Comments

PhrozenByte Over a year ago

I really like your approach, it feels very elegant to directly use a seed-able pseudo random generator. Thanks! I've got two thoughts though: 1. Is RANDOM=0x$h safe considering that $h is a 128-bit number? I'd assume yes (it just overflows), but I want to be sure 2. I'd like to amend my question regarding the different hours constraint. You perfectly solved what I asked for, but I just noticed that I actually need something else: My goal is to ensure that there's a sufficient (30+ minutes) delay between the cronjobs; just having different hours won't do that (e.g. 15:59 vs. 16:00)

Renaud Pacalet Over a year ago

@PhrozenByte 1. Yes, it is truncated to the 64 least significant bits. But as this is not documented you need to dive in the source code to see it. Anyway, I added an explicit truncation (to the 64 most significant bits) using bash substitutions. So, even if bash changes this should be OK. 2. See my updated answer. Instead of drawing minutes and hours we draw full day minutes and apply a similar strategy as before to guarantee that they are at least 30 minutes apart. Then we convert back to minutes and hours.

Renaud Pacalet Over a year ago

@PhrozenByte Looking again at the source code I see that, if your bash was compiled with the CHECK_OVERFLOW macro defined, and there is an overflow, RANDOM is assigned constant INTMAX_MAX. As this would ruin the randomness it is probably safer to use only the 4 leading bytes of md5sum. This leaves only 65536 different seed values but would work on any architecture, even the quite rare with 16 bits integers. I update my answer.

jhnc Over a year ago

I like the idea but I think you would have to include the random number generator to ensure you always get the same result, since the implementation sometimes changes. RANDOM=1234567890; echo $RANDOM gives different results on 3.2.57(1)-release (ubuntu); 4.0.33(0)-release (FreeBSD); 5.1.16(1)-release (ubuntu); I get: 3536, 8569, 14988

Renaud Pacalet Over a year ago

@jhnc If the random generator shall be stable across bash versions see SHELL COMPATIBILITY MODE and BASH_COMPAT in bash documentation.

|

PhrozenByte · Accepted Answer · 2024-08-05 20:59:29Z

1

Thanks to @renaud's (here) and @jnhc's (here) equally amazing solutions (♥) I came up with another solution that tries to combine the advantages of both solutions and my original (but incomplete) approach.

Here's what I came up with:

__crontabs() {
    local HASH="$(md5sum - <<< "$1")"

    # TIME represents the `n`th minute of the month (e.g. 12,345 is the 9th day at 13:45)
    # since the cronjob wouldn't run in February otherwise, we accept the 28th day max,
    # thus we perform an euclidean division by 28 days * 24 hours * 60 minutes = 40,320
    local TIME=$(( 0x${HASH:0:14} % 40320 ))

    # DAILY_TIME_OFFSET represents a `n` minute offset from TIME's time of day
    # the daily cronjob shall run between 00:30 and 23:30 hours later than the monthly cronjob,
    # thus we perform an euclidean division by 24 hours * 60 minutes - 2 * 30 minutes + 1 = 1,381
    # yielding some value between 00:00 and 23:00 hours, thus adding another 30 minutes
    local DAILY_TIME_OFFSET=$(( 0x${HASH:16:14} % 1381 + 30 ))

    local MONTHLY_DAY=$(( TIME / 1440 + 1 ))
    local MONTHLY_HOUR=$(( TIME % 1440 / 60 ))
    local MONTHLY_MINUTE=$(( TIME % 1440 % 60 ))

    local DAILY_TIME=$(( ( TIME % 1440 + DAILY_TIME_OFFSET ) % 1440 ))
    local DAILY_HOUR=$(( DAILY_TIME / 60 ))
    local DAILY_MINUTE=$(( DAILY_TIME % 60 ))

    printf '%d %d * * *\n' $DAILY_MINUTE $DAILY_HOUR
    printf '%d %d %d * *\n' $MONTHLY_MINUTE $MONTHLY_HOUR $MONTHLY_DAY
}

CRONTABS="$(__crontabs "$USER_DATA")"
echo "Daily cronjob schedule: $(head -n1 <<< "$CRONTABS")"
echo "Monthly cronjob schedule: $(tail -n1 <<< "$CRONTABS")"

For USER_DATA="" this yields:

Daily cronjob schedule: 27 5 * * *
Monthly cronjob schedule: 52 16 19 * *

For USER_DATA="Hello World" it yields:

Daily cronjob schedule: 51 1 * * *
Monthly cronjob schedule: 44 5 2 * *

The reasoning is as follows:

First I tried to adapt the simplicity of @jnhc's solution by avoiding loops and using @jnhc's easy-to-understand step-by-step approach.

However, I feel like that @jnhc's slicing of the hash into so many small pieces depends on a deeper understanding of md5 to comprehend whether this might not have unexpected side effects in regards to the final "randomness". I therefore wanted to use @renaud's elegant idea of utilizing $RANDOM as seed-able pseudo random number generator. That's why I included it in my solution at first. However, I then noticed that it actually doesn't provide enough randomness: $RANDOM only yields 32768 different values. In @renaud's solution this is no issue because the biggest modulo is 1440 there, but in my solution it's 40320.

Since 40320 is greater than 32768 I couldn't use $RANDOM and ended up using two 56-bit slices of the 128-bit hash. Why 56 bit and not 64 bit? Because Bash calculates with signed 64-bit integers (platform-dependant, see the discussion below @renaud's answer) and using just 56 bit guarantees that we work with positive integers only (again, works on 64-bit platforms only).

So, why am I working with modulo 40320 again? Because I wanted to keep my original "nth minute of the month" approach. The reason is that my amended "30 minutes offset" requirement caused @renaud to switch to a "nth minute of the day" approach, so I thought that my original approach isn't far and makes the whole solution a little easier to understand. To solve the requirement I chose @renaud's approach - @jnhc's would have worked too, @renaud's solution is just closer to the amended task description.

This gives us the presented solution above.

All credit for my merged solution goes to @renaud and @jnhc - they did the hard part, I just adapted their works ♥

I hope I didn't make any mistakes here; if I did, please let me know. Feedback is very much appreciated.

Last but not least I want to thank @renaud and @jnhc for their equally great solutions. Before writing my own merged solution I was thinking about whose solution I should choose as accepted answer, but couldn't really make a decision: @renaud's solution is amazingly elegant, @jnhc's solution is remarkably simple and easy to understand. They work equally well.

My merged solution is by no means "better" than the other two, just different. Thanks again! ♥

Since all answers provide equally well solutions I'm intentionally not marking any answer as accepted answer.

answered Aug 5, 2024 at 20:59

PhrozenByte

1314 bronze badges

5 Comments

jhnc Over a year ago

the original %40320 actually makes much more sense than my/pmf's slicing of the hash. the parts of the date/time are not independent.

Renaud Pacalet Over a year ago

Note that pseudo-random generators have 2 different important parameters: output range and period length. With RANDOM the period is 2^31 (imagine a hidden random number generator which output is 31 bits but what you see of this is only 15 bits). If 32768 is too short you can simply use RANDOM * 32768 + RANDOM. This provides 30 bits of pseudo-randomness, that is, a less than 0.004% bias with your modulo 40230.

PhrozenByte Over a year ago

@Renaud I'm curios, is RANDOM * <RANDOM_MAX> + RANDOM a generally meaningful approach to safely get more random data, or rather some Bash-specific solution? I'm especially curios about the * 32768 part. Can you provide some resources for me to read? Thanks!

Renaud Pacalet Over a year ago

@PhrozenByte It is a general approach when the period is much longer than the range. In our specific case the best resource is the bash source code itself. You'll find the whole story in lib/sh/random.c. RANDOM is produced by brand that itself calls intrand32 to get a 31 bits random value. But brand returns only 15 bits using masking (plus folding if shell_compatibility_level > 50). So, behind the scene, there is a 31 bits random generator.

Renaud Pacalet Over a year ago

@PhrozenByte Thanks to this, if you evaluate RANDOM repeatedly you will not walk a cycle of 32767 values, you will walk a cycle of 2^31 - 1 values, 15 bits each. This is why you can get 30 bits of randomness with RANDOM * 32768 + RANDOM. It would not work with a true 15 bits Linear Feedback Shift Register (LFSR), for instance.

Collectives™ on Stack Overflow

Random, yet predictable cronjob schedule based upon arbitrary user data in Bash

3 Answers 3

9 Comments

10 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

10 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related