3

I'm trying to construct a random, yet predictable cronjob schedule for a monthly and daily cronjob based upon arbitrary user-provided data. The daily and monthly cronjobs should run with a sufficient delay of at least 30 minutes.

The goal is that if a user provides the same input repeatedly, the cronjobs will always run at the exact same time. Which data yields what schedule should "feel" random, but doesn't have to be cryptographically random. However, if the user just slightly changes the input, the cronjobs should run at totally different (i.e. not similar) times. If the user then changes the input back to the original time, the cronjobs should run at the exact same time as before.

There's no way of storing the resulting cronjob schedule persistently, i.e. it must be constructed strictly from the user data. The user data is arbitrary, i.e. we mustn't expect it to have a certain form or length - it can be an empty string, or a string with 1 gigabytes of random data.

The cronjob schedule should be constructed using Bash with just Busybox tools installed.


My idea was the following:

  1. The schedule of the monthly cronjob might be represented by an integer representing the nth minute of the month. So, n=123 represents the 123rd minute of the month, or 2:03 on the 1st day of the month (cronjob schedule 3 2 1 * *). In contrast, n=12345 represents the 12,345th minute of the month, or 13:45 on the 9th day of the month (cronjob schedule 45 13 9 * *). Since the cronjob wouldn't run in February otherwise, we accept 23:59 on the 28th day max. Thus we need an integer between 0 and 40319 (= 28 days * 24 hours * 60 minutes - 1).

    For this we might create a __crontab_monthly() function, accepting an arbitrary integer. Since the input integer isn't necessarily within the expected range, we first perform a modulo operation of 40320. We then perform modulo operations and divisions to get the respective day, hour, and minute. Lastly we concat the cronjob schedule.

    The same principle also works for the daily cronjob, just limited to an integer between 0 and 1439 (= 24 hours * 60 minutes - 1). We might create a similar __crontab_daily() function for that.

    I don't really have a idea yet about how to ensure that the monthly and daily cronjobs run with a sufficient delay - besides just offsetting the value by some magic value... Any ideas?

  2. However, for this to work we first need to calculate a random, yet persistent integer from the user data to feed them into our __crontab_{monthly,daily}() functions. Since we must accept arbitrary user data, my idea was to first calculate the md5 hash of the user data. This ensures that the result is perceived as random (tiny changes in the input yield a totally different result), yet it is predictable and yields consistent results for the same data.

    The reason why a md5 hash might be a good starting point is that a md5 hash is just the hexdecimal string representation of a 128-bit number. However, we can't do math with a 128-bit number in Bash. I thus had the idea of uniformly consolidating two unique 128-bit hashes to the same 64-bit integer. My approach was to first split the hash into two 64-bit slices, perform modulo operations of 2^32 to condense the slices down to 32 bit each and then concat the two binary numbers. This should yield a 64-bit signed integer which can then be fed into the __crontab_{monthly,daily}() functions.


I came up with the following solution so far. The user input is stored inside the $USER_DATA variable. I'm rather certain that the __crontab_{monthly,daily}() functions do what they are supposed to do, but the __crontab_reference() function is more tricky... I'm just not sure whether the calculations are correct, binary isn't "my thing". Can someone help please?

__crontab_reference() {
    local HASH="$(md5sum <<< "$1" | cut -d ' ' -f 1)"
    echo $(( 0x$(printf '%x\n' $(( 0x${HASH:0:16} % 2147483648 )))$(printf '%x\n' $(( 0x${HASH:16:16} % 2147483648 ))) ))
}

__crontab_daily() {
    local NUMBER=$(( "$1" % 1440 ))
    NUMBER=$(( NUMBER * ((NUMBER>0) - (NUMBER<0)) ))

    local HOUR=$(( NUMBER / 60 ))
    local MINUTE=$(( NUMBER % 60 ))
    echo "$MINUTE $HOUR * * *"
}

__crontab_monthly() {
    local NUMBER=$(( "$1" % 40320 ))
    NUMBER=$(( NUMBER * ((NUMBER>0) - (NUMBER<0)) ))

    local DAY=$(( NUMBER / 1440 + 1 ))
    local HOUR=$(( NUMBER % 1440 / 60 ))
    local MINUTE=$(( NUMBER % 1440 % 60 ))
    echo "$MINUTE $HOUR $DAY * *"
}

CRONTAB_REFERENCE="$(__crontab_reference "$USER_DATA")"
echo "Daily cronjob schedule: $(__crontab_daily "$CRONTAB_REFERENCE")"
echo "Monthly cronjob schedule: $(__crontab_monthly "$CRONTAB_REFERENCE")"

For USER_DATA="" this yields:

Daily cronjob schedule: 36 1 * * *
Monthly cronjob schedule: 36 1 20 * *

Last but not least, does anyone have ideas (please including PoC code) for alternative approaches?

#edit 2024-08-05: Amending the task because I just noticed that I asked for the wrong thing (the goal is sufficient delay of the cronjobs, different hours won't guarantee that)

1
  • 2
    Transforming some hash value is the way to go. But without the need for any cryptographic strength, you could also directly extract the different target values (day, hour, minute etc.) from different slices of the hash, eliminating the need to calculate between time scales (just one mod for each target range). Commented Aug 4, 2024 at 22:05

3 Answers 3

3

bc can operate on hex directly.

Using pmf's idea from the comments:

hash=$(md5sum <<<"$data")

read md mh mm dh dm <<<$( echo $(
    echo "
        ibase=16; s=${hash^^}0; ibase=A
        k=2^8
              md= s%28+1
        s/=k; mh= s%24
        s/=k; mm= s%60
        s/=k; dh= (mh+(s%23)+1)%24
        s/=k; dm= s%60

        md;mh;mm; dh;dm
    " | bc
))

To ensure mh != dh, we offset it by a non-zero amount. The form is dh = (mh + (s%(MaxExtraOffset+1)) + MinOffset) %24. For this work properly, make sure that MaxExtraOffset + 2*MinOffset <= 24.

md5sum output is lowercase but bc needs hex specified in uppercase. tr can be used to do the conversion if the shell being used doesn't have ${var^^} parameter expansion.

Also, << heredoc syntax can be used if <<< is not available.


bash can also operate on hex directly, and won't overflow if the numbers are small enough.

If you replace s% with (s%k)% everywhere above, the result will be the same as this:

hash=$(md5sum <<<"$data")

md=$(( 16#${hash:30:2} %28+1 ))
mh=$(( 16#${hash:28:2} %24 ))
mm=$(( 16#${hash:26:2} %60 ))
dh=$(( (mh+(16#${hash:24:2} %23)+1)%24 ))
dm=$(( 16#${hash:22:2} %60 ))

:2 can be larger (and the offsets adjusted) to make use of more digits of the hash.

Sign up to request clarification or add additional context in comments.

9 Comments

@PhrozenByte using more bits of the hash just means less chance of different data resulting in same crontab values, so more "random". you can change the magic numbers: mh+(s%23)+1 can become mh+(s%22)+2 for example, to make hour differ by at least two (just make sure 23+1 / 22+2 are <= 24). Might be better to use mh+(s%20)+4, etc, to allow for timezone changes
@PhrozenByte now I look, I think my bash answer is just Renaud's earlier approach but with magic numbers rather than parameters.
(of course if you use :6 instead of :2 the offsets also have to be reduced, or you fall off the end of the string - it doesn't actually matter if the slices selected overlap, but the result will be less "random" if they do)
@PhrozenByte I think I calculated wrong. possibly should be mh+(s%21)+2 and mh+(s%17)+4 to avoid wraparound making the hours too close the other way around. see edit.
Regarding (1): I feel like that :2 (= 8 bit = 256 values) isn't much randomness for a 0-59 range. I'm curious, does using :2 mean that it's significantly more likely that we see some minutes more often? I might be very wrong, but my thought is as follows: Unless the range (0-59) fits exactly into the base (2^8 values) it's expected that we see the values yielded by the remainder more often. Since 60 only fits 4.3 times into 256, this gets significant: minutes 0-16 are yielded in 33.2% of the cases; expected is just 17/60=28.3%. The effect never disappears, but is proportional, isn't it?
|
2

You could try the following that uses only bash and md5sum, as you apparently have this in your busybox. RANDOM is a seed-able pseudo random generator. We seed it with the 4 left hexadecimal digits of the md5sum of USER_DATA, that is, 16 bits. This should work on any architecture, even the quite rare with 16 bits integers, and even if your bash was compiled with the CHECK_OVERFLOW macro defined. If your target platform is a 32 (64) bits architecture, and you think 16 bits seeds are not enough, you can seed with the 8 (16) left hexadecimal digits (replace 0x${h:0:4} with 0x${h:0:8} or 0x${h:0:16}).

declare -a n=() m=( 1440 1381 28 ) o=( 0 30 1 )
read h dummy < <( md5sum <<< "$USER_DATA" )
for(( (RANDOM = 0x${h:0:4}) && (i = 0); i < 3; i++ )); do
  (( n[2 * i] = RANDOM % m[i] + o[i] ))
done
(( n[2] = (n[0] + n[2]) % 1440 ))
for(( i = 0; i < 3; i += 2 )); do
  (( n[i + 1] = n[i] / 60, n[i] = n[i] % 60 ))
done
printf 'Daily cronjob schedule: %d %d * * *\n' "${n[@]:0:2}"
printf 'Monthly cronjob schedule: %d %d %d * *\n' "${n[@]:2:3}"

Arrays m and o store the modulus and offsets to apply. After the first for loop, array n contains:

Index Range Description
0 0-1439 Daily minutes (full day offset)
1 ?
2 30-1410 Offset of monthly minutes from daily minutes
3 ?
4 1-28 Monthly day

n[2] = (n[0] + n[2]) % 1440 guarantees that daily and monthly minutes are at least 30 minutes apart because after this adjustment we have 30 <= |n[2] - n[0]| % 1440 <= 1410, that is, greater or equal 30 minutes and less or equal 23 hours and 30 minutes. We convert them to minutes and hours with the second for loop, after which array n contains:

Index Range Crontab field
0 0-59 Daily minutes
1 0-23 Daily hours
2 0-59 Monthly minutes
3 0-23 Monthly hours
4 1-28 Monthly day

With the guarantee that there is at least 30 minutes between the two dates.

10 Comments

I really like your approach, it feels very elegant to directly use a seed-able pseudo random generator. Thanks! I've got two thoughts though: 1. Is RANDOM=0x$h safe considering that $h is a 128-bit number? I'd assume yes (it just overflows), but I want to be sure 2. I'd like to amend my question regarding the different hours constraint. You perfectly solved what I asked for, but I just noticed that I actually need something else: My goal is to ensure that there's a sufficient (30+ minutes) delay between the cronjobs; just having different hours won't do that (e.g. 15:59 vs. 16:00)
@PhrozenByte 1. Yes, it is truncated to the 64 least significant bits. But as this is not documented you need to dive in the source code to see it. Anyway, I added an explicit truncation (to the 64 most significant bits) using bash substitutions. So, even if bash changes this should be OK. 2. See my updated answer. Instead of drawing minutes and hours we draw full day minutes and apply a similar strategy as before to guarantee that they are at least 30 minutes apart. Then we convert back to minutes and hours.
@PhrozenByte Looking again at the source code I see that, if your bash was compiled with the CHECK_OVERFLOW macro defined, and there is an overflow, RANDOM is assigned constant INTMAX_MAX. As this would ruin the randomness it is probably safer to use only the 4 leading bytes of md5sum. This leaves only 65536 different seed values but would work on any architecture, even the quite rare with 16 bits integers. I update my answer.
I like the idea but I think you would have to include the random number generator to ensure you always get the same result, since the implementation sometimes changes. RANDOM=1234567890; echo $RANDOM gives different results on 3.2.57(1)-release (ubuntu); 4.0.33(0)-release (FreeBSD); 5.1.16(1)-release (ubuntu); I get: 3536, 8569, 14988
@jhnc If the random generator shall be stable across bash versions see SHELL COMPATIBILITY MODE and BASH_COMPAT in bash documentation.
|
1

Thanks to @renaud's (here) and @jnhc's (here) equally amazing solutions (♥) I came up with another solution that tries to combine the advantages of both solutions and my original (but incomplete) approach.

Here's what I came up with:

__crontabs() {
    local HASH="$(md5sum - <<< "$1")"

    # TIME represents the `n`th minute of the month (e.g. 12,345 is the 9th day at 13:45)
    # since the cronjob wouldn't run in February otherwise, we accept the 28th day max,
    # thus we perform an euclidean division by 28 days * 24 hours * 60 minutes = 40,320
    local TIME=$(( 0x${HASH:0:14} % 40320 ))

    # DAILY_TIME_OFFSET represents a `n` minute offset from TIME's time of day
    # the daily cronjob shall run between 00:30 and 23:30 hours later than the monthly cronjob,
    # thus we perform an euclidean division by 24 hours * 60 minutes - 2 * 30 minutes + 1 = 1,381
    # yielding some value between 00:00 and 23:00 hours, thus adding another 30 minutes
    local DAILY_TIME_OFFSET=$(( 0x${HASH:16:14} % 1381 + 30 ))

    local MONTHLY_DAY=$(( TIME / 1440 + 1 ))
    local MONTHLY_HOUR=$(( TIME % 1440 / 60 ))
    local MONTHLY_MINUTE=$(( TIME % 1440 % 60 ))

    local DAILY_TIME=$(( ( TIME % 1440 + DAILY_TIME_OFFSET ) % 1440 ))
    local DAILY_HOUR=$(( DAILY_TIME / 60 ))
    local DAILY_MINUTE=$(( DAILY_TIME % 60 ))

    printf '%d %d * * *\n' $DAILY_MINUTE $DAILY_HOUR
    printf '%d %d %d * *\n' $MONTHLY_MINUTE $MONTHLY_HOUR $MONTHLY_DAY
}

CRONTABS="$(__crontabs "$USER_DATA")"
echo "Daily cronjob schedule: $(head -n1 <<< "$CRONTABS")"
echo "Monthly cronjob schedule: $(tail -n1 <<< "$CRONTABS")"

For USER_DATA="" this yields:

Daily cronjob schedule: 27 5 * * *
Monthly cronjob schedule: 52 16 19 * *

For USER_DATA="Hello World" it yields:

Daily cronjob schedule: 51 1 * * *
Monthly cronjob schedule: 44 5 2 * *

The reasoning is as follows:

First I tried to adapt the simplicity of @jnhc's solution by avoiding loops and using @jnhc's easy-to-understand step-by-step approach.

However, I feel like that @jnhc's slicing of the hash into so many small pieces depends on a deeper understanding of md5 to comprehend whether this might not have unexpected side effects in regards to the final "randomness". I therefore wanted to use @renaud's elegant idea of utilizing $RANDOM as seed-able pseudo random number generator. That's why I included it in my solution at first. However, I then noticed that it actually doesn't provide enough randomness: $RANDOM only yields 32768 different values. In @renaud's solution this is no issue because the biggest modulo is 1440 there, but in my solution it's 40320.

Since 40320 is greater than 32768 I couldn't use $RANDOM and ended up using two 56-bit slices of the 128-bit hash. Why 56 bit and not 64 bit? Because Bash calculates with signed 64-bit integers (platform-dependant, see the discussion below @renaud's answer) and using just 56 bit guarantees that we work with positive integers only (again, works on 64-bit platforms only).

So, why am I working with modulo 40320 again? Because I wanted to keep my original "nth minute of the month" approach. The reason is that my amended "30 minutes offset" requirement caused @renaud to switch to a "nth minute of the day" approach, so I thought that my original approach isn't far and makes the whole solution a little easier to understand. To solve the requirement I chose @renaud's approach - @jnhc's would have worked too, @renaud's solution is just closer to the amended task description.

This gives us the presented solution above.

All credit for my merged solution goes to @renaud and @jnhc - they did the hard part, I just adapted their works ♥

I hope I didn't make any mistakes here; if I did, please let me know. Feedback is very much appreciated.


Last but not least I want to thank @renaud and @jnhc for their equally great solutions. Before writing my own merged solution I was thinking about whose solution I should choose as accepted answer, but couldn't really make a decision: @renaud's solution is amazingly elegant, @jnhc's solution is remarkably simple and easy to understand. They work equally well.

My merged solution is by no means "better" than the other two, just different. Thanks again! ♥

Since all answers provide equally well solutions I'm intentionally not marking any answer as accepted answer.

5 Comments

the original %40320 actually makes much more sense than my/pmf's slicing of the hash. the parts of the date/time are not independent.
Note that pseudo-random generators have 2 different important parameters: output range and period length. With RANDOM the period is 2^31 (imagine a hidden random number generator which output is 31 bits but what you see of this is only 15 bits). If 32768 is too short you can simply use RANDOM * 32768 + RANDOM. This provides 30 bits of pseudo-randomness, that is, a less than 0.004% bias with your modulo 40230.
@Renaud I'm curios, is RANDOM * <RANDOM_MAX> + RANDOM a generally meaningful approach to safely get more random data, or rather some Bash-specific solution? I'm especially curios about the * 32768 part. Can you provide some resources for me to read? Thanks!
@PhrozenByte It is a general approach when the period is much longer than the range. In our specific case the best resource is the bash source code itself. You'll find the whole story in lib/sh/random.c. RANDOM is produced by brand that itself calls intrand32 to get a 31 bits random value. But brand returns only 15 bits using masking (plus folding if shell_compatibility_level > 50). So, behind the scene, there is a 31 bits random generator.
@PhrozenByte Thanks to this, if you evaluate RANDOM repeatedly you will not walk a cycle of 32767 values, you will walk a cycle of 2^31 - 1 values, 15 bits each. This is why you can get 30 bits of randomness with RANDOM * 32768 + RANDOM. It would not work with a true 15 bits Linear Feedback Shift Register (LFSR), for instance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.