allocation: time compute rounds by start time over last convergence #137418

schase-es · 2025-10-31T05:08:13Z

In the DesiredBalanceAllocator, there are periodic log messages that warn of long-running allocation balancing rounds. These track the number of loop iterations and the amount of time passed since the last convergence, and log when either exceeds a limit. A previous effort (4c979aa, #100850) persisted these metrics across compute runs so that a sequence of cluster state changes did not disrupt these warnings. However, the modified time-based threshold compares the current time against the last convergence, instead of the round's start time. If enough time has passed between balancing rounds, this produces a warn-level message that misrepresents the round's compute time. This change continues to include the time since last convergence, but uses the time since compute began as the log threshold and as the time since resumption.

Fixes: ES-13327

In the DesiredBalanceAllocator, there are periodic log messages that warn of long-running allocation balancing rounds. These track the number of loop iterations and the amount of time passed since the last convergence, and log when either exceeds a limit. A previous effort (4c979aa, elastic#100850) persisted these metrics across compute runs so that a sequence cluster state changes did not disrupt these warnings. However, the modified time-based threshold compares the current time against the last convergence, instead of the round's start time. If enough time has passed between balancing rounds, this produces a warn-level message that misrepresents the round's compute time. This change continues to include the time since last convergence, but uses the time since compute began as the log threshold and as the time since resumption. Fixes: ES-13327

elasticsearchmachine · 2025-10-31T05:08:39Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

schase-es · 2025-10-31T05:16:35Z

This is for:

The main change is that when the time-based non-convergence log decision is done, it compares the time since compute started, instead of the time since last convergence.

Throughout both convergence and non-convergence logging, all phrases like "still not converged after" or "converged after" uses the time since compute started, instead of the time since last convergence.

Similarly, all messages end with "since the last convergence ago" have switched over to this, instead of saying "since the last resumption" which used the time since compute started.

DaveCTurner

Thanks Simon, I left a couple of comments mostly around testing.

DaveCTurner · 2025-10-31T08:47:08Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java

+                "Desired balance computation for [*] is still not converged after [10s] and [1000] iterations, "
+                    + "last convergence was [10s] ago"


++ I think this is a useful addition to the message. Could we have some tests where the last-converged time differs from the computation-started time to make sure we don't get the numbers backwards?

DaveCTurner · 2025-10-31T09:00:29Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

        final long timeWarningInterval = progressLogInterval.millis();
        final long computationStartedTime = timeProvider.relativeTimeInMillis();
-        long nextReportTime = Math.max(lastNotConvergedLogMessageTimeMillis, lastConvergedTimeMillis) + timeWarningInterval;
+        long nextReportTime = computationStartedTime + timeWarningInterval;


Hmm this doesn't look right to me: if we keep on resuming the computation just before timeWarningInterval elapses without ever converging, I think this means we'll never log a report. While the computation hasn't converged we need the log behaviour to ignore resumptions and account for the overall computation time. Maybe a testing gap?

This is an intentional change from me, and I adjusted the testing to match. (This is the setting change that shortens the log interval to 1 millisecond below.)

The issue I'm pointing out is subtle, and this code is hard to read. These messages have caused a few issues, so I want to get this right.

The period log interval setting was initially built to log periodically when the compute loop was taking too long, so one report is logged per interval. In 4c979aa, this facility was extended so that its initial report is based off the time since the last convergence. So whenever compute starts again some time after a convergence and more than the period interval has passed, we see a log message that it still hasn't converged.

The ticket is about fixing the log message so the time duration reported in this message reflects the amount of time spent in the compute loop. I could leave this time-since-convergence interval in place, but there will still be two issues:

when we restart compute after a long time away (which may be normal), we'll still see a log about allocation still not being converged at the end of the first compute loop, even though nothing is wrong. The duration will be accurately reported, which is a positive step.

an early exit caused by cluster state changing will always be logged at debug.

To fix both of these, we should:

start tracking the time since the first call into compute after a convergence, and use this as the base for the initial log period. This is more significant than the last convergence, because it's when allocation starts trying. This whole idea may be based on a misunderstanding of how and when allocation works: if it's constantly running, then there won't be any difference between the end of the last convergence and the start of the next balance.

whenever the cluster state changes and more than the log period has passed since allocation last started up, the early exit message should log at info

Let me draft something so you can see this in practice.

start tracking the time since the first call into compute after a convergence, and use this as the base for the initial log period

That sounds like what we want indeed, but I don't think that is what the code is doing as written (looking at
77e545a). At the moment it's resetting nextReportTime each time it resumes, based on the time of the resumption, regardless of whether the previous computation converged or not.

Yes! I have not included this in this PR -- I wanted to have a discussion about it first, but have taken a half-step in removing something old while waiting for feedback.

I will revert this, and add the other part separately.

I see ok, so are you saying that you're proposing to (effectively) revert #126008 in this PR and then address #137020 in a separate PR? I don't think we'd want to do that as two separate steps because we might cut a release in the intermediate, regressed, state.

I meant fixing the start time here, so that it's the time since the first effort at compute, and seeing if we want to address the yield to new input log because of cluster updates, which is logged at debug, separately..

DaveCTurner · 2025-10-31T09:05:55Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java

    public void testLoggingOfComputeCallsAndIterationsSinceConvergence() {
        final var clusterSettings = new ClusterSettings(
-            Settings.builder().put(DesiredBalanceComputer.PROGRESS_LOG_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(5L)).build(),
+            Settings.builder().put(DesiredBalanceComputer.PROGRESS_LOG_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(1L)).build(),


I don't think we should do this - at 5ms some time can pass without seeing another log message (and that's what we want) but at 1ms we expect a log message every tick of the clock.

schase-es · 2025-11-03T23:02:10Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

                        numComputeCallsSinceLastConverged,
                        iterations,
-                        TimeValue.timeValueMillis(currentTime - computationStartedTime).toString()
+                        TimeValue.timeValueMillis(currentTime - lastConvergedTimeMillis).toString()


This changes the message to report the time since the compute call began up front, and clarifies the time since last convergence.

There is now a mix of different markers, and I'm not sure the iterations are now represented quite right either: the "converged after [<duration this compute round>] and [<iterations since convergence>]" are mismatched.

How about this?

"still not converged after [%s] and [%d] iterations": the time and iteration count in this compute run. And maybe something about "this round"?

"resumed computation [%d] times with [%d] iterations since [%s]": the number of compute calls, iterations, and time since our first compute effort since last convergence (this message differs from the current template)

"since the last convergence [%s] ago": the time since last convergence

The debug log message above is one I was curious about -- this is "Desired balance computation for [{}] interrupted after [{}] and [{}] iterations as newer cluster state received. Publishing intermediate desired balance and restarting computation." on line 419/429.

I am wondering how frequent this message is, and if it should be logged at info if enough time has passed since compute restarted.

…, into information about this past round, the time since compute began, and the time since convergence.

DaveCTurner · 2025-11-18T12:43:37Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

        final long timeWarningInterval = progressLogInterval.millis();
        final long computationStartedTime = timeProvider.relativeTimeInMillis();
-        long nextReportTime = Math.max(lastNotConvergedLogMessageTimeMillis, lastConvergedTimeMillis) + timeWarningInterval;
+        if (lastConvergedTimeMillis > firstComputeSinceConvergedTimeMillis) {


If the previous computation converged before the clock advanced (e.g. it took <1ms) then we would have lastConvergedTimeMillis == firstComputeSinceConvergedTimeMillis and hence wouldn't update firstComputeSinceConvergedTimeMillis here, so we'd still be counting the idle time in between that computation and the present one. Really we need to know if the previous computation converged or not regardless of how long it took.

Ah this is a great catch -- thanks for finding this.

DaveCTurner · 2025-11-18T12:45:19Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

+                                Desired balance computation for [%d] converged after [%s] and [%d] iterations this round, \
+                                resumed computation [%s] ago with [%d] iterations over [%d] rounds since the last convergence \
+                                [%s] ago""",


I think these rewordings are basically a good idea but would much rather we split them out into a separate PR and keep this one focussed on fixing the bug that counts the idle time. It's just a bit much to keep track of which test changes relate to the cosmetics and which ones are genuine behaviour changes.

DaveCTurner · 2025-11-18T12:50:22Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java

+                "no log messages",
+                DesiredBalanceComputer.class.getCanonicalName(),
+                Level.INFO,
+                "* still not converged after *"


👍 a pattern is required here because we do emit a log message, otherwise I'd suggest *. But "no log messages" is misleading, we do expect one log message, just not a still not converged one. Can we assert that we do see the one we expect to see here?

DaveCTurner · 2025-11-18T12:51:57Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java

+            getLogExpectation.apply(new LogExpectationData(false, "1ms", 1, "17ms", 15, 3, "18ms")),
+            getLogExpectation.apply(new LogExpectationData(false, "6ms", 6, "22ms", 20, 3, "23ms")),
+            getLogExpectation.apply(new LogExpectationData(true, "10ms", 10, "26ms", 24, 3, "27ms"))


I'd like to go through these test changes in detail once we've removed the message rewordings from this PR.

firstCompute time is now calculated whenever the last run converged.

DaveCTurner

I left two tiny tiny comments but otherwise this looks exactly what we need.

DaveCTurner · 2025-11-27T15:10:29Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

    private long numIterationsSinceLastConverged;
    private long lastConvergedTimeMillis;
    private long lastNotConvergedLogMessageTimeMillis;
+    private long firstComputeSinceConvergedTimeMillis;


naming nit: could we include the word "started" in this name somehow? We're tracking the start of the first computation since we converged.

DaveCTurner · 2025-11-27T15:38:18Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java


    static final String TEST_INDEX = "test-index";

+    public void testShouldNotLogLongBalanceComputation() {


Style nit: test suites are easiest to read if they start from the simpler test cases and work down towards the more complex ones. There might be some other structure to the tests too, but in practice new test cases should often be put near the end. This one in particular is much less basic than testComputeBalance and testStopsComputingWhenStale and so on so I'd prefer it went lower down (probably somewhere near testLoggingOfComputeCallsAndIterationsSinceConvergence since they're both about logging)

schase-es requested review from DaveCTurner and JeremyDahlgren October 31, 2025 05:08

[CI] Auto commit changes from spotless

77e545a

schase-es removed the >bug label Oct 31, 2025

DaveCTurner reviewed Oct 31, 2025

View reviewed changes

schase-es added 3 commits November 3, 2025 00:03

Now introducing firstComputeSinceConvertedTimeMillis... in a can!

40b3895

Adding Yang's test

c16062c

Test changes

078db98

schase-es commented Nov 3, 2025

View reviewed changes

elasticsearchmachine added v9.1.8 v9.2.2 and removed v9.1.7 v9.2.1 labels Nov 6, 2025

schase-es added 2 commits November 13, 2025 00:38

Edited log messages to group duration and iterations/rounds logically…

f966381

…, into information about this past round, the time since compute began, and the time since convergence.

Merge branch 'main' into ES-12825_long-balance-computation-log-message

6b12b33

DaveCTurner reviewed Nov 18, 2025

View reviewed changes

schase-es added 3 commits November 19, 2025 18:00

Updated timing fix and reverted log message changes

377dc97

firstCompute time is now calculated whenever the last run converged.

Merge branch 'main' into ES-12825_long-balance-computation-log-message

33af652

Merge branch 'main' into ES-12825_long-balance-computation-log-message

14b15c9

Merge branch 'main' into ES-12825_long-balance-computation-log-message

840f788

elasticsearchmachine added v9.1.9 v9.2.3 and removed v9.1.8 v9.2.2 labels Nov 27, 2025

DaveCTurner reviewed Nov 27, 2025

View reviewed changes

		"Desired balance computation for [*] is still not converged after [10s] and [1000] iterations, "
		+ "last convergence was [10s] ago"


		static final String TEST_INDEX = "test-index";

		public void testShouldNotLogLongBalanceComputation() {

allocation: time compute rounds by start time over last convergence #137418

Are you sure you want to change the base?

allocation: time compute rounds by start time over last convergence #137418

Conversation

schase-es commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 31, 2025

Uh oh!

schase-es commented Oct 31, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schase-es Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schase-es Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

schase-es commented Oct 31, 2025 •

edited

Loading

schase-es Oct 31, 2025 •

edited

Loading

DaveCTurner Oct 31, 2025 •

edited

Loading

schase-es Nov 3, 2025 •

edited

Loading