-
Notifications
You must be signed in to change notification settings - Fork 25.6k
allocation: time compute rounds by start time over last convergence #137418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
allocation: time compute rounds by start time over last convergence #137418
Conversation
In the DesiredBalanceAllocator, there are periodic log messages that warn of long-running allocation balancing rounds. These track the number of loop iterations and the amount of time passed since the last convergence, and log when either exceeds a limit. A previous effort (4c979aa, elastic#100850) persisted these metrics across compute runs so that a sequence cluster state changes did not disrupt these warnings. However, the modified time-based threshold compares the current time against the last convergence, instead of the round's start time. If enough time has passed between balancing rounds, this produces a warn-level message that misrepresents the round's compute time. This change continues to include the time since last convergence, but uses the time since compute began as the log threshold and as the time since resumption. Fixes: ES-13327
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
This is for: The main change is that when the time-based non-convergence log decision is done, it compares the time since compute started, instead of the time since last convergence. Throughout both convergence and non-convergence logging, all phrases like "still not converged after" or "converged after" uses the time since compute started, instead of the time since last convergence. Similarly, all messages end with "since the last convergence ago" have switched over to this, instead of saying "since the last resumption" which used the time since compute started. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Simon, I left a couple of comments mostly around testing.
| "Desired balance computation for [*] is still not converged after [10s] and [1000] iterations, " | ||
| + "last convergence was [10s] ago" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I think this is a useful addition to the message. Could we have some tests where the last-converged time differs from the computation-started time to make sure we don't get the numbers backwards?
| final long timeWarningInterval = progressLogInterval.millis(); | ||
| final long computationStartedTime = timeProvider.relativeTimeInMillis(); | ||
| long nextReportTime = Math.max(lastNotConvergedLogMessageTimeMillis, lastConvergedTimeMillis) + timeWarningInterval; | ||
| long nextReportTime = computationStartedTime + timeWarningInterval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this doesn't look right to me: if we keep on resuming the computation just before timeWarningInterval elapses without ever converging, I think this means we'll never log a report. While the computation hasn't converged we need the log behaviour to ignore resumptions and account for the overall computation time. Maybe a testing gap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an intentional change from me, and I adjusted the testing to match. (This is the setting change that shortens the log interval to 1 millisecond below.)
The issue I'm pointing out is subtle, and this code is hard to read. These messages have caused a few issues, so I want to get this right.
The period log interval setting was initially built to log periodically when the compute loop was taking too long, so one report is logged per interval. In 4c979aa, this facility was extended so that its initial report is based off the time since the last convergence. So whenever compute starts again some time after a convergence and more than the period interval has passed, we see a log message that it still hasn't converged.
The ticket is about fixing the log message so the time duration reported in this message reflects the amount of time spent in the compute loop. I could leave this time-since-convergence interval in place, but there will still be two issues:
-
when we restart compute after a long time away (which may be normal), we'll still see a log about allocation still not being converged at the end of the first compute loop, even though nothing is wrong. The duration will be accurately reported, which is a positive step.
-
an early exit caused by cluster state changing will always be logged at debug.
To fix both of these, we should:
-
start tracking the time since the first call into compute after a convergence, and use this as the base for the initial log period. This is more significant than the last convergence, because it's when allocation starts trying. This whole idea may be based on a misunderstanding of how and when allocation works: if it's constantly running, then there won't be any difference between the end of the last convergence and the start of the next balance.
-
whenever the cluster state changes and more than the log period has passed since allocation last started up, the early exit message should log at info
Let me draft something so you can see this in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start tracking the time since the first call into compute after a convergence, and use this as the base for the initial log period
That sounds like what we want indeed, but I don't think that is what the code is doing as written (looking at
77e545a). At the moment it's resetting nextReportTime each time it resumes, based on the time of the resumption, regardless of whether the previous computation converged or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I have not included this in this PR -- I wanted to have a discussion about it first, but have taken a half-step in removing something old while waiting for feedback.
I will revert this, and add the other part separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant fixing the start time here, so that it's the time since the first effort at compute, and seeing if we want to address the yield to new input log because of cluster updates, which is logged at debug, separately..
| public void testLoggingOfComputeCallsAndIterationsSinceConvergence() { | ||
| final var clusterSettings = new ClusterSettings( | ||
| Settings.builder().put(DesiredBalanceComputer.PROGRESS_LOG_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(5L)).build(), | ||
| Settings.builder().put(DesiredBalanceComputer.PROGRESS_LOG_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(1L)).build(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do this - at 5ms some time can pass without seeing another log message (and that's what we want) but at 1ms we expect a log message every tick of the clock.
| numComputeCallsSinceLastConverged, | ||
| iterations, | ||
| TimeValue.timeValueMillis(currentTime - computationStartedTime).toString() | ||
| TimeValue.timeValueMillis(currentTime - lastConvergedTimeMillis).toString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the message to report the time since the compute call began up front, and clarifies the time since last convergence.
There is now a mix of different markers, and I'm not sure the iterations are now represented quite right either: the "converged after [<duration this compute round>] and [<iterations since convergence>]" are mismatched.
How about this?
"still not converged after [%s] and [%d] iterations": the time and iteration count in this compute run. And maybe something about "this round"?"resumed computation [%d] times with [%d] iterations since [%s]": the number of compute calls, iterations, and time since our first compute effort since last convergence (this message differs from the current template)"since the last convergence [%s] ago": the time since last convergence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug log message above is one I was curious about -- this is "Desired balance computation for [{}] interrupted after [{}] and [{}] iterations as newer cluster state received. Publishing intermediate desired balance and restarting computation." on line 419/429.
I am wondering how frequent this message is, and if it should be logged at info if enough time has passed since compute restarted.
…, into information about this past round, the time since compute began, and the time since convergence.
| final long timeWarningInterval = progressLogInterval.millis(); | ||
| final long computationStartedTime = timeProvider.relativeTimeInMillis(); | ||
| long nextReportTime = Math.max(lastNotConvergedLogMessageTimeMillis, lastConvergedTimeMillis) + timeWarningInterval; | ||
| if (lastConvergedTimeMillis > firstComputeSinceConvergedTimeMillis) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the previous computation converged before the clock advanced (e.g. it took <1ms) then we would have lastConvergedTimeMillis == firstComputeSinceConvergedTimeMillis and hence wouldn't update firstComputeSinceConvergedTimeMillis here, so we'd still be counting the idle time in between that computation and the present one. Really we need to know if the previous computation converged or not regardless of how long it took.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is a great catch -- thanks for finding this.
| Desired balance computation for [%d] converged after [%s] and [%d] iterations this round, \ | ||
| resumed computation [%s] ago with [%d] iterations over [%d] rounds since the last convergence \ | ||
| [%s] ago""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these rewordings are basically a good idea but would much rather we split them out into a separate PR and keep this one focussed on fixing the bug that counts the idle time. It's just a bit much to keep track of which test changes relate to the cosmetics and which ones are genuine behaviour changes.
| "no log messages", | ||
| DesiredBalanceComputer.class.getCanonicalName(), | ||
| Level.INFO, | ||
| "* still not converged after *" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 a pattern is required here because we do emit a log message, otherwise I'd suggest *. But "no log messages" is misleading, we do expect one log message, just not a still not converged one. Can we assert that we do see the one we expect to see here?
| getLogExpectation.apply(new LogExpectationData(false, "1ms", 1, "17ms", 15, 3, "18ms")), | ||
| getLogExpectation.apply(new LogExpectationData(false, "6ms", 6, "22ms", 20, 3, "23ms")), | ||
| getLogExpectation.apply(new LogExpectationData(true, "10ms", 10, "26ms", 24, 3, "27ms")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to go through these test changes in detail once we've removed the message rewordings from this PR.
firstCompute time is now calculated whenever the last run converged.
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two tiny tiny comments but otherwise this looks exactly what we need.
| private long numIterationsSinceLastConverged; | ||
| private long lastConvergedTimeMillis; | ||
| private long lastNotConvergedLogMessageTimeMillis; | ||
| private long firstComputeSinceConvergedTimeMillis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming nit: could we include the word "started" in this name somehow? We're tracking the start of the first computation since we converged.
|
|
||
| static final String TEST_INDEX = "test-index"; | ||
|
|
||
| public void testShouldNotLogLongBalanceComputation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: test suites are easiest to read if they start from the simpler test cases and work down towards the more complex ones. There might be some other structure to the tests too, but in practice new test cases should often be put near the end. This one in particular is much less basic than testComputeBalance and testStopsComputingWhenStale and so on so I'd prefer it went lower down (probably somewhere near testLoggingOfComputeCallsAndIterationsSinceConvergence since they're both about logging)
In the DesiredBalanceAllocator, there are periodic log messages that warn of long-running allocation balancing rounds. These track the number of loop iterations and the amount of time passed since the last convergence, and log when either exceeds a limit. A previous effort (4c979aa, #100850) persisted these metrics across compute runs so that a sequence of cluster state changes did not disrupt these warnings. However, the modified time-based threshold compares the current time against the last convergence, instead of the round's start time. If enough time has passed between balancing rounds, this produces a warn-level message that misrepresents the round's compute time. This change continues to include the time since last convergence, but uses the time since compute began as the log threshold and as the time since resumption.
Fixes: ES-13327