Skip to content

Conversation

@catkins
Copy link
Contributor

@catkins catkins commented Oct 12, 2025

Description

To give us more control over log chunk, I've added an additional field to the agent job representation to allow us to remotely dictate the log chunk interval.

This PR accepts that field and uses it to determine the chunk ticker interval at the beginning of the job.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)

Disclosures / Credits

I wrote the app code, and had amp generate the test, which is kind of a crap test, but this codepath didn't have great test plumbing around it, and needs a bit of a refactor to make it easier to test I think. I'm keen to hear ideas from @buildkite/agent-stewards of how I can pin this down a bit more (mostly around the jitter, and first chunk plumbing) to make the tests more closer to the intent of the change.

@catkins catkins requested a review from a team October 12, 2025 23:33
@DrJosh9000 DrJosh9000 self-requested a review October 12, 2025 23:36
@catkins catkins force-pushed the catkins/configurable-chunks-interval branch from 5703a49 to 81c136c Compare October 12, 2025 23:46
mb := mockBootstrap(t)
mb.Expect().Once().AndCallFunc(func(c *bintest.Call) {
start := time.Now()
for time.Since(start) < 10*time.Second {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL this construct. handy!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i reckon

ratio := float64(avg2s) / float64(avg1s)
t.Logf("Ratio of 2s/1s intervals: %.2f", ratio)

if ratio < 1.4 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1.4? we're not jittering the interval internally, so shouldn't the interval on the 2s always be 2x longer than the interval on 1s? i suppose the chunk collector will sometimes collect more than its size limit and preempt the serverside timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah 1.4 is a bit of a weird number... I had a bit of a struggle with testing this tbh.

that code in run_job.go streamJobLogsAfterProcessStart has a few funky interactions

  • the jitter sleep is up-to the processInterval, using rand.N
  • the jitter is applied after the ticker
  • the ticker will keep ticking on its schedule irrespective of the jitter
  • we also have this first channel to immediately try and push out a chunk without waiting

this all kinda means that the intervals themselves are kinda all over the joint depending on the dice-roll of the jitter itself which can be pathologically lumpy

eg. with 1s interval

Ticker Jitter Final Time Time since last
1.0s 0.9s 1.9s N/A
2.0s 0.1s 2.1s 0.2s
3.0s 0.9s 3.9s 1.8s
4.0s 0.1s 4.1s 0.2s
5.0s 0.5s 5.5s 1.4s

Ideally it would be straightforward to control or disable the jitter in the tests too, but that also requires some surgery or exposing internals that aren't very integration test-ey which I don't love either. Sigh.

Comment on lines 438 to 439
avg1s := runTestWithInterval(t, 1)
avg2s := runTestWithInterval(t, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test suite is currently pretty quick — 1.6 seconds on my machine at current. This test needs(?) to wait for 10s to complete, but i think we should probably do both of these runs concurrently to minimise the amount of time we're lengthening this suite by. something like:

		times := map[int]time.Duration{}
		wg := sync.WaitGroup{}
		wg.Add(2)
		
		go func() {
			defer wg.Done()
			times[1] = runTestWithInterval(t, 1)
		}()
		go func() {
			defer wg.Done()
			times[2] = runTestWithInterval(t, 2)
		}()

		wg.Wait()

		avg1s := times[1]
		avg2s := times[2]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good idea

@moskyb moskyb self-requested a review October 15, 2025 01:20
Copy link
Contributor

@moskyb moskyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! exciting.

@catkins catkins merged commit 6f99559 into main Oct 15, 2025
1 check passed
@catkins catkins deleted the catkins/configurable-chunks-interval branch October 15, 2025 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants