Skip to content

Speedup CoT verification a bit by parallelizing network operations#786

Merged
Eijebong merged 4 commits intomozilla-releng:mainfrom
Eijebong:speedup-cot-verification
Apr 9, 2026
Merged

Speedup CoT verification a bit by parallelizing network operations#786
Eijebong merged 4 commits intomozilla-releng:mainfrom
Eijebong:speedup-cot-verification

Conversation

@Eijebong
Copy link
Copy Markdown
Contributor

@Eijebong Eijebong commented Apr 8, 2026

This parallelizes a good chunk of CoT verification related requests towards hg/taskcluster which are the only slow part of the whole process.

Using crimes to monkeypatch functions using the network so I could introduce constant load times for services and not be at the mercy of whether HG wanted to take 2 or 12 seconds to respond at any given moment, I benchmarked CoT verification on both a signing and a beetmover task. I set the hg response time to 5s and all other requests to 0.5s.

Beetmover

Benchmark 1: uv run python tests/bench_cot_verify.py (ref = HEAD~3)
  Time (mean ± σ):     60.294 s ±  0.006 s    [User: 0.978 s, System: 0.241 s]
  Range (min … max):   60.288 s … 60.300 s    3 runs

Benchmark 2: uv run python tests/bench_cot_verify.py (ref = HEAD~2)
  Time (mean ± σ):     40.290 s ±  0.004 s    [User: 0.984 s, System: 0.236 s]
  Range (min … max):   40.286 s … 40.294 s    3 runs

Benchmark 3: uv run python tests/bench_cot_verify.py (ref = HEAD~1)
  Time (mean ± σ):     20.268 s ±  0.004 s    [User: 0.982 s, System: 0.235 s]
  Range (min … max):   20.264 s … 20.271 s    3 runs

Benchmark 4: uv run python tests/bench_cot_verify.py (ref = HEAD)
  Time (mean ± σ):     16.763 s ±  0.009 s    [User: 0.981 s, System: 0.226 s]
  Range (min … max):   16.754 s … 16.773 s    3 runs

Summary
  uv run python tests/bench_cot_verify.py (ref = HEAD) ran
    1.21 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~1)
    2.40 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~2)
    3.60 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~3)

Signing

Benchmark 1: uv run python tests/bench_cot_verify.py (ref = HEAD~3)
  Time (mean ± σ):     42.025 s ±  0.007 s    [User: 0.445 s, System: 0.094 s]
  Range (min … max):   42.020 s … 42.033 s    3 runs

Benchmark 2: uv run python tests/bench_cot_verify.py (ref = HEAD~2)
  Time (mean ± σ):     27.021 s ±  0.005 s    [User: 0.453 s, System: 0.085 s]
  Range (min … max):   27.018 s … 27.027 s    3 runs

Benchmark 3: uv run python tests/bench_cot_verify.py (ref = HEAD~1)
  Time (mean ± σ):     11.998 s ±  0.008 s    [User: 0.441 s, System: 0.097 s]
  Range (min … max):   11.991 s … 12.006 s    3 runs

Benchmark 4: uv run python tests/bench_cot_verify.py (ref = HEAD)
  Time (mean ± σ):     10.999 s ±  0.002 s    [User: 0.446 s, System: 0.093 s]
  Range (min … max):   10.997 s … 11.001 s    3 runs

Summary
  uv run python tests/bench_cot_verify.py (ref = HEAD) ran
    1.09 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~1)
    2.46 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~2)
    3.82 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~3)

With constant times, they're both almost 4x faster. In practice I doubt that the numbers will be as good although manual testing has shown the signing task to take between 25 and 44s before and I've seen it take 12-13s after.

Eijebong added 2 commits April 8, 2026 11:35
The .taskcluster.yml fetch and the json-e context population (which
fetches the pushlog, scm level) are independent. Run them concurrently
with `asyncio.gather`.

With a mocked implementation for hg/tc, that sets every hg request to
take 5s and every TC request to take 0.5s, this reduces the time to
verify CoT on a signing task from 42s to 27s.
Run all task type and worker impl verification functions concurrently
with asyncio.gather instead of sequentially.

Each verification function only mutates its own link object, never
another link's state, so concurrent execution is safe. The log output
from different links might interleave now, but given the difference in
performance I think that it's a worthy tradeoff.

With a mocked implementation for hg/tc, that sets every hg request to
take 5s and every TC request to take 0.5s, this reduces the time to
verify CoT on a signing task from 27s to 11s. With a beetmover task, it
goes from 40s to 20s.
@Eijebong Eijebong requested a review from a team as a code owner April 8, 2026 14:41
…ies`

For each task, fetch its direct dependencies in parallel instead of one
at a time. We can't really do much more without knowing about the graph
itself since we don't want duplicates and might have diamond shapes. And
we need the task definition (which is what we're trying to get in
the first place) to get that graph shape...

This changes `build_link` (renamed to `add_link`) so that it only
fetches a single task definition and adds it to the chain instead of
recursing directly, recursing into children is now done in
`build_task_dependencies` instead, essentially transforming the
traversal from a DFS to a BFS.

With a mocked implementation for hg/tc, that sets every hg request to
take 5s and every TC request to take 0.5s, this reduces the time to
verify CoT on a signing task from 11s to 10s. On a beetmover task, the
impact is much more visible since the graph is much deeper and I'm
seeing improvements from 20s to 15s.
@Eijebong Eijebong force-pushed the speedup-cot-verification branch from b0cbc62 to 03a0951 Compare April 8, 2026 14:49
Copy link
Copy Markdown
Contributor

@bhearsum bhearsum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log output from different links might interleave now, but given the difference in
performance I think that it's a worthy tradeoff.

Is it different enough that it's going to make debugging more difficult? If so, we may need to consider following up with something that buffers the output...

Pass the `seen` set downwards in recursion to avoid grabbing the same
dependency twice (in case of diamond shaped graphs).

In the signing case it's a small change (-0.5s) but on the beetmover case
it's a 3.5s win.
@Eijebong
Copy link
Copy Markdown
Contributor Author

Eijebong commented Apr 8, 2026

Is it different enough that it's going to make debugging more difficult

I don't think so since this is only about grabbing stuff, not verifying it? But then I've never really had to debug issues with CoT so I can't tell you if the differences are meaningful or not.

@bhearsum
Copy link
Copy Markdown
Contributor

bhearsum commented Apr 8, 2026

Is it different enough that it's going to make debugging more difficult

I don't think so since this is only about grabbing stuff, not verifying it? But then I've never really had to debug issues with CoT so I can't tell you if the differences are meaningful or not.

Is there an example log you can point me at?

@Eijebong
Copy link
Copy Markdown
Contributor Author

Eijebong commented Apr 8, 2026

uv run verify_cot QzPzZhpMRt2oGv4Ummz2lQ --task-type signing --cot-product firefox

https://gist.github.com/Eijebong/5a738795654c0dadc285153dea3714ca

@bhearsum
Copy link
Copy Markdown
Contributor

bhearsum commented Apr 9, 2026

I want to scour the logs a bit more, but it doesn't seem too bad (or even too different) at a glance. If it's possible to show a log with a failure that would be useful as well. I suppose as long as individual log messages/prints don't get broken up, it's probably good enough.

One thing I did notice is a duplicated message:

INFO:scriptworker.cot.verify:Verifying signing QzPzZhpMRt2oGv4Ummz2lQ as a scriptworker task...
INFO:scriptworker.cot.verify:Verifying signing QzPzZhpMRt2oGv4Ummz2lQ as a scriptworker task...

...which suggests maybe something is getting called more than once? (I don't see it in a production beetmover chain of trust log.

@Eijebong
Copy link
Copy Markdown
Contributor Author

Eijebong commented Apr 9, 2026

I suppose as long as individual log messages/prints don't get broken up, it's probably good enough.

Yeah, it's still python, and monothreaded. The only place where logs might be out of place is across await points.

i.e:

def foo():
  print(1)
  await ...
  print(2)

asyncio.gather([foo(), foo()])

That would probably show 1 1 2 2 instead of 1 2 1 2.
Looking at the dupe right now.

@Eijebong
Copy link
Copy Markdown
Contributor Author

Eijebong commented Apr 9, 2026

The dupe isn't new at all, see #787

@bhearsum
Copy link
Copy Markdown
Contributor

bhearsum commented Apr 9, 2026

The dupe isn't new at all, see #787

Oh, hah - it was actually the log ordering change that made me notice it.

@Eijebong Eijebong merged commit a7a92a4 into mozilla-releng:main Apr 9, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants