Speedup CoT verification a bit by parallelizing network operations by Eijebong · Pull Request #786 · mozilla-releng/scriptworker

Eijebong · 2026-04-08T14:41:01Z

This parallelizes a good chunk of CoT verification related requests towards hg/taskcluster which are the only slow part of the whole process.

Using crimes to monkeypatch functions using the network so I could introduce constant load times for services and not be at the mercy of whether HG wanted to take 2 or 12 seconds to respond at any given moment, I benchmarked CoT verification on both a signing and a beetmover task. I set the hg response time to 5s and all other requests to 0.5s.

Beetmover

Benchmark 1: uv run python tests/bench_cot_verify.py (ref = HEAD~3)
  Time (mean ± σ):     60.294 s ±  0.006 s    [User: 0.978 s, System: 0.241 s]
  Range (min … max):   60.288 s … 60.300 s    3 runs

Benchmark 2: uv run python tests/bench_cot_verify.py (ref = HEAD~2)
  Time (mean ± σ):     40.290 s ±  0.004 s    [User: 0.984 s, System: 0.236 s]
  Range (min … max):   40.286 s … 40.294 s    3 runs

Benchmark 3: uv run python tests/bench_cot_verify.py (ref = HEAD~1)
  Time (mean ± σ):     20.268 s ±  0.004 s    [User: 0.982 s, System: 0.235 s]
  Range (min … max):   20.264 s … 20.271 s    3 runs

Benchmark 4: uv run python tests/bench_cot_verify.py (ref = HEAD)
  Time (mean ± σ):     16.763 s ±  0.009 s    [User: 0.981 s, System: 0.226 s]
  Range (min … max):   16.754 s … 16.773 s    3 runs

Summary
  uv run python tests/bench_cot_verify.py (ref = HEAD) ran
    1.21 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~1)
    2.40 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~2)
    3.60 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~3)

Signing

Benchmark 1: uv run python tests/bench_cot_verify.py (ref = HEAD~3)
  Time (mean ± σ):     42.025 s ±  0.007 s    [User: 0.445 s, System: 0.094 s]
  Range (min … max):   42.020 s … 42.033 s    3 runs

Benchmark 2: uv run python tests/bench_cot_verify.py (ref = HEAD~2)
  Time (mean ± σ):     27.021 s ±  0.005 s    [User: 0.453 s, System: 0.085 s]
  Range (min … max):   27.018 s … 27.027 s    3 runs

Benchmark 3: uv run python tests/bench_cot_verify.py (ref = HEAD~1)
  Time (mean ± σ):     11.998 s ±  0.008 s    [User: 0.441 s, System: 0.097 s]
  Range (min … max):   11.991 s … 12.006 s    3 runs

Benchmark 4: uv run python tests/bench_cot_verify.py (ref = HEAD)
  Time (mean ± σ):     10.999 s ±  0.002 s    [User: 0.446 s, System: 0.093 s]
  Range (min … max):   10.997 s … 11.001 s    3 runs

Summary
  uv run python tests/bench_cot_verify.py (ref = HEAD) ran
    1.09 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~1)
    2.46 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~2)
    3.82 ± 0.00 times faster than uv run python tests/bench_cot_verify.py (ref = HEAD~3)

With constant times, they're both almost 4x faster. In practice I doubt that the numbers will be as good although manual testing has shown the signing task to take between 25 and 44s before and I've seen it take 12-13s after.

The .taskcluster.yml fetch and the json-e context population (which fetches the pushlog, scm level) are independent. Run them concurrently with `asyncio.gather`. With a mocked implementation for hg/tc, that sets every hg request to take 5s and every TC request to take 0.5s, this reduces the time to verify CoT on a signing task from 42s to 27s.

Run all task type and worker impl verification functions concurrently with asyncio.gather instead of sequentially. Each verification function only mutates its own link object, never another link's state, so concurrent execution is safe. The log output from different links might interleave now, but given the difference in performance I think that it's a worthy tradeoff. With a mocked implementation for hg/tc, that sets every hg request to take 5s and every TC request to take 0.5s, this reduces the time to verify CoT on a signing task from 27s to 11s. With a beetmover task, it goes from 40s to 20s.

…ies` For each task, fetch its direct dependencies in parallel instead of one at a time. We can't really do much more without knowing about the graph itself since we don't want duplicates and might have diamond shapes. And we need the task definition (which is what we're trying to get in the first place) to get that graph shape... This changes `build_link` (renamed to `add_link`) so that it only fetches a single task definition and adds it to the chain instead of recursing directly, recursing into children is now done in `build_task_dependencies` instead, essentially transforming the traversal from a DFS to a BFS. With a mocked implementation for hg/tc, that sets every hg request to take 5s and every TC request to take 0.5s, this reduces the time to verify CoT on a signing task from 11s to 10s. On a beetmover task, the impact is much more visible since the graph is much deeper and I'm seeing improvements from 20s to 15s.

bhearsum

The log output from different links might interleave now, but given the difference in
performance I think that it's a worthy tradeoff.

Is it different enough that it's going to make debugging more difficult? If so, we may need to consider following up with something that buffers the output...

src/scriptworker/cot/verify.py

Pass the `seen` set downwards in recursion to avoid grabbing the same dependency twice (in case of diamond shaped graphs). In the signing case it's a small change (-0.5s) but on the beetmover case it's a 3.5s win.

Eijebong · 2026-04-08T17:17:46Z

Is it different enough that it's going to make debugging more difficult

I don't think so since this is only about grabbing stuff, not verifying it? But then I've never really had to debug issues with CoT so I can't tell you if the differences are meaningful or not.

bhearsum · 2026-04-08T17:21:52Z

Is it different enough that it's going to make debugging more difficult

I don't think so since this is only about grabbing stuff, not verifying it? But then I've never really had to debug issues with CoT so I can't tell you if the differences are meaningful or not.

Is there an example log you can point me at?

Eijebong · 2026-04-08T17:25:36Z

uv run verify_cot QzPzZhpMRt2oGv4Ummz2lQ --task-type signing --cot-product firefox

https://gist.github.com/Eijebong/5a738795654c0dadc285153dea3714ca

bhearsum · 2026-04-09T01:58:39Z

I want to scour the logs a bit more, but it doesn't seem too bad (or even too different) at a glance. If it's possible to show a log with a failure that would be useful as well. I suppose as long as individual log messages/prints don't get broken up, it's probably good enough.

One thing I did notice is a duplicated message:

INFO:scriptworker.cot.verify:Verifying signing QzPzZhpMRt2oGv4Ummz2lQ as a scriptworker task...
INFO:scriptworker.cot.verify:Verifying signing QzPzZhpMRt2oGv4Ummz2lQ as a scriptworker task...

...which suggests maybe something is getting called more than once? (I don't see it in a production beetmover chain of trust log.

Eijebong · 2026-04-09T09:25:57Z

I suppose as long as individual log messages/prints don't get broken up, it's probably good enough.

Yeah, it's still python, and monothreaded. The only place where logs might be out of place is across await points.

i.e:

def foo():
  print(1)
  await ...
  print(2)

asyncio.gather([foo(), foo()])

That would probably show 1 1 2 2 instead of 1 2 1 2.
Looking at the dupe right now.

Eijebong · 2026-04-09T10:58:37Z

The dupe isn't new at all, see #787

bhearsum · 2026-04-09T13:08:50Z

The dupe isn't new at all, see #787

Oh, hah - it was actually the log ordering change that made me notice it.

Eijebong added 2 commits April 8, 2026 11:35

Eijebong requested a review from a team as a code owner April 8, 2026 14:41

Eijebong force-pushed the speedup-cot-verification branch from b0cbc62 to 03a0951 Compare April 8, 2026 14:49

bhearsum reviewed Apr 8, 2026

View reviewed changes

src/scriptworker/cot/verify.py Outdated Show resolved Hide resolved

Parallelize the recursion in build_task_dependencies

2497adc

Pass the `seen` set downwards in recursion to avoid grabbing the same dependency twice (in case of diamond shaped graphs). In the signing case it's a small change (-0.5s) but on the beetmover case it's a 3.5s win.

bhearsum approved these changes Apr 9, 2026

View reviewed changes

Eijebong merged commit a7a92a4 into mozilla-releng:main Apr 9, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup CoT verification a bit by parallelizing network operations#786

Speedup CoT verification a bit by parallelizing network operations#786
Eijebong merged 4 commits intomozilla-releng:mainfrom
Eijebong:speedup-cot-verification

Eijebong commented Apr 8, 2026

Uh oh!

bhearsum left a comment

Uh oh!

Uh oh!

Eijebong commented Apr 8, 2026

Uh oh!

bhearsum commented Apr 8, 2026

Uh oh!

Eijebong commented Apr 8, 2026

Uh oh!

bhearsum commented Apr 9, 2026

Uh oh!

Eijebong commented Apr 9, 2026

Uh oh!

Eijebong commented Apr 9, 2026

Uh oh!

bhearsum commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Eijebong commented Apr 8, 2026

Beetmover

Signing

Uh oh!

bhearsum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Eijebong commented Apr 8, 2026

Uh oh!

bhearsum commented Apr 8, 2026

Uh oh!

Eijebong commented Apr 8, 2026

Uh oh!

bhearsum commented Apr 9, 2026

Uh oh!

Eijebong commented Apr 9, 2026

Uh oh!

Eijebong commented Apr 9, 2026

Uh oh!

bhearsum commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants