Speedup CoT verification a bit by parallelizing network operations#786
Conversation
The .taskcluster.yml fetch and the json-e context population (which fetches the pushlog, scm level) are independent. Run them concurrently with `asyncio.gather`. With a mocked implementation for hg/tc, that sets every hg request to take 5s and every TC request to take 0.5s, this reduces the time to verify CoT on a signing task from 42s to 27s.
Run all task type and worker impl verification functions concurrently with asyncio.gather instead of sequentially. Each verification function only mutates its own link object, never another link's state, so concurrent execution is safe. The log output from different links might interleave now, but given the difference in performance I think that it's a worthy tradeoff. With a mocked implementation for hg/tc, that sets every hg request to take 5s and every TC request to take 0.5s, this reduces the time to verify CoT on a signing task from 27s to 11s. With a beetmover task, it goes from 40s to 20s.
…ies` For each task, fetch its direct dependencies in parallel instead of one at a time. We can't really do much more without knowing about the graph itself since we don't want duplicates and might have diamond shapes. And we need the task definition (which is what we're trying to get in the first place) to get that graph shape... This changes `build_link` (renamed to `add_link`) so that it only fetches a single task definition and adds it to the chain instead of recursing directly, recursing into children is now done in `build_task_dependencies` instead, essentially transforming the traversal from a DFS to a BFS. With a mocked implementation for hg/tc, that sets every hg request to take 5s and every TC request to take 0.5s, this reduces the time to verify CoT on a signing task from 11s to 10s. On a beetmover task, the impact is much more visible since the graph is much deeper and I'm seeing improvements from 20s to 15s.
b0cbc62 to
03a0951
Compare
bhearsum
left a comment
There was a problem hiding this comment.
The log output from different links might interleave now, but given the difference in
performance I think that it's a worthy tradeoff.
Is it different enough that it's going to make debugging more difficult? If so, we may need to consider following up with something that buffers the output...
Pass the `seen` set downwards in recursion to avoid grabbing the same dependency twice (in case of diamond shaped graphs). In the signing case it's a small change (-0.5s) but on the beetmover case it's a 3.5s win.
I don't think so since this is only about grabbing stuff, not verifying it? But then I've never really had to debug issues with CoT so I can't tell you if the differences are meaningful or not. |
Is there an example log you can point me at? |
https://gist.github.com/Eijebong/5a738795654c0dadc285153dea3714ca |
|
I want to scour the logs a bit more, but it doesn't seem too bad (or even too different) at a glance. If it's possible to show a log with a failure that would be useful as well. I suppose as long as individual log messages/prints don't get broken up, it's probably good enough. One thing I did notice is a duplicated message: ...which suggests maybe something is getting called more than once? (I don't see it in a production beetmover chain of trust log. |
Yeah, it's still python, and monothreaded. The only place where logs might be out of place is across await points. i.e: That would probably show |
|
The dupe isn't new at all, see #787 |
Oh, hah - it was actually the log ordering change that made me notice it. |
This parallelizes a good chunk of CoT verification related requests towards hg/taskcluster which are the only slow part of the whole process.
Using crimes to monkeypatch functions using the network so I could introduce constant load times for services and not be at the mercy of whether HG wanted to take 2 or 12 seconds to respond at any given moment, I benchmarked CoT verification on both a signing and a beetmover task. I set the hg response time to 5s and all other requests to 0.5s.
Beetmover
Signing
With constant times, they're both almost 4x faster. In practice I doubt that the numbers will be as good although manual testing has shown the signing task to take between 25 and 44s before and I've seen it take 12-13s after.