feat: Optimise sortBreadcrumbs()#183
Merged
Merged
Conversation
jessicamcinchak
approved these changes
Nov 1, 2023
Member
jessicamcinchak
left a comment
There was a problem hiding this comment.
Thanks for really detailed write up, local profiling looks really promising here 🤞
I don't have strong feelings about introducing this behind a query param to do comparisons, I'll leave that up to you. As I see it, we definitley want/need these changes, even if the magnitude of improved performance doesn't turn out to be quite as high on AWS as locally 👍
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's the problem?
Recent submissions have been causing the API to crash, due to it running out of CPU. We've bumped this a few times but are still seeing huge spikes on submission.
Profiling
computeBOPSParams()which the the main function which reads a session and flow and converts to various formats points towardssearchNodeEdges()as being the leading cause.What's causing the high CPU usage?
searchNodeEdges()is currently doing a whole lot -breadcrumbIdsanswerDataorderedBreadcrumbOn the mock flow in tests, this is happening 420,000+ times 👀
What approach was taken?
I did a fair bit of reading up on ways we could improve this, and this is very much a first pass for now. This StackOverflow answer is a fair summary of the approach -
breadcumbIds)What I've aimed to do is simplify the flow traversal to avoid as many of the above as possible. We can then iterate over the
visitedarray and generate other data structures much more easily on a scale several orders of magnitude smaller.What are the outcomes?
Seems pretty good! Running locally it's approx 5 times faster so it's certainly an improvement. Quite what this converts to on the AWS infrastructure remains to be seen - see comments on testing below.
How can we test this?
A few ideas here - testing locally is fine and there's "real" mock data being used in tests, but we won't have a decent degree of certainty until we reach the staging environments. I'd suggest that post-review we try this on staging with real flows and customer data, then check Fargate CPU usage vs production on the
/admin/session/:sessionId/bopsendpoints.We could also take a slightly more cautious approach and add a query param to trigger the old vs new functions for a more like for like comparison if this seems wise? Open to suggestions here!
Next steps...
Let's see the outcomes of this, and hopefully reduce the container CPU and memory as a result 🤞
I've also been working on a "filtered" view of a flow which is just an intersection of a flow with a user's breadcrumbs (and answers). Initial testing shows (for the mock data) this would bring the total number of nodes in a flow from 176,000 → 280 which would be another significant change.
Some of the lessons learned here can certainly applied to other flow traversal methods. I've been reading up a little on OpenTelemetry as a more "generic" option to help us identify issues sooner, but honestly an off the shelf solution like Sentry might be a better use of time (and also a replacement for Airbrake?). Something for a spike!
Update: turns out Airbrake has some of this functionality as well, we're just not using it. I'll make a ticket to investigate this 👌