Skip to content

feat: Optimise sortBreadcrumbs()#183

Merged
DafyddLlyr merged 2 commits into
mainfrom
dp/sortBreadcrumbs-profiling
Nov 1, 2023
Merged

feat: Optimise sortBreadcrumbs()#183
DafyddLlyr merged 2 commits into
mainfrom
dp/sortBreadcrumbs-profiling

Conversation

@DafyddLlyr
Copy link
Copy Markdown
Contributor

@DafyddLlyr DafyddLlyr commented Oct 30, 2023

What's the problem?

Recent submissions have been causing the API to crash, due to it running out of CPU. We've bumped this a few times but are still seeing huge spikes on submission.

Profiling computeBOPSParams() which the the main function which reads a session and flow and converts to various formats points towards searchNodeEdges() as being the leading cause.

What's causing the high CPU usage?

searchNodeEdges() is currently doing a whole lot -

  • Recursively navigating through the entire flow
  • Iterating through breadcrumbIds
  • Reducing and generating answerData
  • Generating an orderedBreadcrumb

On the mock flow in tests, this is happening 420,000+ times 👀

What approach was taken?

I did a fair bit of reading up on ways we could improve this, and this is very much a first pass for now. This StackOverflow answer is a fair summary of the approach -

  • Don't compare strings (UUIDs)
  • Avoid extra iteration (looping over breadcumbIds)
  • Reduce scope of variables
  • Don't define new variables which can't be garbage collected until the "main" loop traversing the entire flow has been completed

What I've aimed to do is simplify the flow traversal to avoid as many of the above as possible. We can then iterate over the visited array and generate other data structures much more easily on a scale several orders of magnitude smaller.

What are the outcomes?

Seems pretty good! Running locally it's approx 5 times faster so it's certainly an improvement. Quite what this converts to on the AWS infrastructure remains to be seen - see comments on testing below.

Before After
image image

How can we test this?

A few ideas here - testing locally is fine and there's "real" mock data being used in tests, but we won't have a decent degree of certainty until we reach the staging environments. I'd suggest that post-review we try this on staging with real flows and customer data, then check Fargate CPU usage vs production on the /admin/session/:sessionId/bops endpoints.

We could also take a slightly more cautious approach and add a query param to trigger the old vs new functions for a more like for like comparison if this seems wise? Open to suggestions here!

Next steps...

Let's see the outcomes of this, and hopefully reduce the container CPU and memory as a result 🤞

I've also been working on a "filtered" view of a flow which is just an intersection of a flow with a user's breadcrumbs (and answers). Initial testing shows (for the mock data) this would bring the total number of nodes in a flow from 176,000 → 280 which would be another significant change.

Some of the lessons learned here can certainly applied to other flow traversal methods. I've been reading up a little on OpenTelemetry as a more "generic" option to help us identify issues sooner, but honestly an off the shelf solution like Sentry might be a better use of time (and also a replacement for Airbrake?). Something for a spike!

Update: turns out Airbrake has some of this functionality as well, we're just not using it. I'll make a ticket to investigate this 👌

@DafyddLlyr DafyddLlyr requested a review from a team October 31, 2023 11:31
Copy link
Copy Markdown
Member

@jessicamcinchak jessicamcinchak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for really detailed write up, local profiling looks really promising here 🤞

I don't have strong feelings about introducing this behind a query param to do comparisons, I'll leave that up to you. As I see it, we definitley want/need these changes, even if the magnitude of improved performance doesn't turn out to be quite as high on AWS as locally 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants