-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
DESIGN DOC ➡️
Proposal
Link to Slack here.
Deploy Sentry to help us identify and resolve user issues proactively, reducing time-to-resolution for user issues as well as fixing bugs that never got reported.
Background
We use a mix of tools and techniques for monitoring and investigating issues users encounter across NewDot environments in a reactive manner. This includes:
- Troubleshoot bundles (onyx, logs, profiling trace) sent to expert engineers for analysis
- Manual key=value timing metrics tracking on core actions with StatsD and Grafana
- Fullstory sessions that give us a visual context on what happened during user’s session
- Firebase Crashlytics for errors.
Here is a bit more detail about the tools and how they are used today:
- Troubleshooting bundles require someone to know about this tool, then be able to turn it on/off, capturing the problem. We ideally never want to ask our users to do this.
- Manual metrics lack enough necessary context, which prevents us from developing actionable items based on them.
- Fullstory is focused on UX testing and studying the App from a product perspective, rather than from performance and engineering perspectives.
- Firebase Crashlytics focuses mainly on crash and error monitoring. These are its limitations for performance monitoring:
- It only allows a 0.1 sampling rate, which makes it unsuitable for our workflows, which require detailed insights into every session.
- Poor analytics, with no real-time data available and exports taking 48h.
- Very limited tagging (5 per event). Telemetry is based on tags, as they provide crucial context for every recorded event.
- Its web support is only in beta, whereas we ideally want a full, stable, cross-platform support.
Problem
When users encounter issues that occur only in their unique live sessions, if these do not get reported or engineers cannot reliably reproduce the issues locally/in Supportal, then the bugs don’t get fixed, leading to frustration, lower conversion, higher churn and decrease of Customer lifetime value.
Solution
In New Expensify, instead of our manual metrics collection and Firebase Crashlytics, use Sentry for performance monitoring, remote debugging, and error tracking.
This tool will provide comprehensive traces and detailed logs captured automatically during live user sessions. This will enable us to diagnose the root causes of user issues, including performance problems (app being slow/stuck), UI failures, and navigation errors, without the need to reproduce them manually. We’ll also be able to link them to FullStory sessions for a replay.
Specifically, we propose to:
- Deploy Sentry across New Expensify staging and production environments to automatically capture detailed information about each session (API calls, user interactions, navigation events, heavy rendering tasks, and more) in a single-time context.
- See the thread for specific examples of how the tool can help us fix issues we cannot fix right now
- Use Sentry’s Error Monitoring to correlate crashes with the exact conditions and user actions that preceded them (see what actions the user took before the crash happened, which should help us find reproduction steps)
- Use Sentry to create a proactive system of alerts for metrics (performance) or events (like users hitting not not-found page) with webhooks to investigate critical UX issues or noticeable performance degradation without users having to report the issue, with more details defined in the predesign (below)
Before we continue, we will:
- Provide an analysis of the alternatives for Sentry in the form of a Slack meeting to clarify why this tool was picked over other solutions in the area. From our explorations, Sentry came out as the best option that provides everything we need; however, nothing locks us to use Sentry if strong enough arguments are made for alternatives
- Host a predesign to define metrics that we will want to use in our Weekly Performance monitoring. Discuss a process to improve the metrics and act upon them when they spike over a predefined threshold. Such metrics can be for example, App startup time, switching between bottom tabs, or the time it takes to open a report from LHN.
- Host a predesign to determine how a weekly chore (similar to #stability) will ensure that we maintain a high-quality standard
Combine the Slack discussions into a design doc and follow the full doc process
Tasks
- Post Proposal (full Problem/Solution statement) in
#whatsnext - Wait at least one full business day, and until the post has a majority (2/3) of positive reactions (👍)
- Paste Proposal in the space above with a link to the Slack thread
- Email
strategy@expensify.comand paste in the Proposal - Host a pre-design meeting (example) in an appropriate slack channel to discuss any necessary details in public before filling out the High-level of proposed solution section.
- Fill out the High-level overview of the problem, Timeline, Terminology, and High-level of proposed solution sections of the Design Doc
- Email
strategy@expensify.com(continue the same email chain as before - your last message should be the WN Proposal) with the link to your Design Doc containing your high-level problem and solution - Add the
DesignDocReviewlabel to get the High-level overview of the problem and High-level of proposed solution section reviewed - Respond to any questions or concerns and bring up blockers in Slack to get a consensus if necessary
- Confirm that the doc has the minimum necessary number of reviews before proceeding
- Host another pre-design meeting in the appropriate slack channel to ask for engineering feedback on the technical solution.
- Fill out the Detailed implementation of the solution and related sections.
- Re-add the
DesignDocReviewlabel to this issue - Respond to any questions or concerns and bring up blockers in Slack to get consensus if necessary
- Confirm that the doc has the minimum necessary number of reviews before proceeding
- Email
strategy@expensify.comone last time to let them know the Design Doc is moving into the implementation phase - Implement the changes
- Add regression tests so that QA can test your feature with every deploy (instructions)
- Confirm any related help site updates have been published
- Send out a follow up email to
strategy@expensify.comonce everything has been implemented and do a Project Wrap-Up retrospective that provides:- Summary of what we accomplished with this project
- What went well?
- What could we have done better?
- What did we learn?
Issue Owner
Current Issue Owner: @rlinozMetadata
Metadata
Assignees
Labels
Type
Projects
Status