Replace weak signal device error notifications with daily fail ratio emails#1444
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughDaily per-device failure notifications were removed in favor of a new daily aggregation pipeline: a DB cron aggregates install/fail totals, enqueues per-app Changes
Sequence Diagram(s)sequenceDiagram
participant Device as Device/Runtime
participant Stats as Stats Plugin
participant DB as Postgres (daily_version, apps)
participant Proc as process_daily_fail_ratio_email()
participant Queue as Email Queue (pgmq)
participant CronHandler as cron_email HTTP handler
participant User as Org Member
Note over Device,Stats: Immediate per-device notification removed
Device->>Stats: record 'fail' stat
Note over DB,Proc: Daily aggregation flow
CronHandler->>DB: (scheduled) trigger process_daily_fail_ratio_email at 08:00 UTC
DB->>Proc: aggregate installs & fails per app (yesterday)
Proc->>Proc: filter by min installs & fail% threshold
Proc->>Queue: enqueue `daily_fail_ratio` jobs (payload: appId, orgId, totals, failPercentage, reportDate)
Queue->>CronHandler: POST job payload to `cron_email` endpoint
CronHandler->>CronHandler: validate appId & check `daily_fail_ratio` preference
CronHandler->>User: deliver email (if pref enabled) / track `app:daily_fail_ratio`
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8de866743a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| -- Every minute (at :00 seconds): Per-minute tasks | ||
| IF current_second = 0 THEN | ||
| BEGIN |
There was a problem hiding this comment.
Avoid exact-second gating in cron runner
The new process_all_cron_tasks only triggers the per-minute/hour/daily branches when current_second = 0 (see the IF current_second = 0 THEN guard). In this repo the cron runner is invoked every 10 seconds and prior migration 20260103030451_add_advisory_lock_to_cron.sql explicitly warns that pg_cron intervals are not clock‑aligned, so if the job starts at an offset (e.g., :05) this condition never becomes true and all scheduled tasks (including the new daily fail ratio email) will never run. Please restore the offset-safe check (e.g., current_second < 10) or the table-driven scheduler that handled this.
Useful? React with 👍 / 👎.
| -- Every 10 seconds: High-frequency queues (at :00, :10, :20, :30, :40, :50) | ||
| IF current_second % 10 = 0 THEN | ||
| -- Process high-frequency queues with default batch size (950) | ||
| BEGIN | ||
| PERFORM public.process_function_queue(ARRAY['on_channel_update', 'on_user_create', 'on_user_update', 'on_version_create', 'on_version_delete', 'on_version_update', 'on_app_delete', 'on_organization_create', 'on_user_delete', 'on_app_create', 'credit_usage_alerts']); | ||
| EXCEPTION WHEN OTHERS THEN |
There was a problem hiding this comment.
Keep cron_tasks table execution
This migration replaces the table-driven cron runner with a hard-coded list (process_function_queue(ARRAY['on_channel_update', ...])), so anything registered in cron_tasks will no longer execute. I checked supabase/migrations/20260112140000_cleanup_old_channel_devices.sql and 20260113160650_delete_old_deleted_versions.sql, both of which register maintenance jobs in cron_tasks with the comment that this table is the canonical scheduler. With this change those jobs stop running entirely, so stale data cleanup will silently stall. Please keep the cron_tasks loop or explicitly add those tasks to the hard-coded schedule.
Useful? React with 👍 / 👎.
580ae30 to
2f672dc
Compare
2f672dc to
f93b306
Compare
There was a problem hiding this comment.
Pull request overview
This PR replaces the per-device weak signal notification system with a smarter daily aggregate fail ratio monitoring approach. Instead of sending one email per device failure followed by a week-long silence, the system now analyzes daily aggregated statistics from the daily_version table and sends one email per app per day when the install failure rate exceeds 30% (with a minimum of 10 installs to reduce noise).
Changes:
- Added SQL migration with cron function to calculate daily fail ratios and queue emails for high-failure apps
- Added
daily_fail_ratioemail preference type to allow users to opt in/out - Implemented email handler for daily fail ratio notifications in cron_email endpoint
- Removed old per-device failure notification code from stats endpoint
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| supabase/migrations/20260115025158_add_daily_fail_ratio_email.sql | Creates SQL function to process daily fail ratios and registers cron task to run at 08:00 UTC daily |
| supabase/functions/_backend/utils/org_email_notifications.ts | Adds daily_fail_ratio to email preference types |
| supabase/functions/_backend/triggers/cron_email.ts | Implements handler for daily_fail_ratio email type with preference checking and event tracking |
| supabase/functions/_backend/plugins/stats.ts | Removes old per-device failure notification trigger and unused import |
| PERFORM pgmq.send('cron_email', | ||
| jsonb_build_object( | ||
| 'function_name', 'cron_email', | ||
| 'function_type', 'cloudflare', | ||
| 'payload', jsonb_build_object( | ||
| 'email', record.management_email, | ||
| 'appId', record.app_id, | ||
| 'orgId', record.owner_org, | ||
| 'type', 'daily_fail_ratio', | ||
| 'appName', record.app_name, | ||
| 'totalInstalls', record.total_installs, | ||
| 'totalFails', record.total_fails, | ||
| 'failPercentage', record.fail_percentage, | ||
| 'reportDate', (CURRENT_DATE - INTERVAL '1 day')::text | ||
| ) | ||
| ) | ||
| ); |
There was a problem hiding this comment.
The SQL function process_daily_fail_ratio_email() does not handle errors that might occur during message queuing. If pgmq.send() fails for any reason, the function will continue processing other records without logging or reporting the failure. Consider adding error handling and logging within the loop, or wrapping the operation in a BEGIN...EXCEPTION block to ensure failures are captured and reported appropriately.
| PERFORM pgmq.send('cron_email', | |
| jsonb_build_object( | |
| 'function_name', 'cron_email', | |
| 'function_type', 'cloudflare', | |
| 'payload', jsonb_build_object( | |
| 'email', record.management_email, | |
| 'appId', record.app_id, | |
| 'orgId', record.owner_org, | |
| 'type', 'daily_fail_ratio', | |
| 'appName', record.app_name, | |
| 'totalInstalls', record.total_installs, | |
| 'totalFails', record.total_fails, | |
| 'failPercentage', record.fail_percentage, | |
| 'reportDate', (CURRENT_DATE - INTERVAL '1 day')::text | |
| ) | |
| ) | |
| ); | |
| BEGIN | |
| PERFORM pgmq.send('cron_email', | |
| jsonb_build_object( | |
| 'function_name', 'cron_email', | |
| 'function_type', 'cloudflare', | |
| 'payload', jsonb_build_object( | |
| 'email', record.management_email, | |
| 'appId', record.app_id, | |
| 'orgId', record.owner_org, | |
| 'type', 'daily_fail_ratio', | |
| 'appName', record.app_name, | |
| 'totalInstalls', record.total_installs, | |
| 'totalFails', record.total_fails, | |
| 'failPercentage', record.fail_percentage, | |
| 'reportDate', (CURRENT_DATE - INTERVAL '1 day')::text | |
| ) | |
| ) | |
| ); | |
| EXCEPTION | |
| WHEN OTHERS THEN | |
| RAISE WARNING 'process_daily_fail_ratio_email: failed to queue email for app_id %, org_id %, email %: % (%)', | |
| record.app_id, | |
| record.owner_org, | |
| record.management_email, | |
| SQLERRM, | |
| SQLSTATE; | |
| END; |
| total_fails: (totalFails ?? 0).toString(), | ||
| fail_percentage: (failPercentage ?? 0).toString(), | ||
| report_date: reportDate ?? '', | ||
| success_percentage: (100 - (failPercentage ?? 0)).toFixed(2), |
There was a problem hiding this comment.
The success_percentage calculation at line 638 could produce incorrect results if failPercentage is undefined. When failPercentage is undefined, it defaults to 0, making success_percentage = 100.00, which may not accurately represent the actual state. Consider validating that failPercentage is provided before calculating success_percentage, or explicitly handle the undefined case to avoid misleading metrics in the email.
| const metadata = { | ||
| app_id: appId, | ||
| org_id: orgId ?? '', | ||
| app_name: appName ?? '', | ||
| total_installs: (totalInstalls ?? 0).toString(), | ||
| total_fails: (totalFails ?? 0).toString(), | ||
| fail_percentage: (failPercentage ?? 0).toString(), | ||
| report_date: reportDate ?? '', | ||
| success_percentage: (100 - (failPercentage ?? 0)).toFixed(2), |
There was a problem hiding this comment.
The metadata object converts numeric values to strings but doesn't validate that the values are actually numbers. If totalInstalls, totalFails, or failPercentage are undefined or invalid, calling toString() on them could produce misleading values like "0" or "undefined". Consider adding validation to ensure these values are present and valid before converting them to strings for the event tracking metadata.
| const metadata = { | |
| app_id: appId, | |
| org_id: orgId ?? '', | |
| app_name: appName ?? '', | |
| total_installs: (totalInstalls ?? 0).toString(), | |
| total_fails: (totalFails ?? 0).toString(), | |
| fail_percentage: (failPercentage ?? 0).toString(), | |
| report_date: reportDate ?? '', | |
| success_percentage: (100 - (failPercentage ?? 0)).toFixed(2), | |
| const safeTotalInstalls = | |
| typeof totalInstalls === 'number' && Number.isFinite(totalInstalls) ? totalInstalls : undefined | |
| const safeTotalFails = | |
| typeof totalFails === 'number' && Number.isFinite(totalFails) ? totalFails : undefined | |
| const safeFailPercentage = | |
| typeof failPercentage === 'number' && Number.isFinite(failPercentage) ? failPercentage : undefined | |
| const successPercentage = | |
| safeFailPercentage !== undefined ? (100 - safeFailPercentage).toFixed(2) : undefined | |
| const metadata = { | |
| app_id: appId, | |
| org_id: orgId ?? '', | |
| app_name: appName ?? '', | |
| total_installs: safeTotalInstalls?.toString() ?? '', | |
| total_fails: safeTotalFails?.toString() ?? '', | |
| fail_percentage: safeFailPercentage?.toString() ?? '', | |
| report_date: reportDate ?? '', | |
| success_percentage: successPercentage ?? '', |
| run_at_hour = EXCLUDED.run_at_hour, | ||
| run_at_minute = EXCLUDED.run_at_minute, | ||
| run_at_second = EXCLUDED.run_at_second, | ||
| updated_at = NOW(); |
There was a problem hiding this comment.
The migration does not update the default email_preferences values in the users and orgs tables to include the new daily_fail_ratio preference. Previous migrations (e.g., 20251231060433_add_billing_period_stats_email.sql) set defaults with all preference keys and updated the COMMENT on these columns. This migration should:
- Add the new preference to existing users/orgs:
UPDATE public.users SET email_preferences = email_preferences || '{"daily_fail_ratio": true}'::jsonb WHERE email_preferences IS NOT NULL AND NOT (email_preferences ? 'daily_fail_ratio'); - Update the default value on the columns to include
"daily_fail_ratio": true - Update the COMMENT on email_preferences columns to document the new key
Without these updates, new users/orgs won't have the preference initialized, and the column documentation will be incomplete.
| updated_at = NOW(); | |
| updated_at = NOW(); | |
| -- Backfill daily_fail_ratio preference for existing users | |
| UPDATE public.users | |
| SET email_preferences = email_preferences || '{"daily_fail_ratio": true}'::jsonb | |
| WHERE email_preferences IS NOT NULL | |
| AND NOT (email_preferences ? 'daily_fail_ratio'); | |
| -- Backfill daily_fail_ratio preference for existing orgs | |
| UPDATE public.orgs | |
| SET email_preferences = email_preferences || '{"daily_fail_ratio": true}'::jsonb | |
| WHERE email_preferences IS NOT NULL | |
| AND NOT (email_preferences ? 'daily_fail_ratio'); | |
| -- Ensure new users have daily_fail_ratio in their default email preferences | |
| ALTER TABLE public.users | |
| ALTER COLUMN email_preferences SET DEFAULT | |
| ('{}'::jsonb || '{"daily_fail_ratio": true}'::jsonb); | |
| -- Ensure new orgs have daily_fail_ratio in their default email preferences | |
| ALTER TABLE public.orgs | |
| ALTER COLUMN email_preferences SET DEFAULT | |
| ('{}'::jsonb || '{"daily_fail_ratio": true}'::jsonb); | |
| -- Update documentation for email_preferences columns to include daily_fail_ratio | |
| COMMENT ON COLUMN public.users.email_preferences IS | |
| 'Email notification preferences as JSONB. Known keys include: "daily_fail_ratio".'; | |
| COMMENT ON COLUMN public.orgs.email_preferences IS | |
| 'Email notification preferences as JSONB. Known keys include: "daily_fail_ratio".'; |
| async function handleDailyFailRatio( | ||
| c: Context, | ||
| payload: { | ||
| email: string | ||
| appId?: string | ||
| orgId?: string | ||
| appName?: string | ||
| totalInstalls?: number | ||
| totalFails?: number | ||
| failPercentage?: number | ||
| reportDate?: string | ||
| }, | ||
| ) { | ||
| const { | ||
| email, | ||
| appId, | ||
| orgId, | ||
| appName, | ||
| totalInstalls, | ||
| totalFails, | ||
| failPercentage, | ||
| reportDate, | ||
| } = payload | ||
|
|
||
| if (!appId) { | ||
| throw simpleError('missing_app_id', 'Missing appId for daily_fail_ratio', { email }) | ||
| } | ||
|
|
||
| // Check if user has daily_fail_ratio preference enabled | ||
| const isEnabled = await isEmailPreferenceEnabled(c, email, 'daily_fail_ratio') | ||
| if (!isEnabled) { | ||
| cloudlog({ requestId: c.get('requestId'), message: 'Daily fail ratio email disabled for user', email, appId }) | ||
| return c.json({ status: 'Email preference disabled' }, 200) | ||
| } | ||
|
|
||
| const metadata = { | ||
| app_id: appId, | ||
| org_id: orgId ?? '', | ||
| app_name: appName ?? '', | ||
| total_installs: (totalInstalls ?? 0).toString(), | ||
| total_fails: (totalFails ?? 0).toString(), | ||
| fail_percentage: (failPercentage ?? 0).toString(), | ||
| report_date: reportDate ?? '', | ||
| success_percentage: (100 - (failPercentage ?? 0)).toFixed(2), | ||
| } | ||
|
|
||
| await trackBentoEvent(c, email, metadata, 'app:daily_fail_ratio') | ||
|
|
||
| cloudlog({ | ||
| requestId: c.get('requestId'), | ||
| message: 'Daily fail ratio email sent', | ||
| email, | ||
| appId, | ||
| failPercentage, | ||
| }) | ||
|
|
||
| return c.json(BRES) | ||
| } |
There was a problem hiding this comment.
The handleDailyFailRatio function lacks test coverage. There are existing test suites for other email types in the cron_email endpoint (e.g., weekly_stats, monthly_stats, deploy_stats_24h in email-preferences.test.ts), but no tests for the new daily_fail_ratio type. Consider adding tests to verify:
- Email is sent when daily_fail_ratio preference is enabled
- Email is skipped when daily_fail_ratio preference is disabled
- Required parameters (appId, totalInstalls, totalFails, failPercentage) are handled correctly
- Event tracking is working properly
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on February 17
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
f93b306 to
26e2cfa
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| 'function_name', 'cron_email', | ||
| 'function_type', 'cloudflare', | ||
| 'payload', jsonb_build_object( | ||
| 'email', record.management_email, |
There was a problem hiding this comment.
Admin users no longer receive failure notifications
Medium Severity
The SQL function only sends emails to record.management_email, but the previous implementation using sendNotifToOrgMembers would send to all admin/super_admin members who had the preference enabled, plus the management_email. Organizations with multiple admins will now have reduced visibility into failure events since only the management email receives notifications, not the admin team.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| if (!isEnabled) { | ||
| cloudlog({ requestId: c.get('requestId'), message: 'Daily fail ratio email disabled for user', email, appId }) | ||
| return c.json({ status: 'Email preference disabled' }, 200) | ||
| } |
There was a problem hiding this comment.
Org email preferences not checked for daily fail ratio
Medium Severity
The handleDailyFailRatio function only checks user email preferences via isEmailPreferenceEnabled, but the email is sent to the org's management_email. The org's own email_preferences.daily_fail_ratio setting is never checked, even though the migration explicitly adds this preference to the orgs table. If an org sets daily_fail_ratio: false, they'll still receive emails because only the users table is queried. The old system via sendNotifToOrgMembers checked org preferences using isOrgPreferenceEnabled, but the new implementation lacks this check.
Additional Locations (1)
…ratio emails - Removed per-device failure notifications that sent once then waited a week - Added daily cron job (08:00 UTC) to check install fail ratios by app - Sends email only when fail rate exceeds 30% with minimum 10 installs - Added daily_fail_ratio email preference for user control - One email per app per day maximum, based on aggregate daily stats Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Addresses PR review comments: - P1: No longer modifies process_all_cron_tasks, so offset-safe checks preserved - P2: Registers task in cron_tasks table (canonical scheduler), not hardcoded Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Move daily_fail_ratio handler before user existence check (fixes non-user emails) - Clamp fail_percentage to 0-100 range in both SQL and TypeScript - Add error handling with RAISE WARNING for pgmq.send() failures - Backfill daily_fail_ratio preference for existing users/orgs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
400ecfe to
72c3373
Compare
|



Summary (AI generated)
Replaced the weak per-device failure notification system with a smarter daily aggregate fail ratio check. Now sends one email per app per day when install failures exceed 30%, instead of one email per device followed by a week-long silence.
Motivation (AI generated)
The old system sent one email on first device failure, then ignored all subsequent failures for a week—a poor signal. The new system uses daily aggregate statistics from the
daily_versiontable to calculate real fail ratios and alert only when the situation is actually critical (30%+ failure rate).Business Impact (AI generated)
Improves app reliability monitoring by catching genuine issues faster. App owners now get daily insights into failure trends instead of sporadic one-off alerts. Minimum 10 installs prevents noise from small-volume apps.
Test Plan (AI generated)
Generated with AI
Summary by CodeRabbit
New Features
Changes
✏️ Tip: You can customize this high-level summary in your review settings.
Note
Introduces daily aggregate failure alerts and removes per-device failure emails to reduce noise and reflect real app health.
process_daily_fail_ratio_emailSQL function and schedules it daily (08:00 UTC) to enqueuedaily_fail_ratioemails usingdaily_versionstats; clamps percentages and includes installs/fails/datecron_emailto handledaily_fail_ratiowith email preference checks and safe metadata; skips user existence requirement for management emailsdaily_fail_ratio(users/orgs), plus user tag mapping for opt-outs; backfills default enabled in migrationdevice_errornotification trigger fromplugins/stats.ts(fail events now only record stats)Written by Cursor Bugbot for commit 72c3373. This will update automatically on new commits. Configure here.