RavenDB index health monitoring #2478

ramonsmits · 2021-04-06T15:05:42Z

Check for RavenDB index errors at start.
Run a custom check every 5 minutes that:
- Writes a DEBUG log entry with index statistics.
- Writes WARN/ERROR log entries for indexes where there IndexLag value is above 10.000/100.000
- Writes ERROR log entries for any index errors reported.

src/ServiceControl.Infrastructure.RavenDB/RavenHealthReporter.cs

src/ServiceControl.Audit/Infrastructure/CheckRavenDBIndexLag.cs

SzymonPobiega · 2021-04-13T06:55:06Z

I am hesitant about waiting for non stale indexes at start or throwing until we are sure that there are no customer who do have stale indexes and working ServiceControl. I would fear that we are going to blow up perfectly fine instance of ServiceControl.

ramonsmits · 2021-04-14T13:48:04Z

I am hesitant about waiting for non stale indexes at start or throwing until we are sure that there are no customer who do have stale indexes and working ServiceControl. I would fear that we are going to blow up perfectly fine instance of ServiceControl.

A perfectly fine instance wouldn't have a lot of work to get non-stale indexes at start. At most a few seconds if the instance was underload when it gracefully exited.

This guards again the problem that when the SC instance exits ungracefully (kill, crash, etc.) and has index issues where indexes get rebuild by RavenDB at start start will be delayed until that point.

tmasternak · 2021-04-15T07:58:01Z

@ramonsmits one of the arguments for not delaying the start was to prevent the situation when our clients are already in the situation when the indexes are failing or significantly lagging behind.

That said these clients will first face the delay when upgrading to the new minor of the Service Control. I would assume that this is a moment when they actually have sometimes put set aside to have a closer look at this. Secondly, if we put the bare minimum alerting on the lagging indexes they should be covered on an ongoing basis.

Finally, I would put in place a setting flag that would enable skipping the checks when set - to make sure we can unblock any client that runs into problems on production.

ramonsmits · 2021-04-18T10:40:49Z

@tmasternak Then lets only keep the reporter?

aleksandr-samila · 2021-04-19T06:48:06Z

As for me, I like the idea with a flag(-s)

... Finally, I would put in place a setting flag that would enable skipping the checks when set...

or vice versa doesn't matter at all - but we add a possibility.

tmasternak · 2021-04-19T07:24:19Z

Then let's only keep the reporter?

@ramonsmits and remove the startup delay?

… and a custom check that reports index lag issues.

…on-stale at startup.

ramonsmits · 2021-04-20T10:49:11Z

@tmasternak @SzymonPobiega I've removed the wait for indexes to become non-stale at start. Still throwing on index errors.

src/ServiceControl.Audit.UnitTests/ApprovalFiles/APIApprovals.CustomCheckDetails.approved.txt

mikeminutillo

Overall, I love this! Great job.

src/ServiceControl.Audit.UnitTests/ApprovalFiles/APIApprovals.CustomCheckDetails.approved.txt

src/ServiceControl.Audit/Infrastructure/CheckRavenDBIndexLag.cs

src/ServiceControl/Operations/CheckRavenDBIndexLag.cs

src/ServiceControl.Audit/Infrastructure/CheckRavenDBIndexLag.cs

src/ServiceControl.UnitTests/ApprovalFiles/APIApprovals.CustomCheckDetails.approved.txt

src/ServiceControl.Infrastructure.RavenDB/Extensions.cs

ramonsmits · 2021-04-21T19:39:00Z

@mikeminutillo @danielmarbach thanks for your great feedback. I've applied some refactorings. The Index error check at start is removed. Only reporting remains. I'll cherry pick the index error check on startup into a seperate PR.

src/ServiceControl.Audit/Infrastructure/CheckRavenDBIndexLag.cs

ramonsmits self-assigned this Apr 6, 2021

ramonsmits requested review from HEskandari and aleksandr-samila April 6, 2021 15:06

aleksandr-samila approved these changes Apr 6, 2021

View reviewed changes

HEskandari approved these changes Apr 7, 2021

View reviewed changes

src/ServiceControl.Infrastructure.RavenDB/RavenHealthReporter.cs Outdated Show resolved Hide resolved

Base automatically changed from arg-maintenancemode-sigint to master April 7, 2021 20:24

This comment has been minimized.

Sign in to view

ramonsmits mentioned this pull request Apr 12, 2021

Check for RavenDB index errors and delay start until no stale indexes #2485

Closed

ramonsmits force-pushed the ravendb-health-reporter branch from f95423c to 472681f Compare April 12, 2021 15:40

ramonsmits marked this pull request as ready for review April 12, 2021 15:42

ramonsmits requested review from SzymonPobiega, danielmarbach, mikeminutillo, seanfarmar and tmasternak April 12, 2021 16:24

SzymonPobiega reviewed Apr 13, 2021

View reviewed changes

src/ServiceControl.Audit/Infrastructure/CheckRavenDBIndexLag.cs Show resolved Hide resolved

ramonsmits added 3 commits April 20, 2021 12:32

Check for RavenDB index errors and delay start until no stale indexes…

8be086e

… and a custom check that reports index lag issues.

Fix for failing API approval.

b3ebb68

Revert WaitUntilNoStaleIndexes, not waiting for indexes to become n…

f973986

…on-stale at startup.

ramonsmits force-pushed the ravendb-health-reporter branch from 7aae330 to f973986 Compare April 20, 2021 10:46

tmasternak approved these changes Apr 20, 2021

View reviewed changes

ramonsmits commented Apr 20, 2021

View reviewed changes

src/ServiceControl.Audit.UnitTests/ApprovalFiles/APIApprovals.CustomCheckDetails.approved.txt Outdated Show resolved Hide resolved

ramonsmits requested review from HEskandari and aleksandr-samila April 20, 2021 18:34

mikeminutillo reviewed Apr 21, 2021

View reviewed changes

HEskandari approved these changes Apr 21, 2021

View reviewed changes