Skip to content

Conversation

@ramonsmits
Copy link
Member

@ramonsmits ramonsmits commented Apr 6, 2021

  • Check for RavenDB index errors at start.
  • Run a custom check every 5 minutes that:
    • Writes a DEBUG log entry with index statistics.
    • Writes WARN/ERROR log entries for indexes where there IndexLag value is above 10.000/100.000
    • Writes ERROR log entries for any index errors reported.

@ramonsmits ramonsmits self-assigned this Apr 6, 2021
Base automatically changed from arg-maintenancemode-sigint to master April 7, 2021 20:24
@mikeminutillo

This comment has been minimized.

@ramonsmits

This comment has been minimized.

@SzymonPobiega
Copy link
Member

I am hesitant about waiting for non stale indexes at start or throwing until we are sure that there are no customer who do have stale indexes and working ServiceControl. I would fear that we are going to blow up perfectly fine instance of ServiceControl.

@ramonsmits
Copy link
Member Author

I am hesitant about waiting for non stale indexes at start or throwing until we are sure that there are no customer who do have stale indexes and working ServiceControl. I would fear that we are going to blow up perfectly fine instance of ServiceControl.

A perfectly fine instance wouldn't have a lot of work to get non-stale indexes at start. At most a few seconds if the instance was underload when it gracefully exited.

This guards again the problem that when the SC instance exits ungracefully (kill, crash, etc.) and has index issues where indexes get rebuild by RavenDB at start start will be delayed until that point.

@tmasternak
Copy link
Member

tmasternak commented Apr 15, 2021

@ramonsmits one of the arguments for not delaying the start was to prevent the situation when our clients are already in the situation when the indexes are failing or significantly lagging behind.

That said these clients will first face the delay when upgrading to the new minor of the Service Control. I would assume that this is a moment when they actually have sometimes put set aside to have a closer look at this. Secondly, if we put the bare minimum alerting on the lagging indexes they should be covered on an ongoing basis.

Finally, I would put in place a setting flag that would enable skipping the checks when set - to make sure we can unblock any client that runs into problems on production.

@ramonsmits
Copy link
Member Author

@tmasternak Then lets only keep the reporter?

@aleksandr-samila
Copy link
Member

As for me, I like the idea with a flag(-s)

... Finally, I would put in place a setting flag that would enable skipping the checks when set...

or vice versa doesn't matter at all - but we add a possibility.

@tmasternak
Copy link
Member

Then let's only keep the reporter?

@ramonsmits and remove the startup delay?

@ramonsmits ramonsmits force-pushed the ravendb-health-reporter branch from 7aae330 to f973986 Compare April 20, 2021 10:46
@ramonsmits
Copy link
Member Author

@tmasternak @SzymonPobiega I've removed the wait for indexes to become non-stale at start. Still throwing on index errors.

Copy link
Member

@mikeminutillo mikeminutillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I love this! Great job.

@ramonsmits
Copy link
Member Author

@mikeminutillo @danielmarbach thanks for your great feedback. I've applied some refactorings. The Index error check at start is removed. Only reporting remains. I'll cherry pick the index error check on startup into a seperate PR.

@danielmarbach danielmarbach merged commit e07550e into master Apr 22, 2021
@danielmarbach danielmarbach deleted the ravendb-health-reporter branch April 22, 2021 08:30
@danielmarbach danielmarbach added this to the 4.17.0 milestone Apr 22, 2021
@SzymonPobiega SzymonPobiega changed the title RavenDB index health reporter RavenDB index health monitoring May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants