Skip to content
This repository was archived by the owner on Sep 17, 2021. It is now read-only.

Conversation

@zollman
Copy link
Contributor

@zollman zollman commented Nov 16, 2016

Type: feature

Why:
When working with a large number of accounts, sometimes the run_reporter
is never run for some accounts.

There were two issues with the app scheduler configuration:

  1. Jobs are scheduled before the scheduler starts and when there are
    a lot of accounts, some of the scheduled times fall before the
    scheduler is started.
  2. There is a hardcoded misfire grace time of 30 seconds, meaning that
    if a job is scheduled and does not start within 30 seconds because
    of thread contention or other issues it will be cancelled.

This change addresses the need by:
Providing configurable times for the schedule start delay and the
misfire grace time

Potential Side Effects:
No known side effects

Type: feature

Why:
When working with a large number of accounts, sometimes the run_reporter
is never run for some accounts.

There were two issues with the app scheduler configuration:
1) Jobs are scheduled before the scheduler starts and when there are
   a lot of accounts, some of the scheduled times fall before the
   scheduler is started.
2) There is a hardcoded misfire grace time of 30 seconds, meaning that
   if a job is scheduled and does not start within 30 seconds because
   of thread contention or other issues it will be cancelled.

This change addresses the need by:
Providing configurable times for the schedule start delay and the
misfire grace time

Potential Side Effects:
No known side effects
@scriptsrc
Copy link
Contributor

Do you have a recommendation for when we should modify these defaults?

Wat should they be when monitoring 30 or 45 or 60 accounts?

# Apscheduler Configurations
# Length of time, in seconds, before a scheduled job is cancelled due to thread contention or other issues
MISFIRE_GRACE_TIME=30
# Delay, in seconds, until reporter starts
REPORTER_START_DELAY=10 

@aebie
Copy link

aebie commented Nov 17, 2016

@MonkeySecurity on the reporter start delay we have it configured to the number of accounts + 2. Might be a little bit of overkill but we found that if the scheduler starts before all jobs are scheduled, the remaining ones don't run until the next interval. If the interval for some watchers is set to a large number like daily this can be problematic.

The MISFIRE_GRACE_TIME is a little trickier because it's related to the number of threads, and the average reporter run time. What we were seeing is that as the number of accounts grew and we added more threads, the reporter ran slower because of some inherent bottlenecks in boto when running multiple threads in the same process. By the time we got to 60 accounts it was basically just thrashing. We reduced the number of threads to to about half the number of accounts, but then ran into the situation where some accounts never got run because they would time out waiting for a thread. We made the MISFIRE_GRACE_TIME an hour and found that accounts ran reasonably fast and none of the accounts starved.

I think this is a temporary fix because of the Lambda based architecture you described, and also because we have another change that should be coming at some point where we stagger reporter runs across the interval.

@scriptsrc
Copy link
Contributor

Glad I asked. That's excellent to know. I think I'll update my config accordingly.

@scriptsrc scriptsrc merged commit 99d5c9e into Netflix:develop Nov 18, 2016
@zollman zollman deleted the 7869_run_reporter_config branch November 30, 2016 20:48
@scriptsrc scriptsrc mentioned this pull request Dec 2, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants