Skip to content

Conversation

@m4dcoder
Copy link
Contributor

@m4dcoder m4dcoder commented Feb 6, 2019

Move the lock for coordinating concurrency policies into the scheduler. With the current approach, when there are more than one schedulers, there is a race in scheduling that results in failure to enforce the concurrency accurately.

@m4dcoder m4dcoder requested review from Kami and bigmstone February 6, 2019 01:03
@m4dcoder m4dcoder force-pushed the fix-scheduler-concurrency branch 2 times, most recently from fdf62a2 to 793c2aa Compare February 6, 2019 01:10
@Kami Kami added this to the 2.10.2 milestone Feb 6, 2019
CHANGELOG.rst Outdated
~~~~~~~

* Changed the ``inquiries`` API path from ``/exp`` to ``/api/v1`` #4495
* Moved the lock from concurrency policies into the scheduler. #4481 (bug fix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also clarify in the changelog entry what bug it fixes. Otherwise if people go over the changelog they will have no idea what this change doesn't and if it affects them or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


# Concurrency policies require scheduler to acquire a distributed lock to prevent race
# in scheduling when there are multiple scheduler instances.
POLICY_TYPES_REQUIRING_LOCK = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if policy_types:
query_params['policy_type__in'] = policy_types

policy_dbs = pc_db_access.Policy.query(**query_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding .count() to the end would probably be a bit more efficient since the count will be calculated and returned server side and means we don't need to evaluate and load the whole result set in memory like we do if we use len().

Not a huge issue here since those documents are not large, but still an easy change :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Kami Kami added the policies label Feb 6, 2019
@Kami
Copy link
Member

Kami commented Feb 6, 2019

Thanks for working on this change, LGTM 👍

On a related note - would it some how be possible for us to write end to end / integration tests which actually try to emulate the race and verify it's not there (probably quite hard to do end to end wise)?

Maybe spawn two scheduler process as part of an integration test?

COORDINATOR = coordinator_setup()

return COORDINATOR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this commented out code will be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be removed.

Move the lock for coordinating concurrency policies into the scheduler. With the current approach, when there are more than one schedulers, there is a race in scheduling that results in failure to enforce the concurrency accurately.
Use the count method instead of len so the querying is done server side at MongoDB.
… bug

Update the changelog entry to be more descriptive on the fixing of the scheduler race related bug.
Clean up and remove commented out code from the coordination service.
@m4dcoder m4dcoder force-pushed the fix-scheduler-concurrency branch from 4338989 to a4f8b44 Compare February 7, 2019 20:18
@m4dcoder m4dcoder merged commit 958eaeb into master Feb 7, 2019
@m4dcoder m4dcoder deleted the fix-scheduler-concurrency branch February 7, 2019 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants