Skip to content

Conversation

@MichaelBrim
Copy link
Collaborator

Description

The request manager thread for each client will send a periodic heartbeat RPC to the client to make sure it hasn't gone away unexpectedly. Currently, the period is set to 30 seconds. When a failure is detected, the request manager thread will exit and the client state will be cleaned by the main server thread.

Motivation and Context

See issues: #567 #646

How Has This Been Tested?

Tested in Docker Ubuntu by killing one of the unit tests while it was still executing and making sure the failure was detected and cleanup occurred.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Testing (addition of new tests or update to current tests)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the UnifyFS code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted.

thrd_ctrl->waiting_for_work = 0;
RM_UNLOCK(thrd_ctrl);

rc = rm_heartbeat(thrd_ctrl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard for me to tell from the code, but assuming the resource manager thread has no work to do, how often are we pinging the client?

The comments above suggest the pthread_cond times out after 10ms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in rm_heartbeat() makes it so it only does the RPC once every 30 seconds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Now I see it.

@adammoody adammoody merged commit ea2f5f5 into llnl:dev Aug 3, 2021
@MichaelBrim MichaelBrim deleted the heartbeat-rpc branch August 4, 2021 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants