Skip to content

Conversation

@MichaelBrim
Copy link
Collaborator

Description

  • use timed versions of RPC forwarding functions, and add configuration settings for client-server and server-server RPC timeouts
  • for group RPCs, allow responses in any order
  • for RPCs that transfer extent metadata, avoid use of unifyfs_inode_extent_t and extent_tree_node structures, since they add useless bytes to the transfers

NOTE: due to a bug in prior margo versions, we need to use 0.9.6 or later to use the timed async forwarding

Motivation and Context

At scales of 256 nodes and larger on Summit, we occasionally observe runtime hangs during various operations, particularly those involving group rpcs. These changes help to reduce any forced ordering of rpc responses, and improve diagnosis of responses never received. They also reduce the amount of data sent during synchronization of file extents.

How Has This Been Tested?

Tested on Summit up to 1024 nodes (8ppn) using writeread example (with and without lamination), with no hangs observed.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Testing (addition of new tests or update to current tests)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the UnifyFS code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted.

Copy link
Member

@CamStan CamStan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MichaelBrim!

Would you mind updating the dependencies page in the docs to reflect the mochi-margo version requirement?
https://github.com/LLNL/UnifyFS/blob/dev/docs/dependencies.rst

Also, not sure if anyone uses bootstrap.sh, but might need to update the mochi-margo version there as well.

* use timed versions of RPC forwarding functions, and add configuration
  settings for client-server and server-server RPC timeouts
* for group RPCs, allow responses in any order
* for RPCs that transfer extent metadata, avoid use of
  unifyfs_inode_extent_t and extent_tree_node structures

NOTE: due to a bug in prior margo versions, we need to use 0.9.6
      or later to use the timed async forwarding

TEST_CHECKPATCH_SKIP_FILES="common/src/unifyfs_configurator.h"
@MichaelBrim
Copy link
Collaborator Author

@CamStan I updated the dependencies doc and bootstrap.sh to reflect the current suggested dependencies.

@adammoody adammoody merged commit ab413eb into llnl:dev Feb 2, 2022
@adammoody
Copy link
Collaborator

Thanks @MichaelBrim and @CamStan !

@MichaelBrim MichaelBrim deleted the margo-usage branch February 3, 2022 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants