Conversation
python/bindings.cpp
Outdated
|
|
||
| m.def("make_collective_reader", | ||
| [](py::object comm, bool collective_metadata, bool collective_transfer) { | ||
| #if SONATA_HAS_MPI == 1 |
There was a problem hiding this comment.
When SONATA_HAS_MPI is False I would rather support we dont even expose collective stuff, as it might be misleading to ask for a collective reader and get a normal one.
There was a problem hiding this comment.
Summary: I think it's important that MPI is entirely optional in libsonata, even when writing code that wants to use collective MPI-IO. At most, I'd add a flag that prevents the returning a suitable default.
The lines aren't precise because likely we'll move anything MPI to its own repo. The reason is that mpi4py is a build dependency; and one can't have optional build dependencies (nicely).
Nevertheless, I think code that's written for collective IO should work even if the user fails to install mpi4py. HDF5 doesn't fail either if it can't do collective IO, instead it returns the correct values and sets an internal flag users can check. Functionally collective IO is optional. It's only important for performance reasons under certain conditions.
The risk is that installing mpi4py or libsonata with MPI support will fail to build under certain conditions. I suspect many of our users will be unable or unwilling to debug these failures. To me it's a big difference if what I'm running is a large production run that requires MPI-IO to run reasonably fast, or if it's something small (debugging) that runs fast simply because it's small. Large runs are always done on a cluster and we can expect someone adept at these issues to install the required toolchain (including MPI-enabled libsonata), even if not, there's a necessity and therefore it makes sense to spend time to debug why something failed to build. Small runs are carried out anywhere, including laptops where MPI might not be installed or it might not work correctly at that moment, etc. In my opinion I don't want to force users to have to figure these issues out just to run something that takes 2.3s with MPI-IO and 2.2s without MPI-IO. Similarly, I don't want every downstream project to remember to implement a graceful fallback.
To detect if the important applications are using MPI-IO we can:
- check GPFS waiters,
- run Darshan,
- use
ROMIO_PRINT_HINTS=1and check the output, e.g., during module testing on the BB5.
Therefore, I think, on BB5, we're in a sufficiently strong position to notice if it's silently not using collective IO when it really should be.
|
Interesting design. As I understand you are implementing a collective reader already, so in Neurodamus we would need basically to call |
matz-e
left a comment
There was a problem hiding this comment.
Nice! If I get this correct, we can just pull out the collective read interface into it's own library?
Yes, almost certainly, because mpi4py is a build dependency; and one can't have optional build dependencies (nicely). |
968b661 to
92321c7
Compare
|
This PR has been heavily reworked since the previous round of discussions. Eventually it will only include the |
|
The questions for review are:
The next review would include question like:
With the above, code written to use collective IO, should automatically run even if the user can't install |
|
Last chance for comments @matz-e / @ferdonline; otherwise I'll approve this at the end of the day. |
c573ee4 to
858407a
Compare
|
@mgeplf we're in a reasonably good place to hold off on merging this. With all the preliminary work out of the way, we'll not suffer too much from merge conflicts and we can work on multiple other things (like cleaning up a bit more after #319). We only need this in once we have it fully working and tested in neurodamus. Until then retaining the freedom to change details in might be nice. |
This commit introduces the API for an Hdf5Reader. This reader abstracts the process of opening HDF5 files, and reading an `libsonata.Selection` from a dataset. The default reader calls the existing `_readSelection`.
CI_BRANCHES:BLUECONFIGS_BRANCH=weji/libsonata_mpi
## Context When using `WholeCell` load-balancing, the access pattern when reading parameters during synapse creation is extremely poor and is the main reason why we see long (10+ minutes) periods of severe performance degradation of our parallel filesystem when running slightly larger simulations on BB5. Using Darshan and several PoCs we established that the time required to read these parameters can be reduced by more than 8x and IOps can be reduced by over 1000x when using collective MPI-IO. Moreover, the "waiters" where reduced substantially as well. See BBPBGLIB-1070. Following those finding we concluded that neurodamus would need to use collective MPI-IO in the future. We've implemented most of the required changes directly in libsonata allowing others to benefit from the same optimizations should the need arise. See, BlueBrain/libsonata#309 BlueBrain/libsonata#307 and preparatory work: BlueBrain/libsonata#315 BlueBrain/libsonata#314 BlueBrain/libsonata#298 By instrumenting two simulations (SSCX and reduced MMB) we concluded that neurodamus was almost collective. However, certain attributes where read in different order on different MPI ranks. Maybe due to salting hashes differently on different MPI ranks. ## Scope This PR enables neurodamus to use collective IO for the simulation described above. ## Testing <!-- Please add a new test under `tests`. Consider the following cases: 1. If the change is in an independent component (e.g, a new container type, a parser, etc) a bare unit test should be sufficient. See e.g. `tests/test_coords.py` 2. If you are fixing or adding components supporting a scientific use case, affecting node or synapse creation, etc..., which typically rely on Neuron, tests should set up a simulation using that feature, instantiate neurodamus, **assess the state**, run the simulation and check the results are as expected. See an example at `tests/test_simulation.py#L66` --> We successfully ran the reduced MMB simulation, but since SSCX hasn't been converted to SONATA, we can't run that simulation. ## Review * [x] PR description is complete * [x] Coding style (imports, function length, New functions, classes or files) are good * [ ] Unit/Scientific test added * [ ] Updated Readme, in-code, developer documentation --------- Co-authored-by: Luc Grosheintz <luc.grosheintz@gmail.ch>
## Context When using `WholeCell` load-balancing, the access pattern when reading parameters during synapse creation is extremely poor and is the main reason why we see long (10+ minutes) periods of severe performance degradation of our parallel filesystem when running slightly larger simulations on BB5. Using Darshan and several PoCs we established that the time required to read these parameters can be reduced by more than 8x and IOps can be reduced by over 1000x when using collective MPI-IO. Moreover, the "waiters" where reduced substantially as well. See BBPBGLIB-1070. Following those finding we concluded that neurodamus would need to use collective MPI-IO in the future. We've implemented most of the required changes directly in libsonata allowing others to benefit from the same optimizations should the need arise. See, BlueBrain/libsonata#309 BlueBrain/libsonata#307 and preparatory work: BlueBrain/libsonata#315 BlueBrain/libsonata#314 BlueBrain/libsonata#298 By instrumenting two simulations (SSCX and reduced MMB) we concluded that neurodamus was almost collective. However, certain attributes where read in different order on different MPI ranks. Maybe due to salting hashes differently on different MPI ranks. ## Scope This PR enables neurodamus to use collective IO for the simulation described above. ## Testing <!-- Please add a new test under `tests`. Consider the following cases: 1. If the change is in an independent component (e.g, a new container type, a parser, etc) a bare unit test should be sufficient. See e.g. `tests/test_coords.py` 2. If you are fixing or adding components supporting a scientific use case, affecting node or synapse creation, etc..., which typically rely on Neuron, tests should set up a simulation using that feature, instantiate neurodamus, **assess the state**, run the simulation and check the results are as expected. See an example at `tests/test_simulation.py#L66` --> We successfully ran the reduced MMB simulation, but since SSCX hasn't been converted to SONATA, we can't run that simulation. ## Review * [x] PR description is complete * [x] Coding style (imports, function length, New functions, classes or files) are good * [ ] Unit/Scientific test added * [ ] Updated Readme, in-code, developer documentation --------- Co-authored-by: Luc Grosheintz <luc.grosheintz@gmail.ch>
The idea is that one injects call-backs for reading selections from datasets. These callbacks implement:
The default calls
_readSelection. Separate readers can implement variants suitable for their purpose, e.g. MPI-IO. The advantage is that the reader has control over both the required collective semantics; and setting HDF5 properties. This allows some readers to have non-collective behaviour, such as short-circuiting for empty selections. While others implement strictly collective reading of datasets.Additionally, we can create a suite of readers for different usecases, e.g. MPI-IO for neurodamus, aggregating reader for serial workloads with GPFS (and a separate one for LUSTER if need be), the default implementation because it minimizes number of bytes read.