Skip to content
This repository was archived by the owner on Feb 26, 2025. It is now read-only.

Inject dataset reading via Hdf5Reader.#307

Merged
1uc merged 6 commits intomasterfrom
1uc/hdf5-reader
Dec 12, 2023
Merged

Inject dataset reading via Hdf5Reader.#307
1uc merged 6 commits intomasterfrom
1uc/hdf5-reader

Conversation

@1uc
Copy link
Collaborator

@1uc 1uc commented Nov 7, 2023

The idea is that one injects call-backs for reading selections from datasets. These callbacks implement:

template<class T>
std::vector<T> readSelection(const HighFive::DataSet& dset, Selection& selection);
HighFive::File open_file(const std::string& filename);

The default calls _readSelection. Separate readers can implement variants suitable for their purpose, e.g. MPI-IO. The advantage is that the reader has control over both the required collective semantics; and setting HDF5 properties. This allows some readers to have non-collective behaviour, such as short-circuiting for empty selections. While others implement strictly collective reading of datasets.

Additionally, we can create a suite of readers for different usecases, e.g. MPI-IO for neurodamus, aggregating reader for serial workloads with GPFS (and a separate one for LUSTER if need be), the default implementation because it minimizes number of bytes read.

@1uc 1uc force-pushed the 1uc/hdf5-reader branch from 6411979 to 2e25415 Compare November 7, 2023 12:33

m.def("make_collective_reader",
[](py::object comm, bool collective_metadata, bool collective_transfer) {
#if SONATA_HAS_MPI == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When SONATA_HAS_MPI is False I would rather support we dont even expose collective stuff, as it might be misleading to ask for a collective reader and get a normal one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: I think it's important that MPI is entirely optional in libsonata, even when writing code that wants to use collective MPI-IO. At most, I'd add a flag that prevents the returning a suitable default.

The lines aren't precise because likely we'll move anything MPI to its own repo. The reason is that mpi4py is a build dependency; and one can't have optional build dependencies (nicely).

Nevertheless, I think code that's written for collective IO should work even if the user fails to install mpi4py. HDF5 doesn't fail either if it can't do collective IO, instead it returns the correct values and sets an internal flag users can check. Functionally collective IO is optional. It's only important for performance reasons under certain conditions.

The risk is that installing mpi4py or libsonata with MPI support will fail to build under certain conditions. I suspect many of our users will be unable or unwilling to debug these failures. To me it's a big difference if what I'm running is a large production run that requires MPI-IO to run reasonably fast, or if it's something small (debugging) that runs fast simply because it's small. Large runs are always done on a cluster and we can expect someone adept at these issues to install the required toolchain (including MPI-enabled libsonata), even if not, there's a necessity and therefore it makes sense to spend time to debug why something failed to build. Small runs are carried out anywhere, including laptops where MPI might not be installed or it might not work correctly at that moment, etc. In my opinion I don't want to force users to have to figure these issues out just to run something that takes 2.3s with MPI-IO and 2.2s without MPI-IO. Similarly, I don't want every downstream project to remember to implement a graceful fallback.

To detect if the important applications are using MPI-IO we can:

  • check GPFS waiters,
  • run Darshan,
  • use ROMIO_PRINT_HINTS=1 and check the output, e.g., during module testing on the BB5.

Therefore, I think, on BB5, we're in a sufficiently strong position to notice if it's silently not using collective IO when it really should be.

@ferdonline
Copy link
Contributor

Interesting design. As I understand you are implementing a collective reader already, so in Neurodamus we would need basically to call make_collective_reader and use it as argument in calls to the config which return storage. I think this is fine.

Copy link
Member

@matz-e matz-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! If I get this correct, we can just pull out the collective read interface into it's own library?

@1uc
Copy link
Collaborator Author

1uc commented Nov 8, 2023

Nice! If I get this correct, we can just pull out the collective read interface into it's own library?

Yes, almost certainly, because mpi4py is a build dependency; and one can't have optional build dependencies (nicely).

@1uc 1uc force-pushed the 1uc/hdf5-reader branch 11 times, most recently from 968b661 to 92321c7 Compare November 14, 2023 09:19
This was referenced Nov 14, 2023
@1uc
Copy link
Collaborator Author

1uc commented Nov 14, 2023

This PR has been heavily reworked since the previous round of discussions. Eventually it will only include the Hdf5Reader API. The two commits that refactor libsonata have been split into their own PRs. The implementation of an MPI-IO reader has been moved to a separate PR (until it can be moved to its own repository).

@1uc
Copy link
Collaborator Author

1uc commented Nov 29, 2023

The questions for review are:

  1. Are the rules regarding collectiveness of libsonata acceptable?
  2. Do we need the ability to pass options to the reader on getAttribute and similar? Currently, I can only think of collective which would be used to specify if the call is collective. Things like block size, I fell would go into the reader.
  3. Naming of Hdf5Reader.

The next review would include question like:

  1. Can we have an optional dependency libsonata[mpi]? The advantage is that libsonata can control the version of libsonata_mpi it's compatible with, which is useful if the API of Hdf5Reader needs to change.
  2. Can we have wrappers for libsonata_mpi.make_collective_reader in libsonata? The advantage is that one could then always do libsonata.make_collective_reader which would always succeed. If libsonata_mpi is available it will return a collective reader, otherwise it'll return the default reader.

With the above, code written to use collective IO, should automatically run even if the user can't install mpi4py or libsonata_mpi. Therefore, users can avoid the pain of installing the difficult dependencies; and devs don't need to worry about making it easy for their users.

@1uc 1uc marked this pull request as ready for review November 29, 2023 08:58
Copy link
Contributor

@mgeplf mgeplf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's get #319 merged, and then hopefully this

@mgeplf
Copy link
Contributor

mgeplf commented Nov 30, 2023

Last chance for comments @matz-e / @ferdonline; otherwise I'll approve this at the end of the day.

@1uc
Copy link
Collaborator Author

1uc commented Dec 1, 2023

@mgeplf we're in a reasonably good place to hold off on merging this. With all the preliminary work out of the way, we'll not suffer too much from merge conflicts and we can work on multiple other things (like cleaning up a bit more after #319).

We only need this in once we have it fully working and tested in neurodamus. Until then retaining the freedom to change details in might be nice.

@1uc 1uc requested a review from mgeplf December 12, 2023 07:14
1uc added 6 commits December 12, 2023 14:15
This commit introduces the API for an Hdf5Reader. This reader abstracts the
process of opening HDF5 files, and reading an `libsonata.Selection` from a
dataset.

The default reader calls the existing `_readSelection`.
@1uc 1uc merged commit a764da5 into master Dec 12, 2023
@1uc 1uc deleted the 1uc/hdf5-reader branch December 12, 2023 13:53
1uc referenced this pull request in BlueBrain/neurodamus Dec 20, 2023
CI_BRANCHES:BLUECONFIGS_BRANCH=weji/libsonata_mpi
WeinaJi pushed a commit to BlueBrain/neurodamus that referenced this pull request Jan 29, 2024
## Context
When using `WholeCell` load-balancing, the access pattern when reading
parameters during synapse creation is extremely poor and is the main
reason why we see long (10+ minutes) periods of severe performance
degradation of our parallel filesystem when running slightly larger
simulations on BB5.

Using Darshan and several PoCs we established that the time required to
read these parameters can be reduced by more than 8x and IOps can be
reduced by over 1000x when using collective MPI-IO. Moreover, the
"waiters" where reduced substantially as well. See BBPBGLIB-1070.

Following those finding we concluded that neurodamus would need to use
collective MPI-IO in the future.

We've implemented most of the required changes directly in libsonata
allowing others to benefit from the same optimizations should the need
arise. See,
BlueBrain/libsonata#309
BlueBrain/libsonata#307

and preparatory work:
BlueBrain/libsonata#315
BlueBrain/libsonata#314
BlueBrain/libsonata#298 

By instrumenting two simulations (SSCX and reduced MMB) we concluded
that neurodamus was almost collective. However, certain attributes where
read in different order on different MPI ranks. Maybe due to salting
hashes differently on different MPI ranks.

## Scope
This PR enables neurodamus to use collective IO for the simulation
described above.

## Testing
<!--
Please add a new test under `tests`. Consider the following cases:

1. If the change is in an independent component (e.g, a new container
type, a parser, etc) a bare unit test should be sufficient. See e.g.
`tests/test_coords.py`
2. If you are fixing or adding components supporting a scientific use
case, affecting node or synapse creation, etc..., which typically rely
on Neuron, tests should set up a simulation using that feature,
instantiate neurodamus, **assess the state**, run the simulation and
check the results are as expected.
    See an example at `tests/test_simulation.py#L66`
-->
We successfully ran the reduced MMB simulation, but since SSCX hasn't
been converted to SONATA, we can't run that simulation.

## Review
* [x] PR description is complete
* [x] Coding style (imports, function length, New functions, classes or
files) are good
* [ ] Unit/Scientific test added
* [ ] Updated Readme, in-code, developer documentation

---------

Co-authored-by: Luc Grosheintz <luc.grosheintz@gmail.ch>
WeinaJi pushed a commit to BlueBrain/neurodamus that referenced this pull request Oct 14, 2024
## Context
When using `WholeCell` load-balancing, the access pattern when reading
parameters during synapse creation is extremely poor and is the main
reason why we see long (10+ minutes) periods of severe performance
degradation of our parallel filesystem when running slightly larger
simulations on BB5.

Using Darshan and several PoCs we established that the time required to
read these parameters can be reduced by more than 8x and IOps can be
reduced by over 1000x when using collective MPI-IO. Moreover, the
"waiters" where reduced substantially as well. See BBPBGLIB-1070.

Following those finding we concluded that neurodamus would need to use
collective MPI-IO in the future.

We've implemented most of the required changes directly in libsonata
allowing others to benefit from the same optimizations should the need
arise. See,
BlueBrain/libsonata#309
BlueBrain/libsonata#307

and preparatory work:
BlueBrain/libsonata#315
BlueBrain/libsonata#314
BlueBrain/libsonata#298 

By instrumenting two simulations (SSCX and reduced MMB) we concluded
that neurodamus was almost collective. However, certain attributes where
read in different order on different MPI ranks. Maybe due to salting
hashes differently on different MPI ranks.

## Scope
This PR enables neurodamus to use collective IO for the simulation
described above.

## Testing
<!--
Please add a new test under `tests`. Consider the following cases:

1. If the change is in an independent component (e.g, a new container
type, a parser, etc) a bare unit test should be sufficient. See e.g.
`tests/test_coords.py`
2. If you are fixing or adding components supporting a scientific use
case, affecting node or synapse creation, etc..., which typically rely
on Neuron, tests should set up a simulation using that feature,
instantiate neurodamus, **assess the state**, run the simulation and
check the results are as expected.
    See an example at `tests/test_simulation.py#L66`
-->
We successfully ran the reduced MMB simulation, but since SSCX hasn't
been converted to SONATA, we can't run that simulation.

## Review
* [x] PR description is complete
* [x] Coding style (imports, function length, New functions, classes or
files) are good
* [ ] Unit/Scientific test added
* [ ] Updated Readme, in-code, developer documentation

---------

Co-authored-by: Luc Grosheintz <luc.grosheintz@gmail.ch>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants