Conversation
bcfd6fb to
428ed55
Compare
|
@guj just quickly mentioned the list of collective HDF5 calls again, so if we do not want to re-implement the buffer functionality of ADIOS for HDF5, this backend might just not be as flexible: https://support.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html Nevertheless, I seam to remember for ADIOS1 that calls such as @RemiLehe also mentioned we could try |
|
It should work if one calls adios_open/close methods in all ranks, and write var in some ranks. |
|
@guj Thanks! Is that true for both adios1 and adios2? |
|
Yes - Junmin
… On Oct 4, 2019, at 2:09 PM, Remi Lehe ***@***.***> wrote:
@guj <https://github.com/guj> Thanks! Is that true for both adios1 and adios2?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#554?email_source=notifications&email_token=ABY7OFUJARMNAE5JFCAOX4DQM6WIBA5CNFSM4IOQEPC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAM4SAY#issuecomment-538560771>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY7OFUXZ5EYDZRFSRLLYA3QM6WIBANCNFSM4IOQEPCQ>.
|
428ed55 to
1853ceb
Compare
|
I am not sure why this is not deadlocking for BP3/ADIOS1 anymore, but that is good news. Potentially I mismatched it while running and designing the tests.
|
1853ceb to
d0aa187
Compare
|
So, these are my own checks on my local machine, Ubuntu 18.04.3 LTS, GCC 7.4.0, CMake 3.15.4. ADIOS1 and ADIOS2 build via Spack, PHDF5 taken from system.
(*) with Now ADIOS1, ADIOS2, HDF5 build via Spack for OpenMPI 3.1.4:
(*) with But, when executed on Cori-KNL (NERSC), with the following commands module swap craype-haswell craype-mic-knl
module swap PrgEnv-intel PrgEnv-gnu # GCC 8.2.0
module load adios/1.13.1
export CC="$(which cc)"
export CXX="$(which CC)"
export CRAYPE_LINK_TYPE=dynamic
cd $SCRATCH
git clone https://github.com/openPMD/openPMD-api.git
cd openPMD-api
git remote add ax3l https://github.com/ax3l/openPMD-api.git
git checkout ax3l/fix-storeChunkCalls
mkdir build
cd build
cmake .. -DopenPMD_USE_HDF5=OFF -DopenPMD_USE_ADIOS1=ON -DopenPMD_USE_ADIOS2=OFF -DopenPMD_USE_PYTHON=OFF -DMPIEXEC_EXECUTABLE=$(which srun)
make -j 8
salloc --time=1:00:00 -N 1 -C knl -q interactive
CTEST_OUTPUT_ON_FAILURE=1 ctest -V -R ParallelIO
# or manually
srun -n 2 bin/ParallelIOTestsADIOS1 does not deadlock:
Legend❌ : deadlock |
|
For PHDF5, I added a new environment variable, I am still puzzled why the ADIOS1 implementation deadlocks. Maybe @franzpoeschel can spot it quicker than I do. |
.travis.yml
Outdated
| # run tests | ||
| # Independent I/O in PHDF5 | ||
| # https://github.com/openPMD/openPMD-api/pull/554 | ||
| - OPENPMD_HDF5_INDEPENDENT=ON |
There was a problem hiding this comment.
it would be better if we add this for the specific test as JSON option #569, because otherwise we wildcard a good amount of potential issues in CI, but we do not have this in place yet.
Parallel test: allow to call storeChunk an independent number of times from various ranks. This currently deadlocks on `Series::flush()` for both ADIOS1 and HDF5.
1c3a70a to
a089c92
Compare
Add this via a JSON option to the specific tests where we want this, otherwise we wildcard to much in our CI.
a089c92 to
5f235f8
Compare
|
For some reason, I see 3x Looks like a rouge work-around inside This block which is also present in the serial backend seams to be the culpit: for( auto& f : m_filePaths )
if( m_openWriteFileHandles.find(f.second) == m_openWriteFileHandles.end() )
m_openWriteFileHandles[f.second] = open_write(f.first);Note: The |
|
if a Series is constructed, there are two places adios_open() is called. In your test, rank 1 called adios_open() in both places, rank 0 has no work, so called at (2) only. I am wondering why (2) is necessary to call file creation in a destructor. Might make more sense to construct the file in the Series. constructor. |
|
Thank you for finding these details! I think although issuing an early
|
|
I think I might have now spotted the issue in |
13b22a0 to
f6694e1
Compare
Files for writing should not be closed just to be opened again immediately after. Fix deadlock in `storeChunk()` when called in non-collective manner.
f6694e1 to
15cc01e
Compare
Parallel test: allow to call
storeChunk()an independent number of times from various ranks. This currently deadlocks onSeries::flush()forboth ADIOS1 andHDF5 (ok!).@C0nsultant @franzpoeschel can you please take a look at this? At least for ADIOS1 this should be a totally fine workflow. I think that is a bug in our library to require a matched number of calls to
storeChunk()from all participating ranks.To Do
storeChunk, another does not (fails); but we also needstoreChunktimes, others<M>!=<N>timesloadChunk