Skip to content

Parallel Test: Flush fileBased Some#672

Merged
ax3l merged 1 commit intoopenPMD:devfrom
ax3l:topic-parallelSeriesPartialFlush
Jan 27, 2020
Merged

Parallel Test: Flush fileBased Some#672
ax3l merged 1 commit intoopenPMD:devfrom
ax3l:topic-parallelSeriesPartialFlush

Conversation

@ax3l
Copy link
Member

@ax3l ax3l commented Jan 26, 2020

Add an additional flush - which should be called implicitly anyway in the destructor - and which causes MPI issues.

Essentially what was added in #668

MPICH 3.3.2 error locally with ADIOS1 backend:

Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(179)..................: MPI_Comm_dup(comm=0x84000002, new_comm=0x55c0ca44be1c) failed
PMPI_Comm_dup(164)..................: 
MPIR_Comm_dup_impl(57)..............: 
MPII_Comm_copy(571).................: 
MPIR_Get_contextid_sparse_group(495): 
MPIR_Allreduce_impl(293)............: 
MPIR_Allreduce_intra_auto(178)......: 
MPIR_Allreduce_intra_auto(84).......: 
MPIR_Bcast_impl(310)................: 
MPIR_Bcast_intra_auto(223)..........: 
MPIR_Bcast_intra_binomial(123)......: message sizes do not match across processes in the collective routine: Received 4 but expected 260

Weird: the MPI communicator is duplicated (and freed) in the ADIOS1 backend but not in the HDF5 and ADIOS2 backend.

@ax3l ax3l force-pushed the topic-parallelSeriesPartialFlush branch 3 times, most recently from d71d36f to 5917f9e Compare January 26, 2020 21:23
@ax3l ax3l assigned ax3l and guj Jan 26, 2020
@ax3l ax3l force-pushed the topic-parallelSeriesPartialFlush branch 7 times, most recently from 098293f to 2e1d5b7 Compare January 27, 2020 08:04
@ax3l
Copy link
Member Author

ax3l commented Jan 27, 2020

One should apply the same logic change to the serial ADIOS1 implementation. If I do so, than one already catches a problem that likely is now also present in the parallel version: as soon as flush() is called, some default attributes are written with default values if not yet changed by the users. That means, that after the first flush a user cannot change them anymore.
Not super-problematic if one writes meta-data (openPMD extension, record meta data, etc.) first before data, but not ideal either.

This is probably the reason for delaying attribute writes in this backend in the first place.

e["position"]["x"].storeChunk(position_local, {offset}, {rank});
e["positionOffset"]["x"].storeChunk(positionOffset_local, {offset}, {rank});
}
o.flush();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one removes this flush now, this test hangs in ParallelADIOS1IOHandlerImpl::open_write() (one rank) and in ~ParallelADIOS1IOHandlerImpl in MPI_Barrier(m_mpiComm); (the other rank)

@ax3l ax3l force-pushed the topic-parallelSeriesPartialFlush branch from 2e1d5b7 to bc3ab6e Compare January 27, 2020 19:30
@ax3l
Copy link
Member Author

ax3l commented Jan 27, 2020

Looks like #674 could also fix this one :)

Backup of previous fix stored in branch topic-parallelSeriesPartialFlush-bakFix

@ax3l ax3l force-pushed the topic-parallelSeriesPartialFlush branch from bc3ab6e to f907035 Compare January 27, 2020 19:33
Add an additional flush - which should be called implicitly anyway
in the destructor - and which causes MPI issues.
@ax3l ax3l force-pushed the topic-parallelSeriesPartialFlush branch from f907035 to ca0d59d Compare January 27, 2020 20:06
@ax3l ax3l removed the help wanted label Jan 27, 2020
@ax3l ax3l merged commit 2fa1707 into openPMD:dev Jan 27, 2020
@ax3l ax3l deleted the topic-parallelSeriesPartialFlush branch January 27, 2020 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants