Skip to content

[WIP] Parallel HDF5: 4MB Alignment & Buffer#898

Open
ax3l wants to merge 3 commits intoopenPMD:devfrom
ax3l:topic-4MBalignmentAndBuf
Open

[WIP] Parallel HDF5: 4MB Alignment & Buffer#898
ax3l wants to merge 3 commits intoopenPMD:devfrom
ax3l:topic-4MBalignmentAndBuf

Conversation

@ax3l
Copy link
Member

@ax3l ax3l commented Jan 13, 2021

FS blocksize:

stat -fc %s .

Tried those options on Cori (Scratch and CFS): 8_benchmark case with -w, KNL partition, WarpX-like MPI-rank placement.
modules: ... darshan/3.1.7 gcc/8.3.0 cray-mpich/7.7.10 cray-hdf5-parallel/1.10.5.2 ...

Scratch: 1MB recommended blocksize (confusingly, stat -fc %s <dir> reports 4KiB)
CFS: 16 MB blocksize (with 4MiB subblocks)

Support quote:

blocksize is a quirky parameter for parallel file systems because between your compute node and the actual block devices are a bunch of network and RAID layers that have their own magic sizes. Some arcane knowledge is required

Sets medium striping.
Note: for proper ADIOS2 timings, keep the small default striping (it creates subfiles that should not be heavily striped); for proper HDF5 timings, enable striping (single output file that should be heavily striped).

For HDF5, we can also try T3PIO MPI_Info hints again.

cori.sbatch.txt

Cori: Darshan Logs

# MPICH statistics collection
export MPICH_MPIIO_STATS=1
export MPICH_MPIIO_HINTS_DISPLAY=1
export MPICH_MPIIO_TIMERS=1

# Darshan extended trace (dxt) logs
export DARSHAN_DISABLE_SHARED_REDUCTION=1
export DXT_ENABLE_IO_TRACE=4

# work-around needed
export LD_PRELOAD=/global/common/cori_cle7/software/darshan/3.1.7/lib/libdarshan.so

// srun

# disable work-around
unset LD_PRELOAD

VERIFY(status >= 0, "[HDF5] Internal error: Failed to set H5Pset_dxpl_mpio");

auto const strByte = auxiliary::getEnvString( "OPENPMD_HDF5_ALIGNMENT", "1" );
auto const strByte = auxiliary::getEnvString( "OPENPMD_HDF5_ALIGNMENT", "4194304" );
Copy link
Member Author

@ax3l ax3l Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample & bin directory du -hs with:

OPENPMD_HDF5_ALIGNMENT size
1 7.8M + 1.4G
4194304 7.8M + 1.4G

on my laptop (4KiB blocksize).

ls output (less reliable) also ok for parallel files (not padded to multiples of 4MiB). So either this is cleverly compacted or has no influence...

double policy = 0.0;
status = H5Pget_cache(m_fileAccessProperty, &metaCacheElements, &rawCacheElements, &rawCacheSize, &policy);
VERIFY(status >= 0, "[HDF5] Internal error: Failed to set H5Pget_cache");
rawCacheSize = bytes * 4; // default: 1 MiB per dataset
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be guarded so we don't accidentally set 1Byte cache if a user provides 1

@ax3l ax3l mentioned this pull request Jan 29, 2021
5 tasks
@ax3l
Copy link
Member Author

ax3l commented Jun 11, 2021

We can run these tests again after #916 was merged, maybe we see some improvement when setting striping with chunked data sets

FS blocksize:
```
stat -fc %s .
```
@ax3l

This comment has been minimized.

@ax3l ax3l force-pushed the topic-4MBalignmentAndBuf branch 2 times, most recently from e7c4377 to 9163165 Compare June 24, 2021 06:35
===================================== ========= ====================================================================================
``OPENPMD_HDF5_INDEPENDENT`` ``ON`` Sets the MPI-parallel transfer mode to collective (``OFF``) or independent (``ON``).
``OPENPMD_HDF5_ALIGNMENT`` ``1`` Tuning parameter for parallel I/O, choose an alignment which is a multiple of the disk block size.
``OPENPMD_HDF5_ALIGNMENT`` ``ABC`` Tuning parameter for parallel I/O, choose an alignment which is a multiple of the disk block size.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo with final value

@ax3l ax3l force-pushed the topic-4MBalignmentAndBuf branch from 9163165 to a34d531 Compare June 24, 2021 06:51
@ax3l ax3l force-pushed the topic-4MBalignmentAndBuf branch from a34d531 to 617c465 Compare June 24, 2021 07:05
@ax3l
Copy link
Member Author

ax3l commented Jun 24, 2021

Next measurements we should try on Cori (Suren Byna):

Option 1

  • Set the alignment to 8 MB (in the H5Pset_alignment() call, threshold of 0 and alignment of 8MB)
  • Set striping on the directory where the data is being written.
  • Stripe count: 40
  • Stripe size: 8 MB

Just in case, here’s the command to set the stripe on a directory.

lfs setstripe --stripe_count 40 --stripe_size 8m ./benchmarks

Option 2

  • Alignment of 16 MB
  • Stripe count : 40
  • Stripe size: 16 MB

Jan/Feb tests

We tried various sizes in Jan/February with the job script linked above in the PR description. We saw no improvement on Cori at the time.

Since then, we implementing chunking #406 and changed the benchmark from then from 4D to 3D: #1010
Also, we have new parallel benchmarks now (8a, 8b).

Comment on lines +116 to +118
// align all (no threshold) if only alignment is set
if( m_alignment > 1 && m_threshold == 1 )
m_threshold = 0;
Copy link
Member Author

@ax3l ax3l Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be moved behind the whole if( config.contains( "hdf5" ) ) block, otherwise the env var OPENPMD_HDF5_ALIGNMENT will not imply m_threshold = 0 for values >1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant