Skip to content

Problem with loading XTC offsets in parallel #1988

@mkhoshle

Description

@mkhoshle

Hello all,

I have an MPI job where I have 48 trajectory segments in the XTC format (newtraj_0.xtc,...newtraj_47.xtc). In order to have 48 segments one can just use a small size xtc file and copy this single segment 48 times to get a list of segments. I did not put all the segments in the repository because the whole size was large (30 GB). I repeat my job several times to collect statistics.
In my MPI job I use 48 processes and each process tries to copy one of the segments (e.g. newtraj_6.xtc for rank ID 6) to another folder. This copying is to avoid cach effects. After each rank finish copying, it creates an index file for its own segment by doing print(len(mda.Universe(PSF, newtraj_rankID.xtc).trajectory)). I have a barrier to synchronize all processes.

Also I tested to see if the error is due to the corrupted offset file or not. In my code before creating the universe I do the following and all the ranks pass it successfuly,

offsets_ = []
offsets = glob.glob(os.path.abspath(os.path.normpath(os.path.join(os.getcwd(),'files/.*.xtc_offsets.npz'))))
for offset in offsets:
    offsets_.append(mda.coordinates.XDR.read_numpy_offsets(offset))

but some of them fail when creating the universe. i.e. here:

u = mda.Universe(PSF, XTCs)

The following helps to avoid the error, but some ranks are taking much much longer than others as a result of failure to load the offset files.

try:
    u = mda.Universe(PSF, XTC)
except IOError:
    u = mda.Universe(PSF, XTC, refresh_offsets=True)

Expected behavior
What I expect to see from my code is that each rank should create its index file and they should be able to successfully use the chain-reader to create the universe.
Note:
This error happens when I test my code on more than one node. On a single workstation (node), it works fine. The index files are being written independently by each rank. So there should not be any overlap and I also tested the code with rank0 only creating the indices. However, it does not work either. The offset files cannot be corrupted except having a race condition. Is there any method in MDAnalysis like creating the universe, getting n_frames, etc that lead to rewriting the index file?

Actual behavior
Some ranks fail due to the following error:

Traceback (most recent call last):
  File "/oasis/scratch/comet/mkhoshle/temp_project/MPI-traj-split/main1-ga4py-updated-rmsd-chain-reader.py", line 73, in <module>
    u = mda.Universe(PSF, XTCs)
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/MDAnalysis/core/universe.py", line 305, in __init__
    self.load_new(coordinatefile, **kwargs)
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/MDAnalysis/core/universe.py", line 535, in load_new
    self.trajectory = reader(filename, **kwargs)
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/MDAnalysis/coordinates/chain.py", line 123, in __init__
    for filename in self.filenames]
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/MDAnalysis/coordinates/core.py", line 83, in reader
    return Reader(filename, **kwargs)
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/MDAnalysis/coordinates/XDR.py", line 144, in __init__
    self._load_offsets()
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/MDAnalysis/coordinates/XDR.py", line 183, in _load_offsets
    data = read_numpy_offsets(fname)
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/MDAnalysis/coordinates/XDR.py", line 84, in read_numpy_offsets
    return {k: v for k, v in six.iteritems(np.load(filename))}
  File "/home/mkhoshle/miniconda2/envs/ga4py/lib/python2.7/site-packages/numpy/lib/npyio.py", line 429, in load
    "Failed to interpret file %s as a pickle" % repr(file))
IOError: Failed to interpret file '/oasis/scratch/comet/mkhoshle/temp_project/MPI-traj-split/files/.newtraj_41_1.xtc_offsets.npz' as a pickle

Code to reproduce the behavior
Here is the simple code to reproduce the problem:

import numpy as np
import MDAnalysis as mda
from shutil import copyfile
import glob, os
import mpi4py
from mpi4py import MPI

MPI.Init
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

j = 1    # j is the repeat number
PSF = os.path.abspath(os.path.normpath(os.path.join(os.getcwd(),'files/adk4AKE.psf')))
longXTC = os.path.abspath(os.path.normpath(os.path.join(os.getcwd(),'traj_data_{}'.format(size),'newtraj_{}.xtc'.format(rank))))
longXTC1 = os.path.abspath(os.path.normpath(os.path.join(os.getcwd(),'files/newtraj_{}_{}.xtc'.format(rank,j))))

copyfile(longXTC, longXTC1)
print(len(mda.Universe(PSF, longXTC1).trajectory))

comm.Barrier()
XTCs = glob.glob(os.path.abspath(os.path.normpath(os.path.join(os.getcwd(),'files/*.xtc'))))
u = mda.Universe(PSF, XTCs)

MPI.Finalize

Any feedback and comment will be appreciated.

Currently version of MDAnalysis

  • Which version are you using? (0.17.0)
  • Which version of Python (Happens with both Python 2 & 3)
  • Which operating system? (CentOS release 6.7)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions