Skip to content

Make H5MD serialize with pickle#2894

Merged
orbeckst merged 15 commits intoMDAnalysis:developfrom
edisj:issue-2890-h5md-pickle
Aug 12, 2020
Merged

Make H5MD serialize with pickle#2894
orbeckst merged 15 commits intoMDAnalysis:developfrom
edisj:issue-2890-h5md-pickle

Conversation

@edisj
Copy link
Member

@edisj edisj commented Aug 10, 2020

Fixes #2890

Changes made in this Pull Request:

  • Added H5PYPicklable class to H5MDReader

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Aug 10, 2020

Hello @edisj! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 572:80: E501 line too long (82 > 79 characters)
Line 814:80: E501 line too long (87 > 79 characters)

Comment last updated at 2020-08-11 22:53:49 UTC

@edisj
Copy link
Member Author

edisj commented Aug 10, 2020

Hi @orbeckst, hopefully this fixes the pickling issues. I followed the code @yuxuanzhuang gave in #2890 which does allow the tests to pass. I modified it a bit to work with comm arguments as well and added documentation.

It seems to work with some testing:
image
as in the h5py.File object seems to have all the data in it that it should, but I'm not entirely confident with how pickling works. Please leave any comments/criticisms for changes

@codecov
Copy link

codecov bot commented Aug 10, 2020

Codecov Report

Merging #2894 into develop will increase coverage by 5.67%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2894      +/-   ##
===========================================
+ Coverage    87.20%   92.87%   +5.67%     
===========================================
  Files          167      187      +20     
  Lines        21744    24585    +2841     
  Branches      3186     3192       +6     
===========================================
+ Hits         18961    22834    +3873     
+ Misses        2258     1705     -553     
+ Partials       525       46     -479     
Impacted Files Coverage Δ
package/MDAnalysis/coordinates/H5MD.py 94.55% <100.00%> (+70.41%) ⬆️
package/MDAnalysis/topology/tpr/obj.py 96.96% <0.00%> (-3.04%) ⬇️
package/MDAnalysis/core/topology.py 100.00% <0.00%> (ø)
package/MDAnalysis/analysis/polymer.py 100.00% <0.00%> (ø)
package/MDAnalysis/coordinates/MMTF.py 100.00% <0.00%> (ø)
package/MDAnalysis/coordinates/NAMDBIN.py 100.00% <0.00%> (ø)
core/__init__.py 100.00% <0.00%> (ø)
package/MDAnalysis/analysis/legacy/__init__.py 0.00% <0.00%> (ø)
transformations/__init__.py 100.00% <0.00%> (ø)
__init__.py 92.50% <0.00%> (ø)
... and 116 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 563995c...1966ee7. Read the comment docs.

@yuxuanzhuang
Copy link
Contributor

The Travis error can be solved by adding mock, like in chemfiles:

try:
    import h5py
except ImportError:
    HAS_H5PY = False

    # Allow building documentation even if h5py is not installed
    import imp

    class MockH5pyFile:
        pass
    h5py = imp.new_module("h5py")
    h5py.File = MockH5pyFile

else:
    HAS_H5PY = True

Copy link
Contributor

@yuxuanzhuang yuxuanzhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the file link to the end of package/doc/sphinx/source/documentation_pages/coordinates/pickle_readers.rst

driver = self._driver
comm = self._comm
except AttributeError: # is this error necessary?
driver = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will that be a problem if self._driver exists but self._comm not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right I think it would have been a problem. self._driver should have been self.driver as well, since it's pulling the driver from the file object, right? The pickled file wasn't returning the right driver with self._driver, but does with self.driver

@orbeckst
Copy link
Member

@yuxuanzhuang thank you for reviewing.

@edisj please ping me when all tests are passing and when you have addressed all of @yuxuanzhuang 's comments. I can then do a final review + approval.

@jbarnoud jbarnoud mentioned this pull request Aug 10, 2020
5 tasks
@orbeckst orbeckst self-assigned this Aug 10, 2020
@orbeckst orbeckst added the NSF REU NSF Research Experience for Undergraduates project label Aug 10, 2020
except AttributeError:
driver = None

return {'name': self.filename,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused: If the comm is not included in __getstate__ then how can it be used in __setstate__?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you had it included in b29e675 but then removed again.

Did you test that a MPI.COMM_WORLD could be pickled on its own, i.e.,

from mpi4py import MPI
import pickle

comm = MPI.COMM_WORLD
comm_p = pickle.loads(pickle.dumps(comm))

assert comm == comm_p

or something similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not seem to work
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have a working mpi4py in my desktop. But it seems that it cannot be pickled natively.
This might help meanwhile https://bitbucket.org/mpi4py/mpi4py/issues/104/pickling-of-mpi-comm

self.__init__(name=state['name'],
mode=state['mode'],
driver=state['driver'],
comm=state['comm'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will state['comm'] come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was taking the state dict as argument from

def __getstate__(self):
state = self.__dict__.copy()
del state['_particle_group']
return state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand how pickling works but I assumed that the goal is to make each object picklable by itself, i.e., ,H5PYPicklableFile by itself can be pickled. That's the first thing to check (and add a test for!!). Once this works, H5MDReader can just treat it as one item in its own class dict, as it would a numpy array or a basic Python object. It would be bad program design if H5PYPicklableFile needed the help of H5MDReader because then you could never use it independently from H5MDReader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense, thanks for the clarification. I'm doing some testing with comm on the parallel build I have on spudda

@orbeckst
Copy link
Member

@edisj please see #2897 and switch imp to using importlib.

@edisj
Copy link
Member Author

edisj commented Aug 11, 2020

I'm making a lot of commits at the moment because I'm pulling them to my repository on workstation for testing with parallel h5py

@orbeckst
Copy link
Member

Don't worry, we can squash it all when merging.

@edisj
Copy link
Member Author

edisj commented Aug 11, 2020

@orbeckst The latest commit successfully pickles h5py.File objects with driver="mpio" and comm=MPI.COMM_WORLD, although I'm still unsure if the correct communicator is being unpickled. I started a conversation in the h5py group and will continue to poke around online in the meantime.

image

@yuxuanzhuang
Copy link
Contributor

I am trying to build a parallel h5py system. You could add a few tests (specifically for H5PYPicklable) inside testsuite/MDAnalysisTests/utils/test_pickleio.py to increase codecov. (add skipif marks for them)

@yuxuanzhuang
Copy link
Contributor

I am having trouble to install all that stuffs to make h5py work. But I assume if it is hard to find the "right" comm, you might overwrite the __init__ function to save the input comm with something like

def __init__(self, ...comm):
   self._comm = comm
   super().__init__(...)

@orbeckst
Copy link
Member

The test failures are pretty annoying for all other PRs. Can we make H5PYPicklable work for serial right now? It would be ok to raise a TypeError (or whatever else is normally raised) if a communicator is present. Add a test that checks that comm != None raises the error.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you make this PR work in serial so that develop builds cleanly again.

  • Make H5PYPicklable fail to pickle for parallel (TypeError)
  • Document the limitation with a note in the docs.
  • Add tests
    • test for serial H5PYPicklable (see @yuxuanzhuang 's comment)
    • test that parallel pickle raises the TypeError (you don't need mpi4py for the tests, just mock it
       from unittest.mock import MagicMock
       MPI_COMM_WORLD = MagicMock()
       f = H5PYPicklable(fname, driver="mpio", comm=MPI_COMM_WORLD)
       with pytest.raises(TypeError):
              _ = pickle.dumps(f)
  • raise a new issue for pickling parallel h5py files

def __getstate__(self):
driver = self.driver
if driver == 'mpio':
comm = self.id.get_access_plist().get_fapl_mpio()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MPI.COMM_WORLD cannot be natively pickled then I don't think that this will work.


def __getstate__(self):
driver = self.driver
if driver == 'mpio':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to move the PR along, make __getstate__ fail (with TypeError) when you have driver mpio and a comm object.

Add a test that specifically checks for the exception being raised. Then open a new issue to make pickling of parallel h5py files work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fails with ModuleNotFoundError because h5py tries to import mpi4py when driver='mpio'. Is there any way around this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which test are we talking about?

Copy link
Member Author

@edisj edisj Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this:

@pytest.mark.skipif(not HAS_H5PY, reason="h5py not installed")
def test_H5MD_parallel_pickle():
    from unittest.mock import MagicMock
    MPI_COMM_WORLD = MagicMock()
    h5md_io = H5PYPicklable(H5MD_xvf, 'r',
                            driver="mpio",
                            comm=MPI_COMM_WORLD)
    h5md_io_pickled = pickle.loads(pickle.dumps(h5md_io))

will fail before the pickling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try mocking the mpi4py module, too. The idea is to provide a thing (the mock) that just pretends to be there but doesn't really do anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers you have been dubbed in other PR "The King of Mocks" – any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it cannot be fixed as it's checked inside h5py code. (https://github.com/h5py/h5py/blob/8cc83410995996330c85867d0e56cbe922b709f8/h5py/_hl/files.py#L43)

Maybe just excluding these lines from coverage for now?

Copy link
Member

@orbeckst orbeckst Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a way to trick h5py into believing that it was built with mpi support (one could patch the call to h5.get_config() https://github.com/h5py/h5py/blob/8cc83410995996330c85867d0e56cbe922b709f8/h5py/_hl/files.py#L24 or just the mpi variable) but that's not worth it for right now so I concur with @yuxuanzhuang to just skip the test (EDIT: i.e., not add a test and exclude lines from coverage). Add a comment summarizing the current issues and move forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orbeckst just pushed commit with the lines excluded from coverage

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We will continue the question about pickling MPI_COMMUNICATOR in #2865 .

@yuxuanzhuang could you please also review this PR?

@orbeckst
Copy link
Member

@yuxuanzhuang could you please review this PR within the next day? Sorry for asking for a quick turnaround but @edisj 's project is coming to an end and we would like to have it at a good stopping point, just in case he has other more urgent things to do.

You should be able to leave a approve/request changes review. I will wait with merging until I hear from you.

@orbeckst orbeckst requested a review from yuxuanzhuang August 12, 2020 00:52
Copy link
Contributor

@yuxuanzhuang yuxuanzhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@orbeckst orbeckst merged commit a9e749c into MDAnalysis:develop Aug 12, 2020
@orbeckst
Copy link
Member

Thank you @edisj and @yuxuanzhuang !

@edisj edisj deleted the issue-2890-h5md-pickle branch August 12, 2020 18:49
PicoCentauri pushed a commit to PicoCentauri/mdanalysis that referenced this pull request Mar 30, 2021
* fix MDAnalysis#2890 
* added H5PYPicklable class (works in serial but not with driver="mpio" and MPI comm)
* mock h5py so that the docs build even in the absence of h5py
* tests (anything related to mpi is excluded from coverage)
* update CHANGELOG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component-Readers defect NSF REU NSF Research Experience for Undergraduates project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

H5MD fails to serialize with pickle

5 participants