WIP: Parallel trajectory analysis#617
WIP: Parallel trajectory analysis#617mattiafelice-palermo wants to merge 2 commits intoMDAnalysis:developfrom
Conversation
…ent test analysis (electromagnetism.py), added parallel_jobs.py module, added test for the parallel_jobs.py module
There was a problem hiding this comment.
This can be replaced by:
charge_center = (positions * abscharges[:, None]).sum(axis=0) / charge_sumWhere the weird [:, None] is making the smaller array broadcast onto the larger one
There was a problem hiding this comment.
Thanks for the tip! Unfortunately I'm quite new to both python and numpy (I mainly work with Fortran...) so I apologize for the awkward code here and there.
There was a problem hiding this comment.
No worries, Fortran is my first language too
|
The concurrency problem isn't surprising as currently each trajectory reader fills the coordinate arrays in place and AtomGroups then just have a view onto this array. So for example...
When job1 does its second read of coordinates, the frame has been changed by job2, so the data isn't what it should be for job1. I'm not sure what the easiest or best solution is to this, but it's something we want to get working. I'll try and get a quick hackish fix working for now..... Apart from that, I'd prefer if the parallel run method was just a method attached to td = TotalDipole(etc etc etc)
td.run()
# OR
td.parallel_run()
# OR
td.run(parallel=True) |
|
Ok I understand, it makes sense. So when it works (surprisingly very often), it is because the coordinate array is still there and wasn't modified by another child process. Let me know if you can come up with a hack for that, unfortunately I don't know the internals of the program enough to be of help. As for attaching the method to AnalysisBase, I like the sintax you propose, but my approach of using a "Manager" class to manage the jobs was to try to optimize the computation. By batching different jobs together, each job could in theory access to the coordinates of the frame in the array stored in memory without requiring to re-read them from the disk each time. Access from an array in memory is definitely faster than reading coordinates for each job from a file on disk and would make analysis way faster. So my question is, during the following cycle: for i, ts in enumerate(self._trajectory[start:stop:step]):
for job in job_list:
job._single_frame()when executing each single_frame() method, are positions, charges, masses, etc. accessed from temporary arrays stored in memory or are they re-read from file each time for each job in job_list? Anyway, it could be a good idea to have both the method of the anaylisis object and the "Manager" class, what do you think? |
|
So your code snippet above, you're correct that using your method each frame would be read only once. But I think reading is usually quite fast compared to the actual analysis (but benchmark this and check!). And another problem is you're assuming that each analysis task takes a similar amount of time, but if one takes a large amount of time, then you'll slow all other tasks in the manager object. |
There was a problem hiding this comment.
two last lines should be
(reduced space)
self.filename = filename
self.time = []and so on for other lines (if needed).
|
Ah right, yeah that does sound like it'll work nicely. If the code is an extra commit after this one then just push to your repo and it'll appear here, if it's a big rewrite then just make a new PR. |
|
@mattihappy by the way, we're putting together a release soon, so if you want to create a separate pull request for just the electromagnetism module in serial, we could include that. |
|
@richardjgowers I'm not sure, it's still a very basic module and not really optimized, I wrote it just so that a test could be run. Truth to be told, I was planning to use it to build a analysis class for the computation of the dielectric constant - so perhaps it's better to wait once the module is a bit more complete (and perform better). Probably it's not the right place to discuss this, but shouldn't the dipole moment (maybe also the total charge?) of a group of atoms be included among the basic properties/methods of an |
|
Yeah it could and should be a method of an AtomGroup. That said, we're actually moving how things like this are loaded into a more flexible system. So to avoid So the And an example of how So if an input file defines masses, then the If you wanted to add dipole into the |
With reference to the MDAnalysis Discussion --> link
Branch on my repository --> link
I added a new module parallel_jobs.py in MDAnalysis/analysis, which contains a class ParallelProcessor that should allow to distribute computations of AnalysisBase objects on multiple threads on multicore machines.
A typical setup for a parallel analysis should look as follows:
I made some minor modifications to the AnalysisBase class in order to work with ParallelProcessor.
The first one is to allow analysis object to be initialized without specifiying a trajectory, which is later specified when initializing ParallelProcessor.
The second one is to avoid the trajectory object to be pickled when child processes are spawned from the Process method of the multiprocessing module:
In order to work with the ParallelProcessor class, analysis objects should specify a iadd method, as at the end of the analysis the children processes are merged into a single one with the operator +=.
The class is already functional, but it seems to be affected from problems when a high number concurrent reads are happening, so any input in this issue is welcome. To reproduce the issue, which happens randomly in my tests, a combination of running an analysis on a high number of threads and the use of a small selection of molecules (I guess because this leads to very short and thus frequent read operation on the file) is required.
I created an example module with an analysis class TotalDipole which is compatible with the class ParallelProcessor. Running it on the DCD test data from MDAnalysisTests as a single analysis object and through the ParallelProcessor class (four threads, on my laptop) gives an execution time of 2.7 seconds for the former and 0.9 seconds for the latter. Of course, with more cpu intensive jobs on larger trajectories, the performance gain is even higher.
Let me know if you think this could be added to the MDAnalysis package, or if you have any suggestion to improve the code, the interface or any idea on how to fix the problem of concurrent reads.