replace res with two kinds of output, accumulator and timeseries#55
replace res with two kinds of output, accumulator and timeseries#55kain88-de merged 8 commits intoMDAnalysis:masterfrom
Conversation
|
I do not understand the motivation for this change. In what respect is the size of results reduced? It looks to me like it would make it more complicated to write an analysis because one has to understand and decide if they want a Note I'm not against this change and I appreciate your contribution. I would simply like to understand it before I merge. |
pmda/parallel.py
Outdated
| if accum == []: | ||
| accum = res_sf['accumulator'] | ||
| else: | ||
| accum += res_sf['accumulator'] |
There was a problem hiding this comment.
this if-else clause is not necessary accum.extend(res_sf) would do the same in both cases. I'm pretty sure just += would also work because you create the empty list first.
There was a problem hiding this comment.
Ah ok, so I misinterpreted your intends on this line. You are using the empty list to as a zero or None value.
if accum == []:
accum = res # convert accum from an empty list to a numpy array
else:
accum += res # here we add two numpy arraysIs that correct?
There was a problem hiding this comment.
Yes, that's correct.
|
@kain88-de Currently, |
|
As @VOD555 says, there are two different ways in which one might want to collect data: as a timeseries or as a histogram (accumulator). In principle one could create the timeseries and then histogram in If you have an alternative suggestion we're all ears – one reason for this PR is to discuss and prototype. |
|
OK, I do understand the intent better now. You are saying we have the wrong The current reduce step is timeseries: To implement this the Line 325 in a5aa576 would change to res = self.reduce(res, self._single_frame(ts, agroups))Note I'm not sure about the stack command. One should check this! edit: updated code example to make it clearer |
|
Kind of. I would say that conceptually we do the reduce of map-reduce in I like your idea of defining a reduce function. I would actually be explicit and add a new API method class ParallelAnalysis():
...
@staticmethod
def _reduce(*args):
"""implement the reduction of a single frame"""
# timeseries:
return np.stack(args)
# accumulator
return np.sum(args)and then use res = self._reduce(res, self._single_frame(ts, agroups))(I would make it a static method to make clear that it should only work on the input and not use anything in the class.) If the def _conclude(self):
self.result = self._reduce(*self._result) |
Good idea!
Yes being explicit is good. We should then better document the scheme we currently use for parallelization. |
|
Maybe we just need to have a mixin which define what sort of reduce happens...
|
|
Just to clarify: my idea was that when you write a class, you write your own
because both can be rather specific. We should have a default that maintains the old "timeseries" behavior. Maybe something like the following class ParallelAnalysis():
...
@staticmethod
def _reduce(*args):
""" 'append' action for a time series
Note: all args *must* be of type list so that they create one bigger list.
The alternative would be to do something like
return args[0].extend(args[1:])
"""
return sum(args)(I think |
|
What `np.stack()' does is to Join a sequence of arrays along a new axis. One example for it is >>> a = np.array([1, 2, 3])
>>> b = np.array([2, 3, 4])
>>> np.stack((a, b))
array([[1, 2, 3],
[2, 3, 4]])The input should be one array, and
class ParallelAnalysis():
...
def _reduce(self, res, result_single_frame):
res.append(result_single_frame)
return resI've tested this, and it works well with the functions we currently have. |
Using staticmethods or inheritance through For the @staticmethod
def _reduce(a, b):
return parallel.accumulate(a, b)The accumulate function can be what @VOD555 has currently written. Using a |
pmda/contacts.py
Outdated
| y = y[0] | ||
| return y | ||
|
|
||
| def _reduce(self, res, result_single_frame): |
There was a problem hiding this comment.
shouldn't be needed. It can take the default function from the base class.
|
|
||
| def _reduce(self, res, result_single_frame): | ||
| """ 'add' action for an accumulator""" | ||
| if res == []: |
There was a problem hiding this comment.
The pythonic way is to declare the variable as None. My preferred way is to pre calculate the size of res and initialize it as res = np.zeros(shape). This way the if-else clause isn't needed.
There was a problem hiding this comment.
Your problem is that res is initialized in ParallelBase right?
There was a problem hiding this comment.
Yes. I'm thinking about some way like
class ParallelAnalysis():
...
def _initial_res(shape):
res = np.zeros(shape)
return res
...
self._res_settings = {'shape': shape}
def _dask_helper():
res = self._initial_res(**self._res_settings)
...
pmda/parallel.py
Outdated
| return np.asarray(res), np.asarray(times_io), np.asarray( | ||
| times_compute), b_universe.elapsed | ||
|
|
||
| def _reduce(self, res, result_single_frame): |
There was a problem hiding this comment.
This should be a static function.
Codecov Report
@@ Coverage Diff @@
## master #55 +/- ##
=========================================
- Coverage 99.25% 97.46% -1.8%
=========================================
Files 7 7
Lines 270 276 +6
Branches 28 27 -1
=========================================
+ Hits 268 269 +1
- Misses 1 4 +3
- Partials 1 3 +2
Continue to review full report at Codecov.
|
kain88-de
left a comment
There was a problem hiding this comment.
Thanks for adding the documentation. I have some suggestions to improve them further.
| else: | ||
| # Add two numpy arrays | ||
| res += result_single_frame | ||
| return res |
There was a problem hiding this comment.
Thanks for adding the documentation. Here is my suggestion for an updated version. I use your structure but try to make the text read more fluently.
# NOT REQUIRED
# Called for every frame. ``res`` contains all the results
# before current time step, and ``result_single_frame`` is the
# result of self._single_frame for the current time step. The
# return value is the updated ``res``. The default is to append
# results to a python list. This approach is sufficient for
# time-series data.
res.append(results_single_frame)
# This is not suitable for every analysis. To add results over
# multiple frames this function can be overwritten. The default
# value for ``res`` is an empty list. Here we change the type to
# the return type of `self._single_frame`. Afterwards we can
# safely use addition to accumulate the results.
if res == []:
res = result_single_frame
else:
res += result_single_frame
# If you overwrite this function *always* return the updated
# ``res`` at the end.
return resThere was a problem hiding this comment.
To parallelize the analysis ``ParallelAnalysisBase`` separates the trajectory
into work blocks containing multiple frames. The number of blocks is equal to
the number of available cores or dask workers. This minimizes the number of
python processes that are started during a calculation. Accumulation of frames within a block happens in the `self._reduce` function.
A consequence when
using dask is that adding additional workers during a computation will not
result in an reduction of run-time.
I think it would be beneficial to add this text at the beginning of the class documentation to better explain why _reduce is needed.
|
@VOD555 I made my suggested changes myself. You can have a last look over them before merging. |
|
@kain88-de That's very good, thanks. |
Changes made in this Pull Request:
timeseries is the same as the old res, which just appends the result from each step.
The other one, accumulator, accumulates the results so that we can reduce the size of the results of each parallelled task for function such as rdf.
The example for timeseries is contact.py, and the one for accumulator is rdf.
PR Checklist