-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[DOC, MRG] fix/improve documentation of n_jobs for apply_function #11325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can someone with the permissions restart the CircleCI workflow? It looks like it does not restart on close/re-open as Azure. |
|
just push an empty commit @mscheltienne to restart CI |
|
I'm using GitHub Desktop most of the time, which does not yet support empty commits 😞 |
mne/utils/docs.py
Outdated
|
|
||
| .. note:: If ``n_jobs`` > 1, more memory is required as | ||
| ``len(picks) * n_times`` additional time points need to | ||
| ``n_jobs * n_times`` additional time points need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you certain that this change is correct? The original text implies that full copies of the array are made for each job (which would be unfortunate, but is plausible). I just want to confirm that you checked this before proposing the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point, and with all the efforts to save memory left and right, I did not even think about it. Duplicating the entire data array 'n_jobs' times would be very unfortunate.
But it looks like it was well design:
Lines 950 to 965 in aeeb111
| if channel_wise: | |
| parallel, p_fun, n_jobs = parallel_func(_check_fun, n_jobs) | |
| if n_jobs == 1: | |
| # modify data inplace to save memory | |
| for idx in picks: | |
| self._data[idx, :] = _check_fun(fun, data_in[idx, :], | |
| **kwargs) | |
| else: | |
| # use parallel function | |
| data_picks_new = parallel( | |
| p_fun(fun, data_in[p], **kwargs) for p in picks) | |
| for pp, p in enumerate(picks): | |
| self._data[p, :] = data_picks_new[pp] | |
| else: | |
| self._data[picks, :] = _check_fun( | |
| fun, data_in[picks, :], **kwargs) |
If I read this correctly, each job receives data_in[p] which contains a single channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess in the end data_picks_new is a duplicate of the entire array.. so the docstring was in fact correct. My bad!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ``n_jobs * n_times`` additional time points need to | |
| ``len(picks) * n_times`` additional time points need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait... if I'm reading the code you quoted correctly, data_picks_new is one array that is the same size as data_in, but (crucially) it's outside a for-loop over picks. So that suggests that n_jobs * n_times (your original change) was in fact correct? But I'm not intimately familiar with the parallel code, maybe @larsoner can clarify for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. data_picks_new is indeed the same shape as data_in (except it's a list of len(picks) arrays of shape (n_times,) instead of a 2D array).
So after the execution of parallel() we end up with 2 arrays of shape (len(picks), n_times): self._data (also called data_in but it's the same array) and data_picks_new, until garbage collection at the end of the function where the latter is deleted.
Temporarily, we end up with a second array of shape (len(picks), n_times) even if each job receives only one channel (which initially led me to believe the docstring was false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you are right. I was briefly thinking it was n_jobs * len(picks) * n_times (so much worse!) without realizing that I was assuming that. I'll merge then. Thanks @mscheltienne !
* upstream/main: fix epochs.plot_image for EMG data (mne-tools#11322) fix fontawesome icon display (mne-tools#11328) [DOC, MRG] fix/improve documentation of n_jobs for apply_function (mne-tools#11325) [MAINT, MRG] fix is_mesa (mne-tools#11313) BUG: Fix bug with parallel progress bars (mne-tools#11311) BUG: Fix bug with report replacement (mne-tools#11318) MAINT: Fix docs (mne-tools#11317) Fix typo in changelog (mne-tools#11315)
Minor documentation fix for the
apply_functionmethods, e.g. https://mne.tools/dev/generated/mne.io.Raw.html#mne.io.Raw.apply_function-> n_jobs * n_times and not len(picks).