Skip to content

Conversation

@mscheltienne
Copy link
Member

Minor documentation fix for the apply_function methods, e.g. https://mne.tools/dev/generated/mne.io.Raw.html#mne.io.Raw.apply_function

If n_jobs > 1, more memory is required as len(picks) * n_times additional time points need to be temporarily stored in memory.

-> n_jobs * n_times and not len(picks).

@mscheltienne
Copy link
Member Author

mscheltienne commented Nov 14, 2022

Could not build wheels for pydata-sphinx-theme

Can someone with the permissions restart the CircleCI workflow? It looks like it does not restart on close/re-open as Azure.

@agramfort
Copy link
Member

just push an empty commit @mscheltienne to restart CI

@mscheltienne
Copy link
Member Author

I'm using GitHub Desktop most of the time, which does not yet support empty commits 😞
I was sure I did not have git CLI configured with my identifiers on this computer.. but apparently, I'm wrong!

@mscheltienne
Copy link
Member Author

@mscheltienne mscheltienne changed the title [Doc] fix/improve documentation of n_jobs for apply_function [DOC, MRG] fix/improve documentation of n_jobs for apply_function Nov 14, 2022

.. note:: If ``n_jobs`` > 1, more memory is required as
``len(picks) * n_times`` additional time points need to
``n_jobs * n_times`` additional time points need to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you certain that this change is correct? The original text implies that full copies of the array are made for each job (which would be unfortunate, but is plausible). I just want to confirm that you checked this before proposing the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, and with all the efforts to save memory left and right, I did not even think about it. Duplicating the entire data array 'n_jobs' times would be very unfortunate.

But it looks like it was well design:

mne-python/mne/io/base.py

Lines 950 to 965 in aeeb111

if channel_wise:
parallel, p_fun, n_jobs = parallel_func(_check_fun, n_jobs)
if n_jobs == 1:
# modify data inplace to save memory
for idx in picks:
self._data[idx, :] = _check_fun(fun, data_in[idx, :],
**kwargs)
else:
# use parallel function
data_picks_new = parallel(
p_fun(fun, data_in[p], **kwargs) for p in picks)
for pp, p in enumerate(picks):
self._data[p, :] = data_picks_new[pp]
else:
self._data[picks, :] = _check_fun(
fun, data_in[picks, :], **kwargs)

If I read this correctly, each job receives data_in[p] which contains a single channel.

Copy link
Member Author

@mscheltienne mscheltienne Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess in the end data_picks_new is a duplicate of the entire array.. so the docstring was in fact correct. My bad!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``n_jobs * n_times`` additional time points need to
``len(picks) * n_times`` additional time points need to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait... if I'm reading the code you quoted correctly, data_picks_new is one array that is the same size as data_in, but (crucially) it's outside a for-loop over picks. So that suggests that n_jobs * n_times (your original change) was in fact correct? But I'm not intimately familiar with the parallel code, maybe @larsoner can clarify for us.

Copy link
Member Author

@mscheltienne mscheltienne Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. data_picks_new is indeed the same shape as data_in (except it's a list of len(picks) arrays of shape (n_times,) instead of a 2D array).

So after the execution of parallel() we end up with 2 arrays of shape (len(picks), n_times): self._data (also called data_in but it's the same array) and data_picks_new, until garbage collection at the end of the function where the latter is deleted.

Temporarily, we end up with a second array of shape (len(picks), n_times) even if each job receives only one channel (which initially led me to believe the docstring was false).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you are right. I was briefly thinking it was n_jobs * len(picks) * n_times (so much worse!) without realizing that I was assuming that. I'll merge then. Thanks @mscheltienne !

@drammock drammock merged commit be6bee7 into mne-tools:main Nov 15, 2022
@mscheltienne mscheltienne deleted the fix_doc branch November 15, 2022 21:45
larsoner added a commit to ruuskas/mne-python that referenced this pull request Nov 17, 2022
* upstream/main:
  fix epochs.plot_image for EMG data (mne-tools#11322)
  fix fontawesome icon display (mne-tools#11328)
  [DOC, MRG] fix/improve documentation of n_jobs for apply_function (mne-tools#11325)
  [MAINT, MRG] fix is_mesa (mne-tools#11313)
  BUG: Fix bug with parallel progress bars (mne-tools#11311)
  BUG: Fix bug with report replacement (mne-tools#11318)
  MAINT: Fix docs (mne-tools#11317)
  Fix typo in changelog (mne-tools#11315)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants