Skip to content

Conversation

@KristijanArmeni
Copy link
Contributor

Reference issue

Work in progress for #11874

What does this implement/fix?

  • Implements a draft of RawNeuralynx() class, subclassing BaseRaw(), and the read_raw_neuralynx() wrapper.
  • relies on functionalities in the NeuralynxIO class from the Neo package for reading header information

To-Do

  • Testing data. We are missing testing data to test this implementation and move forward. I can ask if there's ways to get a sample recording on our end. Can also ask on forums (MNE, Fieldtrip etc.) for any samples.

@welcome
Copy link

welcome bot commented Sep 8, 2023

Hello! 👋 Thanks for opening your first pull request here! ❤️ We will try to get back to you soon. 🚴

@larsoner
Copy link
Member

larsoner commented Sep 8, 2023

Looks like a good start! I left some initial comments. Let me know if they don't make sense and I can clarify further. Or if you want me to try pushing some commits I can.

A good next step would be to get ~1 sec of raw data (can be junk!) added to mne-testing-data:

https://github.com/mne-tools/mne-testing-data#add-or-change-files-in-the-repo-for-use-with-mne

@larsoner larsoner changed the title - initial commit, add module to support Neuralynx data ENH: Add support for reading Neuralynx data Sep 8, 2023
@KristijanArmeni
Copy link
Contributor Author

Thanks for the detailed comments! All makes sense to me! I'll have a look and patch these in. In the meantime, I've asked around in the lab if there's a way to get the tiny sample recording done. Report back sometime next week once I hear back.

@KristijanArmeni
Copy link
Contributor Author

Just pushed an update to my remote branch. Seeing the commit history chaos, I should prob stick to rebasing mne:main into my local development branch as opposed to merging it.

We now have the testing dataset (dummy recording) available and I am trying to implement mne.io.read_raw_neuralynx(neuralynx_testing_data) (via RawNeuralynx() and _read_segments_file()) and plot the outcome. Currently, it will obtain the right number of channels and samples, but the values appear faulty (and are the same across channels). See below for details on how I benchmark.

I don't have a good understanding of _read_segments_file() yet, but I guess something goes wrong before np.fromfile is called to obtain the file (.ncs) contents. Perhaps I am doing something wrong in my call to _read_segments_file?

Values read by read_raw_neuralynx() don't match those obtained by neo.NeuralynxIO().read()

As a first attempt, I tried using _read_segments_file and setting n_channels argument to 1 (in neuralynx.py), because every .ncs file on disk is 1 channel. This does make the code run, but the outcome is faulty. For now, I benchmark against the neo.neuralynxIO() reader (test_neuralynx.py, see figure below for 3 channels):

  • the number of samples/timepoints and channels matches matches well
  • the actual sample values are different and most samples obtained by _read_segments_file() are actually 0 (not shown in figure).
  • the values obtained by neo seem to make more sense to me (for an unconnected recording)

test_samples

@larsoner
Copy link
Member

Wasn't sure the easiest way to fix the history so I did:

git checkout main
git fetch upstream
git reset --hard upstream/main
git checkout KristijanArmeni/nlx mne/io/neuralynx
# hack in a few tweaks
git commit --amend -a
# modify commit message
git push origin -u nlx

This gave the commit credit to you but created this branch / commit:

If you think this commit is okay you can do locally:

git remote add larsoner git@github.com:larsoner/mne-python.git
git fetch larsoner
git checkout nlx
git checkout -b nlx-bak  # create a local backup!
git checkout nlx
git reset --hard larsoner/nlx
git commit --amend -a
git push origin --force nlx

and the one commit should be shown in this PR.

I'll look into the data structures soon

@larsoner
Copy link
Member

Okay I got tests to pass here (still on my nsx branch):

larsoner@5cb32a4

(via RawNeuralynx() and _read_segments_file())

Keep in mind _read_segments_file is a very dumb / simple function -- it assumes data in the filename are stored starting at byte zero with no metadata. So it will only work if that's the case for Neuralynx files. Is it?

You can see in my commit I just use neo for reading the data currently. It uses a very inefficient algorithm -- read all data from disk then subselect -- that we should improve upon. But hopefully it makes conceptually clear how things should be structured, and what self._read_segment_file must do internally.

@KristijanArmeni
Copy link
Contributor Author

Thanks for suggesting the fix for the commit history! Will try this in a bit and see how it goes. The fixes in the last commit on your branch all make sense to me.

  • re: header info in .ncs. Right, good call, indeed each .ncs does have metadata I think, so there'd need to be a separate reader for header info, which is what probably causes the mess.
  • re: neo reading: I used neo to have something to benchmark against. I'm fine using it for reading in the data and I'm almost certain it has a way of reading in data after samples and channels are specified. Will have a look once I get this up and running.

@KristijanArmeni KristijanArmeni force-pushed the nlx branch 2 times, most recently from 7273499 to 5dbce2a Compare October 19, 2023 16:36
@KristijanArmeni
Copy link
Contributor Author

Ok, so 5dbce2a is a first attempt into more efficient loading into memory. Test failing, but see below for update and the main idea on how this could work via neo.

Test failing

Right now, test_neuralynx.py fails on my end at _test_raw_reader (l. 43). I tried stepping through it. It seems _mult_cal_one loops over _read_segment_file multiple times (?) and while on some loops idx looks fine (slice(0, 10, None) indexing 10 channels), on one iteration it looks off (array([7, 2, 1, 4, 8, 6, 3, 0, 5, 9])) and data_view and one have inconsistent sample dimension size and assertion on l.83 (assert data_view.shape[1] == one.shape[1]) fails.

I don't have a good insight into _mult_cal_one so not sure what goes one there.

Update

In short, in 5dbce2a I defer reading data into memory and only fetch the header info and empty data structures (bl = nlx_reader.read(lazy=True)). Based on start and stop samples given to _read_segment_file() I find the data chunks in neo.block.segments that contain them and then call neo.Segment.AnalogSignalProxy.load() only on the selected chunks/segments between start and (including) stop. I tested some options with different picks and start/stop values and seems on the right path to me.

# quick and dirty plotting
raw = read_raw_neuralynx(fname=testing_path, preload=True)
d1, t1 = raw.get_data(start=0, stop=1000, return_times=True)
raw_orig = raw.copy()

raw3 = raw.pick(picks=[0, 1, 5])
labels3 = raw3.ch_names
d3, t3 = raw3.get_data(start=400, stop=1200, return_times=True)

raw4 = raw_orig.pick(picks=[1, 2, 8])
labels4 = raw4.ch_names
d4, t4 = raw4.get_data(start=200, stop=1000, return_times=True)
# then plot d1, d3, d4 etc.

test_get_data

]
).T

block = all_data # shape = (len(idx), n_samples))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps somewhat confusingly, whatever you pass to _mult_cal_one needs to be shape (n_channels, n_samples), not (len(idx), n_samples). I'll push a quick commit to show what I mean

@larsoner
Copy link
Member

With my commit now I see:

mne/io/neuralynx/tests/test_neuralynx.py:43: in test_neuralynx
    _test_raw_reader(read_raw_neuralynx, fname=testing_path)
mne/io/tests/test_raw.py:168: in _test_raw_reader
    data2, times2 = other_raw[picks, sl_time]
mne/io/base.py:851: in __getitem__
    return self._getitem(item)
mne/io/base.py:858: in _getitem
    data = self._read_segment(start=start, stop=stop, sel=sel)
<decorator-gen-311>:12: in _read_segment
    ???
mne/io/base.py:479: in _read_segment
    _ReadSegmentFileProtector(self)._read_segment_file(
mne/io/base.py:2511: in _read_segment_file
    return self.__raw.__class__._read_segment_file(
mne/io/neuralynx/neuralynx.py:162: in _read_segment_file
    block[idx] = all_data  # shape = (n_channels, n_samples)
E   ValueError: shape mismatch: value array of shape (10,6783) could not be broadcast to indexing result of shape (10,2000)

So the all_data that you create does not have enough samples, so there must be some bug with _find_first_last_segment or how it's used

Comment on lines 98 to 105
# shape = (n_segments, 2) where 2nd dim is (start_time, stop_time)
onst_offt = np.array(
[
(signal.t_start.item(), signal.t_stop.item())
for segment in segments
for signal in segment.analogsignals
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that there can be gaps and/or overlaps?

For gaps there is a way to handle it (insert zeros) but for overlaps I'm not even sure what the right behavior would be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. I'd assume no gaps/overlap as it should just be chunked representation of a continuous recording by the acquisition system (for some reason). That is, each segments' off time should be the next segments' start time etc. Our test dataset only has two segments (over a total of ~3 sec recording). I can perhaps test this on my local datasets which have on the order of 10's of segments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's assumed that they are contiguous then I'd just construct an array of shape (n_samples,) of int type that gives you the mapping from sample numbers into segment numbers in __init__ and store it in _raw_extras[0] = dict(segment_map=segment-map) or so. Then use this to figure out which segments to read in _read_segments_file. If this isn't clear I can push a quick commit to show what I mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Makes total sense. Will implement.

@KristijanArmeni
Copy link
Contributor Author

KristijanArmeni commented Oct 20, 2023

OK, added a smp2seg dict in _raw_extras[0] in df66e3.

I think this will work as a strategy to keep track of which samples belong to which segments. Seems like atm the tests are failing, but I can't see exactly where from the bot messages on here (stylistic things?).

In any case, if I manually evaluate test_neuralynx() locally it now runs with no errors. But if I try reading one of my bigger datasets I have locally I get some inconsistent sample numbers between all_data and block variables in RawNeuralynx._read_segment_file().

I suspect something fishy is going on with chunk reading based on time slices viaAnalogSignal.load(time_slice=(start_time, stop_time)). It seems the concatenated arrays it loads based on (start_time, stop_time) don't have the same number of samples as an array which is constructed from np.zeros(n_channels, stop-start).

Will explore and report here.

@KristijanArmeni
Copy link
Contributor Author

In e251d98 I implemented way to read in segments via neo based on sample and channel indices (Segment.AnalogSignal.load(channel_indexes=idx).magnitude[:, segment_start:segment_stop]) as opposed to converting time/stop to time units and using t_start/t_stop in AnalogSignal.load() (this, for some reason, resulted in array inconsistencies on larger datasets). This code still only loads in memory the data segments between start and stop and channels specified in idx. See figures below for extra info.

Filtering by channel type needed in read_raw_neuralynx().
After testing my implementation on local datasets that have several GBs worth of real data (and multiple channel types, e.g. seeg, eog etc.), I learned that channel types can have different number of segments which causes NeuralynxIO.read() to throw (a meaningful) error. NeuralynxIO() has the exclude_filename param which is a list of files to ignore to deal with this, such that the user can initiate separate readers for seeg, eog etc. (see this github issue).

For now, my hack is to include exclude_fnames_pattern param containing file patterns to ignore (e.g. "EOG*") in init. These patterns are passed to glob.glob and the returned lists of files are stored in raw._extras and passed to to NeuralynxIO() as exclude_filename input arg. Not sure if there’s a way of doing this that’s more aligned with the MNE API, but this worked for me locally.

Extra information: raw.plot() on testing dataset

from mne.datasets.testing import data_path, requires_testing_data
from mne.io import read_raw_neuralynx
testing_path = data_path(download=False) / "neuralynx"
raw = read_raw_neuralynx(testing_path, preload=True)
raw.plot()
test_raw_plot1

And an actual (non-dummy) dataset:

This is a larger dataset (about 5GB in size).
test_raw_plot2

@KristijanArmeni
Copy link
Contributor Author

Yes if you click blue "details" to the right of the failing Tests/Style to get to https://github.com/mne-tools/mne-python/actions/runs/6852318442/job/18630579970?pr=11969 you'll see:

Just commenting to point that 0c552cb was pushed yesterday here to fix docstrings. This makes the Tests/Style pass, though I see some CIs fail but looks like it's some conda-specific test on mne:main?

(No rush on my end, forgot to comment yesterday and just making sure it's seen, can't tell if a pushed commit alone also sends a notification to all involved in the conversation.)

@larsoner
Copy link
Member

can't tell if a pushed commit alone also sends a notification to all involved in the conversation

It can, but most maintainers I know have disabled this feature because it generates way too many emails, so yes please do comment when it's time to look. Failure is unrelated so you can ignore, I'll merge main into this branch to fix it. Can you take out of draft status if you're ready for it to be merged (assuming CIs are green)? Then I'll take one last look and merge if it's green and I have no further comments!

@KristijanArmeni KristijanArmeni marked this pull request as ready for review November 14, 2023 17:04
@KristijanArmeni
Copy link
Contributor Author

@larsoner It looks all green? See if there's anything missing or fishy in the reader.

It should be a minimally working reader. I guess other improvements can follow on top of this, like bookkeeping any additional info from nlx headers (meas_date etc.). I'll be using this on real data in the coming weeks, so I can open issues as they pop along.

Oh and we'll also need a doc/changes/devel.rst update to mention this improvement!

This can be done in a separate PR? Once it's merged, I'd like to also acknowledge the researcher who acquired the testing dataset, provided the .mat files and helped troubleshooting Neuralynx specifics (if that's a practice).

@larsoner
Copy link
Member

This can be done in a separate PR?

It should be part of this one if possible. Feel free to push.

Yes you can credit whoever you think should get co-author credit for the PR by adding names at the end of the changelog line (or multiple :newcontrib:s for people who are new contributors).

@larsoner larsoner added this to the 1.6 milestone Nov 14, 2023
@KristijanArmeni
Copy link
Contributor Author

Thanks @larsoner! I updated doc/changes/devel.rst and names.ini, hopefully in good order. Seems all checks passing.

@larsoner
Copy link
Member

Thanks @KristijanArmeni !

@larsoner larsoner merged commit e2b3c03 into mne-tools:main Nov 14, 2023
@KristijanArmeni KristijanArmeni deleted the nlx branch November 14, 2023 21:29
hoechenberger added a commit to hoechenberger/mne-python that referenced this pull request Nov 22, 2023
An incorrect type hint sneaked in via mne-tools#11969, causing confusion
to static type checkers.

Since we don't use type hints in MNE yet (except for some very
rare exceptions), I simply removed the faulty one
instead of fixing it.
@eduardosand
Copy link

eduardosand commented Nov 28, 2023

@KristijanArmeni This is awesome! I was wondering in the end how you deal with missing samples, i.e. array size mismatch(when making an array using first and last timepoints vs the actual data present). For me, having to use the exclude file parameter seems like it disregards the power of having all the data in the directory at once and a warning is nice, but not if there isn't much I can do about it.

On my end, I have two opinions on the matter.

The sample rate itself is not exactly 32K or whatever ncs_reader.get_signal_sampling_rate() spits out, but something slightly smaller than that. This could contribute to the reason why there is a slight mismatch, and why this increases on larger datasets.

Segments are generated in neo, not just when the experiments presses play/pause or an event annotation is made, but also when neuralynx fails to record certain samples. I also think this is the reason for the difference in samples across channels, and as the github issue above points out, there's also some weirdness with cutting off samples at the beginning and end of a file. Hence, neuralynx will drop more samples in the 32K sampled channels than the 8K sampled channels but asymmetrically(ex. neuralynx drops 3-4 samples in 32K case, no samples are dropped in 8K case, so a segment is generated in 32K channels, but not 8K channels).

See below for more:
https://www.fieldtriptoolbox.org/faq/discontinuous_neuralynx/
fieldtrip/fieldtrip#1128

My workaround has thus been to assume uniform sampling within a segment, and then interpolating with cubic splines for any data 'in between' segments. I do this instead of fieldtrips filling with NaNs, because the signal processing is easier downstream (without having to then recast the NaN values to something suitable for scipy.signal.sosfiltfilt). Outside of spike sorting, I don't care about 32K or 8K sampling rate, so I then downsample to 2K and save as .npy files for downstream analysis.

@larsoner
Copy link
Member

For me, having to use the exclude file parameter seems like it disregards the power of having all the data in the directory at once and a warning is nice, but not if there isn't much I can do about it.

Assuming you have two streams to align / resample (though it generalizes to more), naively my approach would be to use the exclude parameter to create two Raw instances (the first instance excluding the second file, and the second instance excluding the first file) and then resample them.

If we have gaps in some recordings we should make sure these are in there properly with zeros (or NaNs) and annotated with BAD_ACQ_SKIP so that time is continuous in the instance. Then we "just" need to work out #11408 to resample the data to the same sample rate.

@eduardosand
Copy link

If we have gaps in some recordings we should make sure these are in there properly with zeros (or NaNs) and annotated with BAD_ACQ_SKIP so that time is continuous in the instance. Then we "just" need to work out #11408 to resample the data to the same sample rate.

I agree with annotation, but I was under the impression that scipy.signal doesn't work well with NaNs? Also worried that 0s might cause edge effects when computing power analyses, although maybe not as big of a concern with wavelets?
Interpolation is nice because it allows us to use any filters we want and as long as remember the values themselves are interpolated, we can discard them later after pre-processing.

@larsoner
Copy link
Member

I agree with annotation, but I was under the impression that scipy.signal doesn't work well with NaNs? Also worried that 0s might cause edge effects when computing power analyses, although maybe not as big of a concern with wavelets?

Correct, it doesn't handle NaN well -- it will turn the whole signal NaN. But the idea isn't to operate on the whole signal at once. See the original issue -- the idea is to resample each non-NaN segment separately using skip_by_annotation to find segments that should be resampled. Similarly we can improve our tfr_* functions to also respect skip_by_annotation and thereby (again) process each segment separately.

But thinking about it more, getting the segment-by-segment resampling right seems like a difficult problem. Also reading through your description of Neuralynx and samples just missing at the beginning and end of a recording (instead of in the middle someplaces) I'm not totally sure you need all this. So maybe not the right option here 🤷

@KristijanArmeni
Copy link
Contributor Author

Thanks for raising this @eduardosand! You mention two sources/forms of array-size mismatch:

  1. Across channels due to different sampling rates and/or asymmetric Neuralynx sample trimming (e.g. 8kHz, vs 32KHz, this I was aware of this is what, as @larsoner said, exclude_fname_pattern, propagated to neo reader, should take care of such that the user can construct two Raw instances if need be; outside of whether or not the two can eventually be realigned).
  2. Gaps across segments, discontinuous recording within a channel (i.e. number of valid samples is smaller than total number of samples). Indeed, the current code assumes the segments are continuous, temporally adjacent.

Regarding 2 and reading in the Fieldtrip doc you linked:

The low-level read_neuralynx_ncs function detects the presence of gaps in the .ncs file and issues a warning.

It sounds like mne.io.read_raw_neuralynx() should minimally check for temporal gaps between neo segments and issue a warning? For example, it should check that each neo.Segment[i] object starts when the neo.Segment[i-1] ended and raise a warning if this is not the case (i.e. there's temporal gap, assuming the information in neo is accurate)? And potentially also reconstruct/fill/mark missing samples such that the time axis (i.e. raw.times) is continuous and valid.

If this is on track, happy to open a separate issue and work on this.

@larsoner
Copy link
Member

Sounds good, let's move discussion to #12247

snwnde pushed a commit to snwnde/mne-python that referenced this pull request Mar 20, 2024
Co-authored-by: Eric Larson <larson.eric.d@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants