-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Description of the problem
great to see the progress on the eyetracking @scott-huberty !
i discovered a bug in the code that leads to ignoring about 50% of data in non-continuous/multi block recordings.
the root is the behavior + current parameter choice of pd.merge_asof() (used to fill in missing timesamples).
i'll post a PR to fix it
background (for others than Scott):
eyelink doesnt store sample numbers but time in ms if sampling rate is below 1000hz (usually the case) it is very likely that later recording blocks start at a millisecond count that does not match the initial one (e.g. with sfreq=500hz: sampling uneven ms while the initial block sampled even ms).
so to merge these blocks on a unified timescale the later block has to be shifted by half a sample or so.
pd.merge_asof is used to do that, but the current tolerance is too low to actually catch the cases that have to be shifted, and they are replaced by NaNs and lost
in the 500hz example, the current tolerance is only 0.2 ms, which is not enough to catch the offset of 1.0 ms
below a simpified example:
Steps to reproduce
# replicate bug
sfreq = 500
time_col = "time"
df = pd.DataFrame({
time_col:[2,4,6,11,13,15,20,22,24],
"data":[2,4,6,11,13,15,20,22,24]})
# mimic current _adjust_times function
first, last = df[time_col].iloc[[0, -1]]
step = 1000 / sfreq
df[time_col] = df[time_col].astype(float)
new_times = pd.DataFrame(
np.arange(first, last + step / 2, step), columns=[time_col]
)
# critical line below
return_current = pd.merge_asof(
new_times, df, on=time_col, direction="nearest", tolerance=step / 10
)
print("current implementation:")
print(return_current)
print()
# fixed alternatives
return_new = pd.merge_asof(
new_times, df, on=time_col, direction="nearest", tolerance=step / 2
)
print("fixed (nearest):")
print(return_new)
print()
return_new = pd.merge_asof(
new_times, df, on=time_col, direction="backward", tolerance=step / 2
)
print("fixed (backwards):")
print(return_new)
print()Link to data
No response
Expected results
time data
0 2.0 2.0
1 4.0 4.0
2 6.0 6.0
3 8.0 NaN
4 10.0 NaN
5 12.0 11.0
6 14.0 13.0
7 16.0 15.0
8 18.0 NaN
9 20.0 20.0
10 22.0 22.0
11 24.0 24.0
Actual results
time data
0 2.0 2.0
1 4.0 4.0
2 6.0 6.0
3 8.0 NaN
4 10.0 NaN
5 12.0 NaN
6 14.0 NaN
7 16.0 NaN
8 18.0 NaN
9 20.0 20.0
10 22.0 22.0
11 24.0 24.0
Additional information
doesnt matter