Skip to content

Gforce edge case creates segfault#5109

Merged
mattdowle merged 13 commits intomasterfrom
gforce_edge_case
Aug 24, 2021
Merged

Gforce edge case creates segfault#5109
mattdowle merged 13 commits intomasterfrom
gforce_edge_case

Conversation

@ben-schwen
Copy link
Copy Markdown
Member

Closes #1994.

Considered functions:

  • gsum
  • gmean
  • gvar
  • gsd
  • gmedian
  • gprod
  • gminmax
  • gfirst / ghead
  • glast / gtail

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 22, 2021

Codecov Report

Merging #5109 (0725559) into master (897ac6d) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5109      +/-   ##
==========================================
- Coverage   99.54%   99.53%   -0.01%     
==========================================
  Files          76       76              
  Lines       14623    14468     -155     
==========================================
- Hits        14556    14401     -155     
  Misses         67       67              
Impacted Files Coverage Δ
R/test.data.table.R 100.00% <100.00%> (ø)
src/gsumm.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 897ac6d...0725559. Read the comment docs.

@ben-schwen ben-schwen added bug GForce issues relating to optimized grouping calculations (GForce) segfault labels Aug 22, 2021
@mattdowle mattdowle added this to the 1.14.1 milestone Aug 23, 2021
Comment thread src/gsumm.c Outdated
if (ansd[thisgrp]==NA_STRING) continue;
const int ix = (irowslen == -1) ? i : irows[i]-1;
if (xd[ix]==NA_STRING || (strcmp(CHAR(xd[ix]), CHAR(ansd[thisgrp]))<0)==min)
if (ix+1==NA_INTEGER) SET_STRING_ELT(ans, thisgrp, NA_STRING);
Copy link
Copy Markdown
Member

@mattdowle mattdowle Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT_MAX+1 overflows to INT_MIN (==NA_INTEGER) so if there's a way to check if irows[i]==NA_INTEGER directly like the other cases that would be neater; e.g. in the extreme edge case of nrow==INT_MAX and irowslen==-1. Actually, maybe it's not as extreme if folk have code that batches into INT_MAX sizes.
But who am I to comment when this NA-in-i segfault was here all along.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought yesterday, but then also thought that if overflowing from INT_MIN-1 to INT_MAX works in the first place, then the return overflow for the if should also work.

The problem vectors reaching length INT_MAX is certainly a future problem but one that we have to keep in mind, e.g. switching to 64bit integer for indexes? For now I'm not convinced folks have that kind of big vectors since this would mean an ridiculous amount of RAM.

print(object.size(seq.int(2^31-1)), unit="MB")
8192 Mb

base itself seems to have problems with vectors being that long e.g. x = seq.int(2^32); print(x) leading to

[ reached getOption("max.print") -- omitted -99999 entries ]

Copy link
Copy Markdown
Member

@mattdowle mattdowle Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes overflow/underflow is reliable in practice but, iirc, the C standard only guarantees it for unsigned not signed. So the concern would be i) weird and wonderful architectures and compilers, and ii) to catch unintentional overflow. UBSAN would catch it in CRAN_Release.cmd before release, if a test covered the overflow. UBSAN is also in CRAN extra tests (https://cran.r-project.org/web/checks/check_issue_kinds.html) so compliance is required by CRAN. I've seen overflow caught by UBSAN before and fixed it, it was probably on smaller types like int8_t or char where a test covered it.

That problem is just the printing mechanism. Base R can create and use 'big' vectors but the capability varies by function. R's news file is a good place to search to see how they've increased support over time. But regardless, I had in mind a regular data.table with INT_MAX rows.

@mattdowle mattdowle merged commit b33dee6 into master Aug 24, 2021
@mattdowle mattdowle deleted the gforce_edge_case branch August 24, 2021 00:02
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug GForce issues relating to optimized grouping calculations (GForce) segfault

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Edge case bug in GForce

3 participants