diff --git a/posts/2024-12-12-non-api-use/index.qmd b/posts/2024-12-12-non-api-use/index.qmd new file mode 100644 index 00000000..a775abd0 --- /dev/null +++ b/posts/2024-12-12-non-api-use/index.qmd @@ -0,0 +1,1331 @@ +--- +title: "Use of non-API entry points in `data.table`" +author: "Ivan Krylov" +date: "2024-12-12" +categories: [developer, guest post, performance] +# image: "image.jpg" +draft: true +bibliography: refs.bib +--- + +```{r} +#| echo: false +library(data.table) +library(tools) # format.check_details +load('precomputed.rda') +``` + +In the late 1970's, people at Bell Laboratories designed the S +programming language in order to facilitate interactive exploratory data +analysis [@Chambers2016]. Instead of writing, compiling, scheduling, and +interpreting the output of individual Fortran programs, the goal of S +was to conduct all the necessary steps of the analysis on the fly. S +achieved this not by replacing the extensive collection of Fortran +subroutines, but by providing a special interface language [@Becker1985] +through which S could communicate with compiled code. + +Fast forward more than four decades and an increase by more than three +orders of magnitude in storage and processing capability of computers +around us. The [dominant implementation of S is now R][is.R]. It is now +feasible to implement algorithms solely in R, recouping the potential +performance losses by reducing the programmer effort spent debugging and +maintaining the code [@Nash2024]. Still, the capability of R to be +extended by special-purpose compiled code is as important as ever. As of +`r when`, `r round(sum(needscomp)/length(needscomp)*100)`% of CRAN +packages use compiled code. Since the implementation language of R is C, +not Fortran, the application programming interface (API) for R is mainly +defined in terms of C. + +What's in an API? +================= + +[Writing R Extensions][WRE] ("WRE") is the definitive guide for R +package development. Together with the [CRAN policy][CRANpolicy] it +forms the "rules as written" that the maintainers of CRAN packages must +follow. A recent version of R exports `r nrow(symbols)` symbols, +including `r symbols[,sum(type=='function')]` functions ("entry points", +not counting C preprocessor macros) and +`r symbols[,sum(type!='function')]` variables. Not all of them are +intended to be used by packages. Even back in R-3.3.0, the oldest +version currently supported by `data.table`, [WRE chapter 6, "The R +API"][WRE33API] classified R's entry points into four categories: + +> * __API__ +> Entry points which are documented in this manual and declared in an +> installed header file. These can be used in distributed packages and +> will only be changed after deprecation. +> * __public__ +> Entry points declared in an installed header file that are exported +> on all R platforms but are not documented and subject to change +> without notice. +> * __private__ +> Entry points that are used when building R and exported on all R +> platforms but are not declared in the installed header files. Do not +> use these in distributed code. +> * __hidden__ +> Entry points that are where possible (Windows and some modern +> Unix-alike compilers/loaders when using R as a shared library) not +> exported. + +Although nobody objected to the use of the _API_ entry points, and there +was little point in trying to use the _hidden_ entry points in a package +that would fail to link almost everywhere, the _public_ and the +_private_ entry points ended up being a point of contention. Those +deemed too internal to use but not feasible to make _hidden_ were (and +still are) listed in the character vector `tools:::nonAPI`: ` R CMD +check ` looks at the functions imported by the package and signals a +`NOTE` if it finds any listed there. + +The remaining _public_ functions, neither documented as API nor +explicitly forbidden by ` R CMD check `, sat there, alluring the package +developers with their offers. For example, the [serialization +interface][ltierney_serialize] is only [documented in WRE since +R-4.5][WRE45serialize], but it has been powering part of the [digest] +CRAN package since 2019 (and other packages before it) without any +drastic changes. Some of the inclusions in `tools:::nonAPI` could have +been historical mistakes: while WRE has been saying [back in version +3.3.0][WRE33wilcox] that `wilcox_free` should be called after a call to +the (API) functions `dwilcox`, `pwilcox` or `qwilcox`, the function was +only [declared in the public headers][wilcox_declared] and [removed from +`tools:::nonAPI`][wilcox_api] in R-4.2.0. Still, between R-3.3.3 and +R-4.4.2, the `#define USE_RINTERNALS` escape hatch finally closed, +`tools:::nonAPI` grew from `r length(nonAPI.3_3)` to +`r length(nonAPI.4_4)` entries, and the package maintainers had to adapt +or face archival of their packages. + +A [recent question on R-devel][ALTREPnonAPI] (whether the [ALTREP] +interface should be considered "API" for the purpose of CRAN package +development) sparked a series of events and an extensive discussion +containing the highest count of occurrences of the word "API" per month +ever seen on R-devel (234), topping [October 2002][Rd200210] (package +versioning and API breakage, 150), [October 2005][Rd200510] (API for +graphical interfaces and console output, 124), and [May 2019][Rd201905] +(discussions of the ALTREP interface and multi-threading, 121). As a +result, Luke Tierney [started work][clarifyingAPI] on programmatically +describing the functions and other symbols exported by R (including +variables and preprocessor and enumeration constants), giving a +stronger definition to the interface. His changes add the currently +unexported function `tools:::funAPI()` that lists entry points and two +more of their categories: + +> * __experimental__ +> Entry points declared in an installed header file that are part of +> an experimental API, such as `R_ext/Altrep.h`. These are subject to +> change, so package authors wishing to use these should be prepared +> to adapt. +> * __embedding__ +> Entry points intended primarily for embedding and creating new +> front-ends. It is not clear that this needs to be a separate +> category but it may be useful to keep it separate for now. + +Additionally, WRE now spells out that entry points not explicitly +documented or at least listed in the output of `tools:::funAPI` (or +something that will replace it) are now off-limits, even if not +currently present in `tools:::nonAPI` (emphasis added): + +> * __public__ +> Entry points declared in an installed header file that are exported +> on all R platforms but are not documented and subject to change +> without notice. _Do not use these in distributed code. Their +> declarations will eventually be moved out of installed header +> files._ + +Correspondingly, the number of `tools:::nonAPI` entry points in the +current development version of R rose to `r length(nonAPI.trunk)`, +prompting the blog post you are currently reading. + + + + + + + +Non-API entry points marked by ` R CMD check ` +============================================== + +The first version of the `data.table` package in the CRAN archive dates +back to April 2006 (which corresponds to R version 2.3.0). It has been +evolving together with R and its API and thus has accumulated a number +of uses of R internals that are [now flagged by ` R CMD check ` as +non-API][remove_non_API]: + +`r gsub( + '(?m)^', '> ', perl = TRUE, + format(subset(dtchecks, grepl('API', Output))[1,]) +)` + + -- ` R CMD check --as-cran ` on a released version of `data.table` + +Operating on the S4 bit: `IS_S4_OBJECT`, `SET_S4_OBJECT`, `UNSET_S4_OBJECT` +--------------------------------------------------------------------------- + +In R's "S4" OOP system, objects can have a primitive base type (e.g. +`setClass("PrimitiveBaseType", contains = "numeric")` or no base type at +all (e.g. `setClass("NoBaseType")`). In the former case, their +`SEXPTYPE` code is that of their base class (e.g. `REALSXP`). In the +latter case, their type code is `OBJSXP` (previously `S4SXP`, which is +now an alias for `OBJSXP`). To make both cases work consistently, R uses +a [special "S4" bit][RI_S4rep] in the header of the object. + +The `data.table` class is [registered][setOldClass] with the S4 OOP +system, making it possible to create S4 classes containing `data.table`s +as members (`setClass(slots = c(mytable = 'data.table'))`) or even +inheriting from `data.table` (and, in turn, from `data.frame`: +`setClass(contains = 'data.table')`). Additionally, `data.table`s may +contain columns that are themselves S4 objects, and both of these cases +require care from the C code. + +The undocumented functions `IS_S4_OBJECT`, `SET_S4_OBJECT`, +`UNSET_S4_OBJECT` exist as bare interfaces to [the internal +macros][IS_S4_OBJECT] of the same names and directly access the flag +inside their argument. Writing R Extensions +[documents][WRE_replacement_entrypoints] `Rf_isS4` and `Rf_asS4` as +their replacements. + +The [`Rf_isS4`][isS4] function is a wrapper for `IS_S4_OBJECT` that +follows the usual naming convention for remapped functions, has been +part of the API for a long time, and could implement additional checks +if they are needed by R. The [`Rf_asS4`][asS4] function (experimental +API) is more involved, trying to "deconstruct" S4 objects into S3 +objects if possible and requested to. If the reference +count of its argument is _above_ 1, it will operate upon and return +its shallow duplicate. + +`data.table` used to directly operate on the S4 bit in two places, the +[`shallow` function in `src/assign.c`][datatable_assign_shallow_S4] and +the [`keepattr` function in +`src/dogroups.c`][datatable_dogroups_keepattr_S4]. In both cases, this +was required after directly modifying attribute list using the +undocumented function `SET_ATTRIB`. For +`shallow`, the solution was to replace the manual operation of +attributes with +[`SHALLOW_DUPLICATE_ATTRIB`][datatable_assign_SHALLOW_ATTRIB] (API, +available since 3.3.0), which itself takes care of invariants like the +object bit and the S4 bit. + +The `keepattr` function is only used in +[`growVector`][datatable_dogroups_grow_keepattr] to transplant all +attributes from a vector to its enlarged copy without duplicating them, +for which no API exists. The solution is to +[use `Rf_asS4` to control the S4 object bit][remove_set_s4_object], +knowing that the new vector is freshly allocated and thus cannot be +shared yet. + +**Status** in `data.table`: fixed in [#6183][remove_set_s4_object] and +[#6264]. + +Converting between calls and pairlists: `SET_TYPEOF` +---------------------------------------------------- + +In R, [function calls][call] are internally represented as Lisp-style +pairlists where the first pair is of special type `LANGSXP` instead of +`LISTSXP`. For example, the following diagram illustrates the data +structure of the call `print(x = 42L)`: + +![](langsxp.svg){width=40em} + +Here, every list item is a separate R object, a "cons cell"; each cell +contains the value in its `CAR` field and a reference to the rest of the +list in its `CDR` field. Argument names, if provided, are stored in the +third field, `TAG`. The list is terminated by `R_NilValue`, which is of +type `NILSXP`. These structures must be constructed every time C code +wants to evaluate a function call ([e.g.][datatable_rbindlist_eval]). + +Previously, R API contained a function to allocate `LISTSXP` pairlists +of arbitrary length, `allocList()`, but not function calls, so it became +a somewhat common idiom to first allocate the list and then use +`SET_TYPEOF` to change the type of the head pair to `LANGSXP`. This +did not previously lead to problems, since the two types have the same +internal memory layout. + +The danger of `SET_TYPEOF` lies in the possibility to set the type of an +R value to one with an incompatible memory layout. (For example, vector +types `REALSXP` and `INTSXP` are built very differently from cons cells +`LISTSXP` and `LANGSXP`.) Starting with R-4.4.1, [R contains the +`allocLang` function in addition to the `allocList` function][WRE_call] +that directly allocates a function call object with a head pair of type +`LANGSXP`. In order to stay compatible with previous R versions, +packages may [allocate the `LISTSXP` tail first and then use `lcons()` +to construct the `LANGSXP` head pair of the call][remove_set_typeof]. + + +Problem (the only instance in `data.table`): + +```c + SEXP s = PROTECT(allocList(2)); + SET_TYPEOF(s, LANGSXP); +// ^^^^^^^^^^ unsafe operation, could be used to corrupt objects + SETCAR(s, install("format.POSIXct")); + SETCAR(CDR(s), column); +``` + +Solutions: + +```c +// for fixed-size calls with contents known ahead of time +SEXP s = lang2(install("format.POSIXct"), column); +``` +or: +```c +// partially pre-populate +SEXP s = lang2(install("format.POSIXct"), R_NilValue); +// later, when 'column' is known: +SETCAR(CDR(s), column); +``` +or: +```c +// allocate a call with 'n' elements +SEXP call = lcons(R_NilValue, allocList(n - 1)); +``` +or: +```c +// in R >= 4.4.1 only: +SEXP call = allocLang(n); +``` + +Unfortunately, the `LCONS` macro didn't work with `#define R_NO_REMAP` +prior to R-4.4, because it expanded to `lcons()` instead of +`Rf_lcons()`. + +**Status** in `data.table`: fixed in [#6313][remove_set_typeof]. + +Strings as C arrays of `CHARSXP` values: `STRING_PTR` +----------------------------------------------------- + +From the point of view of R code, strings are very simple things, much +like numbers: they live in atomic vectors and can be directly compared +with other objects. It is only natural to desire to work with them as +easily from C code as it's possible with other atomic types, where +functions `REAL()`, `INTEGER()`, or `COMPLEX()` can be used to access +the buffer containing the numbers. + +The underlying reality of strings is more complicated: since they +internally manage memory buffers containing text in a given encoding, +they must be subject to garbage collection. Like other managed objects +in R, they are represented as `SEXP` values of special type `CHARSXP`. +R's garbage collector is [generational and requires the use of write +barrier][RI17] ([1][Tierney_gengc], [2][Tierney_writebr]) any time a +`SEXP` value (such as an `STRSXP` vector) references another `SEXP` +value (such as a `CHARSXP` string). In a generational garbage collector, +"younger" generations are marked and sweeped more frequently than +"older" ones, because in a typical R session, most objects are temporary +[@Jones2012, chapter 9]. If package C code manually writes a reference +to a "young" `CHARSXP` object into an "old" `STRSXP` vector without +taking generations into account, a following collection of the "young" +pool of objects will miss the `CHARSXP` being referenced by the "old" +`STRSXP` and remove the `CHARSXP` as "garbage". This makes the `SEXP *` +pointers returned by `STRING_PTR` unsafe and requires the use of +`STRING_PTR_RO` function, which returns a read-only `const SEXP *`. + +Thankfully, `data.table` has already been using read-only `const SEXP *` +pointers when working with `STRSXP` vectors, so the required changes to +the code were [not too substantial][remove_string_ptr], limited to +the name of the function: + +Example of the problem: + +```c +const SEXP *sourceD = STRING_PTR(source); +// ^^^^^^^^^^ +// returns a writeable SEXP * pointer, therefore unsafe +``` + +Solution: + +```c +#if R_VERSION < R_Version(3, 5, 0) +// STRING_PTR_RO only appeared in R-3.5 +#define STRING_PTR_RO(x) (STRING_PTR(x)) +#endif + +// later: +const SEXP *sourceD = STRING_PTR_RO(source); +// ^^^^^^^^^^^^^ +// returns a const SEXP * pointer, which prevents accidental writes +``` + +**Status** in `data.table`: fixed in [#6312][remove_string_ptr]. +See also: [PR18775]. + +Reading the reference counts: `NAMED` {#NAMED} +------------------------------------- + +In plain R, all value types -- numbers, strings, lists -- have +pass-by-value semantics. Without dark and disturbing things in play, such +as non-standard evaluation or active bindings, R code can give a plain +value (`x <- 1:10`) to a function (`f(x)`) or store it in a variable (`y +<- x`), have the function modify its argument (`f <- \(x) { x[1] <- 0 +}`) or change the duplicate variable (`y[2] <- 3`), and still have the +original value intact (`stopifnot(identical(x, 1:10))`). Only the +inherently mutable types, such as environments, external pointers and +weak references, will stay shared between all assignments and function +arguments; the value types behave as if R copies them every time. + +And yet actually making these copies is wasteful when the code only +reads the variable and does not alter it. (In fact, one of the original +motivations of `data.table` was to reduce certain wasteful copying of +data that happens during normal R computations.) Until version 4.0.0, +`NAMED` was R's mechanism to save memory and CPU time instead of +creating and storing these copies. A temporary object such as the value +of `1:10` was not bound to a symbol and thus could be modified right +away. Assigning it to a variable, as in `x <- 1:10`, gave it a +`NAMED(x)` count of 1, for which R had an internal optimisation in +replacement function calls like `foo(x) <- 3`. Assigning the same value +to yet another symbol (by copying `y <- x` or calling a function +`foo(x)`) increased the `NAMED()` count to 2 or more, for which there +was no optimisation: in order to modify one of the symbols, R was +required to duplicate `x` first. `NAMED()` was not necessarily decreased +after the bindings disappeared, and decreasing it after having reached +`NAMEDMAX` was impossible. During the lifetime of R-3.x, `NAMEDMAX` was +increased from 2 to 3 and later to 7. + +Between R-3.1.0 and R-4.0.0, R [migrated from `NAMED` to reference +counting][Tierney_refcnt]. Reference counts are easier to properly +decrement than `NAMED`, thus preventing unneeded copies of objects that +became unreferenced. R-3.5.0 [documented the symbols][Rnews_setnamed] +`MAYBE_REFERENCED(.)` / `NO_REFERENCES(.)` for use instead of checking +`NAMED(.) == 0`, `MAYBE_SHARED(.)` / `NOT_SHARED(.)` instead of checking +`NAMED(.) > 1`, and `MARK_NOT_MUTABLE(.)` instead of setting `NAMED(.)` +to `NAMEDMAX`, which later became part of the API instead of the +`NAMED(.)` and `REFCNT(.)` functions. The hard rules are that a value is +safe to modify in place if it has `NO_REFERENCES()` (reference count of +0), definitely unsafe to modify in place (requiring a call to +`duplicate` or `shallow_duplicate`) if it is `MAYBE_SHARED()` (reference +count above 1), and almost certainly unsafe to modify in place if it is +`MAYBE_REFERENCED()` (reference count of 1). + +`data.table`'s only uses of `NAMED()` were in the [verbose output during +assignment][remove_named]: + +```c +if (verbose) { + Rprintf(_("RHS for item %d has been duplicated because NAMED==%d MAYBE_SHARED==%d, but then is being plonked. length(values)==%d; length(cols)==%d)\n"), + i+1, NAMED(thisvalue), MAYBE_SHARED(thisvalue), length(values), length(cols)); + // ^^^^^ non-API function +} +``` + +Since the correctness of the modification operation hinges on the +reference count being 0 (and it may be important whether it's exactly 1 +or above 1), the same amount of _useful_ information can be conveyed by +printing `MAYBE_REFERENCED()` and `MAYBE_SHARED()` instead of `NAMED()`: + +```c +if (verbose) { + Rprintf(_("RHS for item %d has been duplicated because MAYBE_REFERENCED==%d MAYBE_SHARED==%d, but then is being plonked. length(values)==%d; length(cols)==%d)\n"), + i+1, MAYBE_REFERENCED(thisvalue), MAYBE_SHARED(thisvalue), length(values), length(cols)); + // ^^^^^^^^^^^^^^^^ API function +} +``` + +**Status** in `data.table`: fixed in [#6420][remove_named]. + +Encoding bits: `LEVELS` +----------------------- + +`LEVELS` is the name of the internal R [macro][LEVELS_macro] and an +exported non-API [function][LEVELS_function] accessing a [16-bit field +called `gp`][LEVELS_field] ([general-purpose][RI112]) that is present in +the header of every `SEXP` value. Not every access to this field is +done using the `LEVELS()` macro; there are bits of R code that access +`(sexp)->sxpinfo.gp` directly. R uses this field for many purposes: + + * matching given arguments against the formals of a function + ([1][gp_for_match1], [2][gp_for_match2], [3][gp_for_match3]) + * remembering the previous [type][gp_for_gc] of a garbage-collected value + * [finalizing][gp_for_finalize] the reference-semantics objects before + garbage-collecting them + * [marking][gp_for_calling] condition handlers as "calling" (executing + on top of where the condition was signalled in the call stack), as + opposed to "non-calling" (executing at the site of the `tryCatch` + call) + * [marking][gp_for_assignment] objects in complex assignment calls + * storing the [S4 object bit][gp_for_s4] + * [marking][gp_for_jit] functions as (un)suitable for bytecode + compilation + * [marking][gp_for_growable] vectors as growable + * [marking][gp_for_missing] provided ("actual") function arguments as + [missing][gp_for_missing2] + * [marking][gp_for_ddval] the `..1`, `..2`, etc symbols as + corresponding to the [given element of the `...` + argument][Rhelp_dots] + * [marking][gp_for_env] environments as [locked][envflags_locked], or + for [caching][envflags_global] the global variable lookup, or for + looking up values in the base environment or the special functions + ([1][gp_for_basesym], [2][basesym2], [3][gp_for_special], + [4][specialsym2]) + * [marking][gp_for_hashash] symbols naming environment contents for + [hash lookup][hashash2] + * [marking][gp_for_active] bindings inside environments as + [active][active_binding] + * [marking][gp_for_promsxp] promise objects as already evaluated + * [marking][gp_for_charsxp] `CHARSXP` values as present in the global + cache or being in a certain encoding + +Although the value of `gp` is directly stored in R's serialized data +stream, neither of these are part of the API. Out of all possible uses +for this flag, `data.table` is only interested in string encodings. From +the viewpoints of [plain R][R_Encoding] and the [C API][WRE_encoding], +an individual string (`CHARSXP` value) can be marked with the following +encodings: + +R-level encoding name | C-level encoding constant | Meaning +:----------------:|:----------------:|------------------------------ +`"latin1"` | `CE_LATIN1` | ISO/IEC 8859-1 or CP1252 +`"UTF-8"` | `CE_UTF8` | ISO/IEC 10646 +`"unknown"` | `CE_NATIVE` | Encoding of the current locale +`"bytes"` | `CE_BYTES` | Not necessarily text; `translateChar` will fail + +Internally, R also [marks strings as encoded in ASCII][R_SET_ASCII]: +since all three encodings are ASCII-compatible, an ASCII string will +never need to be translated into a different encoding. Note that there +is a subtle difference between a string _marked_ in a certain encoding +and actually _being_ in a certain encoding: in an R session running with +a UTF-8 locale (which includes most modern Unix-alikes and Windows ≥ +10, November 2019 update) a string marked as `CE_NATIVE` will also be in +UTF-8. (Similarly, with an increasingly rare Latin-1 locale, a +`CE_NATIVE` string will be in Latin-1.) + +The `data.table` code is interested in knowing whether a string is +[marked as UTF-8, Latin-1, or ASCII][datatable_isencoded]. This is used +to [convert strings to UTF-8 when needed][datatable_needUTF8] (also: +[output to native encoding or UTF-8 in +`fwrite`][datatable_ENCODED_CHAR], [automatic conversion in +`forder`][datatable_anynotascii]). The `getCharCE` API function appeared +in R-2.7.0 together with the encoding support, so switching the +`IS_UTF8` and `IS_LATIN` macros from `LEVELS` to API calls [was +relatively straightforward][datatable_levels1]. + +R-4.5.0 is expected to introduce the `charIsASCII` "experimental" API +function that returns the value of the ASCII marker for a `CHARSXP` +value, which [will replace the use of `LEVELS` in the `IS_ASCII` +macro][remove_levels]. Curiously, while it looks like the code could +benefit from switching from the `getCharCE()` tests (which only look at +the value of the flags and so may needlessly translate strings from +`CE_NATIVE`) to the new experimental `charIs(UTF8|Latin1)` functions +that will also return `TRUE` for a matching native encoding, actually +making the change breaks a number of unit tests. + +**Status** in `data.table`: partially fixed in +[#6420][datatable_levels1], waiting for R-4.5.0 to be released with the +new API in [#6422][remove_levels]. + +`SETLENGTH`, `SET_GROWABLE_BIT`, `(SET_)TRUELENGTH` +--------------------------------------------------- + +### Growable vectors + +Since `data.frame`s and `data.table`s are lists, and lists in R are +value types with pass-by-value semantics, adding or +removing a column to one normally involves allocating a new list +referencing the rest of the columns (performing a "shallow duplicate"). +By contrast, the [over-allocated lists][datatable_overallocation] can be +resized in place by gradually increasing their `LENGTH` (remembering +their original length in the `TRUELENGTH` field), obviating the need for +shallow duplicates at the cost of making `data.table`s shared, +by-reference values. The change has been introduced in [v1.7.3, November +2011][news173], together with the `:=` operator for changing the columns +by reference (which has since become [the defining feature of +data.table][datatable_logo]). + +R's own use of `TRUELENGTH` is [varied][RI113]. The field itself +appeared in [R-0.63][R_truelength] together with the `VECSXP` lists (to +replace the Lisp-style linked pairlists). R [uses this +field][R_hashvalue] in `CHARSXP` strings to store the hash values +[belonging to symbols][R_install_truelen]. R's many `VECSXP`-based hash +tables use it to count the primary slots in use: hashes are used for +reference tracking during (un)serialization ([1][R_serialize_hash], +[2][R_saveload_hash]) and looking up environment contents +([1][R_envir_hashpri], [2][R_envir_hashval]). R-3.3 (May 2016) saw the +inclusion of [radix sort][R_radixsort] from `data.table` itself, which +uses `TRUELENGTH` to sort strings. R-3.4 +(April 2017) [introduced][R_growable] over-allocation when growing +vectors due to assignment outside their bounds. The [growable +bit][gp_for_growable] was added to prevent the mismanagement of the +allocated memory counter: without the bit set on the over-allocated +vectors, the garbage collector only counted `LENGTH(x)` instead of +`TRUELENGTH(x)` units as released when garbage-collecting the vector, +inflating the counter over time. [ALTREP] objects introduced in R-3.5 +(April 2018) don't have a `TRUELENGTH`: it [cannot be +set][R_altrep_set_truelen] and is [returned as 0][R_altrep_truelen]. In +very old versions of R, `TRUELENGTH` wasn't initialised, but it is +nowadays set to 0, which `data.table` [depends +upon][datatable_init_testtl]. + +Nowadays, `data.table` uses vectors whose length is different from their +allocated size in many places: + +* `src/dogroups.c` + * reuses the same memory for the [`data.table` subset object + `.SD`][datatable_docols_SD] and for the [virtual row number column + `.I`][datatable_docols_I] by shortening the vectors to the size of + the current group + * later [restores their natural length][datatable_docols_restore] + * [extends the `data.table` for new columns][datatable_docols_extend] + as needed +* `src/freadR.c` + * works with an over-estimated line count and so can [truncate the + columns][datatable_freadR_truncate] after the value is known + precisely + * the columns are [prepared to be truncated][datatable_freadR_settl] + * may also [drop columns by reference][datatable_freadR_drop] +* `src/subset.c` + * the `subsetDT` function [prepares an over-allocated + `data.table`][datatable_subset_alloc] together with its names. +* `src/assign.c` + * the `shallow` function [prepares][datatable_assign_shallow] an + over-allocated `data.table` referencing the columns of an existing + `data.table` + * `assign` [creates][datatable_assign_create] or + [removes][datatable_assign_remove] columns by reference + * `finalizer` causes an `INTSXP` vector [with the fake + length][datatable_assign_finalizer] to be (eventually) + garbage-collected to fix up a discrepancy in R's vector size + accounting caused by the existence of the over-allocated + `data.table` + +`SETLENGTH` presents many opportunities to create inconsistencies within +R: + +* When copying shortened objects without the `GROWABLE_BIT` set, R + allocates and copies only `XLENGTH` elements and [duplicates the value + of `TRUELENGTH`][R_duplicate_truelength]. + * For this and other reasons, `data.table`s have a special + [`.internal.selfref` attribute][datatable_assign_selfref] attached + containing an `EXTPTR` back to itself. A copy of a table can be + detected because it will have a different address. + * The `_selfrefok` function tries to [restore the correct + `TRUELENGTH`][datatable_assign_selfrefok] if it detects a copy. + * Setting the `GROWABLE_BIT` on the `data.table` would make R keep the + default `TRUELENGTH` (0) instead of copying it. +* When deallocating shortened objects without the `GROWABLE_BIT` set, R + [accounts only for the `XLENGTH` elements][R_memory_getVecSize] being + released, over-counting the total amount of allocated memory. + * `data.table` compensates for this using the + [`finalizer`][datatable_assign_finalizer] on the `.internal.selfrep` + attribute. + * Setting the `GROWABLE_BIT` on the `data.table` would make R account + for `TRUELENGTH` elements instead of `XLENGTH` elements. + +Unfortunately, `GROWABLE_BIT` is not part of the API and was only +introduced in R-3.4, so it does not present a full solution to the +problems. Moreover, + +* Setting `LENGTH` larger than the allocated length may cause R to + access undefined or even unmapped memory. +* For vectors containing other `SEXP` values (of type `VECSXP`, + `EXPRSXP`, `STRSXP`): when reducing the `LENGTH`, having a + non-persistent value (something unlike the persistent values + `R_NilValue` or `R_BlankString` or `R_NaString` provided by R itself) + in the newly inaccessible cells will also make them unreachable from + the viewpoint of the garbage collector, potentially prompting it to + reuse or unmap the pointed-to memory. Increasing the `LENGTH` again + with invalid pointers in the newly accessible slots will make an + invalid vector that cannot be safely altered or discarded: + + ```c + #include + #include + void foo(void) { + { + SEXP list = PROTECT(allocVector(VECSXP, 100)), elt; + SET_VECTOR_ELT(list, 99, elt = allocVector(REALSXP, 1000)); + + double * p = REAL(elt); // initialise the vector + for (R_xlen_t i = 0; i < xlength(elt); ++i) p[i] = i; + + SETLENGTH(list, 1); // elt is unreachable + R_gc(); // elt is collected + SETLENGTH(list, 100); // invalid elt is reachable again + R_gc(); // invalid elt is accessed + UNPROTECT(1); + } + R_gc(); // crash here if not above + } + ``` + +[Starting with R-4.3][R_PR17620], R packages can implement their own +`VECSXP`-like objects using the [ALTREP] framework; `STRSXP` objects +have been supported since R-3.5. An `ALTREP` object is defined by its +_class_ (a collection of methods) and two arbitrary R values, `data1` +and `data2`. (Attributes are not a part of the ALTREP representation and +exist the same way as on normal R objects.) A simple implementation of a +shortened, expandable vector could hold a full-length vector in the +`data1` slot and its pretend-length as a one-element `REALSXP` value in +the `data2` slot. (Currently, `R_xlen_t` values are limited by the +largest integer precisely representable in an IEEE `double` value, which +is $2^{52}$.) The over-allocated class will need to implement the +following methods: + +* [Common ALTREP methods][Rapi_altrep_methods]: + * `Length()`, returning the pretend-length of the vector. Required. + * `Duplicate(deep)`. If not implemented, R will create a copy as an + ordinary object using the length and the data pointer provided by + the class. + * There is also `DuplicateEX(deep)`, which is responsible for + copying the value _and_ the attributes, but it may be hard to + implement within the API bounds (`ATTRIB` is not API), and R + provides a default implementation that calls `Duplicate` above. + * Shared mutable vectors [can cause problems][Tierney_mutable], so a + decision to let the `Duplicate()` return value share the original + vector will require a lot of thought and testing. + * `Serialized_state()`, `Unserialize(state)`. If not implemented, R + will serialize the value as an ordinary object, which is what + currently happens for `data.table`s. Once an R package implements an + ALTREP class with a `Serialized_state` method, the format is set in + stone; any changes will have to introduce a new class. + * Similarly, there is `UnserializeEX(state, attr, + objf, levs)` responsible for setting `LEVELS`, the object bit, and + the attributes; the default implementation should suffice. + * `Inspect(pre, deep, pvec, inspect_subtree)`. May `Rprintf` some + information from the ALTREP fields before returning `FALSE` to let R + continue `inspect`ing the object. +* [Common `altvec` methods][Rapi_altvec_methods] required for most code + to work with the class: + * `Dataptr(writable)`, returning the pointer to the start of the array + backing the underlying vector. For `VECSXP` or `STRSXP` vectors, + `writable` should always be `FALSE`, so `DATAPTR_RO` can be used. + * `Dataptr_or_null()`. May delegate to `Dataptr(FALSE)` above. + * `Extract_subset(indx, call)`. Must allocate and return `x[indx]` for + 1-based numeric `indx` that may be outside the bounds of `x`. +* Class-specific methods: + * [`altstring` methods][Rapi_altstring_methods]: + * `Elt(i)`. Must return `x[[i]]` for 0-based `i` or signal an error. + Required. + * `Set_elt(i, v)`. Must perform `x[[i]] <- v` for 0-based `i` or + signal an error. Required. + * `Is_sorted()`. If not implemented, will always return + `UNKNOWN_SORTEDNESS`. + * `No_NA()`. If not implemented, will always return 0 (unknown whether + contains missing values). + * [`altlist` methods][Rapi_altlist_methods]: + * `Elt(i)` and `Set_elt(i, v)` like above. + * The rest of the atomic vector types ([integer][Rapi_altinteger], + [logical][Rapi_altlogical], [numeric][Rapi_altreal], + [complex][Rapi_altcomplex], [raw][Rapi_altraw]) will each need a + subset of the following methods: + * `Elt(i)`, `Is_sorted()`, `No_NA()`, as above. + * `Get_region(i, n, buf)` to populate the buffer `buf[n]` of the + corresponding C type with elements at the given 0-based indices + `i`. The indices are not guaranteed to be within bounds; the + number of actually copied elements must be returned. If not + implemented, R will use the `Elt(i)` method, which is slower. + * `Sum(narm)`, `Min(narm)`, `Max(narm)` to compute a summary of the + vector, optionally ignoring the missing values. If not + implemented, R will use `Get_region` to compute the summaries. + +Additionally, `data.table` will need a function to [create new ALTREP +tables][Rapi_new_altrep] and to resize the vector in place. The resize +function will need to check whether the given value +[`R_altrep_inherits`][Rapi_altrep_inherits] from the over-allocated class +and then modify the ALTREP data slots as needed. The function may even +reallocate the payload to enlarge the vector in place past the original +allocation limit without requiring a `setDT` call from the user. Since a +reallocation will invalidate the data pointer, it must be only used from +inside `data.table`, not from the ALTREP methods. + +The original implementation that uses `SETLENGTH` can be kept behind +`#if R_VERSION < R_Version(4, 3, 0)` for backwards compatibility. + +Replacing `TRUELENGTH`-based growable vectors with `ALTREP`-based ones +will conform to the API, allow growing the vector in place, and avoid +the various inconsistencies that happen when R duplicates or deallocates +these vectors, but also has the following downsides: + + * Every place in `data.table` that uses growable vectors will have to + be refactored to use the new abstraction layer (`SETLENGTH` in R < + 4.3, ALTREP in R ≥ 4.3). + * Both implementations will have to be maintained as long as + `data.table` supports R < 4.3. + * The current implementation in `data.table` re-creates ALTREP + objects as ordinary ones precisely because it's impossible to + `SET_TRUELENGTH` on ALTREP objects. This will also need to be + refactored. + * The data pointer access is slower for ALTREP vectors than for + ordinary vectors: having checked the ALTREP bit in the header, R will + have to access the method table and call the method instead of adding + a fixed offset to the original `SEXP` pointer. This shouldn't be + noticeable unless `data.table` puts data pointer access inside a + "hot" loop. + * For numeric ALTREP classes, ALTREP-aware operations that use + `*_GET_REGION` instead of the data pointer will become slower unless + the class implements a `Get_region` method. + +**Status** in `data.table`: not fixed yet. + +### Fast string matching {#TRUELENGTH-mark} + +`data.table`'s use of `TRUELENGTH` is not limited to growable buffers. A +common idiom is to set the `TRUELENGTH`s of `CHARSXP` values from a +vector to their negative 1-based indices in that vector, then look up +other `CHARSXP`s in the original vector using `-TRUELENGTH(s)`. This +technique relies on [R's `CHARSXP` cache][RI110]: for the given string +contents and encoding, only one copy of a string created by +`mkCharLenCE` (and related functions) will exist in the memory. As a +result, if a string does exist in the original vector, it will be the +_same_ `CHARSXP` object whose `TRUELENGTH` had been set to its negative +index. R does not currently set negative `TRUELENGTH`s by itself, so any +positive `TRUELENGTH`s can be safely discarded as non-matches. + +In the best case scenario, this lookup is very fast: for a table of size +$n$ and $k$ strings to look up in it, it takes $\mathrm{O}(1)$ memory +(the `TRUELENGTH` is already there, unused) and $\mathrm{O}(n)$ time for +overhead plus $\mathrm{O}(k)$ time for the actual lookups. + +Care must be taken for the technique to work properly: + +* The strings must be converted to UTF-8. Two copies of the same text in + different encodings will be stored in different objects at different + addresses, preventing the technique from working: + ```r + packageVersion('data.table') + # [1] ‘1.16.99’ + x <- data.table(factor(rep(enc2utf8('ø'), 3))) + # memrecycle() forgot to account for encodings + x[1,V1 := iconv('ø', to='latin1')] + as.numeric(x$V1) + # [1] 2 1 1 + levels(x$V1) # duplicated levels! + # [1] "ø" "ø" + identical(levels(x$V1)[[1]], levels(x$V1)[[2]]) + # [1] TRUE + levels(x$V1) <- levels(x$V1) + levels(x$V1) # R restores unique levels + # [1] "ø" + ``` +* Any non-zero `TRUELENGTH` values resulting from R-internal usage must + be [saved][datatable_assign_savetl] beforehand and restored + afterwards. +* The `TRUELENGTH`s are used to look up variables in hashed + environments, so R code should not run while the values are disturbed. + +The encoding conversions take $\mathrm{O}(n+k)$ time and space; the `TRUELENGTH` +bookkeeping takes $\mathrm{O}(n)$ space and time (thanks to the exponential +`realloc` trick). + +The fast string lookup is used in the following places: + +* `src/assign.c`: [factor level merging in + `memrecycle`][datatable_assign_memrecycle], [`savetl` + helper][datatable_assign_savetl] +* `src/rbindlist.c`: [matching column + names][datatable_rbindlist_matchcolumns], [matching factor + levels][datatable_rbindlist_matchfactors] +* `src/forder.c`: (different purpose, same technique) [storing the + group numbers][datatable_forder_truelen], [looking them + up][datatable_forder_truelen], [restoring the original + values][datatable_forder_free_ustr] +* `src/chmatch.c`: [saving the original + `TRUELENGTH`s][datatable_chmatch_savetl], [remembering the positions + of `CHARSXP`s in the table][datatable_chmatch_settl], [cleaning up on + error][datatable_chmatch_cleanup1], [looking up strings in the + table][datatable_chmatch_lookup], [cleaning up before + exit][datatable_chmatch_cleanup2] +* `src/fmelt.c`: [combining factor levels by merging their `CHARSXP`s in + a common array with indices in `TRUELENGTH`][datatable_fmelt_truelen] + +Since there doesn't seem to be any intent to allow using R API to place +arbitrary integer values inside unused `SEXP` fields, `data.table` will +have to look up the `CHARSXP` values using the externally available +information. Performing $\mathrm{O}(nk)$ direct pointer comparisons would scale +poorly, so for an $\mathrm{O}(1)$ individual lookup `data.table` could build a +hash table of `SEXP` pointers. While pointer hashing [isn't strictly +guaranteed by the C standard to work][Wellons_hashptr], it has been used +[in R itself][R_unique_PTRHASH]. A hash table for $n$ `CHARSXP` pointers +would need $\mathrm{O}(n)$ memory, $\mathrm{O}(n)$ time to initialise, and average $\mathrm{O}(k)$ +time for $k$ lookups [@Cormen2009, chapter 11]. + +Taking the `savetl` bookkeeping into account, the _average asymptotic_ +performance of `TRUELENGTH` and hashing for string lookup is the same in +both time and space, but the constants are most likely lower for +`TRUELENGTH`. Transitioning to a hash will probably involve a +performance hit. + +A truly lazy implementation could just use [`std::unordered_map`][cppreference_unordered_map] (at the cost of requiring C++11, +which was supported but far from required in R-3.3, and having to shield +R from the C++ exceptions) or the permissively-licensed [uthash]. Since +the upper bound on the size of the table is known ahead of time, a +custom-made open-addressing hash table [@Cormen2009, section 11.4] could +be implemented with a fixed load factor, requiring only one allocation +and no linked lists to walk. + +**Status** in `data.table`: not fixed yet. + +### Marking columns for copying + +The use of `TRUELENGTH` in `data.table` to mark objects is not limited +to `CHARSXP` strings. Individual columns are also marked in a similar +manner for later copying: + +* In `src/dogroups.c`, the vectors allocated for the special symbols + `.BY`, `.I`, `.N`, `.GRP` must not be returned by the grouping + operations evaluated with `dt[..., ..., by=...]`, so they are [marked + with a `TRUELENGTH` of -1][datatable_dogroups_setlen-1], and the + [marked columns][datatable_dogroups_anyspecialstatic] are later + re-created. +* In `src/utils.c`, columns share memory or are ALTREP must be copied. + Memory sharing between columns may lead to confusing results when they + are altered by reference, and ALTREP columns cannot have `TRUELENGTH` + set. The code uses the same trick as with `CHARSXP` objects: if + `TRUELENGTH` is set on an object, accessing it through a + different pointer and seeing a non-zero value will prove that the + object had been previously visited. The code first [prepares zero + `TRUELENGTH`s][datatable_copyShared1], then [marks ALTREP, special, + and already marked columns for copying][datatable_copyShared2], then + [marks columns not previously marked with their column + number][datatable_copyShared3], then finally [restores the + `TRUELENGTH`s for the columns that won't be + overwritten][datatable_copyShared4]. + * The `SET_TRUELENGTH` call in `copySharedColumns` would fail if it + ever got an ALTREP column, but the only use of `copySharedColumns` + in `reorder` guards against those. + +The same solution as above can be used +here, with the same downsides of having to allocate memory for the hash +table and the potential to have worst-case $\mathrm{O}(kn)$ time for $k$ lookups +fundamental to hash tables. + +**Status** in `data.table`: not fixed yet. + +But there's more +================ + +Using `tools:::funAPI` together with the lists of symbols exported from +R and imported by `data.table`, we can find a number of non-API entry +points which ` R CMD check ` doesn't complain about yet: +`r paste(paste0('', sort(DTnonAPI_yet), ''), collapse = ', ')`. + +`(SET_)ATTRIB`, `SET_OBJECT` {#ATTRIB-all} +---------------------------- + +`data.table` performs some direct operations on the attribute pairlists. +Accessing attributes directly requires manually maintaining the object +bit. + +> Use `getAttrib` for individual attributes. To test whether there are +> any attributes use `ANY_ATTRIB`, added in R 4.5.0. Use `setAttrib` for +> individual attributes, `DUPLICATE_ATTRIB` or +> `SHALLOW_DUPLICATE_ATTRIB` for copying attributes from one object to +> another. Use `CLEAR_ATTRIB` for removing all attributes, added in R +> 4.5.0. + +-- [WRE 6.21.1][WRE_replacement_entrypoints] + +### Testing for presence of attributes + +`src/nafill.c` [checks][datatable_nafill_ATTRIB] whether the source +object has any attributes before trying to copy them using +`copyMostAttrib`. + +Problem: + +```c +if (!isNull(ATTRIB(VECTOR_ELT(x, i)))) + // ^^^^^^ non-API entry point +``` + +Solution: + +```c +#if R_VERSION < R_Version(4, 5, 0) +#define ANY_ATTRIB(x) (!isNull(ATTRIB(x))) +#endif + +if (ANY_ATTRIB(VECTOR_ELT(x, i))) + // ^^^^^^^^^^ introduced in R-4.5 +``` + +**Status** in `data.table`: not fixed yet. Will need to wait for R-4.5.0 +to be released with the new interface. + +### Iterating over all attributes + +* The code in `src/assign.c` needs to [iterate over all the attributes of +`attr(dt, 'index')`][datatable_assign_ATTRIB] in order to find indices +that use the given column. +* The code in `src/dogroups.c` needs to [iterate over all attributes of + a column][datatable_dogroups_ATTRIB] in case a reference to the value + of a special symbol has been stashed there and must be duplicated. + +Without `ATTRIB`, this will only be possible using an R-level call to +`attributes()`. While the indices could be changed to use a different data +structure (a named `VECSXP` list?), necessitating an update step for +`data.table`s loaded from storage, the code in `src/dogroups.c` cannot +avoid having to see all the attributes. + +**Status** in `data.table`: no idea how to fix yet. + +### Raw `c(NA, n)` row names + +The code in `src/dogroups.c` needs to [access the raw `rownames` +attribute][datatable_dogroups_rownames] of a `data.table`, even if it's +in the compact form as a 2-element integer vector starting with `NA`. +The `getAttrib` function has a special case for the `R_RowNamesSymbol`, +which returns an ALTREP representation of this attribute. + +`data.table` needs this access in order to [temporarily +overwrite][datatable_dogroups_rownames2] the `rownames` attribute for +the specially-prepared subset `data.table` named `.SD` (which has a +different number of rows and therefore needs different `rownames`). +Creating a full-sized `rownames` attribute instead of its compact form +would take more time than desirable. + +**Status** in `data.table`: no idea how to fix yet. + +### Direct transplantation of attributes + +The code in `src/dogroups.c` needs to +[transplant][datatable_dogroups_SETATTR] the attributes from one object +to another without duplicating them, even shallowly. +`SHALLOW_DUPLICATE_ATTRIB` could work as a replacement, but with worse +performance because it would waste time copying attributes from an +object that is about to be discarded. + +**Status** in `data.table`: no idea how to fix yet. + +`findVar` +--------- + +[Used in `dogroups`][datatable_dogroups_findVar] to look up the +pre-created variables corresponding to the special symbols `.SDall`, +`.SD`, `.N`, `.GRP`, `.iSD`, `.xSD` in their environment. + +> The functions `findVar` and `findVarInFrame` have been used in a +> number of packages but are too low level to be part of the API. For +> most uses the functions `R_getVar` and `R_getVarEx` added in R 4.5.0 +> will be sufficient. These are analogous to the R functions `get` and +> `get0`. + +-- [WRE 6.21.7] + +The new function `R_getVar` is different in that it will never return a +`PROMSXP` (which are an internal implementation detail) or an +`R_UnboundValue`, but the current code doesn't try to care about either. + +Example of the problem: + +```c +SEXP SD = PROTECT(findVar(install(".SD"), env)); + // ^^^^^^^ non-API function +``` + +Solution: + + +```c +#if R_VERSION < R_Version(4, 5, 0) +#define R_getVar(sym, rho, inherits) \ + ((inherits) ? findVar((sym), (rho)) : findVarInFrame((rho), (sym))) +#endif + +SEXP SD = PROTECT(R_getVar(install(".SD"), env, TRUE)); + // ^^^^^^^^ introduced in R-4.5 +``` + +**Status** in `data.table`: not fixed yet. Will need to wait for R-4.5.0 +to be released with the new interface. + +`GetOption` +----------- + +Used in `src/rbindlist.c` to read the +[`datatable.rbindlist.check`][datatable_rbindlist_getoption] option, +`src/freadR.c` to read the +[`datatable.old.fread.datetime.character`][datatable_freadR_getoption] +option, `src/init.c` to read the +[`datatable.verbose`][datatable_init_getoption] option, `src/forder.c` +to get the [`datatable.use.index` and +`datatable.forder.auto.index`][datatable_forder_getoption] options, and +`src/subset.c` to read the +[`datatable.alloccol`][datatable_subset_getoption] option. + +> Use `GetOption1`. + +-- [WRE 6.21.1][WRE_replacement_entrypoints] + +The difference is that `GetOption1` doesn't take a second argument +`rho`, which `GetOption` has been ignoring anyway. + +Example of the problem: + +```c +SEXP opt = GetOption(install("datatable.use.index"), R_NilValue); + // ^^^^^^^^^ non-API function +``` + +Solution: + +```c +SEXP opt = GetOption1(install("datatable.use.index")); + // ^^^^^^^^^^ API function introduced in R-2.13 +``` + +**Status** in `data.table`: not fixed yet. + +Testing for a `data.frame`: `isFrame` +------------------------------------- + +Back in 2012, Matt Dowle needed to quickly test an object for being a +`data.frame`, and the undocumented function `isFrame` seemed like it +[did the job][datatable_isframe_added]. Since `isFrame` was not part of +the documented interface, in 2024 Luke Tierney gave the function a +better-fitting name, [`isDataFrame`][R_isdataframe_added], and made it +an experimental entry point, while retaining the original function as a +wrapper. + +Use of `isFrame` [doesn't give a `NOTE` yet][remove_isframe], but when +R-4.5.0 is released together with the new name for the function, +`data.table` will be able to use it, falling back to `isFrame` on older +versions of R. `isDataFrame` is documented among other [replacement +entry point names][WRE_replacement_entrypoints] in Writing R Extensions. + +Problem (the only instance in `data.table`): + +```c +if (!isVector(thiscol) || isFrame(thiscol)) + // ^^^^^^^ may disappear in a future R version +``` + +Solution: + +```c +#if R_VERSION < R_Version(4, 5, 0) +// R versions older than 4.5.0 released use the old name of the function +#define isDataFrame(x) (isFrame(x)) +#endif + +// later: +if (!isVector(thiscol) || isDataFrame(thiscol)) + // ^^^^^^^^^^^ introduced in R-4.5 +``` + +**Status** in `data.table`: change reverted in [#6244][remove_isframe], +waiting for R-4.5.0 to release with the new interface. + +`OBJECT` +-------- + +Used in `src/assign.c` to [test whether S3 dispatch is possible on an +object][datatable_assign_OBJECT] before spending CPU time on +constructing and evaluating an R-level call to `as.character` instead of +`coerceVector`. + +> Use `isObject`. + +-- [WRE 6.21.1][WRE_replacement_entrypoints] + +Problem: +```c +if (OBJECT(source) && getAttrib(source, R_ClassSymbol)!=R_NilValue) { + // ^^^^^^ non-API entry point +``` + +Solution: +```c +if (isObject(source)) { + // ^^^^^^^^ API entry point +``` + +Most likely, the check for `getAttrib(source, R_ClassSymbol)` is +superfluous, because when used correctly, R API maintains the object bit +set only when the `class` attribute is non-empty. + +**Status** in `data.table`: not fixed yet. + +Conclusion +========== + +While `data.table` could get rid of most of its non-API use with +relative ease, either using a different name for the function +(`STRING_PTR_RO`, `GetOption1`) or adding a wrapper for R < 4.5 +(`ANY_ATTRIB`, `findVar`), two interfaces will require a significant +amount of work. + +Replacing the use of `TRUELENGTH` and related functions will require +implementing two features from scratch: a set of ALTREP classes for +growable vectors (with the previous implementation hidden in `#ifdef` +for R < 4.3) and pointer-keyed hash tables for string and column +marking. + +It is not currently clear how to replace the use of `ATTRIB`. + +References +========== + +[is.R]: https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2024/03/08#n2024-03-09 +[WRE]: https://cran.r-project.org/doc/manuals/R-exts.html +[CRANpolicy]: https://cran.r-project.org/web/packages/policies.html +[WRE33API]: https://web.archive.org/web/20160609093632/https://cran.r-project.org/doc/manuals/R-exts.html#The-R-API +[ltierney_serialize]: https://homepage.divms.uiowa.edu/~luke/R/serialize/serialize.html +[WRE45serialize]: https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output +[digest]: https://cran.r-project.org/package=digest +[WRE33wilcox]: https://web.archive.org/web/20160609093632/https://cran.r-project.org/doc/manuals/R-exts.html#Distribution-functions +[wilcox_declared]: https://github.com/r-devel/r-svn/commit/1638b0106279aa1944b17742054bc6882656596e +[wilcox_api]: https://github.com/r-devel/r-svn/commit/32ea1f67f842e3247f782a91684023b0b5eec6c5 +[ALTREPnonAPI]: https://stat.ethz.ch/pipermail/r-devel/2024-April/083339.html +[ALTREP]: https://svn.r-project.org/R/branches/ALTREP/ALTREP.html +[Rd200210]: https://stat.ethz.ch/pipermail/r-devel/2002-October/thread.html +[Rd200510]: https://stat.ethz.ch/pipermail/r-devel/2005-October/thread.html +[Rd201905]: https://stat.ethz.ch/pipermail/r-devel/2019-May/thread.html +[clarifyingAPI]: https://stat.ethz.ch/pipermail/r-devel/2024-June/083449.html +[remove_non_API]: https://github.com/Rdatatable/data.table/issues/6180 +[setOldClass]: https://search.r-project.org/R/refmans/methods/html/setOldClass.html +[RI_S4rep]: https://cran.r-project.org/doc/manuals/R-ints.html#Representation-of-S4-objects +[IS_S4_OBJECT]: https://github.com/r-devel/r-svn/blob/c20ebd2d417d9ebb915e32bfb0bfdad768f9a80a/src/main/memory.c#L4033-L4035 +[isS4]: https://github.com/r-devel/r-svn/blob/c20ebd2d417d9ebb915e32bfb0bfdad768f9a80a/src/main/objects.c#L1838-L1841 +[asS4]: https://github.com/r-devel/r-svn/blob/c20ebd2d417d9ebb915e32bfb0bfdad768f9a80a/src/main/objects.c#L1843 +[datatable_assign_shallow_S4]: https://github.com/Rdatatable/data.table/blob/a2e20d6cab0bc3cd00f8e47d10603e8c04c89759/src/assign.c#L156 +[datatable_dogroups_keepattr_S4]: https://github.com/Rdatatable/data.table/blob/a2213177283f0f15823e1ff823c1fdf63746da3d/src/dogroups.c#L485 +[datatable_assign_SHALLOW_ATTRIB]: https://github.com/Rdatatable/data.table/commit/f952062030e6657bef83de2748c65120990031c1 +[datatable_dogroups_grow_keepattr]: https://github.com/Rdatatable/data.table/blob/a2213177283f0f15823e1ff823c1fdf63746da3d/src/dogroups.c#L522 +[remove_set_s4_object]: https://github.com/Rdatatable/data.table/pull/6183 +[#6264]: https://github.com/Rdatatable/data.table/pull/6264 +[call]: https://search.r-project.org/R/refmans/base/html/call.html +[datatable_rbindlist_eval]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/rbindlist.c#L237 +[WRE_call]: https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Creating-call-expressions +[remove_set_typeof]: https://github.com/Rdatatable/data.table/pull/6313 +[RI17]: https://cran.r-project.org/doc/manuals/R-ints.html#The-write-barrier +[Tierney_gengc]: https://homepage.stat.uiowa.edu/~luke/R/gengcnotes.html +[Tierney_writebr]: https://homepage.stat.uiowa.edu/~luke/R/barrier.html +[remove_string_ptr]: https://github.com/Rdatatable/data.table/pull/6312 +[PR18775]: https://bugs.r-project.org/show_bug.cgi?id=18775 +[Tierney_refcnt]: https://developer.r-project.org/Refcnt.html +[Rnews_setnamed]: https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2017/09/02#n2017-09-03 +[remove_named]: https://github.com/Rdatatable/data.table/pull/6420/files#diff-22b103646a1efab9bbfc374791ccfc3fd1422eefc48918a3e126fc2f30d1f572L552 +[LEVELS_macro]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L228 +[LEVELS_function]:https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/memory.c#L3902 +[LEVELS_field]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L132 +[RI112]: https://cran.r-project.org/doc/manuals/R-ints.html#Rest-of-header +[gp_for_match1]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/match.c#L175 +[gp_for_match2]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/match.c#L233-L236 +[gp_for_match3]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/unique.c#L53 +[gp_for_gc]:https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/memory.c#L151-L155 +[gp_for_finalize]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/memory.c#L1364-L1374 +[gp_for_calling]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/errors.c#L1660-L1665 +[gp_for_assignment]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L280-L324 +[gp_for_s4]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L359-L362 +[gp_for_jit]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L364-L371 +[gp_for_growable]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L373-L377 +[gp_for_missing]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L449-L456 +[gp_for_missing2]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/eval.c#L2260-L2281 +[gp_for_ddval]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L519-L523 +[Rhelp_dots]: https://search.r-project.org/R/refmans/base/html/dots.html +[gp_for_env]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L529-L530 +[envflags_locked]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/envir.c#L106-L108 +[envflags_global]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/envir.c#L613-L655 +[gp_for_hashash]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L1182-L1186 +[hashash2]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/envir.c#L517-L520 +[gp_for_active]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L1205-L1210 +[active_binding]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/envir.c#L3466-L3483 +[gp_for_basesym]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L1225-L1228 +[basesym2]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/main/envir.c#L754-L768 +[gp_for_special]: https://github.com/r-devel/r-svn/blob/2753df314f7d8e154bc42b5abd99daaf6472dbe1/src/include/Defn.h#L1230-L1236 +[specialsym2]: https://github.com/r-devel/r-svn/blob/2753df314f7d8e154bc42b5abd99daaf6472dbe1/src/main/names.c#L1019-L1046 +[gp_for_promsxp]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L1165-L1166 +[gp_for_charsxp]: https://github.com/r-devel/r-svn/blob/c9437a83b9677074fe01310caac6a2a66cc7f680/src/include/Defn.h#L843-L853 +[R_Encoding]: https://search.r-project.org/R/refmans/base/html/Encoding.html +[WRE_Encoding]: https://cran.r-project.org/doc/manuals/R-exts.html#Character-encoding-issues +[R_SET_ASCII]: https://github.com/r-devel/r-svn/blob/2753df314f7d8e154bc42b5abd99daaf6472dbe1/src/main/envir.c#L4312-L4375 +[datatable_isencoded]: https://github.com/Rdatatable/data.table/blob/40ad2e6978202ecc626db9eaae3a18ed5e4df769/src/data.table.h#L36-L38 +[datatable_needUTF8]: https://github.com/Rdatatable/data.table/blob/40ad2e6978202ecc626db9eaae3a18ed5e4df769/src/data.table.h#L63-L73 +[datatable_ENCODED_CHAR]: https://github.com/Rdatatable/data.table/blob/40ad2e6978202ecc626db9eaae3a18ed5e4df769/src/fwriteR.c#L8-L12 +[datatable_anynotascii]: https://github.com/Rdatatable/data.table/blob/40ad2e6978202ecc626db9eaae3a18ed5e4df769/src/forder.c#L312-L331 +[datatable_levels1]: https://github.com/Rdatatable/data.table/pull/6420/commits/46dbfa93e72776c59dacb286de9831fa28c481b5#diff-3b83136e49e2df4f5df80b312d7d4199fed9e0d283401dbf7bd9159a5096bcaaL36 +[remove_levels]: https://github.com/Rdatatable/data.table/pull/6422/commits/72cbd170fd16844dd8094b8d049d2e56d0926d22 +[news173]: https://github.com/Rdatatable/data.table/blob/6a15f8617de121a406cee97b22e83e0c2c4bb034/NEWS.0.md#new-features-13 +[datatable_overallocation]: https://github.com/Rdatatable/data.table/commit/e09d91beccc862eebcd9497c27b422058320396b#diff-22b103646a1efab9bbfc374791ccfc3fd1422eefc48918a3e126fc2f30d1f572R262-R276 +[datatable_logo]: https://raw.githubusercontent.com/Rdatatable/data.table/master/.graphics/logo.png +[datatable_stretch_column]: https://github.com/Rdatatable/data.table/commit/b4e023df736fed8c4dc536ac0061e895a565b375#diff-697a3094ef3d287d25b94aa344f7ed0262aa3fdb97af9b7e04e3b0ef585b05bcR30-R56 +[RI113]: https://cran.r-project.org/doc/manuals/R-ints.html#The-_0027data_0027 +[R_truelength]: https://github.com/r-devel/r-svn/commit/2d4ae2c4bd593bc2aa2273076997b6e63bbcb782 +[R_hashvalue]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/include/Defn.h#L1184-L1187 +[R_install_truelen]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/names.c#L1256-L1272 +[R_serialize_hash]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/serialize.c#L617-L634 +[R_saveload_hash]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/saveload.c#L807-L834 +[R_envir_hashpri]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/envir.c#L193-L207 +[R_envir_hashval]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/envir.c#L497-L520 +[R_radixsort]: https://github.com/r-devel/r-svn/commit/4907092c953bd0b9c059474f77e40990ecf312b1 +[R_growable]: https://github.com/r-devel/r-svn/commit/287b8316232aea7c619d0cadcb515507b1e3ebfa +[R_altrep_set_truelen]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/include/Defn.h#L391 +[R_altrep_truelen]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/altrep.c#L345 +[datatable_init_testtl]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/init.c#L206 +[datatable_docols_SD]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L197 +[datatable_docols_I]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L230-L237 +[datatable_docols_restore]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L482-L485 +[datatable_docols_extend]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L318-L324 +[datatable_freadR_truncate]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/freadR.c#L536-L538 +[datatable_freadR_settl]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/freadR.c#L519 +[datatable_freadR_drop]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/freadR.c#L551-L552 +[datatable_subset_alloc]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/subset.c#L300-L334 +[datatable_assign_shallow]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L192-L196 +[datatable_assign_create]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L535-L536 +[datatable_assign_remove]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L733-L734 +[datatable_assign_finalizer]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L21 +[R_duplicate_truelength]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/duplicate.c#L43-L81 +[datatable_assign_selfref]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L27-L63 +[datatable_assign_selfrefok]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L99-L138 +[R_memory_getVecSize]: https://github.com/r-devel/r-svn/blob/04a3b015e7d20598f66954b88ae2d39068451494/src/main/memory.c#L1108-L1109 +[R_PR17620]: https://bugs.r-project.org/show_bug.cgi?id=17620 +[Rapi_altrep_methods]: https://aitap.codeberg.page/R-api/#R_005fset_005faltrep_005f_002e_002e_002e_005fmethod +[Tierney_mutable]: https://github.com/ALTREP-examples/Rpkg-mutable/blob/master/vignettes/mutable.Rmd +[Rapi_altvec_methods]: https://aitap.codeberg.page/R-api/#R_005fset_005faltvec_005f_002e_002e_002e_005fmethod +[Rapi_altstring_methods]: https://aitap.codeberg.page/R-api/#R_005fmake_005faltstring_005fclass +[Rapi_altlist_methods]: https://aitap.codeberg.page/R-api/#R_005fmake_005faltlist_005fclass +[Rapi_altinteger]: https://aitap.codeberg.page/R-api/#R_005fmake_005faltinteger_005fclass +[Rapi_altlogical]: https://aitap.codeberg.page/R-api/#R_005fmake_005faltlogical_005fclass +[Rapi_altreal]: https://aitap.codeberg.page/R-api/#R_005fmake_005faltreal_005fclass +[Rapi_altcomplex]: https://aitap.codeberg.page/R-api/#R_005fmake_005faltcomplex_005fclass +[Rapi_altraw]: https://aitap.codeberg.page/R-api/#R_005fmake_005faltraw_005fclass +[Rapi_new_altrep]: https://aitap.codeberg.page/R-api/#R_005fnew_005faltrep +[Rapi_altrep_inherits]: https://aitap.codeberg.page/R-api/#index-R_005faltrep_005finheritsaltrep_005finherits +[datatable_assign_savetl]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L1274-L1328 +[RI110]: https://cran.r-project.org/doc/manuals/R-ints.html#The-CHARSXP-cache +[datatable_assign_memrecycle]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L833-L867 +[datatable_rbindlist_matchcolumns]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/rbindlist.c#L70-L179 +[datatable_rbindlist_matchfactors]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/rbindlist.c#L367-L516 +[datatable_forder_range_str]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L295-L383 +[datatable_forder_truelen]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L769 +[datatable_forder_free_ustr]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L75 +[datatable_chmatch_savetl]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L58-L64 +[datatable_chmatch_settl]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L78-L80 +[datatable_chmatch_cleanup1]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L103 +[datatable_chmatch_lookup]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L108-L130 +[datatable_chmatch_cleanup2]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L135-L136 +[datatable_fmelt_truelen]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/utils.c#L273 +[Wellons_hashptr]: https://nullprogram.com/blog/2016/05/30/ +[R_unique_PTRHASH]: https://github.com/r-devel/r-svn/blob/3713345283787c928e563cdcdf01cc4a9dc1c708/src/main/unique.c#L185-L208 +[cppreference_unordered_map]: https://en.cppreference.com/w/cpp/container/unordered_map +[uthash]: https://troydhanson.github.io/uthash/ +[datatable_dogroups_setlen-1]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L105-L152 +[datatable_dogroups_anyspecialstatic]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L6-L64 +[datatable_copyShared1]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/utils.c#L260-L261 +[datatable_copyShared2]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/utils.c#L266-L267 +[datatable_copyShared3]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/utils.c#L273 +[datatable_copyShared4]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/utils.c#L273 +[datatable_nafill_ATTRIB]: https://github.com/Rdatatable/data.table/blob/546259ddaba0e8ab1506729113688f85ca2986fd/src/nafill.c#L216 +[datatable_assign_ATTRIB]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L618-L629 +[datatable_dogroups_ATTRIB]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L57-L58 +[datatable_dogroups_rownames]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L131-L134 +[datatable_dogroups_rownames2]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L195 +[datatable_dogroups_SETATTR]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L509-L515 +[datatable_dogroups_findVar]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/dogroups.c#L90-L118 +[WRE 6.21.7]: https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Working-with-variable-bindings +[datatable_rbindlist_getoption]: https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L231 +[datatable_freadR_getoption]: https://github.com/Rdatatable/data.table/blob/master/src/freadR.c#L132 +[datatable_init_getoption]: https://github.com/Rdatatable/data.table/blob/master/src/init.c#L331 +[datatable_forder_getoption]: https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L1619-L1637 +[datatable_subset_getoption]: https://github.com/Rdatatable/data.table/blob/master/src/subset.c#L299 +[datatable_isframe_added]: https://github.com/Rdatatable/data.table/commit/87666e70ce1a69b28f0e92ec7504d80e3d53a824#diff-4fc47a9752ba4edfef0cabcc1958eda943545ad3859e48d498b0e3f87a9ae5aeR192 +[R_isdataframe_added]: https://github.com/r-devel/r-svn/commit/4ef83b9dc3c6874e774195d329cbb6c11a71c414 +[remove_isframe]: https://github.com/Rdatatable/data.table/issues/6244 +[WRE_replacement_entrypoints]: https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Some-API-replacements-for-non_002dAPI-entry-points +[datatable_isframe_added]: https://github.com/Rdatatable/data.table/commit/87666e70ce1a69b28f0e92ec7504d80e3d53a824#diff-4fc47a9752ba4edfef0cabcc1958eda943545ad3859e48d498b0e3f87a9ae5aeR192 +[R_isdataframe_added]: https://github.com/r-devel/r-svn/commit/4ef83b9dc3c6874e774195d329cbb6c11a71c414 +[remove_isframe]: https://github.com/Rdatatable/data.table/issues/6244 +[WRE_replacement_entrypoints]: https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Some-API-replacements-for-non_002dAPI-entry-points +[datatable_assign_OBJECT]: https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L1158 diff --git a/posts/2024-12-12-non-api-use/langsxp.pikchr b/posts/2024-12-12-non-api-use/langsxp.pikchr new file mode 100644 index 00000000..14831a63 --- /dev/null +++ b/posts/2024-12-12-non-api-use/langsxp.pikchr @@ -0,0 +1,21 @@ +Head: ellipse "LANGSXP" fit +ellipse "SYMSXP" "print" fit with .ne at Head.sw + (-.3in, -.3in) +arrow <- from last ellipse.ne to Head.sw "CAR" above aligned + +ellipse "NILSXP" fit with .n at Head.s + (0,-.3in) +arrow from Head.s to last ellipse.n "TAG" above aligned + +Arg1: ellipse "LISTSXP" fit with .nw at Head.se + (.3in, -.3in) +arrow -> from Head.se to Arg1.nw "CDR" above aligned + +ellipse "INTSXP" "42" fit with .ne at Arg1.sw + (-.3in, -.3in) +arrow <- from last ellipse.ne to Arg1.sw "CAR" above aligned + +ellipse "SYMSXP" "x" fit with .n at Arg1.s + (0,-.42in) +arrow from Arg1.s to last ellipse.n "TAG" above aligned + +ellipse "NILSXP" fit with .nw at Arg1.se + (.3in, -.3in) +arrow -> from Arg1.se to last ellipse.nw "CDR" above aligned + +"SEXP call" mono with .s at Head.n + (0,.2in) +arrow <- from Head.n to last text.s diff --git a/posts/2024-12-12-non-api-use/langsxp.svg b/posts/2024-12-12-non-api-use/langsxp.svg new file mode 100644 index 00000000..292db794 --- /dev/null +++ b/posts/2024-12-12-non-api-use/langsxp.svg @@ -0,0 +1,41 @@ + + +LANGSXP + +SYMSXP +print + + +CAR + +NILSXP + + +TAG + +LISTSXP + + +CDR + +INTSXP +42 + + +CAR + +SYMSXP +x + + +TAG + +NILSXP + + +CDR +SEXP call + + + + diff --git a/posts/2024-12-12-non-api-use/precomputed.R b/posts/2024-12-12-non-api-use/precomputed.R new file mode 100644 index 00000000..fa1bcecb --- /dev/null +++ b/posts/2024-12-12-non-api-use/precomputed.R @@ -0,0 +1,82 @@ +library(data.table) + +# The results are not reproducible because they depend on both the R-devel +# version and the data.table-git version, hence the pre-computation. + +symbols <- fread( + # most likely implies R on GNU/Linux built with --enable-R-shlib + paste('nm -gDP', file.path(R.home('lib'), 'libR.so')), + fill = TRUE, col.names = c('name', 'type', 'value', 'size') +)[ + type %in% c('B', 'D', 'R', 'T') # don't care about [weak] imports +][, + type := fcase( + type == 'B', 'variable', + type == 'D', 'data', + type == 'R', 'read-only data', + type == 'T', 'function' + ) +][] + +DTsymbols <- fread( + # again, only tested on GNU/Linux + paste('nm -gDP', system.file( + file.path('libs', 'data_table.so'), package = 'data.table' + )), + fill = TRUE, col.names = c('name', 'type', 'value', 'size') +)[type %in% c('U', 'w')][, + type := fcase( + type == 'U', 'undefined', + type == 'w', 'weak' + ) +][, + name := sub('@.*', '', name) +][] + +# this is entirely dependent on late-2024 tools:::{funAPI,nonAPI} +setdiff( + # symbols exported by R and imported by data.table... + intersect(symbols$name, DTsymbols$name) |> + tools:::unmap(), # renamed according to how R API entry points are named + # except those listed among API entry points + tools:::funAPI()$name |> tools:::unmap() +) |> setdiff( + # and also skip variables because they are omitted in funAPI + symbols[type == 'variable', name] +) -> DTnonAPI +# which ones does R CMD check _not_ complain about... yet? +DTnonAPI_yet <- setdiff(DTnonAPI, tools:::nonAPI) + +# History of tools:::nonAPI +getNonAPI <- function(ver, + url = sprintf( + "https://svn.r-project.org/R/branches/R-%s-branch/src/library/tools/R/sotools.R", + ver + ) +) { + ee <- parse(text = readLines(url)) + for (e in ee) { + if ( + is.call(e) && length(e) == 3 && + identical(e[[1]], quote(`<-`)) && + identical(e[[2]], quote(`nonAPI`)) + ) + return(do.call(c, as.list(e[[3]])[-1])) + } +} +nonAPI.3_3 <- getNonAPI('3-3') +nonAPI.4_4 <- getNonAPI('4-4') +nonAPI.trunk <- getNonAPI(url = 'https://svn.r-project.org/R/trunk/src/library/tools/R/sotools.R') + +# CRAN package metadata and check results +cpdb <- tools::CRAN_package_db() +needscomp <- cpdb[,'NeedsCompilation'] == 'yes' +checks <- tools::CRAN_check_details() +dtchecks <- subset(checks, Package == 'data.table') + +when <- Sys.Date() +save( + needscomp, dtchecks, symbols, nonAPI.3_3, nonAPI.4_4, nonAPI.trunk, + DTnonAPI, DTnonAPI_yet, + when, file = 'precomputed.rda', compress = 'xz' +) diff --git a/posts/2024-12-12-non-api-use/precomputed.rda b/posts/2024-12-12-non-api-use/precomputed.rda new file mode 100644 index 00000000..9feb788e Binary files /dev/null and b/posts/2024-12-12-non-api-use/precomputed.rda differ diff --git a/posts/2024-12-12-non-api-use/refs.bib b/posts/2024-12-12-non-api-use/refs.bib new file mode 100644 index 00000000..b51c71ee --- /dev/null +++ b/posts/2024-12-12-non-api-use/refs.bib @@ -0,0 +1,55 @@ +@book{Becker1985, + address = {Monterey, Calif}, + series = {The {Wadsworth} statistics/probability series}, + title = {Extending the {S} system}, + isbn = {978-0-534-05016-0}, + language = {eng}, + publisher = {Wadsworth}, + author = {Becker, Richard A. and Chambers, John M.}, + year = {1985}, +} +@book{Chambers2016, + address = {Milton}, + series = {Chapman \& {Hall} / {CRC} {The} {R} {Series}}, + title = {Extending {R}}, + isbn = {978-1-4987-7572-4 978-1-4987-7571-7}, + language = {eng}, + publisher = {CRC Press}, + author = {Chambers, John M.}, + year = {2016}, +} +@article{Nash2024, + author = {Nash, John C. and Bhattacharjee, Arkajyoti}, + title = {A Comparison of {R} Tools for Nonlinear Least Squares Modeling}, + journal = {The R Journal}, + year = {2024}, + note = {https://doi.org/10.32614/RJ-2023-091}, + doi = {10.32614/RJ-2023-091}, + volume = {15}, + issue = {4}, + issn = {2073-4859}, + pages = {198-215} +} +@book{Jones2012, + address = {Boca Raton, FL}, + series = {Applied algorithms and data structures series}, + title = {The garbage collection handbook: the art of automatic memory management}, + isbn = {978-1-4200-8279-1}, + shorttitle = {The garbage collection handbook}, + language = {eng}, + publisher = {CRC Press}, + author = {Jones, Richard and Hosking, Antony and Moss, Eliot}, + year = {2012}, + note = {OCLC: ocn212844102}, + keywords = {Memory management (Computer science)}, +} +@book{Cormen2009, + address = {Cambridge, Massachusetts London, England}, + edition = {Third edition}, + title = {Introduction to algorithms}, + isbn = {978-0-262-03384-8 978-0-262-27083-0}, + language = {eng}, + publisher = {MIT Press}, + author = {Cormen, Thomas H. and Leiserson, Charles Eric and Rivest, Ronald Linn and Stein, Clifford}, + year = {2009}, +}