Skip to content

Commit

Permalink
Merge pull request #71 from aitap/non-api-use-corrections
Browse files Browse the repository at this point in the history
Corrections for the "non-API use" post
  • Loading branch information
kbodwin authored Jan 20, 2025
2 parents 649e0cb + b2cba8a commit bc32dd6
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions posts/2025-01-13-non-api-use/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Fast forward more than four decades and an increase by more than three orders of
Although nobody objected to the use of the *API* entry points, and there was little point in trying to use the *hidden* entry points in a package that would fail to link almost everywhere, the *public* and the *private* entry points ended up being a point of contention. Those deemed too internal to use but not feasible to make *hidden* were (and still are) listed in the character vector `tools:::nonAPI`: `R CMD check` looks at the functions imported by the package and signals a `NOTE` if it finds any listed there.

The remaining *public* functions, neither documented as API nor explicitly forbidden by `R CMD check`, sat there, alluring the package developers with their offers. For example, the [serialization interface](https://homepage.divms.uiowa.edu/~luke/R/serialize/serialize.html) is only [documented in WRE since R-4.5](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output), but it has been powering part of the [digest](https://cran.r-project.org/package=digest) CRAN package since 2019 (and other packages before it) without any drastic changes. Some of the inclusions in `tools:::nonAPI` could have been historical mistakes: while WRE has been saying [back in version 3.3.0](https://web.archive.org/web/20160609093632/https://cran.r-project.org/doc/manuals/R-exts.html#Distribution-functions) that `wilcox_free` should be called after a call to the (API) functions `dwilcox`, `pwilcox` or `qwilcox`, the function was only [declared in the public headers](https://github.com/r-devel/r-svn/commit/1638b0106279aa1944b17742054bc6882656596e) and [removed from `tools:::nonAPI`](https://github.com/r-devel/r-svn/commit/32ea1f67f842e3247f782a91684023b0b5eec6c5) in R-4.2.0. Still, between R-3.3.3 and R-4.4.2, the `#define USE_RINTERNALS` escape hatch finally closed, `tools:::nonAPI` grew from `r length(nonAPI.3_3)` to `r length(nonAPI.4_4)` entries, and the package maintainers had to adapt or face archival of their packages.
The remaining *public* functions, neither documented as API nor explicitly forbidden by `R CMD check`, sat there, alluring the package developers with their offers. For example, the [serialization interface](https://homepage.divms.uiowa.edu/~luke/R/serialize/serialize.html) is only [documented in WRE since R-4.5](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output), but it has been powering the [RApiSerialize](https://cran.r-project.org/package=RApiSerialize) CRAN package since 2014 (building on ideas from the now-archived [Rhpc](https://CRAN.R-project.org/package=Rhpc), 2013) without any drastic changes. Some of the inclusions in `tools:::nonAPI` could have been historical mistakes: while WRE has been saying [back in version 3.3.0](https://web.archive.org/web/20160609093632/https://cran.r-project.org/doc/manuals/R-exts.html#Distribution-functions) that `wilcox_free` should be called after a call to the (API) functions `dwilcox`, `pwilcox` or `qwilcox`, the function was only [declared in the public headers](https://github.com/r-devel/r-svn/commit/1638b0106279aa1944b17742054bc6882656596e) and [removed from `tools:::nonAPI`](https://github.com/r-devel/r-svn/commit/32ea1f67f842e3247f782a91684023b0b5eec6c5) in R-4.2.0. Still, between R-3.3.3 and R-4.4.2, the `#define USE_RINTERNALS` escape hatch finally closed, `tools:::nonAPI` grew from `r length(nonAPI.3_3)` to `r length(nonAPI.4_4)` entries, and the package maintainers had to adapt or face archival of their packages.

A [recent question on R-devel](https://stat.ethz.ch/pipermail/r-devel/2024-April/083339.html) (whether the [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) interface should be considered "API" for the purpose of CRAN package development) sparked a series of events and an extensive discussion containing the highest count of occurrences of the word "API" per month ever seen on R-devel (234), topping [October 2002](https://stat.ethz.ch/pipermail/r-devel/2002-October/thread.html) (package versioning and API breakage, 150), [October 2005](https://stat.ethz.ch/pipermail/r-devel/2005-October/thread.html) (API for graphical interfaces and console output, 124), and [May 2019](https://stat.ethz.ch/pipermail/r-devel/2019-May/thread.html) (discussions of the ALTREP interface and multi-threading, 121). As a result, Luke Tierney [started work](https://stat.ethz.ch/pipermail/r-devel/2024-June/083449.html) on programmatically describing the functions and other symbols exported by R (including variables and preprocessor and enumeration constants), giving a stronger definition to the interface. His changes add the currently unexported function `tools:::funAPI()` that lists entry points and two more of their categories:

Expand Down Expand Up @@ -339,7 +339,7 @@ Replacing `TRUELENGTH`-based growable vectors with `ALTREP`-based ones will conf

- Every place in `data.table` that uses growable vectors will have to be refactored to use the new abstraction layer (`SETLENGTH` in R \< 4.3, ALTREP in R ≥ 4.3).
- Both implementations will have to be maintained as long as `data.table` supports R \< 4.3.
- The current implementation in `data.table` re-creates ALTREP objects as ordinary ones precisely because it's impossible to `SET_TRUELENGTH` on ALTREP objects. This will also need to be refactored.
- The current implementation in `data.table` re-creates ALTREP objects as ordinary ones because it's impossible to `SET_TRUELENGTH` on ALTREP objects and because ALTREP objects may share other columns by reference. This will also need to be refactored.
- The data pointer access is slower for ALTREP vectors than for ordinary vectors: having checked the ALTREP bit in the header, R will have to access the method table and call the method instead of adding a fixed offset to the original `SEXP` pointer. This shouldn't be noticeable unless `data.table` puts data pointer access inside a "hot" loop.
- For numeric ALTREP classes, ALTREP-aware operations that use `*_GET_REGION` instead of the data pointer will become slower unless the class implements a `Get_region` method.

Expand Down Expand Up @@ -382,15 +382,15 @@ The fast string lookup is used in the following places:

- `src/assign.c`: [factor level merging in `memrecycle`](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L833-L867), [`savetl` helper](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L1274-L1328)
- `src/rbindlist.c`: [matching column names](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/rbindlist.c#L70-L179), [matching factor levels](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/rbindlist.c#L367-L516)
- `src/forder.c`: (different purpose, same technique) [storing the group numbers](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L769), [looking them up](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L769), [restoring the original values](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L75)
- `src/forder.c`: (different purpose, same technique) [storing the group numbers](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L295-L383), [looking them up](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L769), [restoring the original values](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L75)
- `src/chmatch.c`: [saving the original `TRUELENGTH`s](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L58-L64), [remembering the positions of `CHARSXP`s in the table](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L78-L80), [cleaning up on error](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L103), [looking up strings in the table](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L108-L130), [cleaning up before exit](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L135-L136)
- `src/fmelt.c`: [combining factor levels by merging their `CHARSXP`s in a common array with indices in `TRUELENGTH`](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/utils.c#L273)

Since there doesn't seem to be any intent to allow using R API to place arbitrary integer values inside unused `SEXP` fields, `data.table` will have to look up the `CHARSXP` values using the externally available information. Performing $\mathrm{O}(nk)$ direct pointer comparisons would scale poorly, so for an $\mathrm{O}(1)$ individual lookup `data.table` could build a hash table of `SEXP` pointers. While pointer hashing [isn't strictly guaranteed by the C standard to work](https://nullprogram.com/blog/2016/05/30/), it has been used [in R itself](https://github.com/r-devel/r-svn/blob/3713345283787c928e563cdcdf01cc4a9dc1c708/src/main/unique.c#L185-L208). A hash table for $n$ `CHARSXP` pointers would need $\mathrm{O}(n)$ memory, $\mathrm{O}(n)$ time to initialise, and average $\mathrm{O}(k)$ time for $k$ lookups [@Cormen2009, chapter 11].

Taking the `savetl` bookkeeping into account, the *average asymptotic* performance of `TRUELENGTH` and hashing for string lookup is the same in both time and space, but the constants are most likely lower for `TRUELENGTH`. Transitioning to a hash will probably involve a performance hit.

A truly lazy implementation could just use [`std::unordered_map<SEXP, int>`](https://en.cppreference.com/w/cpp/container/unordered_map) (at the cost of requiring C++11, which was supported but far from required in R-3.3, and having to shield R from the C++ exceptions) or the permissively-licensed [uthash](https://troydhanson.github.io/uthash/). Since the upper bound on the size of the table is known ahead of time, a custom-made open-addressing hash table [@Cormen2009, section 11.4] could be implemented with a fixed load factor, requiring only one allocation and no linked lists to walk.
A truly lazy implementation could just use [`std::unordered_map<SEXP, int>`](https://en.cppreference.com/w/cpp/container/unordered_map) (at the cost of requiring C++11, which was supported but far from required in R-3.3, and having to shield R from the C++ exceptions) or the permissively-licensed [uthash](https://troydhanson.github.io/uthash/). In cases where the upper bound on the size of the table is known ahead of time, a custom-made open-addressing hash table [@Cormen2009, section 11.4] could be implemented with a fixed load factor, requiring only one allocation and no linked lists to walk. In cases where the upper bound is unrealistic (`forder`, `rbindlist`), a dynamically-growing hash table will need appropriate locking that won't slow it down too much.
**Status** in `data.table`: not fixed yet.
Expand Down Expand Up @@ -583,3 +583,10 @@ While `data.table` could get rid of most of its non-API use with relative ease,
Replacing the use of `TRUELENGTH` and related functions will require implementing two features from scratch: a set of ALTREP classes for growable vectors (with the previous implementation hidden in `#ifdef` for R \< 4.3) and pointer-keyed hash tables for string and column marking.

It is not currently clear how to replace the use of `ATTRIB`.

## Corrections

- Instead of `digest`, the API introduction now uses the earlier `RApiSerialize` package as an example.
- A paragraph at the end of the [growable vectors](#growable-vectors) section used to say that `data.table` re-creates ALTREP columns because those cannot have `TRUELENGTH` set. That's not the only reason: ALTREP columns may also end up referencing special objects like `.SD`.
- A link in the [fast string matching](#TRUELENGTH-mark) section pointed to the wrong lines in the `data.table` source code.
- The [fast string matching](#TRUELENGTH-mark) section used to say that a good upper bound on the number of unique entries is always known; actually, there are two cases where a dynamically growing hash table is required.

0 comments on commit bc32dd6

Please sign in to comment.