diff --git a/posts/2025-01-13-non-api-use/index.qmd b/posts/2025-01-13-non-api-use/index.qmd index a9587cb..90ec351 100644 --- a/posts/2025-01-13-non-api-use/index.qmd +++ b/posts/2025-01-13-non-api-use/index.qmd @@ -34,7 +34,7 @@ Fast forward more than four decades and an increase by more than three orders of Although nobody objected to the use of the *API* entry points, and there was little point in trying to use the *hidden* entry points in a package that would fail to link almost everywhere, the *public* and the *private* entry points ended up being a point of contention. Those deemed too internal to use but not feasible to make *hidden* were (and still are) listed in the character vector `tools:::nonAPI`: `R CMD check` looks at the functions imported by the package and signals a `NOTE` if it finds any listed there. -The remaining *public* functions, neither documented as API nor explicitly forbidden by `R CMD check`, sat there, alluring the package developers with their offers. For example, the [serialization interface](https://homepage.divms.uiowa.edu/~luke/R/serialize/serialize.html) is only [documented in WRE since R-4.5](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output), but it has been powering part of the [digest](https://cran.r-project.org/package=digest) CRAN package since 2019 (and other packages before it) without any drastic changes. Some of the inclusions in `tools:::nonAPI` could have been historical mistakes: while WRE has been saying [back in version 3.3.0](https://web.archive.org/web/20160609093632/https://cran.r-project.org/doc/manuals/R-exts.html#Distribution-functions) that `wilcox_free` should be called after a call to the (API) functions `dwilcox`, `pwilcox` or `qwilcox`, the function was only [declared in the public headers](https://github.com/r-devel/r-svn/commit/1638b0106279aa1944b17742054bc6882656596e) and [removed from `tools:::nonAPI`](https://github.com/r-devel/r-svn/commit/32ea1f67f842e3247f782a91684023b0b5eec6c5) in R-4.2.0. Still, between R-3.3.3 and R-4.4.2, the `#define USE_RINTERNALS` escape hatch finally closed, `tools:::nonAPI` grew from `r length(nonAPI.3_3)` to `r length(nonAPI.4_4)` entries, and the package maintainers had to adapt or face archival of their packages. +The remaining *public* functions, neither documented as API nor explicitly forbidden by `R CMD check`, sat there, alluring the package developers with their offers. For example, the [serialization interface](https://homepage.divms.uiowa.edu/~luke/R/serialize/serialize.html) is only [documented in WRE since R-4.5](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output), but it has been powering the [RApiSerialize](https://cran.r-project.org/package=RApiSerialize) CRAN package since 2014 (building on ideas from the now-archived [Rhpc](https://CRAN.R-project.org/package=Rhpc), 2013) without any drastic changes. Some of the inclusions in `tools:::nonAPI` could have been historical mistakes: while WRE has been saying [back in version 3.3.0](https://web.archive.org/web/20160609093632/https://cran.r-project.org/doc/manuals/R-exts.html#Distribution-functions) that `wilcox_free` should be called after a call to the (API) functions `dwilcox`, `pwilcox` or `qwilcox`, the function was only [declared in the public headers](https://github.com/r-devel/r-svn/commit/1638b0106279aa1944b17742054bc6882656596e) and [removed from `tools:::nonAPI`](https://github.com/r-devel/r-svn/commit/32ea1f67f842e3247f782a91684023b0b5eec6c5) in R-4.2.0. Still, between R-3.3.3 and R-4.4.2, the `#define USE_RINTERNALS` escape hatch finally closed, `tools:::nonAPI` grew from `r length(nonAPI.3_3)` to `r length(nonAPI.4_4)` entries, and the package maintainers had to adapt or face archival of their packages. A [recent question on R-devel](https://stat.ethz.ch/pipermail/r-devel/2024-April/083339.html) (whether the [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) interface should be considered "API" for the purpose of CRAN package development) sparked a series of events and an extensive discussion containing the highest count of occurrences of the word "API" per month ever seen on R-devel (234), topping [October 2002](https://stat.ethz.ch/pipermail/r-devel/2002-October/thread.html) (package versioning and API breakage, 150), [October 2005](https://stat.ethz.ch/pipermail/r-devel/2005-October/thread.html) (API for graphical interfaces and console output, 124), and [May 2019](https://stat.ethz.ch/pipermail/r-devel/2019-May/thread.html) (discussions of the ALTREP interface and multi-threading, 121). As a result, Luke Tierney [started work](https://stat.ethz.ch/pipermail/r-devel/2024-June/083449.html) on programmatically describing the functions and other symbols exported by R (including variables and preprocessor and enumeration constants), giving a stronger definition to the interface. His changes add the currently unexported function `tools:::funAPI()` that lists entry points and two more of their categories: @@ -339,7 +339,7 @@ Replacing `TRUELENGTH`-based growable vectors with `ALTREP`-based ones will conf - Every place in `data.table` that uses growable vectors will have to be refactored to use the new abstraction layer (`SETLENGTH` in R \< 4.3, ALTREP in R ≥ 4.3). - Both implementations will have to be maintained as long as `data.table` supports R \< 4.3. - - The current implementation in `data.table` re-creates ALTREP objects as ordinary ones precisely because it's impossible to `SET_TRUELENGTH` on ALTREP objects. This will also need to be refactored. + - The current implementation in `data.table` re-creates ALTREP objects as ordinary ones because it's impossible to `SET_TRUELENGTH` on ALTREP objects and because ALTREP objects may share other columns by reference. This will also need to be refactored. - The data pointer access is slower for ALTREP vectors than for ordinary vectors: having checked the ALTREP bit in the header, R will have to access the method table and call the method instead of adding a fixed offset to the original `SEXP` pointer. This shouldn't be noticeable unless `data.table` puts data pointer access inside a "hot" loop. - For numeric ALTREP classes, ALTREP-aware operations that use `*_GET_REGION` instead of the data pointer will become slower unless the class implements a `Get_region` method. @@ -382,7 +382,7 @@ The fast string lookup is used in the following places: - `src/assign.c`: [factor level merging in `memrecycle`](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L833-L867), [`savetl` helper](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/assign.c#L1274-L1328) - `src/rbindlist.c`: [matching column names](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/rbindlist.c#L70-L179), [matching factor levels](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/rbindlist.c#L367-L516) -- `src/forder.c`: (different purpose, same technique) [storing the group numbers](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L769), [looking them up](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L769), [restoring the original values](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L75) +- `src/forder.c`: (different purpose, same technique) [storing the group numbers](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L295-L383), [looking them up](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L769), [restoring the original values](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/forder.c#L75) - `src/chmatch.c`: [saving the original `TRUELENGTH`s](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L58-L64), [remembering the positions of `CHARSXP`s in the table](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L78-L80), [cleaning up on error](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L103), [looking up strings in the table](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L108-L130), [cleaning up before exit](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/chmatch.c#L135-L136) - `src/fmelt.c`: [combining factor levels by merging their `CHARSXP`s in a common array with indices in `TRUELENGTH`](https://github.com/Rdatatable/data.table/blob/03c647f9a44710aad834c0718e0b34e8c5341bf1/src/utils.c#L273) @@ -390,7 +390,7 @@ Since there doesn't seem to be any intent to allow using R API to place arbitrar Taking the `savetl` bookkeeping into account, the *average asymptotic* performance of `TRUELENGTH` and hashing for string lookup is the same in both time and space, but the constants are most likely lower for `TRUELENGTH`. Transitioning to a hash will probably involve a performance hit. -A truly lazy implementation could just use [`std::unordered_map`](https://en.cppreference.com/w/cpp/container/unordered_map) (at the cost of requiring C++11, which was supported but far from required in R-3.3, and having to shield R from the C++ exceptions) or the permissively-licensed [uthash](https://troydhanson.github.io/uthash/). Since the upper bound on the size of the table is known ahead of time, a custom-made open-addressing hash table [@Cormen2009, section 11.4] could be implemented with a fixed load factor, requiring only one allocation and no linked lists to walk. +A truly lazy implementation could just use [`std::unordered_map`](https://en.cppreference.com/w/cpp/container/unordered_map) (at the cost of requiring C++11, which was supported but far from required in R-3.3, and having to shield R from the C++ exceptions) or the permissively-licensed [uthash](https://troydhanson.github.io/uthash/). In cases where the upper bound on the size of the table is known ahead of time, a custom-made open-addressing hash table [@Cormen2009, section 11.4] could be implemented with a fixed load factor, requiring only one allocation and no linked lists to walk. In cases where the upper bound is unrealistic (`forder`, `rbindlist`), a dynamically-growing hash table will need appropriate locking that won't slow it down too much. **Status** in `data.table`: not fixed yet. @@ -583,3 +583,10 @@ While `data.table` could get rid of most of its non-API use with relative ease, Replacing the use of `TRUELENGTH` and related functions will require implementing two features from scratch: a set of ALTREP classes for growable vectors (with the previous implementation hidden in `#ifdef` for R \< 4.3) and pointer-keyed hash tables for string and column marking. It is not currently clear how to replace the use of `ATTRIB`. + +## Corrections + +- Instead of `digest`, the API introduction now uses the earlier `RApiSerialize` package as an example. +- A paragraph at the end of the [growable vectors](#growable-vectors) section used to say that `data.table` re-creates ALTREP columns because those cannot have `TRUELENGTH` set. That's not the only reason: ALTREP columns may also end up referencing special objects like `.SD`. +- A link in the [fast string matching](#TRUELENGTH-mark) section pointed to the wrong lines in the `data.table` source code. +- The [fast string matching](#TRUELENGTH-mark) section used to say that a good upper bound on the number of unique entries is always known; actually, there are two cases where a dynamically growing hash table is required.