Releases: SebKrantz/collapse
collapse version 2.0.9
-
Added functions
na_locf()
andna_focb()
for fast basic C implementations of these procedures (optionally by reference).replace_na()
now also has atype
argument which supports options"locf"
and"focb"
(default"const"
), similar todata.table::nafill
. The implementation also supports character data and list-columns (NULL/empty
elements). Thanks @BenoitLondon for suggesting (#489). I note thatna_locf()
exists in some other packages (such as imputeTS) where it is implemented in R and has additional options. Users should utilize the flexible namespace i.e.set_collapse(remove = "na_locf")
to deal with this. -
Fixed a bug in weighted quantile estimation (
fquantile()
) that could lead to wrong/out-of-range estimates in some cases. Thanks @zander-prinsloo for reporting (#523). -
Improved right join such that join column names of
x
instead ofy
are preserved. This is more consistent with the other joins when join columns inx
andy
have different names. -
More fluent and safe interplay of 'mask' and 'remove' options in
set_collapse()
: it is now seamlessly possible to switch from any combination of 'mask' and 'remove' to any other combination without the need of setting them toNULL
first.
collapse version 2.0.8
-
In
pivot(..., values = [multiple columns], labels = "new_labels_column", how = "wieder")
, if the columns selected throughvalues
already have variable labels, they are concatenated with the new labels provided through"new_labels_col"
using" - "
as a separator (similar tonames
where the separator is"_"
). -
whichv()
and operators%==%
,%!=%
now properly account for missing double values, e.g.c(NA_real_, 1) %==% c(NA_real_, 1)
yieldsc(1, 2)
rather than2
. Thanks @eutwt for flagging this (#518). -
In
setv(X, v, R)
, if the type ofR
is greater thanX
e.g.setv(1:10, 1:3, 9.5)
, then a warning is issued that conversion ofR
to the lower type (real to integer in this case) may incur loss of information. Thanks @tony-aw for suggesting (#498). -
frange()
has an optionfinite = FALSE
, likebase::range
. Thanks @MLopez-Ibanez for suggesting (#511). -
varying.pdata.frame(..., any_group = FALSE)
now unindexes the result (as should be the case).
collapse version 2.0.7
-
Fixed bug in full join if
verbose = 0
. Thanks @zander-prinsloo for reporting. -
Added argument
multiple = FALSE
tojoin()
. Settingmultiple = TRUE
performs a multiple-matching join where a row inx
is matched to all matching rows iny
. The defaultFALSE
just takes the first matching row iny
. -
Improved recode/replace functions. Notably,
replace_outliers()
now supports optionvalue = "clip"
to replace outliers with the respective upper/lower bounds, and also has optionsingle.limit = "mad"
which removes outliers exceeding a certain number of median absolute deviations. Furthermore, all functions now have aset
argument which fully applies the transformations by reference. -
Functions
replace_NA
andreplace_Inf
were renamed toreplace_na
andreplace_inf
to make the namespace a bit more consistent. The earlier versions remain available.
collapse version 2.0.6
-
Fixed a serious bug in
qsu()
where higher order weighted statistics were erroneous, i.e. wheneverqsu(x, ..., w = weights, higher = TRUE)
was invoked, the 'SD', 'Skew' and 'Kurt' columns were wrong (ifhigher = FALSE
the weighted 'SD' is correct). The reason is that there appears to be no straightforward generalization of Welford's Online Algorithm to higher-order weighted statistics. This was not detected earlier because the algorithm was only tested with unit weights. The fix involved replacing Welford's Algorithm for the higher-order weighted case by a 2-pass method, that additionally uses long doubles for higher-order terms. Thanks @randrescastaneda for reporting. -
Fixed some unexpected behavior in
t_list()
where names 'V1', 'V2', etc. were assigned to unnamed inner lists. It now preserves the missing names. Thanks @orgadish for flagging this.
collapse version 2.0.5
-
In
join
, the ify
is an expression e.g.join(x = mtcars, y = subset(mtcars, mpg > 20))
, then its name is not extracted but just set to"y"
. Before, the name ofy
would be captured asas.character(substitute(y))[1] = "subset"
in this case. This is an improvement mainly for display purposes, but could also affect code if there are duplicate columns in both datasets andsuffix
was not provided in thejoin
call: before, y-columns would be renamed using a (non-sensible)"_subset"
suffix, but now using a"_y"
suffix. Note that this only concerns cases wherey
is an expression rather than a single object. -
Small performance improvements to
%[!]in%
operators:%!in%
now usesis.na(fmatch(x, table))
rather thanfmatch(x, table, 0L) == 0L
, and%in%
, if exported usingset_collapse(mask = "%in%"|"special"|"all")
isas.logical(fmatch(x, table, 0L))
instead offmatch(x, table, 0L) > 0L
. The latter are faster because comparison operators>
,==
with integers additionally need to check forNA
's (= the smallest integer in C).
collapse version 2.0.4
-
In
fnth()/fquantile()
, there has been a slight change to the weighted quantile algorithm. As outlined in the documentation, this algorithm gives weighted versions for all continuous quantile methods (type 7-9) in R by replacing sample quantities with their weighted counterparts. E.g., for the default quantile type 7, the continuous (lower) target element is(n - 1) * p
. In the weighted algorithm, this became(sum(w) - mean(w)) * p
and was compared to the cumulative sum of ordered (byx
) weights, to preserve equivalence of the algorithms in cases where the weights are all equal. However, upon a second thought, the use ofmean(w)
does not really reflect a standard interpretation of the weights as frequencies. I have reasoned that usingmin(w)
instead ofmean(w)
better reflects such an interpretation, as the minimum (non-zero) weight reflects the size of the smallest sampled unit. So the weighted quantile type 7 target is now(sum(w) - min(w)) * p
, and also the other methods have been adjusted accordingly (note that zero weight observations are ignored in the algorithm). -
This is more a Note than a change to the package: there is an issue with vctrs that users can encounter using collapse together with the tidyverse (especially ggplot2), which is that collapse internally optimizes computations on factors by giving them an additional
"na.included"
class if they are known to not contain any missing values. For examplepivot(mtcars)
gives a"variable"
factor which has classc("factor", "na.included")
, such that grouping on"variable"
in subsequent operations is faster. Unfortunately,pivot(mtcars) |> ggplot(aes(y = value)) + geom_histogram() + facet_wrap( ~ variable)
currently gives an error produced by vctrs, because vctrs does not implement a standard S3 method dispatch and thus does not ignore the"na.included"
class. It turns out that the only way for me to deal with this is would be to swap the order of classes i.e.c("na.included", "factor")
, import vctrs, and implementvec_ptype2
andvec_cast
methods for"na.included"
objects. This will never happen, as collapse is and will remain independent of the tidyverse. There are two ways you can deal with this: The first way is to remove the"na.included"
class for ggplot2 e.g.facet_wrap( ~ set_class(variable, "factor"))
or
facet_wrap( ~ factor(variable))
will both work. The second option is to define a functionvec_ptype2.factor.factor <- function(x, y, ...) x
in your global environment, which avoids vctrs performing extra checks on factor objects.
collapse version 2.0.3
-
Fixed a signed integer overflow inside a hash function detected by CRAN checks (changing to unsigned int).
-
Updated the cheatsheet (see README.md).
collapse version 2.0.2
-
Added global option 'stub' (default
TRUE
) toset_collapse
. It is passed to thestub(s)
arguments of the statistical operators,B
,W
,STD
,HDW
,HDW
,L
,D
,Dlog
,G
(in.OPERATOR_FUN
). By default these operators add a prefix/stub to matrix or data.frame columns transformed by them. Settingset_collapse(stub = FALSE)
now allows to switch off this behavior such that columns are not prepended with a prefix by default. -
roworder[v]()
now also supports grouped data frames, but prints a message indicating that this is inefficient (also for indexed data). An additional argumentverbose
can be set to0
to avoid such messages.
collapse version 2.0.1
-
%in%
withset_collapse(mask = "%in%")
does not warn about overidentification when used with data frames. -
Fixed several typos in the documentation.
collapse version 2.0.0
collapse 2.0, released in Mid-October 2023, introduces fast table joins and data reshaping capabilities alongside other convenience functions, and enhances the packages global configurability, including interactive namespace control.
Potentially breaking changes
- In a grouped setting, if
.data
is used insidefsummarise()
andfmutate()
, and.cols = NULL
,.data
will contain all columns except for grouping columns (in-line with the.SD
syntax of data.table). Before,.data
contained all columns. The selection in.cols
still refers to all columns, thus it is still possible to select all columns using e.g.grouped_data %>% fsummarise(some_expression_involving(.data), .cols = seq_col(.))
.
Other changes
- In
qsu()
, argumentvlabels
was renamed tolabels
. Butvlabels
will continue to work.
Bug Fixes
- Fixed a bug in the integer methods of
fsum()
,fmean()
andfprod()
that returnedNA
if and only if there was a single integer followed byNA
's e.gfsum(c(1L, NA, NA))
erroneously gaveNA
. This was caused by a C-level shortcut that returnedNA
when the first element of the vector had been reached (moving from back to front) without encountering any non-NA-values. The bug consisted in the content of the first element not being evaluated in this case. Note that this bug did not occur with real numbers, and also not in grouped execution. Thanks @blset for reporting (#432).
Additions
-
Added
join()
: class-agnostic, vectorized, and (default) verbose joins for R, modeled after the polars API. Two different join algorithms are implemented: a hash-join (default, ifsort = FALSE
) and a sort-merge-join (ifsort = TRUE
). -
Added
pivot()
: fast and easy data reshaping! It supports longer, wider and recast pivoting, including handling of variable labels, through a uniform and parsimonious API. It does not perform data aggregation, and by default does not check if the data is uniquely identified by the supplied ids. Underidentification for 'wide' and 'recast' pivots results in the last value being taken within each group. Users can toggle a duplicates check by settingcheck.dups = TRUE
. -
Added
rowbind()
: a fast class-agnostic alternative torbind.data.frame()
anddata.table::rbindlist()
. -
Added
fmatch()
: a fastmatch()
function for vectors and data frames/lists. It is the workhorse function ofjoin()
, and also benefitsckmatch()
,%!in%
, and new operators%iin%
and%!iin%
(see below). It is also possible toset_collapse(mask = "%in%")
to replacebase::"%in%"
usingfmatch()
. Thanks tofmatch()
, these operators also all support data frames/lists of vectors, which are compared row-wise. -
Added operators
%iin%
and%!iin%
: these directly return indices, i.e.%[!]iin%
is equivalent towhich(x %[!]in% table)
. This is useful especially for subsetting where directly supplying indices is more efficient e.g.x[x %[!]iin% table]
is faster thanx[x %[!]in% table]
. Similarlyfsubset(wlddev, iso3c %iin% c("DEU", "ITA", "FRA"))
is very fast. -
Added
vec()
: efficiently turn matrices or data frames / lists into a single atomic vector. I am aware of multiple implementations in other packages, which are mostly inefficient. With atomic objects,vec()
simply removes the attributes without copying the object, and with lists it directly callsC_pivot_longer
.
Improvements
-
set_collapse()
now supports options 'mask' and 'remove', giving collapse a flexible namespace in the broadest sense that can be changed at any point within the active session:-
'mask' supports base R or dplyr functions that can be masked into the faster collapse versions. E.g.
library(collapse); set_collapse(mask = "unique")
(or, equivalently,set_collapse(mask = "funique")
) will createunique <- funique
in the collapse namespace, exportunique()
from the namespace, and detach and attach the namespace again so R can find it. The re-attaching also ensures that collapse comes right after the global environment, implying that all it's functions will take priority over other libraries. Users can usefastverse::fastverse_conflicts()
to check which functions are masked after usingset_collapse(mask = ...)
. The option can be changed at any time. Usingset_collapse(mask = NULL)
removes all masked functions from the namespace, and can also be called simply to ensure collapse is at the top of the search path. -
'remove' allows removing arbitrary functions from the collapse namespace. E.g.
set_collapse(remove = "D")
will remove the difference operatorD()
, which also exists in stats to calculate symbolic and algorithmic derivatives (this is a convenient example but not necessary sincecollapse::D
is S3 generic and will callstats::D()
on R calls, expressions or names). This is safe to do as it only modifies which objects are exported from the namespace (it does not truly remove objects from the namespace). This option can also be changed at any time.set_collapse(remove = NULL)
will restore the exported namespace.
For both options there exist a number of convenient keywords to bulk-mask / remove functions. For example
set_collapse(mask = "manip", remove = "shorthand")
will mask all data manipulation functions such asmutate <- fmutate
and remove all function shorthands such asmtt
(i.e. abbreviations for frequently used functions that collapse supplies for faster coding / prototyping). -
-
set_collapse()
also supports options 'digits', 'verbose' and 'stable.algo', enhancing the global configurability of collapse. -
qM()
now also has arow.names.col
argument in the second position allowing generation of rownames when converting data frame-like objects to matrix e.g.qM(iris, "Species")
orqM(GGDC10S, 1:5)
(interaction of id's). -
as_factor_GRP()
andfinteraction()
now have an argumentsep = "."
denoting the separator used for compound factor labels. -
alloc()
now has an additional argumentsimplify = TRUE
.FALSE
always returns list output. -
frename()
supports bothnew = old
(pandas, used to far) andold = new
(dplyr) style renaming conventions. -
across()
supports negative indices, also in grouped settings: these will select all variables apart from grouping variables. -
TRA()
allows shorthands"NA"
for"replace_NA"
and"fill"
for"replace_fill"
. -
group()
experienced a minor speedup with >= 2 vectors as the first two vectors are now hashed jointly. -
fquantile()
withnames = TRUE
adds up to 1 digit after the comma in the percent-names, e.g.fquantile(airmiles, probs = 0.001)
generates appropriate names (not 0% as in the previous version).