Releases: SebKrantz/collapse
collapse version 1.2.0
collapse 1.2.0, released mid May 2020, is a major update of the package - changes and additions:
Changes to Functionality
-
grouped_df methods for fast statistical functions now always attach the grouping variables to the output in aggregations, unless argument
keep.group_vars = FALSE
. (formerly grouping variables were only attached if also present in the data. Code hinged on this feature should be adjusted) -
qF
ordered
argument default was changed toordered = FALSE
, and theNA
level is only added ifna.exclude = FALSE
. ThusqF
now behaves exactly likeas.factor
. -
Recode
is depreciated in favor ofrecode_num
andrecode_char
, it will be removed soon. Similarlyreplace_non_finite
was renamed toreplace_Inf
. -
In
mrtl
andmctl
the argumentret
was renamedreturn
and now takes descriptive character arguments (the previous version was a direct C++ export and unsafe, code written with these functions should be adjusted). -
GRP
argumentorder
is depreciated in favor of argumentdecreasing
.order
can still be used but will be removed at some point.
Bug Fixes
- Fixed a bug in
flag
where unused factor levels caused a group size error.
Improvements
-
Faster grouping with
GRP
and faster factor generation with added radix method + automatic dispatch between hash and radix method.qF
is now ~ 5x faster thanas.factor
on character and around 30x faster on numeric data. AlsoqG
was enhanced. -
Further slight speed tweaks here and there.
-
collap
now provides more control for weighted aggregations with additional argumentsw
,keep.w
andwFUN
to aggregate the weights as well. The defaults arekeep.w = TRUE
andwFUN = fsum
. A specialty ofcollap
remains thatkeep.by
andkeep.w
also work for external objects passed, so code of the formcollap(data, by, FUN, catFUN, w = data$weights)
will now have an aggregatedweights
vector in the first column.
-
qsu
now also allows weights to be passed in formula i.e.qsu(data, by = ~ group, pid = ~ panelid, w = ~ weights)
. -
fdiff
now supports quasi-differences i.e.$x_t - \rho x_{t-1}$ and quasi-log differences i.e.$log(x_t) - \rho log(x_{t-1})$ . an arbitrary$\rho$ can be supplied. -
Added a
Dlog
operator for faster access to log-differences. -
fgrowth
has ascale
argument, the default isscale = 100
which provides growth rates in percentage terms (as before), but this may now be changed. -
All statistical and transformation functions now have a hidden list method, so they can be applied to unclassed list-objects as well. An error is however provided in grouped operations with unequal-length columns.
Additions
-
Added a suite of functions for fast data manipulation:
fselect
selects variables from a data frame and is equivalent but much faster thandplyr::select
.fsubset
is a much faster version ofbase::subset
to subset vectors, matrices and data.frames. The functionss
was also added as a faster alternative to[.data.frame
.ftransform
is a much faster update ofbase::transform
, to transform data frames by adding, modifying or deleting columns. The functionsettransform
does all of that by reference.fcompute
is equivalent toftransform
but returns a new data frame containing only the columns computed from an existing one.na_omit
is a much faster and enhanced version ofbase::na.omit
.replace_NA
efficiently replaces missing values in multi-type data.
-
Added function
fgroup_by
as a much faster version ofdplyr::group_by
based on collapse grouping. It attaches a 'GRP' object to a data frame, but only works with collapse's fast functions. This allows dplyr like manipulations that are fully collapse based and thus significantly faster, i.e.data %>% fgroup_by(g1,g2) %>% fselect(cola,colb) %>% fmean
. Note thatdata %>% dplyr::group_by(g1,g2) %>% dplyr::select(cola,colb) %>% fmean
still works, in which case the dplyr 'group' object is converted to 'GRP' as before. Howeverdata %>% fgroup_by(g1,g2) %>% dplyr::summarize(...)
does not work. -
Added function
varying
to efficiently check the variation of multi-type data over a dimension or within groups. -
Added function
radixorder
, same asbase::order(..., method = "radix")
but more accessible and with built-in grouping features. -
Added functions
seqid
andgroupid
for generalized run-length type id variable generation from grouping and time variables.seqid
in particular strongly facilitates lagging / differencing irregularly spaced panels usingflag
,fdiff
etc.
collapse version 1.1.0
collapse 1.1.0 released 01.04.2020 - some small fixes and additions:
-
Fixed remaining gcc10, LTO and valgrind issues in C/C++ code, and added some more tests (there are now ~ 5300 tests ensuring that collapse statistical functions perform as expected).
-
Fixed the issue that supplying an unnamed list to
GRP()
, i.e.GRP(list(v1, v2))
would give an error. Unnamed lists are now automatically named 'Group.1', 'Group.2', etc... -
Fixed an issue where aggregating by a single id in
collap()
(i.e.collap(data, ~ id1)
), the id would be coded as factor in the aggregated data.frame. All variables including id's now retain their class and attributes in the aggregated data. -
Added weights (
w
) argument tofsum
andfprod
. Note:fmedian
will also support weights as soon as I am able to implement a sufficiently fast (i.e. linear time) algorithm. I also hope to introduce (weighted) quantiles. I am happy for any help with these features. -
Added an argument
mean = 0
tofwithin / W
. This allows simple and grouped centering on an arbitrary mean,0
being the default. For grouped centeringmean = "overall.mean"
can be specified, which will center data on the overall mean of the data. The logical argumentadd.global.mean = TRUE
used to toggle this in collapse 1.0.0 is therefore depreciated. -
Added arguments
mean = 0
(the default) andsd = 1
(the default) tofscale / STD
. These arguments now allow to (group) scale and center data to an arbitrary mean and standard deviation. Settingmean = FALSE
will just scale data while preserving the mean(s). Special options for grouped scaling aremean = "overall.mean"
(same asfwithin / W
), andsd = "within.sd"
, which will scale the data such that the standard deviation of each group is equal to the within- standard deviation (= the standard deviation computed on the group-centered data). Thus group scaling a panel-dataset withmean = "overall.mean"
andsd = "within.sd"
harmonizes the data across all groups in terms of both mean and variance. The fast algorithm for variance calculation toggled withstable.algo = FALSE
was removed fromfscale
. Welford's numerically stable algorithm used by default is fast enough for all practical purposes. The fast algorithm is still available forfvar
andfsd
. -
Added the modulus (
%%
) and subtract modulus (-%%
) operations toTRA()
. -
Added the function
finteraction
, for fast interactions, andas.character_factor
to coerce a factor, or all factors in a list, to character (analogous toas.numeric_factor
). Also exported the functionckmatch
, for matching with error message showing non-matched elements.
collapse 1.0.0 and earlier
-
First version of the package featuring only the functions
collap
andqsu
based on code shared by Sebastian Martin Krantz on R-devel, February 2019. -
Major rework of the package using Rcpp and data.table internals, introduction of fast statistical functions and operators and expansion of the scope of the package to a broad set of data transformation and exploration tasks. Several iterations of enhancing speed of R code used. Seamless integration of collapse with dplyr, plm and data.table. CRAN release of collapse 1.0.0 on 19th March 2020.