Skip to content

Commit

Permalink
Merge pull request #37 from mayer79/resubmission
Browse files Browse the repository at this point in the history
Resubmission
  • Loading branch information
mayer79 authored Jul 18, 2023
2 parents 9a3eb7d + 88c6418 commit 21d2252
Show file tree
Hide file tree
Showing 18 changed files with 92 additions and 74 deletions.
3 changes: 2 additions & 1 deletion .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@
^\.Rproj\.user$
^docu$
^test.R$
^backlog$
^backlog$
^CRAN-SUBMISSION$
3 changes: 3 additions & 0 deletions CRAN-SUBMISSION
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Version: 0.1.0
Date: 2023-07-16 18:48:26 UTC
SHA: 9638974266fcc5f32a51870bccf0d08bde810551
19 changes: 10 additions & 9 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@ Title: Interaction Statistics
Version: 0.1.0
Authors@R:
person("Michael", "Mayer", , "[email protected]", role = c("aut", "cre"))
Description: Fast, model-agnostic implementation of Friedman and Popescu's
H statistics of interaction strength <doi:10.1214/07-AOAS148>. These
statistics quantify interaction strength per feature, feature pair,
and feature triple. The package supports multi-output predictions and
can account for case weights. In addition, several variants of the
original statistics are provided. The shape of the interactions can
be explored through partial dependence plots or individual conditional
expectation plots. 'DALEX' explainers, meta learners ('mlr3', 'tidymodels',
'caret') and most other models work out-of-the-box.
Description: Fast, model-agnostic implementation of different H-statistics
introduced by Jerome H. Friedman and Bogdan E. Popescu (2008)
<doi:10.1214/07-AOAS148>. These statistics quantify interaction
strength per feature, feature pair, and feature triple. The package
supports multi-output predictions and can account for case weights.
In addition, several variants of the original statistics are provided.
The shape of the interactions can be explored through partial
dependence plots or individual conditional expectation plots. 'DALEX'
explainers, meta learners ('mlr3', 'tidymodels', 'caret') and most
other models work out-of-the-box.
License: GPL (>= 2)
Depends:
R (>= 3.2.0)
Expand Down
6 changes: 3 additions & 3 deletions R/H2_overall.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#' Overall Interaction Strength
#'
#' Friedman and Popescu's \eqn{H^2_j} statistics of overall interaction strength per
#' Friedman and Popescu's statistic of overall interaction strength per
#' feature, see Details.
#' By default, the results are plotted as barplot. Set `plot = FALSE` to get numbers.
#'
Expand All @@ -13,10 +13,10 @@
#' \deqn{
#' F(\mathbf{x}) = F_j(x_j) + F_{\setminus j}(\mathbf{x}_{\setminus j}).
#' }
#' Correspondingly, Friedman and Popescu's \eqn{H^2_j} statistic of overall interaction
#' Correspondingly, Friedman and Popescu's statistic of overall interaction
#' strength is given by
#' \deqn{
#' H_{j}^2 = \frac{\frac{1}{n} \sum_{i = 1}^n\big[F(\mathbf{x}_i) -
#' H_j^2 = \frac{\frac{1}{n} \sum_{i = 1}^n\big[F(\mathbf{x}_i) -
#' \hat F_j(x_{ij}) - \hat F_{\setminus j}(\mathbf{x}_{i\setminus j})
#' \big]^2}{\frac{1}{n} \sum_{i = 1}^n\big[F(\mathbf{x}_i)\big]^2}
#' }
Expand Down
4 changes: 2 additions & 2 deletions R/H2_pairwise.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#' Pairwise Interaction Strength
#'
#' Friedman and Popescu's statistics of pairwise interaction strength, see Details.
#' Friedman and Popescu's statistic of pairwise interaction strength, see Details.
#' By default, the results are plotted as barplot. Set `plot = FALSE` to get numbers.
#'
#' @details
Expand All @@ -11,7 +11,7 @@
#' \deqn{
#' F_{jk}(x_j, x_k) = F_j(x_j)+ F_k(x_k).
#' }
#' Correspondingly, Friedman and Popescu's \eqn{H_{jk}^2} statistic of pairwise
#' Correspondingly, Friedman and Popescu's statistic of pairwise
#' interaction strength is defined as
#' \deqn{
#' H_{jk}^2 = \frac{A_{jk}}{\frac{1}{n} \sum_{i = 1}^n\big[\hat F_{jk}(x_{ij}, x_{ik})\big]^2},
Expand Down
10 changes: 5 additions & 5 deletions R/hstats.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
#'
#' @description
#' This is the main function of the package. It does the expensive calculations behind
#' the following interaction statistics:
#' the following H-statistics:
#' - Total interaction strength \eqn{H^2}, a statistic measuring the proportion of
#' prediction variability unexplained by main effects of `v`, see [h2()] for details.
#' - Friedman and Popescu's \eqn{H^2_j} statistic of overall interaction strength per
#' - Friedman and Popescu's statistic \eqn{H^2_j} of overall interaction strength per
#' feature, see [h2_overall()] for details.
#' - Friedman and Popescu's \eqn{H^2_{jk}} statistic of pairwise interaction strength,
#' - Friedman and Popescu's statistic \eqn{H^2_{jk}} of pairwise interaction strength,
#' see [h2_pairwise()] for details.
#' - Friedman and Popescu's \eqn{H^2_{jkl}} statistic of three-way interaction strength,
#' - Friedman and Popescu's statistic \eqn{H^2_{jkl}} of three-way interaction strength,
#' see [h2_threeway()] for details.
#'
#' Furthermore, it allows to calculate an experimental partial dependence based
Expand Down Expand Up @@ -283,7 +283,7 @@ hstats.explainer <- function(object, v = colnames(object[["data"]]),

#' Print Method
#'
#' Print method for object of class "hstats". Shows \eqn{H^2} statistic.
#' Print method for object of class "hstats". Shows \eqn{H^2}.
#'
#' @param x An object of class "hstats".
#' @param ... Further arguments passed from other methods.
Expand Down
4 changes: 2 additions & 2 deletions R/pd_importance.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
#'
#' Experimental variable importance method based on partial dependence functions.
#' While related to Greenwell et al., our suggestion measures not only main effect
#' strength but also interaction effects. It is very closely related to the
#' \eqn{H^2_j} statistics, see Details. By default, the results are plotted as barplot.
#' strength but also interaction effects. It is very closely related to \eqn{H^2_j},
#' see Details. By default, the results are plotted as barplot.
#' Set `plot = FALSE` to get numbers.
#'
#' @details
Expand Down
17 changes: 8 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
[![CRAN status](http://www.r-pkg.org/badges/version/hstats)](https://cran.r-project.org/package=hstats)
[![R-CMD-check](https://github.com/mayer79/hstats/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mayer79/hstats/actions)
[![Codecov test coverage](https://codecov.io/gh/mayer79/hstats/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mayer79/hstats?branch=main)
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#experimental)

[![](https://cranlogs.r-pkg.org/badges/hstats)](https://cran.r-project.org/package=hstats)
[![](https://cranlogs.r-pkg.org/badges/grand-total/hstats?color=orange)](https://cran.r-project.org/package=hstats)
Expand All @@ -16,7 +15,7 @@

**What makes a ML model black-box? It's the interactions!**

The first step in understanding interactions is to measure their strength. This is exactly what Friedman and Popescu's H statistics [1] do:
The first step in understanding interactions is to measure their strength. This is exactly what Friedman and Popescu's H-statistics [1] do:

| Statistic | Short description | How to read its value? |
|-------------|------------------------------------------|-------------------------------------------------------------------------------------------------------|
Expand All @@ -32,8 +31,8 @@ The core functions `hstats()`, `partial_dep()`, and `ice()` can directly be appl

## Limitations

1. H statistics are based on partial dependence estimates and are thus as good or bad as these. One of their problems is that the model is applied to unseen/impossible feature combinations. In extreme cases, H statistics intended to be in the range between 0 and 1 can become larger than 1. Accumulated local effects (ALE) [8] mend above problem of partial dependence estimates. They, however, depend on the notion of "closeness", which is highly non-trivial in higher dimension and for discrete features.
2. Due to their computational complexity, H statistics are usually evaluated on relatively small subsets of the training (or validation/test) data. Consequently, the estimates are typically not very robust. To get more robust results, increase the default `n_max = 300` of `hstats()`.
1. H-statistics are based on partial dependence estimates and are thus as good or bad as these. One of their problems is that the model is applied to unseen/impossible feature combinations. In extreme cases, H-statistics intended to be in the range between 0 and 1 can become larger than 1. Accumulated local effects (ALE) [8] mend above problem of partial dependence estimates. They, however, depend on the notion of "closeness", which is highly non-trivial in higher dimension and for discrete features.
2. Due to their computational complexity, H-statistics are usually evaluated on relatively small subsets of the training (or validation/test) data. Consequently, the estimates are typically not very robust. To get more robust results, increase the default `n_max = 300` of `hstats()`.

## Landscape

Expand Down Expand Up @@ -93,7 +92,7 @@ fit <- xgb.train(

### Interaction statistics

Let's calculate different H statistics via `hstats()`:
Let's calculate different H-statistics via `hstats()`:

```r
# 3 seconds on simple laptop - a random forest will take 1-2 minutes
Expand Down Expand Up @@ -122,7 +121,7 @@ plot(s) # Or summary(s) for numeric output
**Remarks**

1. Pairwise statistics $H^2_{jk}$ are calculated only for the features with strong overall interactions $H^2_j$.
2. H statistics need to repeatedly calculate predictions on up to $n^2$ rows. That is why {hstats} samples 300 rows by default. To get more robust results, increase this value at the price of slower run time.
2. H-statistics need to repeatedly calculate predictions on up to $n^2$ rows. That is why {hstats} samples 300 rows by default. To get more robust results, increase this value at the price of slower run time.
3. Pairwise statistics $H^2_{jk}$ measures interaction strength relative to the combined effect of the two features. This does not necessarily show which interactions are strongest in absolute numbers. To do so, we can study unnormalized statistics:

```r
Expand Down Expand Up @@ -261,7 +260,7 @@ $$
F(\boldsymbol x) = F_j(x_j) + F_{\setminus j}(\boldsymbol x_{\setminus j}).
$$

Correspondingly, Friedman and Popescu's $H^2_j$ statistic of overall interaction strength is given by
Correspondingly, Friedman and Popescu's statistic of overall interaction strength is given by

$$
H_{j}^2 = \frac{\frac{1}{n} \sum_{i = 1}^n\big[F(\boldsymbol x_i) - \hat F_j(x_{ij}) - \hat F_{\setminus j}(\boldsymbol x_{i\setminus j})\big]^2}{\frac{1}{n} \sum_{i = 1}^n\big[F(\boldsymbol x_i)\big]^2}.
Expand All @@ -285,7 +284,7 @@ $$
F_{jk}(x_j, x_k) = F_j(x_j) + F_k(x_k).
$$

Correspondingly, Friedman and Popescu's $H_{jk}^2$ statistic of pairwise interaction strength is defined as
Correspondingly, Friedman and Popescu's statistic of pairwise interaction strength is defined as

$$
H_{jk}^2 = \frac{A_{jk}}{\frac{1}{n} \sum_{i = 1}^n\big[\hat F_{jk}(x_{ij}, x_{ik})\big]^2}
Expand Down Expand Up @@ -371,7 +370,7 @@ In [5], $1 - H^2$ is called *additivity index*. A similar measure using accumula

#### Workflow

Calculation of all $H_j^2$ statistics requires $O(n^2 p)$ predictions, while calculating of all pairwise $H_{jk}$ requires $O(n^2 p^2$ predictions. Therefore, we suggest to reduce the workflow in two important ways:
Calculation of all $H_j^2$ requires $O(n^2 p)$ predictions, while calculating of all pairwise $H_{jk}$ requires $O(n^2 p^2$ predictions. Therefore, we suggest to reduce the workflow in two important ways:

1. Evaluate the statistics only on a subset of the data, e.g., on $n' = 300$ observations.
2. Calculate $H_j^2$ for all features. Then, select a small number $m = O(\sqrt{p})$ of features with highest $H^2_j$ and do pairwise calculations only on this subset.
Expand Down
11 changes: 7 additions & 4 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# Resubmission

- Fixing an indirect URL in the README
- Sticking to "authors (year) <doi>" reference in DESCRIPTION.

# Original message

Hello CRAN team

Trying to submit a new package that calculates Friedman and Popescu's H statistics in many variants.
Expand All @@ -10,9 +17,6 @@ Michael

## Local checks seem ok

❯ checking for future file timestamps ... NOTE
unable to verify current time

❯ checking HTML version of manual ... NOTE
Skipping checking HTML validation: no command 'tidy' found

Expand All @@ -21,7 +25,6 @@ Michael
New submission

Possibly misspelled words in DESCRIPTION:
Popescu's (6:66)
explainers (13:32)

## Winbuilder seems ok
Expand Down
51 changes: 31 additions & 20 deletions docu/document.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is pdfTeX, Version 3.141592653-2.6-1.40.24 (MiKTeX 22.3) (preloaded format=pdflatex 2022.5.15) 16 JUL 2023 20:32
This is pdfTeX, Version 3.141592653-2.6-1.40.24 (MiKTeX 22.3) (preloaded format=pdflatex 2022.5.15) 17 JUL 2023 21:26
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
Expand Down Expand Up @@ -49,33 +49,44 @@ LaTeX Font Info: External font `cmex10' loaded for size
(Font) <7> on input line 10.
LaTeX Font Info: External font `cmex10' loaded for size
(Font) <5> on input line 10.
[1

{C:/Users/Michael/AppData/Local/MiKTeX/2.9/fonts/map/pdftex/pdftex.map}] [2] [3
] [4] (document.bbl) [5] (document.aux) )
Overfull \hbox (1.00066pt too wide) in paragraph at lines 28--30
\OT1/cmr/m/n/10 Correspondingly, Fried-man and Popescu's statis-tic of over-all
in-ter-ac-tion strength
[]

[1

{C:/Users/Michael/AppData/Local/MiKTeX/2.9/fonts/map/pdftex/pdftex.map}]
Overfull \hbox (8.00066pt too wide) in paragraph at lines 48--50
\OT1/cmr/m/n/10 Correspondingly, Fried-man and Popescu's statis-tic of pair-wis
e in-ter-ac-tion strength
[]

[2] [3] [4] (document.bbl) [5] (document.aux) )
Here is how much of TeX's memory you used:
459 strings out of 478608
8568 string characters out of 2850693
306862 words of memory out of 3000000
18687 multiletter control sequences out of 15000+600000
472077 words of font info for 37 fonts, out of 8000000 for 9000
1141 hyphenation exceptions out of 8191
34i,6n,38p,423b,184s stack positions out of 10000i,1000n,20000p,200000b,80000s
<C:\Users\Michael\AppData\Local\MiKTe
X\2.9\fonts/pk/ljfour/jknappen/ec/dpi600\tcrm1000.pk><C:/Program Files/MiKTeX 2
.9/fonts/type1/public/amsfonts/cm/cmbx10.pfb><C:/Program Files/MiKTeX 2.9/fonts
/type1/public/amsfonts/cm/cmbx12.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/p
ublic/amsfonts/cm/cmbx7.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/ams
fonts/cm/cmex10.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm
/cmmi10.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmmi5.p
fb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmmi7.pfb><C:/Pr
ogram Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmr10.pfb><C:/Program Fil
es/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmr7.pfb><C:/Program Files/MiKTeX
2.9/fonts/type1/public/amsfonts/cm/cmsy10.pfb><C:/Program Files/MiKTeX 2.9/font
s/type1/public/amsfonts/cm/cmsy5.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/p
ublic/amsfonts/cm/cmsy7.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/ams
fonts/cm/cmti10.pfb>
Output written on document.pdf (5 pages, 174976 bytes).
34i,6n,38p,423b,182s stack positions out of 10000i,1000n,20000p,200000b,80000s
<C:\Users\Michael\AppData\Local
\MiKTeX\2.9\fonts/pk/ljfour/jknappen/ec/dpi600\tcrm1000.pk><C:/Program Files/Mi
KTeX 2.9/fonts/type1/public/amsfonts/cm/cmbx10.pfb><C:/Program Files/MiKTeX 2.9
/fonts/type1/public/amsfonts/cm/cmbx12.pfb><C:/Program Files/MiKTeX 2.9/fonts/t
ype1/public/amsfonts/cm/cmbx7.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/publ
ic/amsfonts/cm/cmex10.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/amsfo
nts/cm/cmmi10.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/c
mmi5.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmmi7.pfb>
<C:/Program Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmr10.pfb><C:/Progr
am Files/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmr7.pfb><C:/Program Files/M
iKTeX 2.9/fonts/type1/public/amsfonts/cm/cmsy10.pfb><C:/Program Files/MiKTeX 2.
9/fonts/type1/public/amsfonts/cm/cmsy5.pfb><C:/Program Files/MiKTeX 2.9/fonts/t
ype1/public/amsfonts/cm/cmsy7.pfb><C:/Program Files/MiKTeX 2.9/fonts/type1/publ
ic/amsfonts/cm/cmti10.pfb>
Output written on document.pdf (5 pages, 174950 bytes).
PDF statistics:
88 PDF objects out of 1000 (max. 8388607)
0 named destinations out of 1000 (max. 500000)
Expand Down
Binary file modified docu/document.synctex.gz
Binary file not shown.
10 changes: 5 additions & 5 deletions docu/document.tex
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ \subsection{Overall interaction strength}
$$
F(\mathbf{x}) = F_j(x_j) + F_{\setminus j}(\mathbf{x}_{\setminus j}).
$$
Correspondingly, Friedman and Popescu's $H^2_j$ statistic of overall interaction strength is given by
Correspondingly, Friedman and Popescu's statistic of overall interaction strength is given by
$$
H_{j}^2 = \frac{\frac{1}{n} \sum_{i = 1}^n\big[F(\mathbf{x}_i) - \hat F_j(x_{ij}) - \hat F_{\setminus j}(\mathbf{x}_{i\setminus j})\big]^2}{\frac{1}{n} \sum_{i = 1}^n\big[F(\mathbf{x}_i)\big]^2}.
$$
Expand All @@ -46,7 +46,7 @@ \subsection{Pairwise interaction strength}
$$
F_{jk}(x_j, x_k) = F_j(x_j)+ F_k(x_k).
$$
Correspondingly, Friedman and Popescu's $H_{jk}^2$ statistic of pairwise interaction strength is defined as
Correspondingly, Friedman and Popescu's statistic of pairwise interaction strength is defined as
$$
H_{jk}^2 = \frac{A_{jk}}{\frac{1}{n} \sum_{i = 1}^n\big[\hat F_{jk}(x_{ij}, x_{ik})\big]^2}
$$
Expand Down Expand Up @@ -106,7 +106,7 @@ \subsection{Total interaction strength of all variables together}
In \cite{zolkowski2023}, $1 - H^2$ is called {\em additivity index}. A similar measure using accumulated local effects is discussed in \cite{molnar2020}.

\subsection{Workflow}
Calculation of all $H_j^2$ statistics requires $O(n^2p)$ predictions, while calculating of all pairwise $H_{jk}$ requires $O(n^2 p^2)$ predictions. Therefore, we suggest to reduce the workflow in two important ways:
Calculation of all $H_j^2$ requires $O(n^2p)$ predictions, while calculating of all pairwise $H_{jk}$ requires $O(n^2 p^2)$ predictions. Therefore, we suggest to reduce the workflow in two important ways:
\begin{itemize}
\item Evaluate the statistics only on a subset of the data, e.g., on $n' = 300$ observations.
\item Calculate $H_j^2$ for all features. Then, select a small number $m = O(\sqrt{p})$ of features with highest $H^2_j$ and do pairwise calculations only on this subset.
Expand All @@ -130,9 +130,9 @@ \section{Variable importance}
\section{Limitation}

\begin{enumerate}
\item H statistics are based on partial dependence estimates and are thus as good or bad as these. One of their problems is that the model is applied to unseen/impossible feature combinations. In extreme cases, H statistics intended to be in the range between 0 and 1 can become larger than 1.
\item H-statistics are based on partial dependence estimates and are thus as good or bad as these. One of their problems is that the model is applied to unseen/impossible feature combinations. In extreme cases, H-statistics intended to be in the range between 0 and 1 can become larger than 1.
Accumulated local effects (ALE) \cite{apley2016} mend above problem of partial dependence estimates. They, however, depend on the notion of ``closeness'', which is highly non-trivial in higher dimension and for discrete features.
\item Due to their computational complexity, H statistics are usually evaluated on relatively small subsets of the training (or validation/test) data. Consequently, the estimates are typically not very robust. To get more robust results, increase the sample size.
\item Due to their computational complexity, H-statistics are usually evaluated on relatively small subsets of the training (or validation/test) data. Consequently, the estimates are typically not very robust. To get more robust results, increase the sample size.
\end{enumerate}

\bibliographystyle{ieeetr}
Expand Down
Loading

0 comments on commit 21d2252

Please sign in to comment.