POC bin breaks derived from scale breaks #6174

arcresu · 2024-11-01T06:14:59Z

This is a proof of concept of the minimal changes necessary to fix #6159. If you're willing to consider this approach I'll finish it off with documentation, tests, and the outstanding TODOs below.

The first part is essentially the same as the extension discussed in the issue: i.e. the follow.scale param on stat_bin causes it to inherit bins from the scale. As noted, that only works if the scale doesn't get new breaks during the final retraining, i.e. provide fixed breaks, or disable scale expansion and hope other layers don't cause issues. In this example the bins don't align with the final breaks because the scale expands after the binning, causing the breaks to move.

(TODO: add follow.scale to the other binning stats. Suppress the default binning warning when follow.scale = TRUE. Add a value like follow.scales = "minor" to allow inheriting major and minor breaks?)

devtools::load_all("~/code/ggplot2")
#> ℹ Loading ggplot2

set.seed(2024)
df <- data.frame(
  date = as.Date("2024-01-01") + rnorm(100, 0, 5),
  z = sample(c("a", "b"), 100, replace = TRUE)
)

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The fix is to tell the scale that we want the breaks to be "frozen" before the stats are computed. Subsequent retraining is free to change the limits, which affects which breaks are shown, but once the breaks are frozen it acts as though they had been passed in as an explicit breaks vector.

(TODO: Add a param to the continuous scale constructor and scale_{x,y}_{continuous,date,datetime}. Maybe come up with a better name than freezing, like breaks_computation = c("auto", "before_stat"))

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It also looks reasonable when there are multiple facets:

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  facet_wrap(vars(z)) +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adding a distant data point and setting the scales to free, we can see that the binning is done independently for different facets:

rbind(df, data.frame(date = as.Date("2025-03-01") , z = "a")) |>
  ggplot(aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  facet_wrap(vars(z), scales = "free_x") +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE, guide = guide_axis(angle = 90)) 
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2024-11-01 with reprex v2.1.1}

This seems probably desirable behaviour since we did explicitly request free scales here. Changing it to make the binning consistent across panels would also be a bit complicated because I think the facets clone the scales before the first time breaks are computed.

To make the combination of settings more discoverable, it's probably reasonable to add a warning when using follow.scale with a scale that doesn't have freeze_breaks = TRUE.

Please let me know if I've overlooked some way that these changes will cause problems with other parts of ggplot!

thomasp85 · 2024-11-01T07:57:31Z

I'm a bit weary of adding this kind of link between scales and stats tbh. It makes the code much harder to reason about. Is there anything in this you couldn't achieve with binned scale and geom_bar()

teunbrand · 2024-11-01T07:59:16Z

From the linked issue:

I can't for example add a geom_vline() to mark a specific date on the axis, since the vertical line would then be snapped into a bin by the scale transform.

arcresu · 2024-11-01T08:44:26Z

Thanks, yes it's exactly that. A typical use case I'm working with is epidemic curves that are annotated with contextual events. To pick a random example online, something like this (but sometimes the date ranges incolved are shorter so snapping to the bin centre is a more significant change):

The main data is a histogram with date bins that are either epidemiological weeks (i.e. weeks with the boundary fixed to a specific day of the week depending on local concentions) or calendar months/years. Annotations refer to events on a specific day or are time series that might be binned differently. A binned scale would force all layers to use the same bins.

The follow.scale part of this change by itself could easily live in an extension, but the frozen breaks part is hard to do outside of ggplot.

thomasp85 · 2024-11-01T08:59:21Z

Ah, I see... Sorry for driving by with a half-informed suggestion 🙈

thomasp85 · 2024-11-01T09:01:09Z

I'm still very much against geoms/stats being able to control aspects of the scaling. This kind of flow of control could lead to incompatibility between layers etc (and make code harder to reason about)

arcresu · 2024-11-01T09:29:37Z

The change I proposed is the opposite: scales influence stats. I get that it stretches the abstractions a bit unnaturally, though.

The status quo solution is to separately tell the scale and the stat to use the same bin/break width and offset. It's certainly doable when you understand the issue, but it tends to be brittle.

So I suppose a compromise could be tweaks to the way breaks and bins are specified (especially with date scales) so that there's a more discoverable way to control them with the same syntax? I've see a lot of cases where people worked out how to synchronise bin (binwidth=7) and break widths (date_breaks="week") but got stuck with the offsets.

thomasp85 · 2024-11-01T09:33:28Z

My reading skills are obviously impaired this morning 🙈

arcresu · 2024-11-06T03:39:10Z

I've found a different way to get the result I'm after without needing any changes to the scale internals. Essentially rather than the scales going out of their way to avoid updating breaks after the first time, the breaks function is memoised by wrapping it along with an environment. I previously thought that interior mutability in a function wouldn't be possible.

library(ggplot2)
#> 
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:base':
#> 
#>     is.element

set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))

StatBin2 <- ggproto(
  "StatBin2", StatBin,
  compute_panel = function(self, data, scales, breaks = NULL, ...) {
    breaks <- breaks %||% scales$x$get_transformation()$inverse(scales$x$get_breaks())
    ggproto_parent(StatBin, self)$compute_panel(data, scales, breaks = breaks, ...)
  }
)

p <- ggplot(df, aes(date)) + geom_histogram(stat = StatBin2)

p + scale_x_date(breaks = scales::breaks_width("2 week"))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

(this bit as before, with breaks undesirably recomputed after scales are expanded)

breaks_cached <- function(breaks) {
  ggplot2::ggproto(
    "BreaksCached", NULL,
    fn = breaks,
    cached = NULL,
    get_breaks = function(self, limits) {
      if (is.null(self$cached)) self$cached <- self$fn(limits)
      self$cached
    }
  )$get_breaks
}

p + scale_x_date(breaks = breaks_cached(scales::breaks_width("2 week")))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2024-11-06 with reprex v2.1.1}

It's feasible for an extension to handle everything by providing its own version of each binning stat (which is a minor hassle to substitute in for the normal stats) along with breaks_cached().

Alternatively there could be a less invasive ggplot implementation with follow.scale on the binning stats and breaks_cached() in {scales}. It doesn't have to use ggproto - this also works:

breaks_cached <- function(f) {
  fn <- function(...) {
    if (is.null(.cache.env$cached)) {
      .cache.env$cached <- .cache.env$inner(...)
    }
    .cache.env$cached
  }
  e <- new.env()
  e$inner <- f
  e$cached <- NULL
  rlang::fn_env(fn)$.cache.env <- e
  fn
}

Does this approach feel better for ggplot?

POC bin breaks derived from scale breaks

f9618ac

arcresu marked this pull request as draft November 1, 2024 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC bin breaks derived from scale breaks #6174

POC bin breaks derived from scale breaks #6174

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

teunbrand commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 6, 2024 •

edited

Loading

POC bin breaks derived from scale breaks #6174

Are you sure you want to change the base?

POC bin breaks derived from scale breaks #6174

Conversation

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

teunbrand commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 6, 2024 • edited Loading

arcresu commented Nov 6, 2024 •

edited

Loading