Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

madrat caching dependent on attached packages #219

Closed
0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q opened this issue Oct 11, 2024 · 16 comments
Closed
Assignees
Labels

Comments

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member

This bug does not affect input data generation currently, but has the potential to screw it up in intractable ways.

For context, see pik-piam/mrremind#573, pik-piam/GDPuc#38 and this discussion.

Two facts can conspire to break the madrat caching for input data preparation.

  1. Packages use
    .onAttach <- function(libname, pkgname) {
        madrat::madratAttach(pkgname)
    }
    
    to be discovered by madrat and have their functions included in the fingerprinting for cache invalidation.
  2. The RSE R Programming Standards state that

    in general package::fun() should be used instead of roxygen's @importFrom

Using plausible changes to the mrdriver package [↑] as an example, GDPuc is loaded, but never attached. (So this is wrong.) Consequently, madrat caching ingnores toolConvertGDP() in the fingerprinting, possibly leading to corrupted caches.

devtools::load_all()

fp <- madrat:::fingerprint('calcGDP', details = TRUE)
# Warning messages:
# …
# 6: In getMadratGraph(...) :
#   Following functions could not be found in the scope of packages to be checked.: 
#    calcEdgeTransportSA->readEDGETransport, toolConvertGDP->calcGDP, toolConvertGDP->calcGDPpc, toolConvertGDP->calcInternalGDPFutureSSP2EU, toolConvertGDP->calcInternalGDPFutureSSPs, toolConvertSingle->calcInternalGDPFutureSSPs, toolConvertGDP->calcInternalGDPMI, toolConvertGDP->calcInternalGDPPastEurostat, toolConvertGDP->calcInternalGDPPastWDI
#   Please make sure that they exist and adjust the scope of packages accordingly!

as.vector(fp)
# [1] "32785696"

attr(fp, 'details')[grepl('Convert', names(attr(fp, 'details')))]
#      UNKNOWN:::toolConvertGDP    UNKNOWN:::toolConvertSingle mrdrivers:::toolGeneralConvert 
#                            NA                             NA                     "5d6a17ee" 

Yes, there is a warning. But the warnings of the input data generation are ignored as a matter of policy, and truncated to 50 in any case.

Actually attaching, not only loading GDPuc yields different result:

library(GDPuc)

fp2 <- madrat:::fingerprint('calcGDP', details = TRUE)

as.vector(fp2)
# [1] "7c09e366"

attr(fp2, 'details')[grepl('Convert', names(attr(fp2, 'details')))]
#        GDPuc:::toolConvertGDP      GDPuc:::toolConvertSingle mrdrivers:::toolGeneralConvert 
#                    "84d0659f"                     "73e1e1bf"                     "5d6a17ee" 

This is not an issue currently for input data generation, as GDPuc is imported explicitly in several packages (mrcommons, mrfactors, mrfaocore, mrland, mrremind, mrvalidation, piamInterfaces). But should that be changed, caches will rot.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Oct 11, 2024

As @fbenke-pik discovered, the situation is somewhat different. madrat only fingerprints functions if their package is registered through madratAttach(), which all packages do during attachment, and packages are only attached when they are Depends dependencies of packages that are being attached themselves. Otherwise they are only loaded.
Two possible solutions present themselves:

  1. Make all "madrat-verse" packages Depends dependencies, with the ensuing bloated search paths.
  2. Have the packages call madratAttach() during .onLoad().

@tscheypidi
Copy link
Member

The first is actually the policy we currently have in place. I think the 2nd solution does not work as madratAttach() in some cases needs to happen before .onLoad is being run.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

The first is actually the policy we currently have in place.

As in "an undocumented policy in contradiction to the stated 'R Language Rules'"?

I think the 2nd solution does not work as madratAttach() in some cases needs to happen before .onLoad is being run.

.onLoad() runs before .onAttach().

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

Options for @fbenke-pik

  1. Use .onLoad() instead of .onAttach() (and .onUnload() instead of .onDetach()) in all "madrat-verse" packages.
    Both Depends and Imports dependencies are loaded before the workspace of the current package is loaded and the .onLoad() hook is called.
  2. If this cannot be coordinated, packages can switch from Depends to Imports dependencies by registering these dependencies with madrat themselves. Using mrremind as an example:
     Depends:
    -    edgeTransport (>= 1.5.5),
    -    GDPuc (>= 1.3.0),
         madrat (>= 3.7.1),
         magclass (>= 6.16.1),
    -    mrcommons (>= 1.44.12),
    -    mrdrivers (>= 2.0.0),
         R (>= 2.10.0)
     Imports:
    +    edgeTransport (>= 1.5.5),
    +    GDPuc (>= 1.3.0),
    +    mrcommons (>= 1.44.12),
    +    mrdrivers (>= 2.0.0),
         assertr,
    -.onAttach <- function(libname, pkgname) {
    -  madrat::madratAttach(pkgname)
    +.onLoad <- function(libname, pkgname) {
    +  madrat::madratAttach(c(pkgname, 'edgeTransport', 'GDPuc', 'mrcommons',
    +                         'mrdrivers'))
     }
    
    -.onDetach <- function(libpath) {
    -  madrat::madratDetach(libpath)
    +.onUnload <- function(libpath) {
    +  madrat::madratDetach(c(libpath, 'edgeTransport', 'GDPuc', 'mrcommons',
    +                         'mrdrivers'))
     }
    This would break if any of the dependencies is registered with madrat independently of mrremind and mrremind is detached – and I would not know under which circumstances that should happen.

@fbenke-pik
Copy link
Contributor

fbenke-pik commented Oct 15, 2024

Some comments / questions

  • I certainly cannot implement a solution that must be applied to all madrat-packages to work. For this, we need Jan's approval and some sort of coordination. I would assume any solution that can be applied to mrremind individually without affecting other packages should be fine. I interpret Michaja's explanation in a way that option 1 won't happen because of that.
  • Concerning the policy currently in place: How bad are bloated search paths in practice and what is the worst that could happen? So, what is the actual benefit of moving away from the current policy and how is it affecting us negatively right now?
  • I personally don't mind switching from the policy currently in place to option 2 for mrremind.
  • @tscheypidi Would it make sense to have something stronger than a warning when madrat caching / fingerprinting is unable to collect all the info needed for making a caching decision? I have seen a few instances over the last months where caching was not working as expected because our implementation did not use the right setup - and it was unnoticed for a long time.

@fbenke-pik
Copy link
Contributor

Concerning the policy currently in place: How bad are bloated search paths in practice and what is the worst that could happen? So, what is the actual benefit of moving away from the current policy and how is it affecting us negatively right now?

I guess, the answer is: "Depends makes a package vulnerable to whatever other packages are loaded by the user."

@fbenke-pik
Copy link
Contributor

fbenke-pik commented Oct 15, 2024

Actually, option 2 does not work for me. I tested on this branch: https://github.com/fbenke-pik/mrremind/tree/poc

Starting a new R session, and loading the library via devtools::load_all(), I have NAs in the dependencies

> madrat::getDependencies("calcGDP", direction = "din")
                func type   package                           call     hash
1         calcDriver calc mrdrivers         mrdrivers:::calcDriver e1d570f3
2     toolConvertGDP tool   UNKNOWN       UNKNOWN:::toolConvertGDP     <NA>
3 toolCheckUserInput tool mrdrivers mrdrivers:::toolCheckUserInput 5d8931cf

@tscheypidi
Copy link
Member

  • @tscheypidi Would it make sense to have something stronger than a warning when madrat caching / fingerprinting is unable to collect all the info needed for making a caching decision? I have seen a few instances over the last months where caching was not working as expected because our implementation did not use the right setup - and it was unnoticed for a long time.

what kind of stronger warning you have in mind?

In general, I would be in favor of a solution which does not require depends in order to work. So far I just could not find any. If you have one which is working robustly I would be interested in it.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

Bloated search paths are mostly a nuisance, in my view.

I am not convinced of the performance penalty @pascal-sauer claimed. Or more precisely, that it is relevant in this case.

Name collisions between madrat-packages are a problem in any case—or not at all, depending on how you look at it. madrat completely ignores R namespaces and acts inconsistently on the functions in registered packages.

library(madrat)

setConfig(mainfolder = tempdir(), .verbose = FALSE)
dir.create(file.path(getConfig('sourcefolder'), 'Foo'))

library(Tom)
library(Dick)

getCalculations('read')
#   type package             call
# 1  Tau  madrat madrat:::readTau
# 2  Foo     Tom    Tom:::readFoo
# 3  Foo    Dick   Dick:::readFoo

madrat:::fingerprint('readFoo')
# [1] "b4a51d22"
# attr(,"call")
# [1] "Tom:::readFoo"
# There were 50 or more warnings (use warnings() to see the first 50)
   
warnings()
# Warning messages:
# 1: In getCode(packages = packages, globalenv = globalenv) :
#   Duplicate functions found: Dick:::readFoo, Tom:::readFoo! Laste entry will be used
#

readSource('Foo', convert = FALSE)
# An object of class "magpie"
# , , 1
# 
#       year
# region [,1]
#   Dick   NA
# 
# Warning messages:
# 1: In prepFunctionName(type = type, prefix = prefix) :
#   More than one function found for type "Foo" and prefix "read". Use last occurrence (package "Dick")
# 2: In getCode(packages = packages, globalenv = globalenv) :
#   Duplicate functions found: Tom:::readFoo, Dick:::readFoo! Laste entry will be used
# 3: In prepFunctionName(type = type, prefix = prefix, ignore = ignore) :
#   More than one function found for type "Foo" and prefix "read". Use last occurrence (package "Dick")

returning the hash of one but calling the other if several functions of the same name are found. And there is no way to call madrat on functions of a specific package, so the "madrat-verse" is one big namespace where all users have to get along. So it does not matter whether the namespaces of madrat-packages would collide, since that would wreak havoc anyhow.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

Actually, option 2 does not work for me. I tested on this branch: https://github.com/fbenke-pik/mrremind/tree/poc

👉 zzz.R Sneaky to have .onLoad() and .onAttach() in different files.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Oct 15, 2024

Concerning the policy currently in place: How bad are bloated search paths in practice and what is the worst that could happen? So, what is the actual benefit of moving away from the current policy and how is it affecting us negatively right now?

I guess, the answer is: "Depends makes a package vulnerable to whatever other packages are loaded by the user."

No. Packages have internal search paths that are determined by the NAMESPACE file and sealed when they are loaded. A package would only be affected by user-loaded packages when they call a function that has no binding in its local search path and is therefore searched in the global one. That is exactly why R CMD check complains about "no visible global function definition for x" mrremind#559.

The situation is reversed. Using Depends-dependencies makes the user's environment dependent on what the package is doing, since they are all re-exported to the global search path.

library(tidyverse)
# …
library(mrremind)
# Loading required package: edgeTransport
# Loading required package: data.table
# data.table 1.16.2 using 2 threads (see ?getDTthreads).  Latest news: r-datatable.com
# 
# Attaching package: ‘data.table’
# 
# The following objects are masked from ‘package:lubridate’:
# 
#     hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year
# 
# The following objects are masked from ‘package:dplyr’:
# 
#     between, first, last
# 
# The following object is masked from ‘package:purrr’:
# 
#     transpose
# 
# Loading required package: mrtransport
# Loading required package: GDPuc
# Loading required package: madrat
# Loading required package: magclass
# 
# Attaching package: ‘magclass’
# 
# The following object is masked from ‘package:dplyr’:
# 
#     where
# 
# Loading required package: mrdrivers
# Loading required package: mrcommons
# Loading required package: mrfaocore
# Loading required package: mstools
# Loading required package: mrlandcore
library(mrremind)
# …
library(tidyverse)
# ✖ dplyr::between()     masks data.table::between()
# ✖ dplyr::filter()      masks stats::filter()
# ✖ dplyr::first()       masks data.table::first()
# ✖ lubridate::hour()    masks data.table::hour()
# ✖ lubridate::isoweek() masks data.table::isoweek()
# ✖ dplyr::lag()         masks stats::lag()
# ✖ dplyr::last()        masks data.table::last()
# ✖ lubridate::mday()    masks data.table::mday()
# ✖ lubridate::minute()  masks data.table::minute()
# ✖ lubridate::month()   masks data.table::month()
# ✖ lubridate::quarter() masks data.table::quarter()
# ✖ lubridate::second()  masks data.table::second()
# ✖ purrr::transpose()   masks data.table::transpose()
# ✖ lubridate::wday()    masks data.table::wday()
# ✖ lubridate::week()    masks data.table::week()
# ✖ dplyr::where()       masks magclass::where()
# ✖ lubridate::yday()    masks data.table::yday()
# ✖ lubridate::year()    masks data.table::year()

So if packages keep re-exporting other packages, user scripts (and packages using them as Depends-dependencies) are vulnerable to newly introduced bugs and conflicts. This is the reason packages should use them with extreme prejudice.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

Would it make sense to have something stronger than a warning when madrat caching / fingerprinting is unable to collect all the info needed for making a caching decision? I have seen a few instances over the last months where caching was not working as expected because our implementation did not use the right setup - and it was unnoticed for a long time.

what kind of stronger warning you have in mind?

It should outright fail. If a component cannot be hashed, then there is no way of determining the validity of the cache. "Es ist besser, nicht zu berechnen, als falsch zu berechnen."

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

  • Concerning the policy currently in place: How bad are bloated search paths in practice and what is the worst that could happen? So, what is the actual benefit of moving away from the current policy and how is it affecting us negatively right now?
  1. I did not find the current policy anywhere. I do not know to what extent you were aware of it. The madrat vignette states

    Besides [adding madrat as a dependency and adding the .onAttach()/.onDetach() functions] no further changes are required and functions in the new package will be visible to the madrat wrapper functions.

    which may lead users to not add required packages as Depends-dependencies, which will fail silently as the madrat-style functions (e.g. toolConvertGDP()) will simply not be hashed by madrat.

  2. Currently, mrremind attaches eleven packages the user did not ask for: edgeTransport, GDPuc, madrat, magclass, mrcommons, mrdrivers, data.table, mrtransport, mrfaocore, mrlandcore, and mstools. Any one of those starts exporting a function that clashes with something else the use is doing: things break for no apparent reason.

@fbenke-pik
Copy link
Contributor

fbenke-pik commented Oct 16, 2024

Thanks for the comments, Michaja. Very insightful.

👉 zzz.R Sneaky to have .onLoad() and .onAttach() in different files.

Right, works as expected once the duplicate functions are removed.

It should outright fail. If a component cannot be hashed, then there is no way of determining the validity of the cache. "Es ist besser, nicht zu berechnen, als falsch zu berechnen."

I agree, at least for REMIND mr packages. I want it to fail and cause people to halt. If there are cases where this should not fail, we could make it a madrat config setting that defaults to false. I could draft a PR if @tscheypidi is open to move into this direction.

I did not find the current policy anywhere. I do not know to what extent you were aware of it.

I just was made aware of it through our discussions here. But i find it in the madrat vignette as well: https://github.com/pik-piam/madrat/blob/master/vignettes/madrat.Rmd#L176-L180

The sentence you quoted might be misleading, but if you read the whole section Create madrat-based package, it should be clear what to do.

@fbenke-pik
Copy link
Contributor

Following Michaja's comments, I have created a PR for mrremind: pik-piam/mrremind#578
Seems to work as expected.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

I did not find the current policy anywhere. I do not know to what extent you were aware of it.

I just was made aware of it through our discussions here. But i find it in the madrat vignette as well: https://github.com/pik-piam/madrat/blob/master/vignettes/madrat.Rmd#L176-L180

The sentence you quoted might be misleading, but if you read the whole section Create madrat-based package, it should be clear what to do.

No. What one needs to do is "when creating a madrat-based package that uses functions of other madrat-based packages, these have to be declared as dependencies under Depends, not under Imports, in order for madrat to cache all functions correctly." There is no way anybody understands this from that section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants