Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LRP #2

Open
albermax opened this issue Feb 26, 2018 · 20 comments
Open

LRP #2

albermax opened this issue Feb 26, 2018 · 20 comments

Comments

@albermax
Copy link
Owner

albermax commented Feb 26, 2018

  • Restructure code.
  • Add pooling layers. (max pooling is covered with the gradient, sum pooling should work with LRP rules; means need to add reverse hook).
  • Add merge layers; i.e. add reverse hooks.
  • (still needed?) Add LNR layer to innvestigate.layers and create a LRP rule to handle it.
  • Add check for all keras layer if compatible, if not raise exception.
  • Get to work for all application networks.
  • Organize module: which rules to keep, divide DT and LRP?
@sebastian-lapuschkin
Copy link
Contributor

so far, only vgg*-models seem to produce non-nan analyses, even for lrp.alpha1beta0
for vgg16, lrp.epsilon also causes nan-outputs, which should not happen. all other lrp-variants are fine.

@albermax
Copy link
Owner Author

Ok, thanks for letting me know. I did not test after resnet50.

Did you test alpha-beta or epsilon with these networks on Caffee?
There is a flag reverse_check_min_max_values that prints you the min/max values in the tensors along the reversed graph. Could you please check if they get increasingly bigger/smaller due to the length of the networks?

How are you getting on with LRP? Do you have any major issues?

@sebastian-lapuschkin
Copy link
Contributor

sebastian-lapuschkin commented Mar 22, 2018

as far as I know, Alex Binder has applied LRP to Resnet-50 on Caffe, with the restriction to alpha-beta for the layers merging skip connections.

unfortunately progress is slow(er than expected), partly due to the quite thin documentation and spread out structure of the code, partly due to tf being nigh unusable when numeric hiccups occur (several minutes for an image to be analyzed as soon as nans are popping up), which makes debugging a mess.

how can the flag reverse_check_min_max_values be activated?

@albermax
Copy link
Owner Author

Sorry, I don't understand what with the restriction to alpha-beta for the layers merging skip connections?

I see. Can you please be more detailed on which parts are unclear? I invite you to ask if you don't understand parts of the code, so far I did not hear from you...

You pass it to create_analyzer or the constructor of the class.

@sebastian-lapuschkin
Copy link
Contributor

sebastian-lapuschkin commented Mar 22, 2018

min/max values definitely change in orders of magnitude several times.

here are the pints for Image 4 and 5, vgg16, lrp.epsilon
Minimum values in tensors: ((ReverseID, TensorID), Value) - [((-1, 0), 18.803598), ((0, 0), 0.0), ((1, 0), -0.04429253), ((2, 0), -0.28278047), ((3, 0), -5.2967877), ((4, 0), -5.2967877), ((5, 0), -5.2967877), ((6, 0), -27.4333), ((7, 0), -7.988126), ((8, 0), -6.816643), ((9, 0), -6.816643), ((10, 0), -4.0907025), ((11, 0), -9.793212), ((12, 0), -4.943315), ((13, 0), -4.943315), ((14,0), -2.82281), ((15, 0), -2.95577), ((16, 0), -2.8494782), ((17, 0), -2.8494782), ((18, 0), -2.1176488), ((19, 0), -2.9139009), ((20, 0), -2.9139009), ((21, 0), -2.0885756), ((22, 0), -943.41003)]
Maximum values in tensors: ((ReverseID, TensorID), Value) - [((-1, 0), 18.803598), ((0, 0), 18.803598), ((1, 0), 0.14031076), ((2, 0), 0.63318485), ((3, 0), 7.256505), ((4, 0), 7.256505), ((5, 0), 7.256505), ((6, 0), 36.63927), ((7, 0), 12.196029), ((8, 0), 16.556505), ((9, 0), 16.556505), ((10, 0), 9.502874), ((11, 0), 8.440442), ((12, 0), 20.109344), ((13, 0), 20.109344), ((14, 0), 6.629166), ((15, 0), 12.376229), ((16, 0), 5.6767926), ((17, 0), 5.6767926), ((18, 0), 2.1372676), ((19, 0), 4.7161083), ((20, 0), 4.7161083), ((21, 0), 2.1704943), ((22, 0), 1160.8564)]
Minimum values in tensors: ((ReverseID, TensorID), Value) - [((-1, 0), 17.283815), ((0, 0), 0.0), ((1, 0), -0.06721622), ((2, 0), -0.73970973), ((3, 0), -1.1157701), ((4, 0), -1.1157701), ((5, 0), -1.1157701), ((6, 0), -0.76313674), ((7, 0), -0.47384033), ((8, 0), -0.64085555), ((9, 0), -0.64085555), ((10, 0), -0.28913656), ((11, 0), -0.32694364), ((12, 0), -0.2691846), ((13, 0), -0.2691846), ((14, 0), -0.2412349), ((15, 0), -0.22283275), ((16, 0), -0.23346575), ((17, 0), -0.23346575), ((18, 0), -0.13281964), ((19, 0), -inf), ((20, 0), -inf), ((21, 0), -0.2580217), ((22, 0),-12.92273)]
Maximum values in tensors: ((ReverseID, TensorID), Value) - [((-1, 0), 17.283815), ((0, 0), 17.283815), ((1, 0), 0.22695385), ((2, 0), 0.9860331), ((3, 0), 0.858147), ((4, 0), 0.858147), ((5, 0), 0.858147), ((6, 0), 1.3015455), ((7, 0), 0.63290113), ((8, 0), 0.6716301), ((9, 0), 0.6716301), ((10, 0), 0.37879318), ((11, 0), 0.3874756), ((12, 0), 0.3756235), ((13, 0), 0.3756235), ((14, 0), 0.27613422), ((15, 0), 0.32623556), ((16, 0), 0.4887689), ((17, 0), 0.4887689), ((18, 0), 0.13896592), ((19, 0), inf), ((20, 0), inf), ((21, 0), 0.5129423), ((22, 0), 11.542535)]
Image 5, analysis of LRP-Epsilon not finite: nan True inf False

parts which are unclear from early on are eg what decides whether _default_reverse_mapping is called and when the actual rule objects are evaluated. I find myself looking for pieces of code rather frequently, since just jumping to a function's definition is often not possible via the IDE.
I can only apologize for not asking earlier.

Also you might consider adding

elif(isinstance(reverse_state['layer'], keras.layers.BatchNormalization)):
return ilayers.GradientWRT(len(Xs))(Xs+Ys+reversed_Ys)

as an immediate fix to LRP._default_reverse_mapping. I prefer not pushing the state of my current relevance_based.py. Not 100% sure on that though. It definitely should not be resolved as return reversed_Ys. The BatchNorm Layer might need to be manually resolved in two steps (mean subtraction and shift wrt the learned parameter gamma)

@albermax
Copy link
Owner Author

Good that you noticed this! Not sure how values can recover from +-inf. I guess I there might be an issue with the division. Even though that exactly should not happen with the epsilon rule.

Does this still happen if you set a larger keras.backend.epsilon value (you can set that in the keras config; the value is used in relevance_based.py line 262)?


Basically default_reverse_mapping gets called when no "conditional_mapping" applies. Conditional_mappings is a list that gets traversed, when a condition applies the according reverse function is called. If non applies default_reverse_mapping is called.
For LRP this is the case when a layer has no kernel (i.e., see kchecks.contains_kernel line 615; with kernel is typically meant a weight matrix f.e. dense and conv layers.).

One of the points here is to add another condition that checks for Add and other merge layers as they need to be treated by an LRP rule too.

Does this make it more clear to you?

(IDE-issue: I guess this is to the use of higher order function. I use emacs so I've never thought of that :-))


I suggest to treat BatchNorm explicitely by adding an entry to conditional_mappings and do the same for Reshape layers. The way I implemented default_reverse_mapping is not nice. I suggest to have an assert there that it only applies to layers where the shape stays the same.

To wrap it up:

  • For BatchNorm and Reshape layer there will be an entry in conditional_mappings.
    • The list entry can be (condition, kgraph.ReverseMappingBase) or (condition, function) with the function having the interface of apply in kgraph.ReverseMappingBase.
  • A check in default mappings that no layer ends up there that shouldn't be.

What do you think about that?


Do you know git flow? Or in general I suggest you open a branch/git flow feature and push it. Then I can see your changes/commits and be more helpful.

@sebastian-lapuschkin
Copy link
Contributor

to be more specific regarding BatchNormalization:
given an X as input, BN does, during prediction time:

Xms = X - mu # 1) mean shift based on population mean
Xsscl = Xms / sigma # 2) covariate scaling
Xgscl = Xsscl * gamma # 3) another scaling with learned layer parameter gamma
Y = Xgscl + B # 4) shift wrt to learned parameter B

BN can be interpreted as a sequence of one ADD layer, two component-wise linear activation layers and another ADD layer. Operations 2 and 3 have no effect on LRP decomposition and need not to be decomposed, however 1 and 4 do, but 4 depends on 1, 2 and 3.

Decomposing a BatchNorm layer with LRP would (to my understanding) in any case some footwork to do, to achieve the following, given a top layer relevances R

R_xgscl = Xgscl * (R/Y) # op 4
R_b = B * (R/Y)

R_xms = R_sscl = R_xgscl # op 3 and 2

R_x = X * (R_xms/Xms) #op 1
R_mu = -mu * (R_xms/Xms)

where we can expect R_b + R_xgscl = R and R_x + R_mu = R_xms = R_xgscl.

Can above decomposition be resolved efficiently using the GradientWRT function?
thanks for pointing out self._conditional_mappings. This explains things.

@sebastian-lapuschkin
Copy link
Contributor

increasing keras.backend.epsilon was not the solution (went until eps=1e0)
Cause of the infs (or nans) was the epsilon rule not applying to zero-valued denominators.
I replaced
K.sign(x)
with
(K.cast(K.greater_equal(x,0), K.floatx())*2-1)
to also consider 0 as a positively signed number. Otherwise, epsilon is only added to non-zero denominators, defeating its purpose.

@albermax
Copy link
Owner Author

Oh I see! Thanks!

I will follow up on BN.

@albermax
Copy link
Owner Author

Coming back to batchnorm. Is this intuition right:

  • Assume we have our original BN-layer l and one BN-layer l_no_b where bias and beta are set to zero and the other parameters are copied.
  • x_i+1 = l(x_i) # original forward step
  • x_i+1_no_b = l_no_b(x_i) # forward step not applying bias/gamma
  • then given the incoming relevances r_i+1 we want to scale that with (x_i+1/x_i+1_no_b) ?
    • i.e., r_i = r_i+1 * (x_i+1/x_i+1_no_b)

Assuming this is right you basically can copy the BN layer and set the center/beta param to zero. You can see here how to modify the params (please mind depending on whether beta/gamma is used the order of the weights changes, that is a bit messy.): https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L16

Is my train of thought right? and let me know if you need help with the implementation.

@sebastian-lapuschkin
Copy link
Contributor

sebastian-lapuschkin commented Mar 26, 2018

Hi Max,

your thoughts are kinda right, but do not capture everything which needs to be done. For decopmosing BN, there are two of above steps required. I have read the implementation of the BatchNormalization layer behind the link you sent, which redirects to the use of keras.backend.batch_normalization. ( Ctrl+F towards its interface description in https://keras.io/backend/ )

Based on the formula given there, let me elaborate how to apply LRP to this layer type below. First, let us formulate BN as a series of steps, with both scaling steps merged into one. Then, we can attribute relevance towards the input for each step separately.

batchnorm

the first line of formulas shows the direct evaluation of the batch norm layer output y, as well as the corresponding relevance decomposition in the right column for the whole layer. Below, the evaluation of the batchnorm layer and the relevance decomposition is given as a series of steps, with the arrows indicating the workflow.

To summarize, to perform LRP for the batch norm layer, we need access to either the parameters (which we have) or intermediate transformations (which we can get) of the input data, for the Zrule.
On top of that, there needs to be a mechanic to handle the other deompositon rules.
A first suggestion would be to implement reverse classes for all non-standard layers such as BatchNorm and whatever other layer requires a similar treatment, which checks the rule type via isinstance(obj, cls) and then does its thing. I think this will result in the least amount of code, since we need support for all layers in the end, but nut all layers require an implementation of all decomposition rules, e.g. for some the difference in code is minimal such that it can be captured by an if/else.

Would you prefer to see the BN decomposition implemented "by hand" as in the table above, or to transform it into a sequence of Add, linear activation and Add layer to use the GradientWRT pattern?

Intuitively, I would suggest starting with the implementation by hand, which can then be resolved into something more elegant, while using the manually coded transformations for debugging and comparision. Then, in the end, you can decide which segments of code to keep.

@sebastian-lapuschkin
Copy link
Contributor

another question is, whether the moving mean and variance parameters are stored in the layers of the pretrained models. if not, then using the models for making predictions based on individual examples would be problematic anyway.

@albermax
Copy link
Owner Author

albermax commented Mar 27, 2018

Sorry, if I keep myself short, are the not only two cases for BN: one for rules that ignore biases and one for those that don't?

  • For those that ignore it one can pass the incoming relevance.
  • In the other case, thank you for pointing out, one can re-weight the Relevances with (x/(x+mu))*(y-beta/y), which would not require any fancy code. So yes I would just reuse the params. Don't think we need a linear activation, we can go backward from y(what is already given).

If BN and add-layer are implemented (as far I know) LRP would work for all networks in the application folder. Which is think already a lot?

@sebastian-lapuschkin
Copy link
Contributor

If ignoring the bias means that both beta and the moving_mean are ignored, then yes, a return Ys should suffice.

If we consider the bias, I would at least add an option to add an optional epsilon to the denominator, eg in case y is close to zero.
(at some point) alpha-beta should then be added for consistency, ie to allow analyzers purely focussed on activation (alpha=1, beta=0)

@albermax
Copy link
Owner Author

albermax commented Apr 1, 2018

Yes, every division should be done with an epsilon to not run into issues. See SafeDivision layer.

For alpha/beta we would need to distinguish between positive and negative biases/betas?

@sebastian-lapuschkin
Copy link
Contributor

yes, regarding alpha/beta: Positive and negative components of the respective inputs, betas and biases. When treating one case, just setting the differently signed to zero would be an option

@sebastian-lapuschkin
Copy link
Contributor

sebastian-lapuschkin commented Apr 20, 2018

Remaining TODOs are:

  • LNR layer: postponed (indefinitely) : not a basic and frequently used layer
  • Weed out LRP Analyzer classes. Coordinate with next point. DTD should receive the ones without bias (verify!)
  • Organize module: which rules to keep, divide DTD and LRP?
  • Merge AddReverseLayer and AveragePoolingReverselayer into ReverseLayerWithoutKernel
  • Add support for Epsilon, Alpha-Beta, ... , -rules for layers coverde by BatchNormalizationReverseLayer and ReverseLayerWithoutKernel
  • Create decomposition rule presets for application networks using node IDs for rule assignment to individual layers.
  • optimize vanilla BN relevance prop from (x*x''*R)/(x'*y) to (x*c*R)/y

@JaeDukSeo
Copy link

thank you for the contributions!

@cae67
Copy link

cae67 commented Dec 30, 2022

Hi all! Have these issues with getting NaNs in neural networks containing batch normalization layers been resolved?

I am currently getting NaN values for some samples with lrp.alpha1beta0 and am trying to figure out why.

@adrhill
Copy link
Collaborator

adrhill commented Jan 6, 2023

Hi @cae67, the issue persists. Subscribing to issue #292 will notify you when we have resolved it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants