-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LRP #2
Comments
so far, only vgg*-models seem to produce non-nan analyses, even for lrp.alpha1beta0 |
Ok, thanks for letting me know. I did not test after resnet50. Did you test alpha-beta or epsilon with these networks on Caffee? How are you getting on with LRP? Do you have any major issues? |
as far as I know, Alex Binder has applied LRP to Resnet-50 on Caffe, with the restriction to alpha-beta for the layers merging skip connections. unfortunately progress is slow(er than expected), partly due to the quite thin documentation and spread out structure of the code, partly due to tf being nigh unusable when numeric hiccups occur (several minutes for an image to be analyzed as soon as nans are popping up), which makes debugging a mess. how can the flag reverse_check_min_max_values be activated? |
Sorry, I don't understand what with the restriction to alpha-beta for the layers merging skip connections? I see. Can you please be more detailed on which parts are unclear? I invite you to ask if you don't understand parts of the code, so far I did not hear from you... You pass it to create_analyzer or the constructor of the class. |
min/max values definitely change in orders of magnitude several times. here are the pints for Image 4 and 5, vgg16, lrp.epsilon parts which are unclear from early on are eg what decides whether _default_reverse_mapping is called and when the actual rule objects are evaluated. I find myself looking for pieces of code rather frequently, since just jumping to a function's definition is often not possible via the IDE. Also you might consider adding elif(isinstance(reverse_state['layer'], keras.layers.BatchNormalization)): as an immediate fix to LRP._default_reverse_mapping. I prefer not pushing the state of my current relevance_based.py. Not 100% sure on that though. It definitely should not be resolved as return reversed_Ys. The BatchNorm Layer might need to be manually resolved in two steps (mean subtraction and shift wrt the learned parameter gamma) |
Good that you noticed this! Not sure how values can recover from +-inf. I guess I there might be an issue with the division. Even though that exactly should not happen with the epsilon rule. Does this still happen if you set a larger keras.backend.epsilon value (you can set that in the keras config; the value is used in relevance_based.py line 262)? Basically default_reverse_mapping gets called when no "conditional_mapping" applies. Conditional_mappings is a list that gets traversed, when a condition applies the according reverse function is called. If non applies default_reverse_mapping is called. One of the points here is to add another condition that checks for Add and other merge layers as they need to be treated by an LRP rule too. Does this make it more clear to you? (IDE-issue: I guess this is to the use of higher order function. I use emacs so I've never thought of that :-)) I suggest to treat BatchNorm explicitely by adding an entry to conditional_mappings and do the same for Reshape layers. The way I implemented default_reverse_mapping is not nice. I suggest to have an assert there that it only applies to layers where the shape stays the same. To wrap it up:
What do you think about that? Do you know git flow? Or in general I suggest you open a branch/git flow feature and push it. Then I can see your changes/commits and be more helpful. |
to be more specific regarding BatchNormalization: Xms = X - mu # 1) mean shift based on population mean BN can be interpreted as a sequence of one ADD layer, two component-wise linear activation layers and another ADD layer. Operations 2 and 3 have no effect on LRP decomposition and need not to be decomposed, however 1 and 4 do, but 4 depends on 1, 2 and 3. Decomposing a BatchNorm layer with LRP would (to my understanding) in any case some footwork to do, to achieve the following, given a top layer relevances R
where we can expect R_b + R_xgscl = R and R_x + R_mu = R_xms = R_xgscl. Can above decomposition be resolved efficiently using the GradientWRT function? |
increasing keras.backend.epsilon was not the solution (went until eps=1e0) |
Oh I see! Thanks! I will follow up on BN. |
Coming back to batchnorm. Is this intuition right:
Assuming this is right you basically can copy the BN layer and set the center/beta param to zero. You can see here how to modify the params (please mind depending on whether beta/gamma is used the order of the weights changes, that is a bit messy.): https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L16 Is my train of thought right? and let me know if you need help with the implementation. |
Hi Max, your thoughts are kinda right, but do not capture everything which needs to be done. For decopmosing BN, there are two of above steps required. I have read the implementation of the BatchNormalization layer behind the link you sent, which redirects to the use of keras.backend.batch_normalization. ( Ctrl+F towards its interface description in https://keras.io/backend/ ) Based on the formula given there, let me elaborate how to apply LRP to this layer type below. First, let us formulate BN as a series of steps, with both scaling steps merged into one. Then, we can attribute relevance towards the input for each step separately. the first line of formulas shows the direct evaluation of the batch norm layer output y, as well as the corresponding relevance decomposition in the right column for the whole layer. Below, the evaluation of the batchnorm layer and the relevance decomposition is given as a series of steps, with the arrows indicating the workflow. To summarize, to perform LRP for the batch norm layer, we need access to either the parameters (which we have) or intermediate transformations (which we can get) of the input data, for the Zrule. Would you prefer to see the BN decomposition implemented "by hand" as in the table above, or to transform it into a sequence of Add, linear activation and Add layer to use the GradientWRT pattern? Intuitively, I would suggest starting with the implementation by hand, which can then be resolved into something more elegant, while using the manually coded transformations for debugging and comparision. Then, in the end, you can decide which segments of code to keep. |
another question is, whether the moving mean and variance parameters are stored in the layers of the pretrained models. if not, then using the models for making predictions based on individual examples would be problematic anyway. |
Sorry, if I keep myself short, are the not only two cases for BN: one for rules that ignore biases and one for those that don't?
If BN and add-layer are implemented (as far I know) LRP would work for all networks in the application folder. Which is think already a lot? |
If ignoring the bias means that both beta and the moving_mean are ignored, then yes, a return Ys should suffice. If we consider the bias, I would at least add an option to add an optional epsilon to the denominator, eg in case y is close to zero. |
Yes, every division should be done with an epsilon to not run into issues. See SafeDivision layer. For alpha/beta we would need to distinguish between positive and negative biases/betas? |
yes, regarding alpha/beta: Positive and negative components of the respective inputs, betas and biases. When treating one case, just setting the differently signed to zero would be an option |
Remaining TODOs are:
|
thank you for the contributions! |
Hi all! Have these issues with getting NaNs in neural networks containing batch normalization layers been resolved? I am currently getting NaN values for some samples with lrp.alpha1beta0 and am trying to figure out why. |
The text was updated successfully, but these errors were encountered: