Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review #3 #3

Open
distillpub-reviewers opened this issue Oct 17, 2019 · 1 comment
Open

Review #3 #3

distillpub-reviewers opened this issue Oct 17, 2019 · 1 comment

Comments

@distillpub-reviewers
Copy link

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill is grateful to Sara Hooker for taking the time to review this article.


General Comments

The question proposed by the authors is very interesting, namely that the choice of baseline is a crucial hyperparameter that determines all subsequent attribution. Given the wide use of integrated gradients in various sensitive domains, it is a valuable contribution. I enjoyed the visuals, and think these charts provide valuable insight for a new comer to the field.
The key methodological limitations of the draft (as is) would appear to be the lack of a formal framework to articulate the differences in properties between the baselines introduced. While expected gradients may avoid the issue of ignoring certain pixel values, the same could be said of using a baseline of random noise[?] How do we compare the relative merit of these baselines?

Comments on writing exposition

This version of the article is a promising draft on an interesting question -- ""What is a valid reference point for an attribution method such as integrated gradients?"" ""What are the implications of this choice?""

Below follows my comments related to writing exposition:

  1. Some sections of the text could benefit from repositioning -- for example ""This question may be critical when models are making high stakes decisions about.. "" speaks to the motivation of the work but is buried in the second section.

  2. Certain terms are introduced far too late in the draft such as ""path methods"" which would have been good to introduce at the very beginning to clarify the scope of the contribution. Instead, phrases such as ""Most require choosing a hyper-parameter known as the baseline input..."" suggest that most saliency methods are path methods when this is not the case -- many estimate the contribution of pixels to a given prediction using different methodologies (raw gradient, pertubation based methods such as Fong et al., 2017, Ribeiro et al. 2016).

  3. section ""Game Theory and Missingess"" actually introduces the topic of interest -- why does the choice of baseline matter, why should the reader case. Having this section so far into the article is disruptive for the reader, a reshuffling of sections could improve the flow.

  4. Diagrams - high latency for certain diagrams. for figure 1,2 the true label and predicted appear identical for all chosen images which make it less interesting to have the same images repeated.

Comments on methodology and exposition of contributions

The authors have made replication of results very easy by releasing code, and put together experiments that use a standard CV architecture (InceptionV4), open source dataset and only require limited compute.

Below follows additional comments related to the methodology and exposition of contributions (as well as suggested relevant work):

  1. this draft omits that Sundararajan et al. themselves discuss the difficulty of choice of baseline, and propose that for computer vision tasks one possible choice is a black image (note that this is not always a grid of zeros, if the network normalizes the images as a pre-processing step),. Note that in their work, Sundararajan et al. also mention that there may be multiple possible baseline, including one of random noise.
    The current draft does not appear to take into account that this was already surfaced as an acknowledged limitation by the authors, which feels like a large oversight and unnecessary. It is also important to note that Sundararajan et al. are careful to convey that a black image is not the only possible choice, but that the intent of the baseline is "" to convey a complete absence of signal, so that the features that are apparent from the attributions are properties only of the input, and not of the baseline.""

  2. In our recent work on evaluating saliency methods, we also dedicate discussion to the implications of choice of baseline for integrated gradients (Kindermans et al. The (Un)reliability of Saliency methods, section 3.2). We formally evaluate how the reliability of the explanation depends on the choice of reference point. We do in fact show that a black image reference point is reliable, but a zero grid is unreliable. This set of experiments may be useful as context for this work, in particular it seems that partly what is missing right now is a formal measure to compare how choosing different baselines impacts the end properties of the explanation. Additional work in this direction includes (Adebayo et al. 2018 Sanity Checks for Saliency Maps, Hooker et. al 2019 Evaluating Feature Importance Estimates) Given the new baseline, is the method a more reliable feature importance estimator?

  3. Visualizations of saliency maps with post-processing steps such as taking the absolute value or capping the feature attributions at 99th percentile is problematic. These post-processing steps are not well justified theoretically and appear to mainly be used to improve the perceptual quality of saliency map visualization.

  4. how much more computational cost is incurred by expected gradients over integrated gradients? This would appear to be another hyperparameter introduced which is the number of images to be interpolated over?


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 3/5
Writing Style 3/5
Diagram & Interface Style 3/5
Impact of diagrams / interfaces / tools for thought? 5/5
Readability 3/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 1/5
How easy would it be to replicate (or falsify) the results? 5/5
Does the article cite relevant work? 4/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 5/5
@psturmfels
Copy link
Contributor

psturmfels commented Nov 23, 2019

Thank you for the detailed comments! Based on your feedback, we’ve made some changes to the article and added several new sections. In particular:

“this draft omits that Sundararajan et al. themselves discuss the difficulty of choice of baseline, and propose that for computer vision tasks one possible choice is a black image (note that this is not always a grid of zeros, if the network normalizes the images as a pre-processing step),. Note that in their work, Sundararajan et al. also mention that there may be multiple possible baseline, including one of random noise.
The current draft does not appear to take into account that this was already surfaced as an acknowledged limitation by the authors, which feels like a large oversight and unnecessary. It is also important to note that Sundararajan et al. are careful to convey that a black image is not the only possible choice, but that the intent of the baseline is "" to convey a complete absence of signal, so that the features that are apparent from the attributions are properties only of the input, and not of the baseline.""”

I agree with this point. This wasn’t an intentional omission: we simply meant to keep the article brief and to the point. However, in doing so, we may have mis-characterized the original work. In our new draft, we acknowledge the existing discussion around the choice of baseline and significantly expand our discussion of the pros and cons of the various baseline choices throughout the article. I hope this to some degree addresses this point: our intention is not to criticize integrated gradients but rather to understand the assumptions that each baseline choice makes.

“In our recent work on evaluating saliency methods, we also dedicate discussion to the implications of choice of baseline for integrated gradients (Kindermans et al. The (Un)reliability of Saliency methods, section 3.2). We formally evaluate how the reliability of the explanation depends on the choice of reference point. We do in fact show that a black image reference point is reliable, but a zero grid is unreliable. This set of experiments may be useful as context for this work, in particular it seems that partly what is missing right now is a formal measure to compare how choosing different baselines impacts the end properties of the explanation. Additional work in this direction includes (Adebayo et al. 2018 Sanity Checks for Saliency Maps, Hooker et. al 2019 Evaluating Feature Importance Estimates) Given the new baseline, is the method a more reliable feature importance estimator?”

This is another good point. In our new section “Comparing Saliency Methods” we discuss various ways of comparing different saliency methods. In particular, we discuss why evaluation is a difficult problem. However, we don’t fully benchmark all of the baselines against all of the possible metrics: in fact, we only include one very limited set of quantitative results. I felt that a significant weakness of our first draft was that we were clearly pointing out our preference for a specific baseline. But the purpose of this article is not to conclude that a specific baseline is the “best” baseline. On that vein, too much quantitative evaluation might skew the focus toward being a results-driven paper when really the purpose is to understand what the baseline choice itself means. I do however think that valuable future work would be more comprehensive evaluation of each of the baselines presented here.

“Visualizations of saliency maps with post-processing steps such as taking the absolute value or capping the feature attributions at 99th percentile is problematic. These post-processing steps are not well justified theoretically and appear to mainly be used to improve the perceptual quality of saliency map visualization.”

In terms of capping attributions at the 99th percentile, doing so doesn’t actually change the rank ordering of features, so it could be argued that it is analogous to a log-transform. It is true that we take these steps solely to improve the quality of saliency map visualization: but in general, given that we are not trying to advocate for a method based on its visual appearance, I hope this concern is less pressing.

“how much more computational cost is incurred by expected gradients over integrated gradients? This would appear to be another hyperparameter introduced which is the number of images to be interpolated over?”

It definitely does introduce more computational cost if you want to get the same kind of convergence. Based on this feedback, we’ve added expanded discussion in the ablation tests section about how a proper comparison would account for the fact that using a single reference is computationally cheaper than using a distribution as a baseline.

“Diagrams - high latency for certain diagrams. for figure 1,2 the true label and predicted appear identical for all chosen images which make it less interesting to have the same images repeated.”

Hopefully, I’ve managed to fix this issue. The sliders should be much smoother now!

“Certain terms are introduced far too late in the draft such as ""path methods"" which would have been good to introduce at the very beginning to clarify the scope of the contribution. Instead, phrases such as ""Most require choosing a hyper-parameter known as the baseline input..."" suggest that most saliency methods are path methods when this is not the case -- many estimate the contribution of pixels to a given prediction using different methodologies (raw gradient, pertubation based methods such as Fong et al., 2017, Ribeiro et al. 2016).”

This is definitely a fair point. Based on this feedback, the new draft clarifies early on that this article is specifically focused on path methods. We also make a better effort to cite existing methods that are not path methods. Finally, we attempt to connect the idea of a baseline to the broader idea of missingness, which is useful outside of the scope of path methods.

My hope is that the new draft addresses some of your concerns, especially regarding our portrayal of integrated gradients as it was originally proposed. Our goal is to use integrated gradients as a way to open up discussion about missingness and baselines, not to criticize the original method. I hope we’ve managed to achieve this, and welcome further feedback in this direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants