Update README.md

szq0214 · szq0214 · commit 89341faee73d · 2020-09-18T14:04:58.000-04:00
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ This is the official pytorch implementation of our paper:
 <img width=70% src="https://user-images.githubusercontent.com/3794909/92182326-6f78c400-ee19-11ea-80e4-2d6e4d73ce82.png"/>
 </div>
 
-In this paper, we introduce a simple yet effective approach that can boost the vanilla ResNet-50 to 80%+ Top-1 accuracy on ImageNet without any tricks. Generally, ourmethod is based on the recently proposed [MEAL](https://arxiv.org/abs/1812.02425), i.e.,ensemble knowledge distillation via discriminators. We further simplify it through 1) adopting the similarity loss anddiscriminator only on the final outputs and 2) using the av-erage of softmax probabilities from all teacher ensemblesas the stronger supervision for distillation. One crucial perspective of our method is that the one-hot/hard label shouldnot be used in the distillation process. We show that such asimple framework can achieve state-of-the-art results with-out involving any commonly-used tricks, such as 1) archi-tecture modification; 2) outside training data beyond Im-ageNet; 3) autoaug/randaug; 4) cosine learning rate; 5) mixup/cutmix training; 6) label smoothing; etc.
+In this paper, we introduce a simple yet effective approach that can boost the vanilla ResNet-50 to 80%+ Top-1 accuracy on ImageNet without any tricks. Generally, our method is based on the recently proposed [MEAL](https://arxiv.org/abs/1812.02425), i.e., ensemble knowledge distillation via discriminators. We further simplify it through 1) adopting the similarity loss and discriminator only on the final outputs and 2) using the average of softmax probabilities from all teacher ensembles as the stronger supervision for distillation. One crucial perspective of our method is that the one-hot/hard label should not be used in the distillation process. We show that such a simple framework can achieve state-of-the-art results without involving any commonly-used tricks, such as 1) architecture modification; 2) outside training data beyond ImageNet; 3) autoaug/randaug; 4) cosine learning rate; 5) mixup/cutmix training; 6) label smoothing; etc.
 
 ## Citation