Skip to content

giakoumoglou/rrd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

182 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Representation Distillation

This is a PyTorch implementation of the RRD paper:

@misc{giakoumoglou2024relational,
      title={Relational Representation Distillation}, 
      author={Nikolaos Giakoumoglou and Tania Stathaki},
      year={2024},
      eprint={2407.12073},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.12073}, 
}

CIFAR-100 Classification

Please refer to CIFAR-100 for more details.

ImageNet ILSVRC-2012 Classification

Please refer to ImageNet for more details. Also supports CIFAR-100 classification.

Abstract

Knowledge distillation transfers knowledge from large, high-capacity teacher models to more compact student networks. The standard approach minimizes the Kullback–Leibler (KL) divergence between the probabilistic outputs of the teacher and student, effectively aligning predictions but neglecting the structural relationships encoded within the teacher’s internal representations. Recent advances have adopted contrastive learning objectives to address this limitation; however, such instance-discrimination–based methods inevitably induce a “class collision problem”, in which semantically related samples are inappropriately pushed apart despite belonging to similar classes. To overcome this, we propose Relational Representation Distillation (RRD) that preserves the relative relationships among instances rather than enforcing absolute separation. Our method introduces separate temperature parameters for teacher and student distributions, with a sharper teacher (low $\tau_t$) emphasizing primary relationships and a softer student (high $\tau_s$) maintaining secondary similarities. This dual-temperature formulation creates an implicit information bottleneck that preserves fine-grained relational structure while avoiding the over-separation characteristic of contrastive losses. We establish theoretical connections showing that InfoNCE emerges as a limiting case of our objective when $\tau_t \rightarrow 0$, and empirically demonstrate that this relaxed formulation yields superior relational alignment and generalization across classification and detection tasks.

Information Bottleneck Visualization

Figure 1. Visualization of the information bottleneck effect. The teacher produces a sharper similarity distribution $\mathbf{p}^T(\mathbf{x}_i;\tau_t)$ (solid black) highlighting primary relationships, while the student adopts a softer distribution $\mathbf{p}^S(\mathbf{x}_i;\tau_s)$ (dashed black) that retains secondary similarities. The gray-shaded overlap region illustrates the filtered information flow, where only essential relational cues are transferred from teacher to student, effectively bounding $I(\mathbf{z}^T;\mathbf{z}^S)$.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Releases

No releases published

Packages

No packages published