sublayers = [
'self_attn.k_proj',
'self_attn.v_proj',
'self_attn.q_proj',
'self_attn.out_proj',
'fc1',
'fc2'
]
def get_encoder_layer_names(num_layers, sublayers): #pretrained layer 수, 파인튜닝 하고자 하는 layer 이름
names = []
for i in range(num_layers):
for sublayer in sublayers:
name = f'encoder.layers.{i}.{sublayer}' # 원하는 module layer 작성
names.append(name)
return names
# 저의 경우 Whisper-base 모델을 활용하였습니다.
get_encoder_layer_names(6, sublayers)
위처럼 get_encoder_layer_names의 customize함수에 원하는 layer이름을 넣어줍니다.
저의 경우 encoder만 넣고 싶으니 encoder.layer~~ 이하로 설정 해주었습니다.
RNN (Recurrent Neural Network) 는 은닉층의 노드에서 activation function을 통해 나온 결과값을 출력층 방향으로 보내면서, 은닉층 노드의 다음 계산의 입력으로 보냄.
이러한 RNN은 대표적으로 두 가지 그림을 통해 표현할 수 있음. (bias는 생략)
각각의 노드는 벡터 형태임.
1번과 2번의 그림을 보면 알 수 있듯이, self feedback loop가 존재하는 것을 볼 수 있음. 그리고 $x_t$ 이전 까지의 상태와, 이전까지의 입력을 대표할 수 있는 압축본이라고 할 수 있다. 이러한 hidden state $x$를 셀(cell)이라고 하며, 이전의 값을 기억하는 메모리 역할을 수행하여 메모리 셀 or RNN 셀 이라고 표현한다.
우리가 알기 익숙한 형태로 시각화를 해보자면,
RNN: Problem Types
RNN은 입력과 출력의 길이를 다르게 설계할 수 있어 아래와 같은 3가지 task 로 나눌 수 있음.
Many-to-many (번역)
Many-to-one (예측): sentiment classification (입력 문서가 긍정 or 부정), spam detection
One-to-many (생성): Image captioning (사진의 제목 생성하기)
RNN: Training
RNN의 학습은 backpropagation의 확장형인 BPTT(Back Propagation Through Time)를 사용함
가중치 W가 모든 시점에서 메모리 셀의 출력($y$)를 구할 때 사용되었기 때문에 0에서 k까지 계산하여 합하는 것임.
이러한 Training 때문에, 전파가 길어질 수록 Gradient Exploding or Gradient Vanishing 현상이 발생함. (정보량의 손실)
Gradient Exploding => Gradient clipping (gradient가 일정 threshold가 넘어가면, clipping 해줌)을 해 줄 수 있음.
Gradient Vanishing => 학습 도중 파악하기 어려움. 만약 loss값이 0 이라면 학습이 종료된 것인지, 아니면 Vanishing gradient 인지 모름. 따라서 다른 네트워크 구조를 사용하는 것이 편함
따라서 RNN은 긴 의존기간의 문제를 어려워함.
ex, The clouds are in the sky => sky를 맞추기 위해서는 이 문장만 봐도 해결 가능함. I grew up in France ... I speak fluent French => French 를 맞추고 싶다면 앞의 문맥부터 참고 해야함. 아래의 경우는 필요한 정보를 얻기 위해 시간 격차가 굉장히 커지기 때문에, 학습하는 정보를 계속 이어나가기 힘들다. ==> 이를 해결하는 네트워크 구조 **Gated RNNs: LSTM/GRU**
Gate의 이름에서 알 수 있듯이 어떤 정보를 잊을지 기억할지를 선택해 long term과 short term에 대한 정보를 고려함.
Cell state
Hidden state와 마찬가지로 이전 시점의 cell state를 다음 시점으로 넘겨줌.
Cell state의 주 역할은 gate들과 함께 작용해 정보를 선택적으로 활용
Cell state의 업데이트는 각 gate의 결과를 더함으로서 진행됨.
Gate
1), 3), 4)이 gate라고 할 수 있음.
세 개의 gate 모두 활성화 함수로 시그모이드 적용 => $\sigma$
gate는 cell state와 함께 정보를 선택적으로 활용할 수 있도록 함.
1) Forget gate layer
과거 정보를 얼마나 잊을 것인지/기억할 것인지 결정하는 단계
전 시점의 hidden state $h$와 현재 입력 $x$에 대해 연산을 진행하고, $\sigma$함수 사용 함.
이 값이 0에 가까울 수록 정보를 잊은 것이며, 1에 가까울 수록 정보를 기억하는 것임.
연산의 결과인 $f_t$는 과거 정보에 관해 얼마나 잊었는지, 기억하는지를 가지고 있는 값임.
2) Input gate layer
새로운 정보 중 어떤것을 cell state에 저장할 것인지?
forget gate 와 동일한 기능으로 $i_t$는 현재의 정보를 기억할 것인지/기억하지 않을 것인지를 결정함.
이후 $h_{t-1}$ $x_t$는 tanh 함수에 들어가 출력값으로 반환되어 hadamarad product()가 되고, 새로운 후보 값인 $C_t$를 만들어 cell state에 더해짐.
3) Cell state update
과거 state인 $C_{t-1}$을 업데이트 해서 새로운 cell state인 $C_t$를 만듦. 이때 forget gate에서 잊어야 하는 이전 상태의 정보를 잊어버리고, 현재의 input값의 반영 값을 포함해서 업데이트 해줌.
4) Output gate layer
시그모이드 레이어에 $x_{t}$와 $h_{t-1}$이 들어가 0~1사이의 값을 출력하고, 이 값은 cell state의 어느 부분을 output으로 내보낼 지 결정함. 이후 cell state가 tanh에 들어가 나온 출력값과 output gate에서 나온 값이 곱해져 $h_t$가 출력됨. 이 $h_t$는 출력값으로 나가가기도 하며, 다음 state의 input으로 들어감.
Gated Recurrent Unit: GRU
GRU는 기존 LSTM의 구조를 조금 더 간단하게 개선한 모델임.
LSTM보다 학습 속도가 빠르지만, 여러 평가에서 LSTM과 비슷한 성능을 보인다고 알려져 있음.
데이터 양이 적을 때는 매개변수의 양이 적은 GRU가 더 좋으며, 데이터 양이 많다면 LSTM이 더 좋다고 알려져 있음.
LSTM의 forget gate, input gate, output gate 를 reset gate, update gate 2개의 gate만을 사용함. 그리고 cell state, hidden state를 하나의 hidden state로 표현함.
1) Reset gate ($r(t)$) : 이전 상태를 얼마나 반영할 지
이전 시점의 hidden state, 현 시점의 입력값을 sigmoid에 통과해 이전 hidden state값을 얼마나 활용할 것인지 결정 식(2).
(3)식에 다시 활용하여 이전 time point의 hidden state에 reset gate를 곱하여 사용함.
2) Update gate ($z(t)$) : 과거와 현재의 정보를 각각 얼마나 반영할 지에 대한 비율 ==> 삭제 게이트와 입력 게이트의 역할을 수행함.
과거와 현재의 정보를 각각 얼마나 반영할 지에 대한 비율을 구함.
식 (1)을 통한 결과인 $z$는 현재 정보를 얼마나 사용할 지를 반영, $1-z$는 과거 정보를 얼마나 사용할 지에 대해 반영함. 전자는 LSTM의 Input gate, 이후를 forget gate라고 생각할 수 있음.
최종적으로는 (4) 식을 통해 현 시점의 hidden state 값을 구할 수 있음.
GRU 셀은 output gate가 없어 hidden vector $h_t$가 타임 스텝마다 출력되고, 이전 상태의 $h_{t-1}$의 어느 부분이 출력될 지 제어하는 gate controller인 $r_t$가 있는 것임
The lack of transparency in black-box models like neural networks is hindering the widespread use of machine learning and image classification. Humans want to understand and control the learning algorithm to ensure that it has captured the underlying concepts of the classes. This paper focuses on model-agnostic explanations that can be applied to any image classifier. Various methods have been proposed for generating counterfactual explanations, which construct a concrete input with a different classification to test if the model has learned the correct features. These explanations can be used for debugging existing machine learning models.
The main reason why visual counterfactual explanations (VCEs) for image classification are not widely used is that generating VCEs and adversarial examples are closely related tasks, and even slight changes to an image can change the classifier's prediction. Adversarially robust models have been shown to produce semantically meaningful VCEs, but they can only generate VCEs for robust models, which are not competitive in terms of prediction accuracy. Other approaches have restrictions or limitations. This paper presents Diffusion Visual Counterfactual Explanations (DVCEs), which overcome previous challenges and can generate VCEs for arbitrary ImageNet classifiers. The authors use a combination of distance regularization and starting point of the diffusion process, together with an adaptive reparameterization, and a cone regularization of the gradient of the classifier via an adversarially robust model to produce realistic images of the target class with high confidence by the classifier. Their approach achieves higher realism and has more meaningful features compared to recent methods.
2. Diffusion models
Diffusion models are generative models that transform the data distribution to a prior distribution through a forward diffusion process, and then transform it back to the data distribution through a reverse diffusion process. The reverse diffusion process was studied more closely in a research paper. In the discrete-time setting, a Markov chain is defined by adding noise to the data point at each timestep.
For the detailed process of Diffusion, we recommend referencing the corresponding blog post:
Diffusion models are generative models that consist of a forward and reverse diffusion process, with the reverse process transforming the prior distribution back to the data distribution. For the experiments, the class-unconditional diffusion model is used, and a noise-aware classifier is introduced to explain any classifier. The reverse process transitions are of the form pθ,φ(xt−1|xt, y) = Z pθ(xt−1|xt) pφ(y|xt, t), and the noise-free image x0 is estimated using the mapping fdn. To sample from pθ,φ(xt−1|xt, y) efficiently, transition kernels are approximated with slightly shifted versions of pθ(xt−1|xt). The transition kernels are given by pθ,φ(xt−1|xt, y) = N (µt, Σθ(xt, t)), and µt is obtained using the noise-aware classifier.
Transition kernels are defined as follow:
3. Diffusion Visual Counterfactual Explanations
Given a classifier pφ(y|·) and a target class y, a Visual Counterfactual Explanation (VCE) x for an input xˆ should meet the following criteria: i) validity: the VCE x should be classified by pφ(y|·) as the desired target class y with high predicted probability, ii) realism: the VCE should be as close as possible to a natural image, iii) minimality/closeness: the difference between the VCE x and the original image xˆ should be the minimal semantic modification necessary to change the class, while being valid and realistic. The paper introduces a new approach called Diffusion Visual Counterfactual Explanations (DVCEs) that work for any classifier and are more realistic due to better generative properties of diffusion model
3.1 Adaptive Parameterization
To generate the DVCE of the original image xˆ, the diffusion process needs to be conditioned on xˆ to ensure that the generated image x is both realistic and close to xˆ. To achieve this, the mean of the transition kernel is modified to include the gradient of the log-likelihood of the noise-aware classifier pφ.
However, even with this modified approach, it can still be difficult to generate semantically meaningful changes close to xˆ. To address this issue, the starting point of the diffusion process is varied and it is observed that starting from step T/2 of the forward diffusion process, together with an adaptive parameterization and using the L1-distance as the distance metric, provides sparse but semantically meaningful changes. In the experiments, T is set to 200.
3.2 Cone projection for classifier guidance
In summary, to generate DVCEs that can be applied to any image classifier, regardless of whether it is adversarially robust or not, we suggest projecting the gradient of an adversarially robust classifier with parameters ψ onto a cone centered at the gradient of the non robust classifier, using the denoising function fdn to preprocess the image. This cone projection ensures that the gradient of the robust classifier is always an ascent direction for the probability of the non-robust classifier, which we would like to maximize. The cone projection is a form of regularization that guides the diffusion process to generate semantically meaningful changes to the image.
3.3 final scheme for Diffusion Visual Counterfactuals
solution for a non-adversarially robust classifier pφ(y|·) by replacing the update step with:
4. Experiments
4.1. Comparison of methods for VCE Generation
Assess the effectiveness of the DVCE by comparing it to existing works in Sec. 4.1. Then, in Sec. 4.2, demonstrate how DVCEs can be employed to interpret differences between classifiers by comparing their performance across various state-of-the-art ImageNet models.
DVCEs are the only type of VCEs that fulfill all desired properties. BD, on the other hand, tends to create images that are significantly different from the original, as seen in the examples of leopard and tiger, and frequently produces artifacts in images such as pizza, potpie, timber wolf, white wolf, and alp. Although l1.5 SVCEs can produce images with lower quality and artifacts, they still offer some benefits, as seen with examples like white wolf, volcano, and night snake.
Quantitative analysis using FID scores is challenging for VCEs because methods that do not alter the original image have low FID scores. Therefore, a cross-over evaluation scheme was developed to partition classes into two sets and only analyze cross-over VCEs. The results in Tab. 1 show that DVCEs are less close to the target class than l1.5-SVCEs but are more realistic and have similar validity. In contrast, BDVCEs perform the worst in all categories. A user study with 20 participants also confirms that DVCEs generate more meaningful features in the target classes than l1.5-SVCEs and BDVCEs. While the quantitative evaluation may suggest otherwise, users considered the images generated by DVCEs to be more realistic regardless of whether they show the target class or not.
4.2. model comparison
Non-robust ImageNet models were used in the experiments, including Swin-TF, ConvNeXt, and Noisy-Student EfficientNet. DVCEs were generated using the cone projection method with a 30◦ angle, with a robust model from the previous section. The resulting DVCEs provide insight into the most important features of each model and class, with the roof and tower structures being the most prominent for stupa and church classes. In addition, DVCEs of different robust models, including MNR-RN50, MNR-XCiT, and MNR-DeiT, were generated and evaluated, demonstrating the generative properties of adversarially robust models, particularly robust transformers.
5. Limitations
For future research, it would be interesting to explore alternative "denoising" procedures for the gradient of non-robust models, as training robust models can be challenging. However, it is important to note that approximating reverse transitions using shifted normal distributions can be difficult to verify, especially when adding a classifier and other terms, which can affect the diffusion process outcome.
Evaluating VCEs quantitatively is challenging, as existing metrics such as FID and IM1, IM2 rely on well-trained (V)AEs for every class and generating meaningful changes, in addition to realistic images and high confidence. Developing new metrics for VCEs is necessary, especially for datasets with high-resolution images and many classes. While DVCEs and VCEs in general help uncover biases in classifiers and have a positive impact, there is also the potential for unintended misuse as with any conditional generative model.
References
[1] Maximilian Augustin, Alexander Meinke, and Matthias Hein. Adversarial robustness on inand out-distribution improves explainability. In ECCV, 2020.
[2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
[3] Sebastian Bach, Alexander Binder, Frederick Klauschen Gregoire Montavon, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One, 2015.
[4] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. JMLR, 2010.
[5] Yutong Bai, Jieru Mei, Alan Yuille, and Cihang Xie. Are transformers more robust than cnns? In NeurIPS, 2021.
[6] Solon Barocas, Andrew D. Selbst, and Manish Raghavan. The hidden assumptions behind counterfactual explanations and principal reasons. In FAT, 2020.
[7] Valentyn Boreiko, Maximilian Augustin, Francesco Croce, Philipp Berens, and Matthias Hein. Sparse visual counterfactual explanations in image space. In GCPR, 2022.
[8] Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining image classifiers by counterfactual generation. In ICLR, 2019.
[9] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In ICCV, 2021.
[10] EU Commission. Regulation for laying down harmonised rules on AI. European Commission, 2021.
[11] Francesco Croce and Matthias Hein. Adversarial robustness against multiple lp-threat models at the price of one and how to quickly fine-tune robust models to another threat model. In ICML, 2022.
[12] Edoardo Debenedetti. Adversarially robust vision transformers. Master’s thesis, Swiss Federal Institue of Technology, Lausanne (EPFL), 2022.
[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: a large-scale hierarchical image database. In CVPR, 2009. License: No license specified.
[14] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
[15] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In NeurIPS, 2018. [16] Logan Engstrom, Andrew Ilyas, Hadi Salman, Shibani Santurkar, and Dimitris Tsipras. Robustness (python library), 2019. License: MIT.
[17] Christian Etmann, Sebastian Lunz, Peter Maass, and Carola-Bibiane Schönlieb. On the connection between adversarial robustness and saliency map interpretability. In ICML, 2019.
[18] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In ICML, 2019.
[19] Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. Generating visual explanations. In ECCV, 2016.
[20] Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. Grounding visual explanations. In ECCV, 2018.
[21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
[22] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS Workshop, 2021.
[23] Pieter Abbeel Jonathan Ho, Ajay Jain. Denoising diffusion probabilistic models. In NeurIPS, 2020.
[24] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
[25] Saeed Khorram and Li Fuxin. Cycle-consistent counterfactuals by latent transformations. In CVPR, 2022.
[26] Oran Lang, Yossi Gandelsman, Michal Yarom, Yoav Wald, Gal Elidan, Avinatan Hassidim, William T. Freeman, Phillip Isola, Amir Globerson, Michal Irani, and Inbar Mosseri. Explaining in style: Training a gan to explain a classifier in stylespace. In ICCV, 2021.
[27] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In WACV, 2023. [28] Ze Liu, Fand Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
[29] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022.
[30] Arnaud Looveren and Janis Klaise. Interpretable counterfactual explanations guided by prototypes. In ECML, 2021. [31] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In NeurIPS, 2017.
[32] Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint, arXiv:2208.11970, 2022.
[33] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 1995.
[34] Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 2019.
[35] Ramaravind K. Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. In FAccT, 2020.
[36] Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021.
[37] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022. [38] Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. Deep structural causal models for tractable counterfactual inference. In NeurIPS, 2020.
[39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier. In KDD, 2016.
[40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
[41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015. License: No license specified.
[42] Pouya Samangouei, Ardavan Saeedi, Liam Nakagawa, and Nathan Silberman. Explaingan: Model explanation via decision boundary crossing transformations. In ECCV, 2018.
[43] Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Image synthesis with a single (robust) classifier. In NeurIPS, 2019.
[44] Lisa Schut, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, and Yarin Gal. Generating interpretable counterfactual explanations by implicit minimisation of epistemic and aleatoric uncertainties. In AISTATS, 2021.
[45] Kathryn Schutte, Olivier Moindrot, Paul Hérent, Jean-Baptiste Schiratti, and Simon Jégou. Using stylegan for visual interpretability of deep learning models on medical images. In NeurIPS Workshop, 2020. 12
[46] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. IJCV, 2019.
[47] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.
[48] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
[49] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
[50] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
[51] Suraj Srinivas and François Fleuret. Full-gradient representation for neural network visualization. In NeurIPS, 2019. [52] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.
[53] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019. [54] Sahil Verma, John P. Dickerson, and Keegan Hines. Counterfactual explanations for machine learning: A review. arXiv preprint, arXiv:2010.10596, 2020.
[55] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 2018.
[56] Zifan Wang, Haofan Wang, Shakul Ramkumar, Matt Fredrikson, Piotr Mardziel, and Anupam Datta. Smoothed geometry for robust attribution. In NeurIPS, 2020.
[57] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy student improves imagenet classification. In CVPR, 2020.
[58] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018