Self-supervised learning (SSL) is a commonly used approach to learning and
encoding data representations. By using a pre-trained SSL image encoder and
training a downstream classifier on top of it, impressive performance can be
achieved on various tasks with very little labeled data. The increasing usage
of SSL has led to an uptick in security research related to SSL encoders and
the development of various Trojan attacks. The danger posed by Trojan attacks
inserted in SSL encoders lies in their ability to operate covertly and spread
widely among various users and devices. The presence of backdoor behavior in
Trojaned encoders can inadvertently be inherited by downstream classifiers,
making it even more difficult to detect and mitigate the threat. Although
current Trojan detection methods in supervised learning can potentially
safeguard SSL downstream classifiers, identifying and addressing triggers in
the SSL encoder before its widespread dissemination is a challenging task. This
is because downstream tasks are not always known, dataset labels are not
available, and even the original training dataset is not accessible during the
SSL encoder Trojan detection. This paper presents an innovative technique
called SSL-Cleanse that is designed to detect and mitigate backdoor attacks in
SSL encoders. We evaluated SSL-Cleanse on various datasets using 300 models,
achieving an average detection success rate of 83.7% on ImageNet-100. After
mitigating backdoors, on average, backdoored encoders achieve 0.24% attack
success rate without great accuracy loss, proving the effectiveness of
SSL-Cleanse.
Related Stories
June 3, 2023