In this paper we study data augmentation for opcode sequence based Android
malware detection. Data augmentation has been successfully used in many areas
of deep-learning to significantly improve model performance. Typically, data
augmentation simulates realistic variations in data to increase the apparent
diversity of the training-set. However, for opcode-based malware analysis it is
not immediately clear how to apply data augmentation. Hence we first study the
use of fixed transformations, then progress to adaptive methods. We propose a
novel data augmentation method — Self-Embedding Language Model Augmentation —
that uses a malware detection network’s own opcode embedding layer to measure
opcode similarity for adaptive augmentation. To the best of our knowledge this
is the first paper to carry out a systematic study of different augmentation
methods for opcode sequence based Android malware classification.
Related Stories
February 5, 2023