Audio source separation aims to extract individual sound sources from an audio mixture. Recent studies on source separation focus primarily on minimizing signal-level distance, typically measured by source-to-distortion ratio (SDR). However, scant attention has been given to the perceptual quality of the separated tracks. In this paper, we propose MDX-GAN, an efficient and high-fidelity audio source separator based on MDX-Net for multiple sound classes. We leverage different training objectives to enhance the perceptual quality of audio source separation. Specifically, we adopt perceptually-motivated loss functions on top of the waveform loss, including multi-resolution STFT and Mel-spectrogram losses, and employ the adversarial training paradigm with multi-domain and multi-scale discriminators to refine the perceptual quality of separation. Additionally, we extend the model to support multiple sound classes within a single network via feature-wise linear modulation (FiLM). We conduct both objective and subjective experiments to evaluate MDX-GAN on real-world settings, and assess the impacts of design components on the perceptual quality and SDR scores. Results demonstrate that MDX-GAN accurately separates the sound source and achieves superior perceptual quality.
Learn More