Self-supervised learning (SSL) has revolutionized image processing, but extending its success to video understanding presents unique challenges due to increased data complexity and computational demands. We introduce ViDROP (Video Dense Representation thrOugh spatio-temporal sParsity), a novel and efficient SSL architecture for video understanding that combines token dropping and masking strategies. Our approach eliminates the need for a computationally expensive decoder and enables per-patch loss computation, overcoming limitations of previous video SSL methods and significantly reducing computational overhead. Moreover, we propose a simple yet effective video compression technique using k-means clustering in pixel space, dramatically accelerating data loading and facilitating rapid experimentation. Primarily, we continue pre-training ViT-Base, ViT-Large, and ViT-Huge models from existing SSL checkpoints (VideoMAE or V-JEPA), achieving significant performance gains while drastically reducing the computational cost typically associated with video SSL. Furthermore, we demonstrate the feasibility of training a ViT-Huge model from scratch using network expansion techniques and modest computational resources, achieving comparable accuracy to VideoMAE but with a \textbf{25$\times$ reduction in training time}. This marks a significant breakthrough in efficient, large-scale video SSL, enabling the training of state-of-the-art models with limited resources. Extensive experiments show that ViDROP achieves state-of-the-art performance on various video understanding benchmarks, including Kinetics400, SSv2, UCF101, and HMDB51, as well as in temporal action detection (THUMOS14). These results highlight the effectiveness of our fine-grained token-level learning strategy for efficient video representation learning.
 AI & Machine Learning
AI & Machine Learning  Computer Vision, Imaging & Video
Computer Vision, Imaging & Video