In keypoint regression, models are trained to consume an image and produce the x, y coordinates of some entity, e.g., a person's nose or corner of a document. Typical methods for doing so include linear regression over deep features to directly produce x, y coordinates and training networks to regress a dense heatmap around each keypoint. Recently, several works have independently proposed using Fully Convolutional Networks (FCNs) to predict latent heatmaps, from which the continuous coordinates of the key points can be computed using a fully differentiable layer. This allows end-to-end training of convolutional neural networks with dense output to learn keypoint locations. We show that three of the these proposed methods are equivalent to interpreting the heatmap as an arrangement of point-masses and computing the Center of Mass (CoM) of the masses. The CoM is heavily influenced by outliers, which can lead to imprecise predictions. To fix this problem, we propose a new differentiable layer that computes the spatial median point of the latent heatmap, which is more robust to outliers. We demonstrate that our approach outperforms both strong baselines and a state-of-the-art regression function in synthetic experiments and on 5 real world keypoint regression datasets including faces, human pose, fashion, and document images.
Learn More