Bottom-Up and Top-Down Reasoning with Hierarchical Rectified Gaussians

CVPR 2016

Peiyun Hu Deva Ramanan

Robotics Institute
Carnegie Mellon University

Abstract

Convolutional neural nets (CNNs) have demonstrated remarkable performance in recent history. Such approaches tend to work in a “unidirectional” bottom-up feed-forward fashion. However, practical experience and biological evidence tells us that feedback plays a crucial role, particularly for detailed spatial understanding tasks. This work explores “bidirectional” architectures that also reason with top-down feedback: neural units are influenced by both lower and higher-level units.

We do so by treating units as rectified latent variables in a quadratic energy function, which can be seen as a hierarchical Rectified Gaussian model (RGs). We show that RGs can be optimized with a quadratic program (QP), that can in turn be optimized with a recurrent neural network (with rectified linear units). This allows RGs to be trained with GPU-optimized gradient descent. From a theoretical perspective, RGs help establish a connection between CNNs and hierarchical probabilistic models. From a practical perspective, RGs are well suited for detailed spatial tasks that can benefit from top-down reasoning. We illustrate them on the challenging task of keypoint localization under occlusions, where local bottom-up evidence may be misleading. We demonstrate state-of-the-art results on challenging benchmarks.

Coarse-to-fine prediction

We visualize coarse-to-fine heatmap predictions, obtained by using multi-scale classifiers defined on features extracted from multiple layers in our probabilistic model. Predictions based on only the coarse-scale layer are identical. As one extracts features from additional lower layers, the top-down model does a better job of incorporating contextual evidence, which cleans up localization around the right knee and disambiguates the left/right ankle.

Low-level activation

Inspired by neurological experiments, we compare low-level activations at 0.5ms (during bottom-up processing) to the ones at 30ms (during top-down feedback). The convolutional activation of same neural units appears different after top-down feedback. Activation on facial parts (hair for top and facial skin for bottom) become stronger while activation on the background (including clothing) is suppressed. This is best shown with the average activation across images in the right-most column.

Human keypoint localization

Left are some human keypoint localization results produced by our proposed model. Top compares keypoint localization performance on MPII Human Pose between our approach and state-of-the-art at the time of submission. QP₁ is equivalent to a "unidirectional" bottom-up feed-forward model, and QP ₂ represents a "bidirectional" model with 2 passes of inference.

Facial landmark localization and visibility prediction

Left show examples of facial landmark localization plus visibility prediction (invisible and visible) from our model. Top compares keypoint localization performance on COFW between our approach and prior-art at the time of submission.

Code

We have released code and models for both human keypoint and facial landmark localization on Github.

Acknowledgments

This research is supported by NSF Grant 0954083 and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012.