Pose Guided Human Video Generation

Posted by Ceyuan| on July 27, 2018

ECCV18: Pose Guided Human Video Generation

Ceyuan Yang$^{1}$, Zhe Wang$^{2}$, Xinge Zhu$^{1}$, Chen Huang$^{3}$, Jianping Shi$^{2}$ and Dahua Li$^{1}$
$^{1}$CUHK-SenseTime Joint Lab
$^{2}$SenseTime Research
$^{3}$Carnegie Mellon University

Abstract

Due to the emergence of Generative Adversarial Networks, video synthesis has witnessed exceptional breakthroughs. However, existing methods lack a proper representation to explicitly control the dynamics in videos. Human pose, on the other hand, can represent motion patterns intrinsically and interpretably, and impose the geometric constraints regardless of appearance. In this paper, we propose a pose guided method to synthesize human videos in a disentangled way: plausible motion prediction and coherent appearance generation. In the first stage, a Pose Sequence Generative Adversarial Network (PSGAN) learns in an adversarial manner to yield pose sequences conditioned on the class label. In the second stage, a Semantic Consistent Generative Adversarial Network (SCGAN) generates video frames from the poses while preserving coherent appearances in the input image. By enforcing semantic consistency between the generated and ground-truth poses at a high feature level, our SCGAN is robust to noisy or abnormal poses. Extensive experiments on both human action and human face datasets manifest the superiority of the proposed method over other state-of-the-arts.

Methodology

Our method consists of two stages. In the first stage, we extract the pose from input image and the Pose Sequence GAN (PSGAN) is proposed to generate a temporally smooth pose sequence conditioned on the pose of input image and the target action class. In the second stage, we focus on appearance modeling and propose a Semantic Consistent GAN (SCGAN) to generate realistic and coherent video frames conditioned on the input image and pose sequence from stage one.

img1

The First Stage

img1

Abnormal Poses: We show some bad pose generation results where some key points seem bigger/smaller (b) than the ground-truth (a), or some key points seem missing (c) because of their weak responses. We call such cases as abnormal poses. For human beings however, abnormal poses might look weird at the first glance, but would hardly prevent us from imagining what the “true” pose is. This requires our network to grasp the semantic implication of human pose, and to alleviate the influence of small numerical differences.

img1

The Second Stage

img1

Implementation Details: For our detailed network architecture, all of the generators $(G,G_{1},G_{2})$ apply 4 convolution layers with a kernel size of 4 and the stride of 2 for downsampling. In the decoding step of stage one, transposed convolution layers with stride of 2 are adopted for upsampling, while normal convolutions layers together with interpolation operations take place of transposed convolution layers in the second stage. The feature map of the red block in Fig.2 is extended with a time dimension ($C \times W \times H \Rightarrow C \times 1 \times W \times H$) for the decoder of PSGAN. The discriminators $(D, D_{1}, D_{2}, D_{which})$ are PatchGANs to classify whether local image patches are real or fake. Besides, ReLU serves as the activation function after each layer and the instance normalization is used in all networks. Several residual blocks are leveraged to encode the concatenated feature representations jointly.

Results

img1
img1

Failure Case Analysis

Empirically, we observed two main failure patterns of our method.

  • Instability in sequence generation from partially occluded images. Fig 1 (a) shows an input image where the human face is occluded by the right arm. As a result, blurry near-face regions are hallucinated due to the lack of information.
  • Generalization difficulty to unseen pose. Fig 1 (b) shows such an example where the target pose is significantly different from our training data and the generation collapses. In future, we can explore various ways to address these failures, e.g. learning with more diverse data or applying regularization.
img1

TODO: MORE RESULTS

Citation

If our work is useful in your reasearch, please consider citing our paper . Code and models will be available publicly.

@inproceedings{ceyuan2018PGVIDEO,
  title={Pose Guided Human Video Generation},
  author={Ceyuan Yang, Zhe Wang, Xinge Zhu, Chen Huang, Jianping Shi, Dahua Lin},
  booktitle={ECCV},
  year={2018}
}