Input Clip. This represents the input information. (Half a second)*
Pose Forecast. This is the forecast pose information (largest k-means cluster of 1000 samples) by Pose-VAE for 1 second into the future.
Video Forecast. This is the forecast video information by Pose-GAN for 1 second into the future.
Baseline Forecast. This is the forecast video by Vondrick et al., NIPS 2016* for 1 second into the future

Input Clip	Pose Forecast	Video Forecast	Baseline Forecast

*Our model takes in 12 consecutive frames of past pose information and the only last frame of the input clip. In order to accommodate the structure of volumetric convolutions, the baseline is actually given more past information, 16 full video frames. What is shown here are the full past 16 video frames. The baseline is identical to the image-conditioned GAN in "Generating Videos with Scene Dynamics", NIPS 2016 except that we condition instead on 16 input video frames with volumetric convolutions and we change the aspect ratio to 80x64 pixels.