|Input Clip||Pose Forecast||Video Forecast||Baseline Forecast|
*Our model takes in 12 consecutive frames of past pose information and the only last frame of the input clip. In order to accommodate the structure of volumetric convolutions, the baseline is actually given more past information, 16 full video frames. What is shown here are the full past 16 video frames. The baseline is identical to the image-conditioned GAN in "Generating Videos with Scene Dynamics", NIPS 2016 except that we condition instead on 16 input video frames with volumetric convolutions and we change the aspect ratio to 80x64 pixels.