Controllable Human-centric Keyframe Interpolation with Generative Prior

TL;DR: We introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI).

Abstract

Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Video

Method

Our PoseFuse3D-KI framework, as shown in (a), comprises a video diffusion model (VDM) and a novel control model, PoseFuse3D. The PoseFuse3D model extracts rich features from both 3D and 2D control signals and fuses them into a unified representation to guide the VDM. The key component of PoseFuse3D is the SMPL-X encoder as illustrated in (b), which provides explicit 3D signal features. Specifically, the SMPL-X encoder first extracts 3D information from the SMPL-X model with 2D correspondences via projection. The 3D and 2D information is then encoded in parallel. With features of 2D correspondences, 3D information is aggregated onto the 2D image plane using attention mechanisms. The aggregated features are subsequently processed to produce the final feature \(S^{3D}\).

In-the-wild Interpolation

Our PoseFuse3D-KI framework can be directly applied to interpolate in-the-wild human-centric keyframes. Below, we illustrate a simple pipeline using linear interpolation for human body joints:

Step 1: Fit SMPL-X models to the humans in the input keyframes using a 3D human model estimator such as SMPLer-X.

Step 2: Linearly interpolate the SMPL-X parameters to obtain intermediate SMPL-X poses.

Step 3: Map SMPL-X joints to 2D DWPose keypoints.

Step 4: Get the intermediate frames using PoseFuse3D-KI!

Start Keyframe