SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Wenhao Yan1*†, Sheng Ye1*†, Zhuoyi Yang1,2‡, Jiayan Teng1,2, ZhenHui Dong1, Kairui Wen1, Xiaotao Gu2, Yong-Jin Liu, Jie Tang
1Tsinghua University, 2Z.ai
*Equal contribution. Work done during internship at Z.ai. Project leader. §Corresponding author.

SCAIL enables high-fidelity character animation under diverse and challenging conditions.

Abstract

Achieving character animation that meets the studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present SCAIL (Studio-grade Character Animation via In-context Learning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that SCAIL achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

Method

3D-Consistent Pose Representation

3D-Consistent Pose Representation

Exploration of Different Injection Methods

Exploration of Different Injection Methods

Full-Context Pose Injection with P-RoPE within DiT Architecture

Full-Context Pose Injection with P-RoPE within DiT Architecture

SCAIL builds upon Wan-I2V models and incorporates 3D-Consistent pose representation to learn precise identity-agnostic motion. After comparing different injection methods, we adopt full-context pose injection for the model to learn spatial-temporal motion characteristics. We leverage Pose-shifted RoPE to facilitate learning of spatial-temporal relation between video tokens and pose tokens.

Results Gallery

Comparison on Self-Driven Complex Motion

Ballet

Straddle

Comparison on Cross-Driven Complex Motion

Human Motion -> Human Character.

Acrobats

Expressive Body Movements

Occluded Postures

Fighting Scenes

Comparison on Cross-Driven Complex Motion

Human Motion -> Anime Character.

Motion of Nonstandard Figures

Anime Characters' Interactions

More Examples

Examples on Studio-Bench.


More in-the-wild examples.

BibTeX

@article{yan2025scail,
  title={SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations},
  author={Yan, Wenhao and Ye, Sheng and Yang, Zhuoyi and Teng, Jiayan and Dong, ZhenHui and Wen, Kairui and Gu, Xiaotao and Liu, Yong-Jin and Tang, Jie},
  journal={arXiv preprint arXiv:2512.05905},
  year={2025}
}