End-to-end character animation across diverse tasks, no skeleton intermediates required.
Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations — including pose skeletons to represent motion or masked background to represent environment — which inevitably leads to information loss. Skeleton maps suffer from inherent ambiguity under complex scenarios; character masks limit body-shape flexibility; and depth-ambiguous overlapping skeletons cause misinterpretation in multi-character interactions.
To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos latents to the sequence, the model obtains all required visual information from the input. To overcome the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and curate a pipeline to synthesize MotionPair-60K — a heterogeneous dataset of 60K motion pairs spanning animation, replacement, and multi-character tasks. We introduce in-context mask conditioning and mode-specific RoPE as unified soft guidance. To mitigate synthetic-data bias in detailed regions (e.g. fingers), we propose Bias-Aware DPO for post-training refinement. Extensive experiments demonstrate that SCAIL-2 substantially outperforms existing state-of-the-art approaches across all tasks, while unlocking emerging zero-shot capabilities such as animal-driven animation and mesh-based control.
SCAIL-2 builds a unified motion transfer interface on top of a latent video diffusion model, replacing brittle skeleton intermediates with direct visual conditioning.
Agentic data synthesis pipeline for MotionPair-60K
Unified in-context conditioning with mode-specific RoPE
We use SCAIL-Preview, Wan-Animate, and MoCha as generators in the generation pipeline to synthesize 60K heterogeneous motion pairs spanning animation, replacement, and multi-character tasks.
Two masking channel types — an environment switch and character binding slots — enable task unification with proper guidance.
Dedicated RoPE position encodings per task mode allow a single model to correctly route spatial-temporal attention across animation and replacement.
A post-training DPO scheme targets synthetic-data bias concentrated in fine-grained regions (especially hands/fingers), improving end-to-end motion fidelity in detailed areas.
SCAIL-2 vs. state-of-the-art pose-driven animation methods or proprietary services. Each video shows multiple methods side by side.
When multiple people interact closely, depth-ambiguous overlapping skeletons cause misinterpretation in existing methods. SCAIL-2 handles interactions correctly end-to-end.
Replace a character in an existing video with a reference identity. SCAIL-2 achieves seamless environment integration and accurate motion without background-inpainting masks and surpasses our data generator MoCha.
Because SCAIL-2 learns from visual context rather than skeleton semantics, it generalizes to driving sources that are entirely outside training — including animals and egocentric videos.
@article{yan2025scail2,
title={SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning},
author={Yan, Wenhao and Guo, Fengjia and Yang, Zhuoyi and Tang, Jie},
year={2025}
}
@article{yan2025scail,
title={SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations},
author={Yan, Wenhao and Ye, Sheng and Yang, Zhuoyi and Teng, Jiayan and Dong, ZhenHui and Wen, Kairui and Gu, Xiaotao and Liu, Yong-Jin and Tang, Jie},
journal={arXiv preprint arXiv:2512.05905},
year={2025}
}