☀️

SCAIL-2

Unifying Controlled Character Animation with End-to-end In-Context Conditioning
1Tsinghua University  ·  2Z.ai    *Equal contribution  ·  Tech lead  ·  Corresponding author

SCAIL-2 in Action

End-to-end character animation across diverse tasks, no skeleton intermediates required.

Multi-character animation
Cross-identity character replacement
Cross-identity character animation
Character replacement
Multi-character replacement

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations — including pose skeletons to represent motion or masked background to represent environment — which inevitably leads to information loss. Skeleton maps suffer from inherent ambiguity under complex scenarios; character masks limit body-shape flexibility; and depth-ambiguous overlapping skeletons cause misinterpretation in multi-character interactions.

To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos latents to the sequence, the model obtains all required visual information from the input. To overcome the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and curate a pipeline to synthesize MotionPair-60K — a heterogeneous dataset of 60K motion pairs spanning animation, replacement, and multi-character tasks. We introduce in-context mask conditioning and mode-specific RoPE as unified soft guidance. To mitigate synthetic-data bias in detailed regions (e.g. fingers), we propose Bias-Aware DPO for post-training refinement. Extensive experiments demonstrate that SCAIL-2 substantially outperforms existing state-of-the-art approaches across all tasks, while unlocking emerging zero-shot capabilities such as animal-driven animation and mesh-based control.


Unified End-to-End Architecture

SCAIL-2 builds a unified motion transfer interface on top of a latent video diffusion model, replacing brittle skeleton intermediates with direct visual conditioning.

SCAIL-2 Data Pipeline

Agentic data synthesis pipeline for MotionPair-60K

SCAIL-2 Network Architecture

Unified in-context conditioning with mode-specific RoPE

📦 MotionPair-60K Dataset

We use SCAIL-Preview, Wan-Animate, and MoCha as generators in the generation pipeline to synthesize 60K heterogeneous motion pairs spanning animation, replacement, and multi-character tasks.

🎭 In-Context Mask Conditioning

Two masking channel types — an environment switch and character binding slots — enable task unification with proper guidance.

🔄 Mode-Specific Context RoPE

Dedicated RoPE position encodings per task mode allow a single model to correctly route spatial-temporal attention across animation and replacement.

✋ Bias-Aware DPO

A post-training DPO scheme targets synthetic-data bias concentrated in fine-grained regions (especially hands/fingers), improving end-to-end motion fidelity in detailed areas.


Single-Character Animation

SCAIL-2 vs. state-of-the-art pose-driven animation methods or proprietary services. Each video shows multiple methods side by side.

Cross-identity complex motion on Studio-Bench
Dancing videos on X-Dance benchmark
Camera following with character movements
Detailed motion when arms overlap

Multi-Character Animation

When multiple people interact closely, depth-ambiguous overlapping skeletons cause misinterpretation in existing methods. SCAIL-2 handles interactions correctly end-to-end.

End-to-end with proper identity isolation
Precise interactions with identity isolation

Character Replacement

Replace a character in an existing video with a reference identity. SCAIL-2 achieves seamless environment integration and accurate motion without background-inpainting masks and surpasses our data generator MoCha.

Detailed human object interaction
Occluded characters
Human object interaction + cross-identity
Human object interaction + difficult tracking
Complex motion + cross-identity

Zero-Shot Capabilities

Because SCAIL-2 learns from visual context rather than skeleton semantics, it generalizes to driving sources that are entirely outside training — including animals and egocentric videos.

Note: as both videos are completely zero-shot and out-of-distribution, there may exist artifacts.
Zero-shot animal driving
Zero-shot egocentric driving

BibTeX

@article{yan2025scail2,
  title={SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning},
  author={Yan, Wenhao and Guo, Fengjia and Yang, Zhuoyi and Tang, Jie},
  year={2025}
}

@article{yan2025scail,
  title={SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations},
  author={Yan, Wenhao and Ye, Sheng and Yang, Zhuoyi and Teng, Jiayan and Dong, ZhenHui and Wen, Kairui and Gu, Xiaotao and Liu, Yong-Jin and Tang, Jie},
  journal={arXiv preprint arXiv:2512.05905},
  year={2025}
}