AvatarForcing teaser figure

We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching.

Comparison to Contemporary Baselines

Direct comparisons with state-of-the-art methods demonstrating AvatarForcing's superior lip synchronization quality and temporal consistency across various challenging scenarios.

Echo Mimic
Hallo3
Hunyuan Avatar
Fantasy Talking
Omni Avatar
Multitalk
Stable Avatar
Live Avatar
Wan-S2V
Ours

Demo Gallery

Phoneme-Level Lip Sync

Tight close-ups highlighting frame-accurate mouth articulation and viseme timing aligned to audio across diverse speakers.

Scene Variety & Global Coherence

Richer contexts (street, studio, home) with long-range identity stability and co-speech motion consistency.

Minute-Level Long Video Generation

Extended talking avatar generation maintaining temporal consistency and high quality over minute-long sequences with our sliding-window denoising approach.

Ablation Studies

Ablations diagnose what drives stable long-form streaming and real-time efficiency in AvatarForcing, and how allocating compute between local-future look-ahead (L) and per-step refinement (N) shapes the quality–latency trade-off.

Qualitative Ablations (Long Rollouts)

We compare long-rollout stability across anchor and alignment ablations: removing the style or temporal anchor increases drift/flicker (Tab. 4 / Fig. 6), removing anchor-audio zero padding causes mouth jitter/artifacts, and removing RoPE re-indexing leads to gradual appearance/color drift (Tab. 6 / Fig. 8). Against one-step baselines (Self-Forcing (1-step), Causal ODE), AvatarForcing preserves sharper motion with less drift/blur (Tab. 5 / Fig. 7).

L/N Decoupling Sweep

The sweep in Fig. 4 / Tab. 3 fixes the dual-anchor design and varies window length L and denoising steps N to track how latency scales with stability: larger L tends to improve long-horizon consistency more than simply stacking denoising passes, but excessively large L can over-smooth motion, which the merged comparison below captures.

B=1, N=4, 28.19ms
B=2, N=1, 19.5ms
B=2, N=2, 33.19ms
B=4, N=1, 34.14ms
B=4, N=2, 69.45ms
B=8, N=1, 59.67ms
B=4 N=4, 166.42ms

BibTeX

@misc{cui2026avatarforcingonestepstreamingtalking,
      title={AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising}, 
      author={Liyuan Cui and Wentao Hu and Wenyuan Zhang and Zesong Yang and Fan Shi and Xiaoqiang Liu},
      year={2026},
      eprint={2603.14331},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.14331}, 
}