Alibaba's HappyHorse-1.0 became available on fal.ai on April 26, giving creators API access to the AI video model that currently ranks first on the Artificial Analysis Video Arena leaderboard for both text-to-video and image-to-video. It generates 1080p clips with synchronized native audio in a single pass.

For the broader landscape, see our complete guide to AI video generation in 2026.

What Happened

HappyHorse-1.0, built by Alibaba's ATH AI Innovation Unit (Taotian Group), launched on fal.ai on April 26 at 9 PM PST. The model had been available only through the third-party happyhorse.app since its initial appearance, which means this is the first time creators can access it through a standard API with per-second billing and no platform lock-in.

Alibaba revealed the model's origin on April 10, after it surfaced anonymously on the Artificial Analysis leaderboard around April 7 and quickly reached #1 in blind human preference voting.

Why It Matters

The leaderboard rank matters because Artificial Analysis uses blind voting -- voters do not see which model they are evaluating. HappyHorse-1.0 holds a 107-point Elo lead over the second-ranked model in text-to-video (without audio), which translates to users preferring its output roughly 65% of the time in head-to-head comparisons. That kind of gap in a blind test is a credible quality signal.

The architectural differentiator is joint audio-video generation. Most competing models run audio as a separate post-processing step, which introduces timing drift. HappyHorse generates dialogue, ambient sound, and Foley effects in the same forward pass as the video, which is why its multilingual lip-sync benchmark scores lower word error rates than LTX 2.3 and OVI 1.1.

Practically speaking, this matters for talking-head content, multilingual marketing, and any use case where audio-visual sync failure is a visible problem. It does not have the longest max duration (capped at 10 seconds in the UI) or the lowest per-second cost, but for output quality and audio sync, it currently leads the field.

Key Details

  • Leaderboard: #1 text-to-video and image-to-video on Artificial Analysis Video Arena (Elo 1,360 T2V, 1,403 I2V)
  • Architecture: 15B-parameter unified Transformer, joint audio-video single-pass generation
  • Resolution: 720p and 1080p; aspect ratios 16:9, 9:16, 1:1, 4:3
  • Duration: 3-15 seconds supported; UI currently limits to 5 or 10 seconds
  • Audio: Native dialogue, ambient, and Foley -- multilingual lip-sync in English, Mandarin, Cantonese, Japanese, Korean, German, French
  • fal.ai pricing: $0.14/second at 720p, $0.28/second at 1080p
  • API: Also available via Alibaba Cloud's Bailian platform as of April 27

What to Do Next

Try HappyHorse-1.0 at fal.ai/happyhorse-1.0 -- no subscription required, pay per second of generated video. For best results with audio, enable audio generation and include language and sound direction in your prompt (for example: "English dialogue, street ambient sound bed"). The model's strongest use cases are talking-head video, animated stills with synced speech, and multilingual localization where other models produce drift.

At $0.28/second for 1080p, a 5-second clip costs $1.40. That is higher than Kling 3.0 Pro ($0.55-0.85 for 5 seconds) but on par with Seedance 2.0 for equivalent resolution. Run a 5-second test before committing to longer clips or batch production.