Turn Grimm's fairy tale into an animated children's video — with no 3D studio and no animation team.
Characters come from an image model, motion from video models, voices from text-to-speech, the cut from
ffmpeg. This page explains what we built, shows the characters and videos,
compares the models, and honestly documents the challenges on the way to the first finished scene.
The Project
The same fairy tale, two depths — and consistently made child-friendly.
The goal is an animated video of "The Bremen Town Musicians" (Brothers Grimm) in two versions: a very simple one for ages 1–3 (~5–6 min, short sentences, lots of repetition, onomatopoeia) and an elaborated one for ages 4–6 (~10 min, real dialogue, character humour, slapstick).
In the original, the animals are to be killed — that is consistently defused: no threat, the animals are simply "old and no longer needed" and find a new purpose. The robbers aren't evil but clumsy and cowardly (slapstick instead of scares).
Current status: proof of concept passed. The entire pipeline is proven and a first complete, voiced, text-faithful scene exists. Next up: produce the screenplay scene by scene.
The Characters
Every character is created once as a still in a uniform picture-book watercolour style — the donkey is the style anchor, his image sets the look for all the others. All four animals are gently aged (greying muzzle, soft wrinkles), fitting the "too old" theme.
Per character: a pose library
The key to character stability (more below): every character gets several views. Identity = front · face close-up · profile · rear. Action = a matching pose per motion (e.g. walking, with the lute on its strap instead of in the hands). Just like an animation studio keeps a model sheet per action.











The Pipeline
None of this needs a web UI or a human clicking somewhere — everything runs headless as a Node
script that calls the APIs and assembles with ffmpeg.
scripts/generate.mjsBatch generator via the nanobanana CLI (Gemini). Style-anchor, composition and edit modes. Archives predecessors before overwriting.
scripts/i2v.mjsImage-/reference-to-video via Atlas Cloud. Picks the schema per model automatically (Kling vs. Seedance), multi-ref as a base64 array, robust polling.
scripts/build-scene.mjsReads a screenplay (JSON), generates N variants per shot, creates the ElevenLabs voices and cuts everything together audio-driven. Also writes a readable DREHBUCH.md.
concat.
{
"id": "fenster", "duration": 6, "variants": 2, "pick": 2,
"refs": [
"assets/characters/esel.jpg",
"assets/characters/refs/esel_profile.jpg",
"assets/characters/raeuber.jpg",
"assets/characters/stube.jpg",
"assets/characters/haus.jpg"
],
"prompt": "Night exterior, locked static camera... the old grey donkey
peers inside the lit window... THROUGH THE WINDOW three funny scruffy
robbers sit around a big wooden table laden with food, feasting...",
"narration": [
{ "speaker": "hund", "text": "Was siehst du denn, Grauschimmel?" },
{ "speaker": "esel", "text": "Einen Tisch voll mit herrlichem Essen —
und drei Räuber, die sich's so richtig schmecken lassen!" }
]
}
Models compared
Every model runs through the same Atlas Cloud API. The key insight: no single model wins — each shot type has its model.
| Model | API ID | Price/s | Resolution | Role here |
|---|---|---|---|---|
| Seedance 2.0 ref-to-video | bytedance/seedance-2.0/reference-to-video | ~$0.24 | 720p | Default for scenes: holds all characters (up to 9 refs), composes freely, on-model |
| Kling 3.0 Pro | kwaivgi/kling-v3.0-pro/image-to-video | $0.095 | 1080p | Default for single shots: faithfully animates a composed still, high-res, cheap |
| Seedance 2.0 (full) i2v | bytedance/seedance-2.0/image-to-video | ~$0.24 | 720p | holds identity with an anchor, but single image → no real multi-ref |
| Seedance 2.0 fast | …/seedance-2.0-fast/… | ~$0.022 | 720p | ❌ dropped — breaks the style completely (3D / wrong character) |
| Seedance v1.5 Pro | bytedance/seedance-v1.5-pro/… | $0.047 | — | researched (style-preservation leader), test failed on the content schema |
| Hailuo 2.3 · Vidu Q3 · Wan 2.6 | minimax / vidu / alibaba | $0.018–0.28 | — | researched as cheap / stylised alternatives |
Click to play — the clips start automatically as soon as they scroll into view.
Challenges & solutions
The honest part. Almost every step forward came out of a visible failure. Each card shows Problem → Cause → Solution, many with the before/after video evidence.
image-to-video + a realism-tuned model + no negative prompt. The model has to invent every unknown view → drift.reference-to-video mode (identity lock from several refs) instead of image-to-videoBonus: the array schema reference_images:[base64…] also eliminated the 400 errors — the single image_url path can't do multi-ref at all.

502/HTML gateway pages instead of JSON, sporadic "internal error" mid-render.The finished scene
The real milestone: not "does the pipeline work?" but "we can produce scenes on an assembly line" — style-true, character-stable, with German character voices, faithful to the Grimm source.
| # | Shot | Sound |
|---|---|---|
| 1 | Tired rest in the dark woods | Narrator |
| 2 | Spotting the distant light | Rooster + narrator |
| 3 | The approach out of the woods | Narrator |
| 4 | Window: the donkey sees the robbers | Dog + donkey |
scenes/<name>.json →
run build-scene.mjs. Several variants are generated per shot; pick the best with
"pick" and reassemble with --assemble. Variants are kept. Every further
scene is built the same way.
Strategic approaches
A fundamental question ran in parallel: shouldn't this really be built in a proper 3D engine? The honest trade-off — and the chosen direction.
Technically yes — but the renderer was never the problem. You'd first need the other 90% (models, rigging, animation) and would be fighting your own picture-book style, which neither engine does natively. The wrong path for our flat watercolour look.
Fully open is conceivable: ComfyUI + SDXL/Flux (with a LoRA per character for real consistency), ControlNet for choreography, local video models (Wan/Hunyuan), Coqui/Piper/Kokoro TTS, rembg, ffmpeg. Catch: local video diffusion is sluggish on the Mac.
Cut-out / puppet animation: rig the cut-out figures as jointed puppets and animate the motion deterministically (Blender headless or HyperFrames). The music pyramid as a physics sim. Consistency is perfect by design, since it's the same art in every shot.
AI where it makes sense — stills (characters, backgrounds), voices, later music/SFX and inspiration. Motion can increasingly move to real animation. Proven and productive today is the AI video pipeline; it delivers results immediately, while the cut-out track stands alongside as an expansion.