The Bremen Town Musicians

A children's fairy tale, told by an AI video pipeline — driven entirely from the terminal.

Turn Grimm's fairy tale into an animated children's video — with no 3D studio and no animation team. Characters come from an image model, motion from video models, voices from text-to-speech, the cut from ffmpeg. This page explains what we built, shows the characters and videos, compares the models, and honestly documents the challenges on the way to the first finished scene.

7Characters in the cast
5+Video models tested
2Age versions (1–3 / 4–6)
10Challenges solved
37 sFirst finished scene
Image
nanobanana / Gemini
Character references, backgrounds
Motion
Atlas Cloud
Seedance · Kling (i2v / ref-to-video)
Voice
ElevenLabs
native German voices
Cut
ffmpeg
audio-driven assembly

The Project

What it's about

The same fairy tale, two depths — and consistently made child-friendly.

The goal is an animated video of "The Bremen Town Musicians" (Brothers Grimm) in two versions: a very simple one for ages 1–3 (~5–6 min, short sentences, lots of repetition, onomatopoeia) and an elaborated one for ages 4–6 (~10 min, real dialogue, character humour, slapstick).

In the original, the animals are to be killed — that is consistently defused: no threat, the animals are simply "old and no longer needed" and find a new purpose. The robbers aren't evil but clumsy and cowardly (slapstick instead of scares).

Recurring musical motif: each animal is one note — Hee-Haw · Woof · Meow · Cock-a-doodle-doo — together "the band". Usable as a leitmotif across the whole series.

Current status: proof of concept passed. The entire pipeline is proven and a first complete, voiced, text-faithful scene exists. Next up: produce the screenplay scene by scene.

The full cast as a size reference
The full cast
Size reference: three human-sized robbers, the upright donkey as the largest animal, then the dog, the rooster, the cat.

The Characters

The cast

Every character is created once as a still in a uniform picture-book watercolour style — the donkey is the style anchor, his image sets the look for all the others. All four animals are gently aged (greying muzzle, soft wrinkles), fitting the "too old" theme.

Donkey, Grauschimmel
Donkey — Grauschimmel
Leader, optimist, has the Bremen idea. Carries a lute.
Voice: Crizz · deep, warm
Dog, Packan
Dog — Packan
loyal, clumsy, always hungry. Kettledrum as his anchor.
Voice: Adrian · strong, cheerful
Cat, Bartputzer
Cat — Bartputzer
elegant, grumpy, sharp tongue.
Voice: still to be cast
Rooster, Rotkopf
Rooster — Rotkopf
loud, proud, the "early warner" — spots the light.
Voice: Paul · bright, dramatic
The robber gang
The robber gang
Captain + 2 henchmen. Clumsy and cowardly, never scary.
Narrator: Bettina · warm
Robbers' house at night
Robbers' house (night)
Setting background: a warmly lit window in the dark forest.
Robbers' parlour interior
Robbers' parlour (interior)
Stage for the window peek: a table full of food, candlelight.
Size chart of the four animals
Size chart
Defines the relative standing heights of the four musicians.
Group reference in natural poses
Group reference
The four in natural poses — defines how the characters relate.

Per character: a pose library

Identity references + action references

The key to character stability (more below): every character gets several views. Identity = front · face close-up · profile · rear. Action = a matching pose per motion (e.g. walking, with the lute on its strap instead of in the hands). Just like an animation studio keeps a model sheet per action.

Donkey front
Identity Front (anchor)
Donkey face
Identity Face close-up
Donkey profile
Identity Profile
Donkey rear
Identity Rear
Donkey walk profile
Action Walking, profile — lute on the strap
Donkey walk 3/4
Action Walking, ¾ view
Dog walk profile
Action Dog — four-legged, drum as anchor
Cat walk profile
Action Cat — four-legged, tail up
Rooster walk profile
Action Rooster — strutting on two legs
Donkey pose sheet
Sheet 4 poses in 1 image (2×2)
Cast sheet
Sheet All 4 animals in 1 image

The Pipeline

Four building blocks, three small scripts

None of this needs a web UI or a human clicking somewhere — everything runs headless as a Node script that calls the APIs and assembles with ffmpeg.

① Characters
scripts/generate.mjs

Batch generator via the nanobanana CLI (Gemini). Style-anchor, composition and edit modes. Archives predecessors before overwriting.

② Video
scripts/i2v.mjs

Image-/reference-to-video via Atlas Cloud. Picks the schema per model automatically (Kling vs. Seedance), multi-ref as a base64 array, robust polling.

③ Scene
scripts/build-scene.mjs

Reads a screenplay (JSON), generates N variants per shot, creates the ElevenLabs voices and cuts everything together audio-driven. Also writes a readable DREHBUCH.md.

The proven chain: Atlas upload → Seedance/Kling → polling → download → ElevenLabs DE-TTS → ffmpeg mux. Each shot becomes a segment (video padded to speech length + voices mixed), then joined into the scene via concat.
Example: one shot from the screenplay (scenes/annaeherung.json)
{
  "id": "fenster", "duration": 6, "variants": 2, "pick": 2,
  "refs": [
    "assets/characters/esel.jpg",
    "assets/characters/refs/esel_profile.jpg",
    "assets/characters/raeuber.jpg",
    "assets/characters/stube.jpg",
    "assets/characters/haus.jpg"
  ],
  "prompt": "Night exterior, locked static camera... the old grey donkey
    peers inside the lit window... THROUGH THE WINDOW three funny scruffy
    robbers sit around a big wooden table laden with food, feasting...",
  "narration": [
    { "speaker": "hund", "text": "Was siehst du denn, Grauschimmel?" },
    { "speaker": "esel", "text": "Einen Tisch voll mit herrlichem Essen —
       und drei Räuber, die sich's so richtig schmecken lassen!" }
  ]
}

Models compared

Which video model for which shot?

Every model runs through the same Atlas Cloud API. The key insight: no single model wins — each shot type has its model.

ModelAPI IDPrice/sResolutionRole here
Seedance 2.0 ref-to-videobytedance/seedance-2.0/reference-to-video~$0.24720pDefault for scenes: holds all characters (up to 9 refs), composes freely, on-model
Kling 3.0 Prokwaivgi/kling-v3.0-pro/image-to-video$0.0951080pDefault for single shots: faithfully animates a composed still, high-res, cheap
Seedance 2.0 (full) i2vbytedance/seedance-2.0/image-to-video~$0.24720pholds identity with an anchor, but single image → no real multi-ref
Seedance 2.0 fast…/seedance-2.0-fast/…~$0.022720p❌ dropped — breaks the style completely (3D / wrong character)
Seedance v1.5 Probytedance/seedance-v1.5-pro/…$0.047researched (style-preservation leader), test failed on the content schema
Hailuo 2.3 · Vidu Q3 · Wan 2.6minimax / vidu / alibaba$0.018–0.28researched as cheap / stylised alternatives
Model choice by shot type: multi-character scenes, free composition and locomotion → Seedance ref-to-video. Single figure, close-up, or "bring an already-composed still to life" → Kling i2v (1080p, cheaper). On multi-character work Kling is unreliable (drops figures, turns the upright donkey into a quadruped).

The same donkey, five approaches

Click to play — the clips start automatically as soon as they scroll into view.

Drift Seedance fast, 1 image. Our flat donkey turns into a photorealistic 3D plush toy and lies down.
Drift Fast + a "2D" prompt. Now flat — but a completely new character (a boy with antlers). Our donkey is gone.
better Seedance full + anchor. The donkey is preserved (standing, lute, 2D) — but a bit "pencilly".
Breakthrough Seedance ref-to-video, 4 refs + negative. Exactly our watercolour donkey, stable across all frames.
strong Kling 3.0 Pro, multi-ref. Holds the same donkey — in 1080p and ~2.5× cheaper.
Scene with everything Seedance, 6 refs. All four animals on-model + the robbers' house — one coherently composed scene.

Challenges & solutions

What was hard — and how we solved it

The honest part. Almost every step forward came out of a visible failure. Each card shows ProblemCauseSolution, many with the before/after video evidence.

1 Style break: 2D turns into 3D

ProblemSeedance turns the flat picture-book donkey into a photorealistic 3D plush toy — or, with a "2D" prompt, invents a brand-new character.
CauseA single front image + image-to-video + a realism-tuned model + no negative prompt. The model has to invent every unknown view → drift.
SolutionThe "yeti method" — three ingredients together (see box).
The 3 ingredients of character stability:
  • Multi-reference instead of a single image — front + face + profile + rear
  • The reference-to-video mode (identity lock from several refs) instead of image-to-video
  • A hard negative prompt in every call against "3D, photoreal, different animal, deer …"

Bonus: the array schema reference_images:[base64…] also eliminated the 400 errors — the single image_url path can't do multi-ref at all.

before 1 image → 3D plush
after 4 refs + negative → our donkey

2 Multi-character scenes: Kling loses figures

ProblemIn group scenes Kling drops a figure and turns the upright donkey into a quadruped like a real horse.
CauseKling i2v uses the (first) image as a start frame and isn't built to compose a coherent multi-figure scene from separate refs.
SolutionRoute group/story scenes to Seedance ref-to-video — it holds all four (up to 9 refs) and composes freely.
Kling donkey on all fours, dog missing
Seedance all four, on-model

3 Kling's 4-ref limit & the literally-animated sheet

ProblemKling accepts only 4 reference images (our 6 were rejected). And a composite pose sheet gets animated as a whole — the 2×2 grid wobbles instead of one figure being extracted from it.
CauseKling animates exactly the input image as a canvas. Seedance instead uses the refs as an identity pool and builds a new scene. Aspect ratio is sensitive too (a 4×1 strip → "aspect ratio invalid").
SolutionTwo clean production paths: (A) Seedance composes directly from many refs. (B) For Kling, first compose the scene as a real still (nanobanana composition mode), then animate. Keep sheets square (2×2).
evidence Kling animates the entire 2×2 donkey sheet — four wobbling donkeys instead of one.

4 Extra arms while walking

ProblemWhen walking, the donkey grows extra arms that don't belong.
CauseAll identity refs show him upright with "arms" (playing the lute). The model mixes that arm pose with walking legs → anatomy errors.
SolutionProvide action references per motion: a walking profile with the lute on its strap (not in the hands) + a sharpened negative against "extra arm, holding the lute". Every character needs a pose library per action.
Donkey walk reference
action ref upright gait, lute on the strap
after clean walk, 2 arms, 2 legs

5 "The house runs away" — camera drift instead of locomotion

ProblemThe animals don't really move — the background/house slides out of frame while the animals march in place.
CauseThe prompt said "the camera follows them" → Seedance moves the camera (a cheap background slide) instead of having the figures travel real distance. AI video is notoriously weak at "walk toward a distant goal + camera move".
SolutionLocked camera + lateral walk: the animals wander across the frame (left → right), house fixed. Or the dynamic depth variant: appear small at the back of the woods and walk forward, growing larger — real distance.
before house drifts away, marching in place
after locked camera, real travel
hero shot dynamic, 10 s: out of the woods, small → large (establishing + approach in one).

6 Geography & text fidelity

ProblemA single shot collapsed "spot the light from afar" and "arrive at the house" — the geography collapsed. And through the window you only saw food, not the robbers at the table (the Grimm core beat).
CauseToo much action in one shot; and the robber reference was missing as the window content.
SolutionSplit the beats into separate shots (spot from afar → approach → near window peek). Provide robber refs + the parlour as window content. Never drop story core beats.
Beat 1 a small, distant light in the woods
Beat 4 robbers at the laid table (faithful to the text)

7 Transient API errors & robust polling

ProblemOccasional 502/HTML gateway pages instead of JSON, sporadic "internal error" mid-render.
CauseCloud backend hiccups — not a schema problem.
SolutionHarden the poller (on non-JSON, just keep polling instead of crashing) and simply restart failed renders. Baked into the script.

The finished scene

"Approaching the robbers' house" — 37 seconds, end to end

The real milestone: not "does the pipeline work?" but "we can produce scenes on an assembly line" — style-true, character-stable, with German character voices, faithful to the Grimm source.

with sound 4 shots · narrator + character voices · audio-driven cut. (Turn the sound on — this clip has voices, in German.)
#ShotSound
1Tired rest in the dark woodsNarrator
2Spotting the distant lightRooster + narrator
3The approach out of the woodsNarrator
4Window: the donkey sees the robbersDog + donkey
The scene machine: write a screenplay as scenes/<name>.json → run build-scene.mjs. Several variants are generated per shot; pick the best with "pick" and reassemble with --assemble. Variants are kept. Every further scene is built the same way.

Strategic approaches

Which tools — and how much AI?

A fundamental question ran in parallel: shouldn't this really be built in a proper 3D engine? The honest trade-off — and the chosen direction.

RenderMan / Unreal?

Technically yes — but the renderer was never the problem. You'd first need the other 90% (models, rigging, animation) and would be fighting your own picture-book style, which neither engine does natively. The wrong path for our flat watercolour look.

Open-source stack

Fully open is conceivable: ComfyUI + SDXL/Flux (with a LoRA per character for real consistency), ControlNet for choreography, local video models (Wan/Hunyuan), Coqui/Piper/Kokoro TTS, rembg, ffmpeg. Catch: local video diffusion is sluggish on the Mac.

Less AI, real production

Cut-out / puppet animation: rig the cut-out figures as jointed puppets and animate the motion deterministically (Blender headless or HyperFrames). The music pyramid as a physics sim. Consistency is perfect by design, since it's the same art in every shot.

★ Chosen direction

AI where it makes sense — stills (characters, backgrounds), voices, later music/SFX and inspiration. Motion can increasingly move to real animation. Proven and productive today is the AI video pipeline; it delivers results immediately, while the cut-out track stands alongside as an expansion.