The Bremen Town Musicians — an AI video pipeline, explained

The Project

What it's about

The same fairy tale, two depths — and consistently made child-friendly.

The goal is an animated video of "The Bremen Town Musicians" (Brothers Grimm) in two versions: a very simple one for ages 1–3 (~5–6 min, short sentences, lots of repetition, onomatopoeia) and an elaborated one for ages 4–6 (~10 min, real dialogue, character humour, slapstick).

In the original, the animals are to be killed — that is consistently defused: no threat, the animals are simply "old and no longer needed" and find a new purpose. The robbers aren't evil but clumsy and cowardly (slapstick instead of scares).

Recurring musical motif: each animal is one note — Hee-Haw · Woof · Meow · Cock-a-doodle-doo — together "the band". Usable as a leitmotif across the whole series.

Current status: proof of concept passed. The entire pipeline is proven and a first complete, voiced, text-faithful scene exists. Next up: produce the screenplay scene by scene.

The full cast

Size reference: three human-sized robbers, the upright donkey as the largest animal, then the dog, the rooster, the cat.

The Characters

The cast

Every character is created once as a still in a uniform picture-book watercolour style — the donkey is the style anchor, his image sets the look for all the others. All four animals are gently aged (greying muzzle, soft wrinkles), fitting the "too old" theme.

Donkey — Grauschimmel

Leader, optimist, has the Bremen idea. Carries a lute.

Voice: Crizz · deep, warm

Dog — Packan

loyal, clumsy, always hungry. Kettledrum as his anchor.

Voice: Adrian · strong, cheerful

Cat — Bartputzer

elegant, grumpy, sharp tongue.

Voice: still to be cast

Rooster — Rotkopf

loud, proud, the "early warner" — spots the light.

Voice: Paul · bright, dramatic

The robber gang

Captain + 2 henchmen. Clumsy and cowardly, never scary.

Narrator: Bettina · warm

Robbers' house (night)

Setting background: a warmly lit window in the dark forest.

Robbers' parlour (interior)

Stage for the window peek: a table full of food, candlelight.

Size chart

Defines the relative standing heights of the four musicians.

Group reference

The four in natural poses — defines how the characters relate.

Per character: a pose library

Identity references + action references

The key to character stability (more below): every character gets several views. Identity = front · face close-up · profile · rear. Action = a matching pose per motion (e.g. walking, with the lute on its strap instead of in the hands). Just like an animation studio keeps a model sheet per action.

Identity Front (anchor)

Identity Face close-up

Identity Profile

Identity Rear

Action Walking, profile — lute on the strap

Action Walking, ¾ view

Action Dog — four-legged, drum as anchor

Action Cat — four-legged, tail up

Action Rooster — strutting on two legs

Sheet 4 poses in 1 image (2×2)

Sheet All 4 animals in 1 image

The Pipeline

Four building blocks, three small scripts

None of this needs a web UI or a human clicking somewhere — everything runs headless as a Node script that calls the APIs and assembles with ffmpeg.

① Characters

scripts/generate.mjs

Batch generator via the nanobanana CLI (Gemini). Style-anchor, composition and edit modes. Archives predecessors before overwriting.

② Video

scripts/i2v.mjs

Image-/reference-to-video via Atlas Cloud. Picks the schema per model automatically (Kling vs. Seedance), multi-ref as a base64 array, robust polling.

③ Scene

scripts/build-scene.mjs

Reads a screenplay (JSON), generates N variants per shot, creates the ElevenLabs voices and cuts everything together audio-driven. Also writes a readable DREHBUCH.md.

The proven chain: Atlas upload → Seedance/Kling → polling → download → ElevenLabs DE-TTS → ffmpeg mux. Each shot becomes a segment (video padded to speech length + voices mixed), then joined into the scene via concat.

Example: one shot from the screenplay (scenes/annaeherung.json)

{
  "id": "fenster", "duration": 6, "variants": 2, "pick": 2,
  "refs": [
    "assets/characters/esel.jpg",
    "assets/characters/refs/esel_profile.jpg",
    "assets/characters/raeuber.jpg",
    "assets/characters/stube.jpg",
    "assets/characters/haus.jpg"
  ],
  "prompt": "Night exterior, locked static camera... the old grey donkey
    peers inside the lit window... THROUGH THE WINDOW three funny scruffy
    robbers sit around a big wooden table laden with food, feasting...",
  "narration": [
    { "speaker": "hund", "text": "Was siehst du denn, Grauschimmel?" },
    { "speaker": "esel", "text": "Einen Tisch voll mit herrlichem Essen —
       und drei Räuber, die sich's so richtig schmecken lassen!" }
  ]
}

Models compared

Which video model for which shot?

Every model runs through the same Atlas Cloud API. The key insight: no single model wins — each shot type has its model.

Model	API ID	Price/s	Resolution	Role here
Seedance 2.0 ref-to-video	bytedance/seedance-2.0/reference-to-video	~$0.24	720p	Default for scenes: holds all characters (up to 9 refs), composes freely, on-model
Kling 3.0 Pro	kwaivgi/kling-v3.0-pro/image-to-video	$0.095	1080p	Default for single shots: faithfully animates a composed still, high-res, cheap
Seedance 2.0 (full) i2v	bytedance/seedance-2.0/image-to-video	~$0.24	720p	holds identity with an anchor, but single image → no real multi-ref
Seedance 2.0 fast	…/seedance-2.0-fast/…	~$0.022	720p	❌ dropped — breaks the style completely (3D / wrong character)
Seedance v1.5 Pro	bytedance/seedance-v1.5-pro/…	$0.047	—	researched (style-preservation leader), test failed on the content schema
Hailuo 2.3 · Vidu Q3 · Wan 2.6	minimax / vidu / alibaba	$0.018–0.28	—	researched as cheap / stylised alternatives

Model choice by shot type: multi-character scenes, free composition and locomotion → Seedance ref-to-video. Single figure, close-up, or "bring an already-composed still to life" → Kling i2v (1080p, cheaper). On multi-character work Kling is unreliable (drops figures, turns the upright donkey into a quadruped).

The same donkey, five approaches

Click to play — the clips start automatically as soon as they scroll into view.

Drift Seedance fast, 1 image. Our flat donkey turns into a photorealistic 3D plush toy and lies down.

Drift Fast + a "2D" prompt. Now flat — but a completely new character (a boy with antlers). Our donkey is gone.

better Seedance full + anchor. The donkey is preserved (standing, lute, 2D) — but a bit "pencilly".

Breakthrough Seedance ref-to-video, 4 refs + negative. Exactly our watercolour donkey, stable across all frames.

strong Kling 3.0 Pro, multi-ref. Holds the same donkey — in 1080p and ~2.5× cheaper.

Scene with everything Seedance, 6 refs. All four animals on-model + the robbers' house — one coherently composed scene.

Challenges & solutions

What was hard — and how we solved it

The honest part. Almost every step forward came out of a visible failure. Each card shows Problem → Cause → Solution, many with the before/after video evidence.

1 Style break: 2D turns into 3D

ProblemSeedance turns the flat picture-book donkey into a photorealistic 3D plush toy — or, with a "2D" prompt, invents a brand-new character.

CauseA single front image + image-to-video + a realism-tuned model + no negative prompt. The model has to invent every unknown view → drift.

SolutionThe "yeti method" — three ingredients together (see box).

The 3 ingredients of character stability:

Multi-reference instead of a single image — front + face + profile + rear
The reference-to-video mode (identity lock from several refs) instead of image-to-video
A hard negative prompt in every call against "3D, photoreal, different animal, deer …"

Bonus: the array schema reference_images:[base64…] also eliminated the 400 errors — the single image_url path can't do multi-ref at all.

before 1 image → 3D plush

after 4 refs + negative → our donkey

2 Multi-character scenes: Kling loses figures

ProblemIn group scenes Kling drops a figure and turns the upright donkey into a quadruped like a real horse.

CauseKling i2v uses the (first) image as a start frame and isn't built to compose a coherent multi-figure scene from separate refs.

SolutionRoute group/story scenes to Seedance ref-to-video — it holds all four (up to 9 refs) and composes freely.

Kling donkey on all fours, dog missing

Seedance all four, on-model

3 Kling's 4-ref limit & the literally-animated sheet

ProblemKling accepts only 4 reference images (our 6 were rejected). And a composite pose sheet gets animated as a whole — the 2×2 grid wobbles instead of one figure being extracted from it.

CauseKling animates exactly the input image as a canvas. Seedance instead uses the refs as an identity pool and builds a new scene. Aspect ratio is sensitive too (a 4×1 strip → "aspect ratio invalid").

SolutionTwo clean production paths: (A) Seedance composes directly from many refs. (B) For Kling, first compose the scene as a real still (nanobanana composition mode), then animate. Keep sheets square (2×2).

evidence Kling animates the entire 2×2 donkey sheet — four wobbling donkeys instead of one.

4 Extra arms while walking

ProblemWhen walking, the donkey grows extra arms that don't belong.

CauseAll identity refs show him upright with "arms" (playing the lute). The model mixes that arm pose with walking legs → anatomy errors.

SolutionProvide action references per motion: a walking profile with the lute on its strap (not in the hands) + a sharpened negative against "extra arm, holding the lute". Every character needs a pose library per action.

Donkey walk reference — action ref upright gait, lute on the strap

after clean walk, 2 arms, 2 legs

5 "The house runs away" — camera drift instead of locomotion

ProblemThe animals don't really move — the background/house slides out of frame while the animals march in place.

CauseThe prompt said "the camera follows them" → Seedance moves the camera (a cheap background slide) instead of having the figures travel real distance. AI video is notoriously weak at "walk toward a distant goal + camera move".

SolutionLocked camera + lateral walk: the animals wander across the frame (left → right), house fixed. Or the dynamic depth variant: appear small at the back of the woods and walk forward, growing larger — real distance.

before house drifts away, marching in place

after locked camera, real travel

hero shot dynamic, 10 s: out of the woods, small → large (establishing + approach in one).

6 Geography & text fidelity

ProblemA single shot collapsed "spot the light from afar" and "arrive at the house" — the geography collapsed. And through the window you only saw food, not the robbers at the table (the Grimm core beat).

CauseToo much action in one shot; and the robber reference was missing as the window content.

SolutionSplit the beats into separate shots (spot from afar → approach → near window peek). Provide robber refs + the parlour as window content. Never drop story core beats.

Beat 1 a small, distant light in the woods

Beat 4 robbers at the laid table (faithful to the text)

7 Transient API errors & robust polling

ProblemOccasional 502/HTML gateway pages instead of JSON, sporadic "internal error" mid-render.

CauseCloud backend hiccups — not a schema problem.

SolutionHarden the poller (on non-JSON, just keep polling instead of crashing) and simply restart failed renders. Baked into the script.

The finished scene

"Approaching the robbers' house" — 37 seconds, end to end

The real milestone: not "does the pipeline work?" but "we can produce scenes on an assembly line" — style-true, character-stable, with German character voices, faithful to the Grimm source.

with sound 4 shots · narrator + character voices · audio-driven cut. (Turn the sound on — this clip has voices, in German.)

#	Shot	Sound
1	Tired rest in the dark woods	Narrator
2	Spotting the distant light	Rooster + narrator
3	The approach out of the woods	Narrator
4	Window: the donkey sees the robbers	Dog + donkey

The scene machine: write a screenplay as scenes/<name>.json → run build-scene.mjs. Several variants are generated per shot; pick the best with "pick" and reassemble with --assemble. Variants are kept. Every further scene is built the same way.

Strategic approaches

Which tools — and how much AI?

A fundamental question ran in parallel: shouldn't this really be built in a proper 3D engine? The honest trade-off — and the chosen direction.

RenderMan / Unreal?

Technically yes — but the renderer was never the problem. You'd first need the other 90% (models, rigging, animation) and would be fighting your own picture-book style, which neither engine does natively. The wrong path for our flat watercolour look.

Open-source stack

Fully open is conceivable: ComfyUI + SDXL/Flux (with a LoRA per character for real consistency), ControlNet for choreography, local video models (Wan/Hunyuan), Coqui/Piper/Kokoro TTS, rembg, ffmpeg. Catch: local video diffusion is sluggish on the Mac.

Less AI, real production

Cut-out / puppet animation: rig the cut-out figures as jointed puppets and animate the motion deterministically (Blender headless or HyperFrames). The music pyramid as a physics sim. Consistency is perfect by design, since it's the same art in every shot.

★ Chosen direction

AI where it makes sense — stills (characters, backgrounds), voices, later music/SFX and inspiration. Motion can increasingly move to real animation. Proven and productive today is the AI video pipeline; it delivers results immediately, while the cut-out track stands alongside as an expansion.