Production Diary: Using Generative Video to Complete a Creative Brief

[obviously: post is in my capacity as an academic, views do not reflect those of my employer etc]

There’s been years of claims suggesting that video AI technology is making big chunks of video creation redundant. I’m testing this by testing the tooling in-depth to see how it can help execute a creative brief in a commercial art setting in early 2026.

If you’re in a rush – I’ll save you time: outside of some very narrow jobs the tooling is not ready for regular iterative creative work. Why this is the case is worth digging into, and if you’re interested: read on.

Preface

In late 2023 I asserted that AI isn’t going to replace creative professionals. My hypothesis remains unchanged, and I’m working with some talented co-authors on revisiting this in detail – that’s another post, soon.

Also: I’m factoring out the reductive and dogmatic “the models are just going to keep better” argument; aside from its extreme laziness it’s not applicable here. Model capacity or quality can’t fix the fundamental UX flaws of these tools i.e. they don’t meet creators needs. This also means it doesn’t matter if we’re evaluating a local model or a gigantic cloud-hosted one – the central problem lies outside of model scale and capability.

And briefly: the elephant in the room. Modern generative AI is rightly controversial, with unanswered questions around provenance, copyright, and rights. I’m not ignoring these but it’s been well-discussed (including by myself). What I am examining here are claims of generative AI’s capabilities – which must be rigorously tested. Understanding in clear terms what it can, but importantly, what it cannot do can help better ground the discussion in pragmatic reality instead of magical thinking.

The brief

Creative tools are magnifiers of intent; they take an idea that lives in your head and helps you put it into a durable medium. The dream state of generative models is that this can be done in one shot – that is, a single stimulus like a prompt can generate an expressed  intricate work of art that executes the intent of the creator.

I’m going to find out where the reality of this is using a real creative brief to create a video. Rather than test the whole from-scratch video creation process I pared it back. The brief is to edit an existing video, using generative tools to create new footage and re-interpret the meaning. 

I came across this gem of a webcomic from @itsmegcomics. It was perfect – it was already storyboarded, it featured scenes already from the film and it has an opinion about composition. The job then is to create the new footage, cut it in with the existing film and it’s done, right? Easy, right? (spoiler: it wasn’t).

The source webcomic:

The end result:

Trying to put GenAI into a pipeline

Bleeding-edge ML tools aren’t really built for creatives doing commercial art (that is, commissioned or contract work). The tools have largely been built by hobbyists or academics and many have been generously open-sourced – that’s great. However: underlying architectures are focused on one-shot, single-user workflows – not great for commercial art at scale, which is based on iteration and multi-artist pipelines.

Mostly low friction is important for creative work; when I was in the industry full-time I remember how the ease involved in starting work on a shot in VFX. A single command would check out the needed assets as well as blessed and validated versions of the applications. This made it easy to get to work but moreover it provided predictability and stability for asset flow in a pipeline involving several disciplines. 

In a typical commercial art pipeline, shared tools across a production should do these things:

  • They should stay a stable version during the production for stability & repeatability (ML tools update often and these changes can break stability/repeatability)
  • They should separate binaries/code, data to run the app, config files and assets (ML tools will sometimes often these all together into one folder)
  • It should be easy to start work on a shot with a selection of curated tools and assets

Reflecting on this, I had a go at creating a toy version of a pipeline like this to test these tools.

(source)

Rather than cobble together something temporary, I indulged once again in overthinking and came up with a somewhat first-principles driven generic pipeline to test my assumptions. My hunch here is that bleeding edge ML tools can be made more pipeline friendly by re-creating applications on disk. This means versioning for tool code/binaries (and a virtual environment for each) and for tool models.

I built it and you can find the toy pipeline here: https://github.com/bhautikj/dcc-ml-env

When you make a workspace to iterate on a shot, you then get:

  • A selection of tools + models (versioned!)
  • Per-application-version virtual environment (I used conda – its more lightweight than docker)
  • Per-workspace tool configs
  • Per-workspace user assets (both inputs and outputs)

Example flows

Using dcc-ml-env requires listing the tools required and where the shot will live. An example yaml configuration:

working_directory: "/home/username/workspace/my_shot"

tools:
  - name: "comfyui"
    type: "comfyui"
    version: "20260219"
    path: "/home/username/source/ComfyUI-20260219"
    env: "comfyui"
    models:
      - name: "ckpts"
        subtype: "ckpts"
        path: "/home/username/models/models.comfy"

The tool then gets invoked to set it up

dcc-ml-env my_shot.yaml

And it creates a directory structure like so:

my_shot/
├──bin            # symlink trees of tools and run_XYZ_APP.sh launchers
├──config         # config files for the tools
├──in             # where to place input assets (apps will point to this)
├──out            # where output assets are put by default
├──models         # symlinks to the ML models
├──tmp            # temp dirs for the tools; cleared up on exit
└──var            # var dirs for the tools; not cleared up on exit 

…and now it switches to a familiar flow which hides all the environment configuration away and launches tools ready to iterate and with asset inputs/output directories pre-populated:

cd /home/username/workspace/my_shot
cp /shared/assets/assets_for_my_shot/* /home/username/workspace/my_shot/in
bin/run_comfyui_20260219.sh
#..
#..creative iteration in comfyui
#..

# output assets
ls /home/username/workspace/my_shot/out

Findings

It’s nothing more than a toy – definitely not for real work but enough to scaffold the second part of this experiment in the next section. An early conclusion: a lot of ML tooling is immature not just because of its poor creative iterability but also because of high UX friction. By default inputs and outputs are unmanaged, and they are poor at version control. This is a pretty classic software deployment problem and can be largely fixed without having to re-architect the applications.

Centrally though: there’s a tension between the belief that ML tools are ‘one shot’ and represent whole pipelines on their own, vs the reality that they need to live inside harnesses with known interfaces for deployment and asset management  so they can be used in iterative, commercial art pipelines.

Completing the brief

From a tooling point of view, I used dcc-ml-env (above) to set up the workspace and manage assets. I generated the videos and audio inside Wan2GP (a swiss-army-knife, no frills tool for media creation). Within this I opted to use InfiniteTalk for creating videos (based on a combo of starting image and audio) and chatterbox TTS for generating spoken dialogue. I then moved this over to Canva to edit the footage and for final assembly.

Storyboard and animatic

The first stage was to block out the work needed. The first frame is from the movie itself, the second and third are scenes from the movie but require new audio and new motion and the fourth is a scene from the movie but with new offscreen dialog.

The voice generation was done with chatterbox TTS; for this, it took 5-6 takes for every one that I kept. Chatterbox TTS bills itself as expressive, but it was hard to direct emphasis in the right places, with longer pieces of dialog having more uneven performance. I ended up going with sentence-level generation as this was more consistent and was more likely to lead to a performance I could use. 

Once I had audio I could use, I then created a quick animatic – that is, I simply have the storyboard frames in sequence with the audio already cut together. Critically, this audio and the timing forms the backbone for the edit – generated video will be cut against it.

Animatic to video

Next was replacing the animatic with movie scenes and generated footage. This required creating videos for panels two and three and then using these alongside footage from the original film to create a video based on the storyboard.

There are models like LTX2 that offer the ability to create both video and audio together but: it wasn’t a good choice. An obvious reason is that the timing and audio was already generated, but another is joint generation of video and audio is a bad UX for working with video on a nonlinear video editing timeline.

It’s difficult to get consistent audio (i.e. voices) between scenes, and a common artifact is that background music and even room tone is baked into the audio. Moreover, if you don’t suggest a video duration that is close to the length of the spoken dialog, you end up with strange generation artifacts.

The video generation for each shot was run through InfiniteTalk with an image of the first frame of the scene and the dialog. This proved surprising – the lip and mouth movement of the shark was very humanlike which was uncanny and unsettling. This is when I remembered – in the movie, the shark is actually Bruce, a fiberglass and wire confection of a model with a very rigid jaw. Once I adjusted the prompt to ask for a more rigid movement in the mouth, it read as more consistent.

I’d argue that consistency – picking up on deeply implied offscreen context and bringing it into the generation – is something that ML models may never catch up on. It’s another UX affordance that’s decoupled from model capabilities that takes a human being with taste and discretion to bridge.

Finishing the work

I wasn’t surprised the literal conversion of the storyboard into a video was far from complete. We’ll be touching on this in-depth soon, but the translation of an idea from script to storyboard to screen is not a straightforward process. While the core idea must remain, the actual timeline edit needs revising to read correctly for the medium and context..

In this case – direct translation of the storyboard to screen – led to a low quality result. There were issues with the production itself; in the generated clips the audio didn’t continue the background music or the noises of the sea and boats, which was jarring.

The edit itself didn’t follow many of the filmmaking tropes that are the bread and butter of storytelling in video. The shark suddenly shows up with no explanation of how it got there; the one-sided back and forth still needed to follow the 180 degree rule. Moreover the pacing was completely off – it was rushed, and shots lingered too long on the shark or the intrepid sailors.

I had a go at fixing this, mostly in the edit itself. It also just felt right that if the shark was talking, he had to be on-screen, so I generated extra footage of him talking for the last panel. The end result is still rushed and not completely final, but it’s enough to once again demonstrate that taste (of which I have trace amounts) and discretion (of which I have enough) is needed to complete a creative task.

Closing thoughts

Human-in-the loop authoring for video isn’t going anywhere due to fundamental flaws in how generative AI works. The inability to work incrementally, infer hidden context or understand emphasis will leave it with utility tool status instead of a one-and-done for a while longer at least.

The idea of ‘one shot’ completion of creative tasks using generative AI has been consistently difficult to achieve in a general way. It looks like model capability improvements will continue apace, but significant UX friction and the ‘slot machine’ nature of generation means it will continue to be a challenge to use as part of intentional filmmaking.

I’d be welcome to hear your thoughts on this.