Prompting Grok Imagine for Dialogue That Lands

article / prompting-grok-imagine-for-dialogue-that-lands.md
$ meta show prompting-grok-imagine-for-dialogue-that-lands
category: prompting
author: grokimagineapi editorial
published: 2026-04-19
read_time: 6 min read
Prompting Grok Imagine for Dialogue That Lands> Quoted dialogue in a video prompt works, but only if you give the model a speaker, a line, an emotion, a shot size, and an audio environment. Here is the structure that makes lip sync land on the first try.
────────────────────────────────────────────────────

Grok Imagine v1.0 shipped with real phoneme-aware audio, which is the reason dialogue in your prompts can now track lip movement instead of floating in as a voiceover. The catch is that the model interprets dialogue prompts literally. If you imply a line, you get mumbling. If you quote a line but skip the emotion, you get flat delivery.

This post is about writing dialogue prompts that survive the first generation. You will learn the five-slot structure and why quoted lines outperform indirect speech.

A script page with five labeled sections titled speaker, line, emotion, shot, and audio

The five slots

Every dialogue prompt should fill five slots. Skip one and the model fills it for you, usually badly.

Speaker. Describe the person in enough detail that the model can commit to a face. Age range, one distinctive wardrobe item, one distinctive feature. "A woman in her forties with short silver hair wearing a navy blazer" gives the model something to lock onto. "A woman" does not.

Line. Write the exact words in quotes. No paraphrase, no summary. The model treats quoted strings as the audio target. If you write "she explains the situation," you get ambient room tone and mouth movements that match nothing. If you write "'We lost three servers in Frankfurt,' she says," you get lip sync.

Emotion. One adjective, placed near the quoted line. Frustrated, reassuring, hesitant, dry. The model reads emotion from adjacent tokens, so put the descriptor close to the quote.

Shot size. Close up, medium close up, medium shot, wide shot. Dialogue lands best on medium close up and tighter, because the lip sync has enough pixels to track.

Audio environment. One phrase about the acoustic space. Small room with carpet, open warehouse with echo, outdoor street with traffic. This anchors the foley layer so the voice does not sound like it was recorded in a vacuum.

A working example

JAVASCRIPT

1import { fal } from "@fal-ai/client";
2
3fal.config({ credentials: process.env.FAL_KEY });
4
5const result = await fal.subscribe("xai/grok-imagine-video/text-to-video", {
6  input: {
7    prompt: "A woman in her forties with short silver hair wearing a navy blazer sits at a glass conference table, medium close up. She leans forward and says in a frustrated tone, 'We lost three servers in Frankfurt overnight, and I need to know why.' Small office room tone, low HVAC hum.",
8    resolution: "720p",
9    duration: 8,
10    audio: true,
11    aspect_ratio: "16:9"
12  },
13  logs: true
14});
15
16console.log(result.data.video.url);

An 8-second 720p clip with audio runs $0.56. If the lip sync is off on the first pass, you reroll once and usually land it on the second try.

Two stills from the same clip, one labeled implied dialogue with no visible speech, other labeled quoted line with clear lip shape

Why quoted lines beat implied dialogue

The model's audio track is driven by phoneme prediction. When you write "she explains the situation," there are no phonemes to predict, so the model generates generic speech shapes. When you write a quoted line, the model has a target sequence. The lip sync follows because the visual side is conditioned on the same audio predictions.

Paraphrases fail. "He says he is tired" produces a mouth that opens and closes without intent. "'I have been up since four,' he says" produces a mouth that shapes the vowels in those specific words.

Small adjustments that help

Keep lines short. Eight to fifteen words fits comfortably in a 6 to 10 second clip. Longer lines get rushed or truncated.

Name the delivery style. Calm, urgent, whispered, matter of fact. The model varies cadence based on these cues.

Avoid stacking two speakers in one clip unless you are willing to reroll. Two-person dialogue works but the sync rate drops to about one in three first passes. Generate each line as its own clip and cut them together.

What breaks

Whispers at wide shot sizes do not sync well. Pull in to close up or switch to voiceover framing.

Accents land unevenly. Generic American is the safest.

Lines with numbers or proper nouns are slightly less reliable than conversational English. Keep it simple.

Nail the five slots and the first generation lands more often than not.

[cd ../archive]