Audio Scenes

Audio scenes add sound to your composition: narration, music, sound effects, or any audio content. They play alongside visual scenes without affecting what's displayed.

How audio scenes work

Audio scenes are "invisible" layers. They have timing like visual scenes (start, duration) but don't render anything visible. Instead, they play audio at the specified time.

{
  type: "audio_only",
  start: 0,
  duration: 30,
  config: {
    audioUrl: "https://example.com/narration.mp3",
    fileId: "file_abc123"
  }
}

Adding audio scenes

Via AI

The most common way to add audio is text-to-speech:

Generate narration saying "Welcome to our product demo."

The assistant creates an audio scene with synthesized speech. Multiple voice options are available. Check listSpeechModels for the full list.

For existing audio files:

Add the audio file from https://example.com/background-music.mp3

Or drag an audio file directly into the chat panel.

Via CLI

Generate speech from the command line:

program tool exec generateSpeech \
  --text "Welcome to our product demo" \
  --model "eleven_labs/rachel"

Add an existing audio file:

program tool exec addScene \
  --title "Background Music" \
  --duration 30 \
  --type audio_only \
  --config '{"audioUrl": "https://example.com/music.mp3"}'

List available voice models:

program tool exec listSpeechModels

In the editor

Audio scenes appear in the timeline like any other scene. Click to select, drag to reposition, or resize to adjust duration. Audio scenes show a waveform visualization when selected.

Timeline positioning

Audio scenes appear in the timeline just like visual scenes. Adjust their position:

Drag to move the audio to a different time
Resize to change the duration (audio may loop or cut off)
Overlap with other scenes. Audio and visuals are independent

Supported formats

Common audio formats work:

MP3 is recommended for broad compatibility and reasonable file sizes.

Audio behavior

Playback

Audio scenes use Remotion's Html5Audio component:

Synchronized: audio timing matches the composition timeline exactly
Buffering: playback pauses if audio needs to load
Premounting: audio loads slightly before its start time for seamless playback

Volume

Audio plays at full volume (1.0) by default. Volume control is handled at the composition level or through post-processing.

Looping

Audio doesn't automatically loop. If your scene duration exceeds the audio length, there's silence after the audio ends. For background music, either:

Match the scene duration to the audio length
Use audio that's long enough for your composition

Multiple audio layers

You can have multiple audio scenes playing simultaneously:

Background music (full duration)
Narration (specific segments)
Sound effects (short, timed)

They mix together in the final render.

Speech generation

The generateSpeech tool converts text to audio:

Generate speech saying "This is the introduction to our demo"

Options:

Parameter	Description
`text`	The words to speak
`voice`	Voice model to use
`speed`	Playback speed adjustment

Available voices depend on your configured integrations. Use listSpeechModels to see options.

Best practices

Match timing to content. If your narration mentions something, time the visual scene to appear when those words are spoken. Leave breathing room. Don't pack narration too tight. Brief pauses between sentences sound more natural. Check levels. If combining music and narration, ensure the music doesn't overpower the voice. Preview with audio. Always preview with audio enabled to catch timing issues.

Transcription

Have audio but need text? Use transcription:

Transcribe the audio from https://example.com/interview.mp3

The transcribeAudio tool converts speech to text, useful for creating captions or generating scenes from audio content.

For the full list of tools available for composition management, see the API Reference.