Serve real-time AI - Livepeer Docs

Real-time AI is Livepeer’s flagship workload — the live-video-to-video pipeline (also called Cascade) that powers live style transfer and generative video apps like Daydream. Unlike batch AI, your node doesn’t answer discrete requests: it continuously transforms a live WebRTC stream, frame by frame, with a ~33 ms budget per frame at 30 fps. This guide assumes a working orchestrator (see the tutorial) and familiarity with batch AI setup — the flags are the same; the runner, models, and hardware bar are not.

How it differs from batch AI

	Batch AI	Real-time AI (Cascade)
Input / output	Discrete request → result	Continuous WebRTC stream in → transformed stream out
Latency target	Seconds per request	< 100 ms per frame
Runner image	`livepeer/ai-runner`	`livepeer/ai-runner:live-base`
Models	Standard diffusion, Whisper, BLIP	StreamDiffusion, ComfyUI DAGs, ControlNet
Min VRAM	4–24 GB by pipeline	24 GB recommended

GPUs below 24 GB VRAM (RTX 3080 10 GB, RTX 3060 12 GB) are typically insufficient — model weights, ComfyStream overhead, and frame buffers exhaust VRAM. RTX 4090 strongly recommended; RTX 3090 works with less headroom; A100/H100 for production multi-stream. You’ll also want 8+ CPU cores (frame encode/decode is CPU-bound) and a low-latency, low-jitter connection — WebRTC punishes packet loss.

The moving parts

Your orchestrator receives a WebRTC stream from a gateway, hands it to the AI runner, and streams processed frames back:

Gateway → go-livepeer AI worker → ai-runner:live-base (ComfyStream)
          → frame loop: receive → inference → emit → processed stream back

ComfyStream is the runtime inside the container — it wraps ComfyUI’s node-based workflows for continuous processing: WebRTC frame ingestion, an async frame queue, and warm-model management. StreamDiffusion is the primary model family, purpose-built for live inference (30+ fps on an RTX 4090) via frame batching, reduced-step sampling, and skipping near-identical frames.

Set it up

Verify GPU access

nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Both must show your GPU. CUDA 12.0+ and the NVIDIA Container Toolkit are required.

Pull the live-base runner

docker pull livepeer/ai-runner:live-base

This is a separate image from the batch livepeer/ai-runner — it bundles ComfyStream and its dependencies.

Configure aiModels.json

Add a live-video-to-video entry:

[
  {
    "pipeline": "live-video-to-video",
    "model_id": "streamdiffusion",
    "price_per_unit": 500,
    "warm": true
  }
]

For live pipelines, model_id names the ComfyUI workflow/pipeline, not a Hugging Face model — the models load inside the ComfyStream container. Your price_per_unit must be at or below the gateway’s -maxPricePerCapability for this pipeline — a hard filter, regardless of your hardware.

Download model weights

Weights must exist before the container starts:

git clone https://github.com/livepeer/comfystream
cd comfystream
pip install -r requirements.txt
python scripts/download_models.py

Download into the directory your node exposes via -aiModelsDir.

Start with the AI worker flags

The same three flags as batch AI — -aiWorker, -aiModels, -aiModelsDir — with your live entry in aiModels.json. Run with -v 6 initially to watch frame-loop activity in the logs.

Verify registration

Your node should appear under live-video-to-video with Warm status at tools.livepeer.cloud/ai/network-capabilities. Test end-to-end with the utilities in the ComfyStream repo.

Tuning for the frame budget

Everything about Cascade operation is a fight for milliseconds:

Keep models warm. A cold model load mid-stream is a dropped stream. "warm": true is not optional here.
Trade steps for latency. StreamDiffusion workflows run at inference steps as low as 2; quality vs latency is tunable in the workflow definition.
Watch VRAM headroom. Frame buffers ride on top of model weights — a node that fits batch models exactly will OOM on live streams.
Dedicate the GPU. Sharing a Cascade GPU with transcoding or batch AI undermines the latency that gets you selected.

Going further

Operators who want custom live pipelines (beyond ComfyUI workflows) can implement the AI runner’s Python Pipeline interface and package it as an image extending ai-runner:live-base — see the ai-runner docs and the Pipeline interface. Note that custom pipelines currently require small upstream additions (the pipeline registry is a static mapping); a dynamic plugin architecture is in progress.

Add AI inference (batch)

The batch pipelines — start here if you haven’t run AI yet.

Hardware & GPU support

VRAM requirements across workloads.

Monitor your orchestrator

Watch the AI runner container and frame latency.

Set pricing

Price per capability, and gateway price caps.

​How it differs from batch AI

​The moving parts

​Set it up

​Tuning for the frame budget

​Going further

​Related

Add AI inference (batch)

Hardware & GPU support

Monitor your orchestrator

Set pricing

How it differs from batch AI

The moving parts

Set it up

Tuning for the frame budget

Going further

Related