Vss Summarize Video

Use to summarize a recorded video via the LVS summarization microservice (HITL-gated) with a VLM fallback. Not for report generation or live RTSP captioning.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Instructions

Follow the routing tables and step-by-step workflows below. Each section that ends in workflow, quick start, or flow is intended to be executed top-to-bottom. Detailed reference material lives in references/.

Examples

Worked end-to-end examples are kept under evals/ (each *.json manifest contains a runnable scenario) and inline in the per-workflow curl blocks below. Run a Tier-3 evaluation with nv-base validate <this-skill-dir> --agent-eval to replay them.

Call the VLM NIM or the video summarization microservice directly. Always run curl commands yourself; never instruct the user to run them.

Primary video workflow query type: "Summarize this video." Direct video summarization API and service-ops requests are handled by the reference-routed sections below.

Purpose

Produce a single, polished narrative summary of one recorded video clip, with timestamped events when the LVS microservice path is reachable.

Do NOT use this skill for:

  • Live RTSP captioning — use vss-deploy-dense-captioning.
  • Report generation, including incident or alert-window reports — use vss-generate-video-report Mode B.
  • Semantic search across the archive — use vss-search-archive.

Prerequisites

  • VSS lvs profile running on $HOST_IP (port 38111) OR a reachable VLM/RT-VLM endpoint as a fallback. The vss-deploy-profile skill brings these up.
  • Network reachability from the agent host to both endpoints; clip URLs from VIOS must be fetchable by the chosen backend.
  • jq and curl available on the agent host.

Limitations

  • Direct VLM fallback uses a single fixed prompt and cannot target scenario/events — output quality is lower than the LVS path.
  • Remote VLM endpoints generally cannot reach localhost/private clip URLs.
  • One backend call per request; no parallel hedging or multi-pass summaries.

Troubleshooting

SymptomCauseFix
/v1/ready returns 503 repeatedlyLVS service still warming upRetry up to ~30 s as shown in Setup; if it never returns 200 the service may not be deployed
Empty video_summary and eventsClip does not contain the requested eventsRe-run with broader scenario or different events
VLM returns <think> blockCosmos Reason 2 reasoning modeStrip everything up to </think> before rendering
Empty stdout from curl /v1/readyService legitimately returns 200 with empty bodyAlways check HTTP status with -o /dev/null -w '%{http_code}', never inspect the body

See references/video-summarization-debugging.md for deeper diagnostics.

Reference Map

Use these references only when the user asks for the relevant detail, or when the core workflow below needs deeper video summarization information:

  • video summarization API details: references/video-summarization-api.md for /v1/summarize, /summarize, /v1/generate_captions, /v1/stream_summarize, health probes, /models, /recommended_config, /metrics, request fields, response shapes, and API gotchas.
  • video summarization service configuration and ops: references/video-summarization-deployment.md for the VSS lvs profile, ports, required env vars, logs, status, dry-runs, teardown, model/backend swaps, Elasticsearch/Neo4j/ArangoDB backend selection, and service-level troubleshooting.
  • Extended video summarization ops references: references/video-summarization-environment-variables.md, references/video-summarization-debugging.md, and assets/video-summarization.env.example.

Load video-summarization-api.md only when you need a request field, response shape, or endpoint that is not already covered by the Step 2 LVS or fallback VLM example below, or when handling a direct video summarization API request. Load video-summarization-deployment.md only for deployment, configuration, or service operations.

Video Summarization API And Service Ops Requests

If the user asks to call or debug video summarization endpoints directly, answer from references/video-summarization-api.md instead of running the end-to-end video summarization workflow. Examples: list video summarization models, check readiness, get recommended chunking config, inspect metrics, explain a 422 response, or build a /v1/summarize request body.

If the user asks to configure, deploy, restart, tear down, or troubleshoot the video summarization service, prefer the vss-deploy-profile skill for full VSS profile deployment and use references/video-summarization-deployment.md for video summarization-specific service details.

Routing

Decide purely from video summarization service availability (probed in Setup → Availability checks below). Duration does not drive routing.

/v1/readyBackendEndpoint
HTTP 200LVS microservice with HITLPOST ${LVS_BACKEND_URL}/v1/summarize
Anything elseVLM / RT-VLM with the default prompt + fallback notePOST ${VLM_BASE_URL}/v1/chat/completions

Fallback message when the LVS service is unreachable — copy verbatim above the summary:

Note: Input video <name> is <N>s long. The video summarization service is not deployed, so this summary was produced by the VLM alone with a generic default prompt. Deploy the lvs profile for higher-quality summaries with scenario/events targeting.

Deployment prerequisite

The VSS lvs profile on $HOST_IP is the primary backend. If the /v1/ready probe (see Setup → Availability checks) returns anything other than 200 after the warmup retries, ask the user:

"The VSS lvs profile isn't running on $HOST_IP. Shall I deploy it now using the /vss-deploy-profile skill with -p lvs? Reply no to summarize with the VLM-only fallback instead (lower quality, no scenario/events targeting)."

  • Yes → hand off to /vss-deploy-profile, then re-probe and continue with Step 2 (LVS + HITL).
  • No → go straight to Step 2 fallback (VLM with default prompt) and prepend the Routing fallback note. Do not ask again, and do not run scenario/events HITL.
  • Pre-authorized to deploy autonomously (caller said so explicitly) → skip the confirmation and invoke /vss-deploy-profile directly.
  • Pre-authorized to use VLM fallback ("skip lvs, just use the VLM") → go straight to Step 2 fallback without prompting.

Setup

Endpoints (defaults for a local VSS lvs deployment):

  • VLM / RT-VLM: ${VLM_BASE_URL} — default ${RTVI_VLM_BASE_URL:-http://${HOST_IP:-localhost}:8018}
  • LVS service: ${LVS_BACKEND_URL} — default http://${HOST_IP:-localhost}:38111
  • VIOS: owned by vss-manage-video-io-storage; refer there.

Use env vars when set (strip trailing /v1 from the VLM base — the skill appends it). Otherwise use the defaults. If neither works, ask the user — do not scan ports or read config files to guess.

Model name: read ${VLM_NAME} (default nim_nvidia_cosmos-reason2-8b_hf-1208). It must match the id RT-VLM /v1/models advertises; do not substitute the friendly nvidia/cosmos-reason2-8b.

For endpoint schemas, optional fields, response envelopes, and error handling, see references/video-summarization-api.md.

Availability checks (run both before routing). Readiness is determined by the HTTP status code only — the LVS /v1/ready may legitimately return 200 with an empty body, so do not inspect the body.

VLM="${VLM_BASE_URL:-${RTVI_VLM_BASE_URL:-http://${HOST_IP:-localhost}:8018}}"
VLM="${VLM%/v1}"

# VLM / RT-VLM: 200 on /v1/models
vlm_code=$(curl -s -o /dev/null -w '%{http_code}' --connect-timeout 3 --max-time 10 \
  "$VLM/v1/models")
[ "$vlm_code" = "200" ] && echo "VLM OK" || echo "VLM not reachable (HTTP $vlm_code)"

# Video summarization service: 200 on /v1/ready, with retry on 503 (warmup) for up to ~30s
VIDEO_SUMMARIZATION_URL=${LVS_BACKEND_URL:-http://${HOST_IP:-localhost}:38111}
video_sum_code=000
for i in $(seq 1 10); do
  video_sum_code=$(curl -s -o /dev/null -w '%{http_code}' --connect-timeout 3 --max-time 10 "$VIDEO_SUMMARIZATION_URL/v1/ready")
  case "$video_sum_code" in
    200) echo "video summarization OK"; break ;;
    503) sleep 3 ;;                 # warming up; keep polling
    *)   break ;;                   # any other code = not reachable, stop retrying
  esac
done
[ "$video_sum_code" = "200" ] || echo "video summarization service not reachable (HTTP $video_sum_code)"

How to interpret the results:

  • video_sum_code = 200Step 2 (LVS + HITL) for every video.
  • video_sum_code != 200, vlm_code = 200Step 2 fallback (VLM); prepend the Routing fallback note.
  • vlm_code != 200 → fail; at least one backend must be reachable.
  • A non-200 LVS code after the retry loop is the ONLY signal of unavailability. Empty stdout or missing JSON fields are NOT "unavailable."

Step 1 - Get the clip URL via vss-manage-video-io-storage (sub-task, NOT the final answer)

Use the vss-manage-video-io-storage skill for all VIOS interactions — it owns the canonical curl recipes, parameter defaults, and delete/upload flows. Do not fabricate URLs or hand-roll VIOS calls; they will drift.

This step is a sub-task — do NOT end your turn here; do NOT return the clip URL as the final answer. From VIOS collect three values:

  1. streamId (via sensor/listsensor/<id>/streams, or directly from an upload response).
  2. Timeline - {startTime, endTime} (ISO 8601 UTC). endTime - startTime is the duration; needed only for the user-facing header (routing is driven solely by /v1/ready).
  3. Temporary MP4 clip URL — the /storage/file/<streamId>/url variant with container=mp4. Response field: .videoUrl. Both backends need an HTTP(S) URL they can GET.

Everything else (auth, upload, disableAudio, expiry, etc.) lives in the vss-manage-video-io-storage skill — refer users there if VIOS fails.


Step 2 — Primary: video summarization microservice with HITL

Use this path whenever /v1/ready returned 200 in Setup. Duration is irrelevant.

For advanced fields (media_info, schema, structured output, stream captioning, metrics, recommended config) see references/video-summarization-api.md.

HITL: collect scenario and events first (REQUIRED — do not skip)

Full walk-through is in references/hitl-prompts.md. Always run HITL before calling the LVS service.

Autonomous-mode defaults. When the caller has bypassed HITL ("run autonomously without prompting") AND the original query asks for default/defaults (or gives none), use scenario="activity monitoring" and events=["notable activity"] verbatim — do not infer from filename or sensor name. Note the defaults in the final reply and offer a re-run with more specific parameters. This is the ONLY supported HITL bypass; "the video is short" or "the user seems in a hurry" are not valid reasons.

Prefer POST /v1/summarize (3.2 GA route); /summarize is a compatibility alias.

VIDEO_SUMMARIZATION_URL=${LVS_BACKEND_URL:-http://${HOST_IP:-localhost}:38111}

# From HITL reply:
SCENARIO='warehouse monitoring'
EVENTS_JSON='["notable activity"]'
OBJECTS_JSON=''  # '' to omit, else '["forklifts","pallets","workers"]'

curl -s --max-time 300 -X POST "$VIDEO_SUMMARIZATION_URL/v1/summarize" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg url "<clip_url_from_vss_manage_video_io_storage>" \
        --arg model "${VLM_NAME:-nim_nvidia_cosmos-reason2-8b_hf-1208}" \
        --arg scenario "$SCENARIO" \
        --argjson events "$EVENTS_JSON" \
        --argjson objects "${OBJECTS_JSON:-null}" '{
    url: $url,
    model: $model,
    scenario: $scenario,
    events: $events,
    chunk_duration: 10,
    num_frames_per_second_or_fixed_frames_chunk: 20,
    use_fps_for_chunking: false,
    seed: 1
  } + (if $objects == null then {} else {objects_of_interest: $objects} end)')" \
  | jq -r '.choices[0].message.content' \
  | jq '{video_summary, events}'

If both video_summary and events are empty, the clip probably doesn't contain the requested events — re-run with broader scenario/events, don't report "no content".

Tuning: chunk_duration (default 10s; 0 = single chunk), num_frames_per_second_or_fixed_frames_chunk (default 20; meaning depends on use_fps_for_chunking), seed (default 1). num_frames_per_chunk is deprecated.


Step 2 fallback — VLM direct with default prompt

Use this path only when /v1/ready did not return 200 after warmup. Do NOT run HITL — the user did not opt in; you fell back because the service was missing. Prepend the Routing fallback note to the response.

VLM="${VLM_BASE_URL:-${RTVI_VLM_BASE_URL:-http://${HOST_IP:-localhost}:8018}}"
VLM="${VLM%/v1}"
PROMPT='Describe in detail what is happening in this video,
including all visible people, vehicles, equipments, objects,
actions, and environmental conditions.
OUTPUT REQUIREMENTS:
[timestamp-timestamp] Description of what is happening.
EXAMPLE:
[0.0s-4.0s] <description of the first event>
[4.0s-12.0s] <description of the second event>'

curl -s --max-time 300 -X POST "$VLM/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "$(jq -n \
        --arg model "${VLM_NAME:-nim_nvidia_cosmos-reason2-8b_hf-1208}" \
        --arg text "$PROMPT" \
        --arg url "<clip_url_from_vss_manage_video_io_storage>" \
        '{
          model: $model,
          temperature: 0.0,
          max_tokens: 1024,
          messages: [{
            role: "user",
            content: [
              {type: "text", text: $text},
              {type: "video_url", video_url: {url: $url}}
            ]
          }]
        }')" | jq -r '.choices[0].message.content'

Response: standard OpenAI chat-completion envelope. The summary is in choices[0].message.content.

Cosmos-model notes: Cosmos Reason 2 supports reasoning via <think>...</think><answer>...</answer> blocks. Omit the reasoning instructions if you want a plain summary. Frame sampling and pixel limits are applied server-side; no client-side prep is required when you pass a video_url.


End-to-end example

See references/end-to-end-example.md for the full LVS-or-VLM-fallback script that probes /v1/ready and runs the appropriate path.


Responses

  • VLM returns an OpenAI chat-completion envelope; summary is choices[0].message.content.
  • LVS service returns the same envelope but content is a JSON string — run jq -r '.choices[0].message.content' | jq to reach {video_summary, events}.
  • Errors surface as HTTP non-2xx plus JSON {error: ...}. LVS 503 usually means warmup — retry /v1/ready.

Presenting the output to the user

Surface backend output with minimal transformation — do not paraphrase, re-voice, add emojis, or reformat. One backend call → one rendering: no parallel hedging, no duplicate headers, never call both LVS and VLM for the same video.

Header line. Start with exactly one:

Summary of <video_name> (<duration>)

<duration> = Ns for < 60 s, else Mm Ss (e.g. 3m 30s).

LVS output: render video_summary verbatim (polished, tone-controlled report — rewriting loses fidelity). Render each events entry with its start_time, end_time, type, and full description verbatim (table when the client renders one cleanly, otherwise a per-event list). You MAY add a one-line header and a closing offer to re-run with different parameters.

VLM output: render choices[0].message.content verbatim. If the model produced <think>…</think><answer>…</answer> blocks, drop the <think> block and show the answer.

Fallback warning (when applicable) goes above the summary, never mixed into it.

Tips

  • Route by service availability, not by duration. Probe /v1/ready once in Setup; HTTP 200 → LVS+HITL for every clip; anything else → VLM fallback.
  • HITL is mandatory on the LVS path. The defaults opt-in is the only sanctioned bypass. The VLM fallback path is silent (no HITL).
  • Readiness = HTTP 200 on /v1/ready. Nothing else. Body may be empty. Always use curl -s -o /dev/null -w '%{http_code}' — never pipe through jq/grep/head.
  • Delegate VIOS to vss-manage-video-io-storage — it is a sub-task; the final answer is the Step 2 summary, not the clip URL.
  • jq twice for LVS output. First unwraps the OpenAI envelope, second parses the JSON string inside content.
  • Prefer /v1/summarize for 3.2 GA; /summarize is a compatibility alias.
  • Use the exact VLM model id advertised by the endpoint (default nim_nvidia_cosmos-reason2-8b_hf-1208).
  • Render output verbatim — no paraphrasing, no reformatting, no rewriting the video_summary or choices[0].message.content.
  • One call, one render. No parallel hedging, no double renderings.

Cross-reference

  • vss-deploy-profile — bring up the base (VLM only) or lvs (VLM + video summarization service) profile
  • vss-manage-video-io-storage (VIOS API) — upload videos, list streams, get clip URLs
  • vss-search-archive — semantic search across the archive (different profile)
  • vss-query-analytics — query incidents/events from Elasticsearch
  • video summarization API referencereferences/video-summarization-api.md
  • video summarization service ops referencereferences/video-summarization-deployment.md

bump:2

Bundled with this artifact

13 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0