Nemo Evaluator Plugin

Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Evaluator Plugin

Use this skill for evaluation tasks against a running NeMo Platform server. The plugin-backed CLI interface is nemo evaluator; the legacy generated nemo evaluation API command group is not the target surface for new guidance.

CLI Interface

Prerequisites

  • all commands in this file assume that the shell's working dir is at the root of the Nvidia-NeMo/nemo-platform repo
  • activate the Python virtual environment before invoking the nemo CLI: source .venv/bin/activate

Check plugin status from the CLI:

nemo evaluator info

Metric Types

Explore Available Metrics

To view available metric names, run:

nemo evaluator metric-types

To view a specific metric schema, pass a metric name from the metric_types list above:

nemo evaluator metric-types <metric-name>

Inspect all the registered metric schema contracts:

nemo evaluator evaluate explain

Note: use nemo evaluator evaluate explain as the source of truth for the current plugin input schema. It will return a large json schema response, so strongly prefer nemo evaluator metric-types when you only need metric names and corresponding schemas.

Evaluation Spec

Evaluation spec is a payload that is provided to CLI as an input to execute evaluation.

At a high level, a spec describes:

  • metrics: bundled Evaluator SDK metric configurations
  • dataset: inline rows to evaluate or platform FilesetRef that contains the dataset
  • params: optional Evaluator SDK execution parameters
  • target: optional model or agent target for online evaluation

See the LLM-judge spec example at assets/specs/llm_as_judge.json.

Metric Bundle Payloads

The checked-in spec examples use bundled SDK metrics. The fields under metrics[*].payload are generated by bundle_metric(metric, CloudpickleMetricBundlePackager()).

To see the pattern for configuring a pre-defined SDK metric, for example ExactMatchMetric, and converting it into bundled metric JSON, inspect build_metric_bundle_example() in generate_example_specs.py and run:

uv run --frozen python skills/nemo-evaluator-plugin/scripts/generate_example_specs.py

Run Evaluations

Run Using File Spec Reference

When using the nemo evaluator evaluate run command, results are saved into local temporary directories and the link is printed to stdout. Prefer the --spec-file named argument over inline shell JSON because metric bundles include serialized payloads. Examples of various specs are provided in the assets/specs directory.

Evaluate using exact-match metric

See the spec example at assets/specs/exact_match_metric.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json
Evaluate using a benchmark metric set
nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_benchmark.json
Evaluate using LLM-Judge metric

Uses an LLM to score responses. See the spec example at assets/specs/llm_as_judge.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/llm_as_judge.json

Run Evaluation As A Durable Job

Use the nemo evaluator evaluate submit command to create a durable evaluation job. The response of this command returns a job handler object instead of the evaluation result.

nemo evaluator evaluate submit \
  --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json

The submit response includes the generated job's name field, for example nemo-evaluator-zlhn1ecd. Wait for the job to complete, then list and download the job results.

nemo jobs get-status <job-name>
nemo jobs get <job-name>
nemo jobs results list <job-name>
nemo jobs results download aggregate-scores --job <job-name> --output-file aggregate-scores.json
nemo jobs results download row-scores --job <job-name> --output-file row-scores.jsonl

Python SDK Interface

Evaluator Python SDK client is exposed as evaluator variable on NeMoPlatform instance:

from nemo_platform import NeMoPlatform

platform_client = NeMoPlatform(base_url="http://localhost:8080")
status = platform_client.evaluator.plugin_status()

See examples of using the plugin SDK interface in plugin_sdk_examples.py.

Security

Make sure not to print any secrets to stdout since this can be collected as logs

Additional Resources

For LLM-judge setup notes, see LLM Judge Notes.

For evaluator API key auth, see Evaluator API Auth.

For local and cluster troubleshooting, see Evaluation Troubleshooting.

Bundled with this artifact

12 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0