Evaluator Plugin

Use this skill for evaluation tasks against a running NeMo Platform server. The plugin-backed CLI interface is nemo evaluator; the legacy generated nemo evaluation API command group is not the target surface for new guidance.

CLI Interface

Prerequisites

all commands in this file assume that the shell's working dir is at the root of the Nvidia-NeMo/nemo-platform repo
activate the Python virtual environment before invoking the nemo CLI: source .venv/bin/activate

Check plugin status from the CLI:

nemo evaluator info

Metric Types

Explore Available Metrics

To view available metric names, run:

nemo evaluator metric-types

To view a specific metric schema, pass a metric name from the metric_types list above:

nemo evaluator metric-types <metric-name>

Inspect all the registered metric schema contracts:

nemo evaluator evaluate explain

Note: use nemo evaluator evaluate explain as the source of truth for the current plugin input schema. It will return a large json schema response, so strongly prefer nemo evaluator metric-types when you only need metric names and corresponding schemas.

Evaluation Spec

Evaluation spec is a payload that is provided to CLI as an input to execute evaluation.

At a high level, a spec describes:

metrics: bundled Evaluator SDK metric configurations
dataset: inline rows to evaluate or platform FilesetRef that contains the dataset
params: optional Evaluator SDK execution parameters
target: optional model or agent target for online evaluation

See the LLM-judge spec example at assets/specs/llm_as_judge.json.

Metric Bundle Payloads

The checked-in spec examples use bundled SDK metrics. The fields under metrics[*].payload are generated by bundle_metric(metric, CloudpickleMetricBundlePackager()).

To see the pattern for configuring a pre-defined SDK metric, for example ExactMatchMetric, and converting it into bundled metric JSON, inspect build_metric_bundle_example() in generate_example_specs.py and run:

uv run --frozen python skills/nemo-evaluator-plugin/scripts/generate_example_specs.py

Run Evaluations

Run Using File Spec Reference

When using the nemo evaluator evaluate run command, results are saved into local temporary directories and the link is printed to stdout. Prefer the --spec-file named argument over inline shell JSON because metric bundles include serialized payloads. Examples of various specs are provided in the assets/specs directory.

Evaluate using `exact-match` metric

See the spec example at assets/specs/exact_match_metric.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json

Evaluate using a benchmark metric set

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_benchmark.json

Evaluate using `LLM-Judge` metric

Uses an LLM to score responses. See the spec example at assets/specs/llm_as_judge.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/llm_as_judge.json

Run Evaluation As A Durable Job

Use the nemo evaluator evaluate submit command to create a durable evaluation job. The response of this command returns a job handler object instead of the evaluation result.

nemo evaluator evaluate submit \
  --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json

The submit response includes the generated job's name field, for example nemo-evaluator-zlhn1ecd. Wait for the job to complete, then list and download the job results.

nemo jobs get-status <job-name>
nemo jobs get <job-name>
nemo jobs results list <job-name>
nemo jobs results download aggregate-scores --job <job-name> --output-file aggregate-scores.json
nemo jobs results download row-scores --job <job-name> --output-file row-scores.jsonl

Python SDK Interface

Evaluator Python SDK client is exposed as evaluator variable on NeMoPlatform instance:

from nemo_platform import NeMoPlatform

platform_client = NeMoPlatform(base_url="http://localhost:8080")
status = platform_client.evaluator.plugin_status()

See examples of using the plugin SDK interface in plugin_sdk_examples.py.

Security

Make sure not to print any secrets to stdout since this can be collected as logs

Additional Resources

For LLM-judge setup notes, see LLM Judge Notes.

For evaluator API key auth, see Evaluator API Auth.

For local and cluster troubleshooting, see Evaluation Troubleshooting.

Nemo Evaluator Plugin

Evaluator Plugin

CLI Interface

Prerequisites

Metric Types

Explore Available Metrics

Evaluation Spec

Metric Bundle Payloads

Run Evaluations

Run Using File Spec Reference

Evaluate using `exact-match` metric

Evaluate using a benchmark metric set

Evaluate using `LLM-Judge` metric

Run Evaluation As A Durable Job

Python SDK Interface

Security

Additional Resources

Bundled with this artifact

More on the bench

Whisper

Guidance

Pinecone

Evaluator Plugin

CLI Interface

Prerequisites

Metric Types

Explore Available Metrics

Evaluation Spec

Metric Bundle Payloads

Run Evaluations

Run Using File Spec Reference

Evaluate using exact-match metric

Evaluate using a benchmark metric set

Evaluate using LLM-Judge metric

Run Evaluation As A Durable Job

Python SDK Interface

Security

Additional Resources

Bundled with this artifact

More on the bench

Whisper

Guidance

Pinecone

Evaluate using `exact-match` metric

Evaluate using `LLM-Judge` metric