Mcore Split Pr

Split a PR into multiple PRs to reduce the number of required CODEOWNERS reviewer groups.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Split PR by CODEOWNERS Groups

Split a large pull request into multiple smaller PRs, where each PR touches the fewest possible CODEOWNERS reviewer groups. The goal is to reduce review burden: a PR that only touches megatron/core/ needs only the core reviewers, while a PR that also touches examples/, tools/, and megatron/training/ pulls in many additional groups.

Answer-First Constraints

For split-planning questions, lead with these constraints before the full workflow:

  • Minimize CODEOWNERS reviewer groups per PR, but each resulting PR must still be independently mergeable and reviewable.
  • Tests travel with the production code they validate; do not split tests into a separate PR just to reduce reviewer groups.
  • If PR B depends on symbols renamed in PR A, call out the dependency and put backward-compatible aliases, re-exports, or shims in PR A when needed.
  • Wait for user approval before execution.
  • Execution creates draft PRs from the right base, applies file-scoped diffs with git diff upstream/main..<source-branch> -- <paths> | git apply, pushes to the user's fork, and never pushes directly to upstream.

Workflow

1. Analyze the PR

  1. Fetch the PR details: gh pr view <number> --repo NVIDIA/Megatron-LM --json title,body,headRefName,author and gh pr diff <number> --repo NVIDIA/Megatron-LM --stat. Also determine the current GitHub user with gh api user --jq .login.
  2. Parse .github/CODEOWNERS to build a mapping from file path patterns to owner groups.
  3. For each changed file in the PR, determine which CODEOWNERS groups would be required to review it.
  4. Build a summary table grouped by CODEOWNERS group, showing which files pull in which groups.
  5. Count the total number of distinct reviewer groups the PR currently requires.

2. Propose a split that minimizes reviewer groups per PR

The primary optimization goal: minimize the number of CODEOWNERS reviewer groups required for each resulting PR.

Strategy:

  1. Cluster files by their CODEOWNERS groups. Files owned by the same set of groups naturally belong together.
  2. Identify the largest cluster — this becomes the first (and usually largest) PR.
  3. Remaining files form one or more additional PRs, each ideally requiring only one or two reviewer groups.
  4. If a split creates a dependency (e.g., PR B uses symbols renamed in PR A), the dependent PR must be merged after the first. Note this explicitly.
  5. Each PR must be independently mergeable to main — no broken imports, no missing symbols. Backward-compatible aliases and re-export stubs in the first PR can make this possible.

Present the proposed split as a table:

  • PR name/description
  • Files included
  • CODEOWNERS groups required
  • Dependencies on other PRs (if any)

Wait for user approval before proceeding.

3. Execute the split (after user approval)

For each new PR:

  1. Create a new branch from the appropriate base (main, or a dependency PR's branch).
  2. Extract the relevant changes: git diff upstream/main..<source-branch> -- <file paths> | git apply.
  3. Stage, commit with a clear message, and push to the user's fork.
  4. Create the PR as a draft (per repo contributing guidelines).
  5. If the original PR needs to be narrowed in scope, confirm with the user before force-pushing.
  6. Report all PR URLs when done.

Important guidelines

  • Always create PRs as drafts and push to the user's fork, never directly to upstream.
  • Backward-compatible changes (aliases, re-exports, deprecation shims) should go in the first PR so subsequent PRs can depend on them.
  • Test files should go with the production code they test, not in a separate PR.
  • Prefer a single clean commit per split PR over replaying the original commit history.
  • If a file is hard to categorize (e.g., it touches two groups), ask the user which PR it should go in.
  • If the current GitHub user is not the author of the original PR, each new PR's description must explicitly credit the original author (e.g., "Original changes by @ in #").

Bundled with this artifact

4 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0