Split PR by CODEOWNERS Groups
Split a large pull request into multiple smaller PRs, where each PR touches
the fewest possible CODEOWNERS reviewer groups. The goal is to reduce review
burden: a PR that only touches megatron/core/ needs only the core reviewers,
while a PR that also touches examples/, tools/, and megatron/training/
pulls in many additional groups.
Answer-First Constraints
For split-planning questions, lead with these constraints before the full workflow:
- Minimize CODEOWNERS reviewer groups per PR, but each resulting PR must still be independently mergeable and reviewable.
- Tests travel with the production code they validate; do not split tests into a separate PR just to reduce reviewer groups.
- If PR B depends on symbols renamed in PR A, call out the dependency and put backward-compatible aliases, re-exports, or shims in PR A when needed.
- Wait for user approval before execution.
- Execution creates draft PRs from the right base, applies file-scoped diffs
with
git diff upstream/main..<source-branch> -- <paths> | git apply, pushes to the user's fork, and never pushes directly to upstream.
Workflow
1. Analyze the PR
- Fetch the PR details:
gh pr view <number> --repo NVIDIA/Megatron-LM --json title,body,headRefName,authorandgh pr diff <number> --repo NVIDIA/Megatron-LM --stat. Also determine the current GitHub user withgh api user --jq .login. - Parse
.github/CODEOWNERSto build a mapping from file path patterns to owner groups. - For each changed file in the PR, determine which CODEOWNERS groups would be required to review it.
- Build a summary table grouped by CODEOWNERS group, showing which files pull in which groups.
- Count the total number of distinct reviewer groups the PR currently requires.
2. Propose a split that minimizes reviewer groups per PR
The primary optimization goal: minimize the number of CODEOWNERS reviewer groups required for each resulting PR.
Strategy:
- Cluster files by their CODEOWNERS groups. Files owned by the same set of groups naturally belong together.
- Identify the largest cluster — this becomes the first (and usually largest) PR.
- Remaining files form one or more additional PRs, each ideally requiring only one or two reviewer groups.
- If a split creates a dependency (e.g., PR B uses symbols renamed in PR A), the dependent PR must be merged after the first. Note this explicitly.
- Each PR must be independently mergeable to main — no broken imports, no missing symbols. Backward-compatible aliases and re-export stubs in the first PR can make this possible.
Present the proposed split as a table:
- PR name/description
- Files included
- CODEOWNERS groups required
- Dependencies on other PRs (if any)
Wait for user approval before proceeding.
3. Execute the split (after user approval)
For each new PR:
- Create a new branch from the appropriate base (
main, or a dependency PR's branch). - Extract the relevant changes:
git diff upstream/main..<source-branch> -- <file paths> | git apply. - Stage, commit with a clear message, and push to the user's fork.
- Create the PR as a draft (per repo contributing guidelines).
- If the original PR needs to be narrowed in scope, confirm with the user before force-pushing.
- Report all PR URLs when done.
Important guidelines
- Always create PRs as drafts and push to the user's fork, never directly to upstream.
- Backward-compatible changes (aliases, re-exports, deprecation shims) should go in the first PR so subsequent PRs can depend on them.
- Test files should go with the production code they test, not in a separate PR.
- Prefer a single clean commit per split PR over replaying the original commit history.
- If a file is hard to categorize (e.g., it touches two groups), ask the user which PR it should go in.
- If the current GitHub user is not the author of the original PR, each new PR's description must explicitly credit the original author (e.g., "Original changes by @ in #").