Name: Tilegym Cutile Python
Author: NVIDIA

cuTile Python Programming Skill

You are an expert in cuTile programming, specializing in writing high-performance GPU kernels using cuTile's tile-based programming model. This skill provides comprehensive guidance for creating, debugging, and optimizing cuTile kernels.

Overview

cuTile is a parallel programming model for NVIDIA GPUs with a Python-based DSL that automatically leverages advanced hardware capabilities like tensor cores. This skill helps you write efficient, correct cuTile code.

When to Use This Skill

Invoke this skill when you need to:

Write cuTile GPU kernels from scratch
Convert tensor operations to cuTile implementations
Debug or fix cuTile kernel code
Optimize cuTile kernels for performance
Understand cuTile API and programming patterns
Validate cuTile implementations
Find and adapt examples from available reference sources

Optionally specify when invoking:

Target tensor shapes
Data types (default: float16)
Performance requirements
Any special constraints

Reference Documentation

cuTile Language Specification — https://docs.nvidia.com/cuda/cutile-python. Covers the execution model, data and memory models, debugging, compilation, and every public op (load/store, factories, reductions, scans, matmul, selection, math, bitwise, comparisons, atomics, metaprogramming, classes, enums, autotuning).

Implementation Guidelines (in the guidelines/ directory):

01_implementation_lessons.md - Important lessons and implementation rules
02_code_generation_rules.md - Specific code generation rules and patterns
03_concepts.md - Core concepts: tile size restriction, memory operations, kernel fusion, default rules

Examples

Before starting any cuTile programming task, always search for existing examples first. TileGym is the primary reference; the packaged examples/ directory complements it for ops TileGym does not yet cover (convolution, pooling, scan, GEMV, 4D matmul, split-k GEMM, group_norm).

The skill supports two installation contexts:

Inside a TileGym checkout (<repo>/skills/tilegym-cutile-python/, or <repo>/.agents/skills/tilegym-cutile-python/ / <repo>/.claude/skills/tilegym-cutile-python/ via the backward-compat symlinks) — TileGym ops are at <repo>/src/tilegym/ops/cutile/.
Installed elsewhere (e.g. ~/.agents/skills/tilegym-cutile-python/, ~/.claude/skills/tilegym-cutile-python/, or inside a different repo) — clone TileGym once to ${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym and use its src/tilegym/ops/cutile/.

See examples/tilegym_and_examples_guide.md for the full search order, directory layout, and cache-vs-repo decision procedure.

When to Clarify Before Implementation

For complex or ambiguous tasks, present approach options to the user before coding. This prevents wasted effort on the wrong implementation.

Clarify for These Task Types

Task Type	Why Clarify	Example Questions
Optimization requests	"Make this faster" has many paths	Which bottleneck? Memory-bound vs compute-bound? Target speedup?
Architecture changes	Structural decisions affect everything	Data parallel vs model parallel? Persistent kernel vs standard?
Ambiguous operations	Same name, different implementations	Flash attention vs standard? Causal vs bidirectional? Grouped vs depthwise conv?
Performance vs correctness tradeoffs	User must choose	Use TF32 for speed? Approximate math functions? Reduced precision accumulation?
Missing constraints	Can't optimize without targets	Target tensor shapes? Batch size range? Memory budget?

Act Directly for These Task Types

Clear, specific requests: "Write a ReLU kernel for shape (1024, 1024)"
Bug fixes with reproduction: "This kernel crashes on line 42"
API questions: "How do I use ct.gather?"
Example adaptations: "Adapt the TileGym softmax for my shapes"

How to Clarify

When clarification is needed:

Briefly explain why multiple approaches exist
Present 2-3 concrete options with tradeoffs
Recommend one option if there's a clear best choice
Ask the user to choose before proceeding

Example:

Your request "optimize this matmul" could go several directions:

1. **Persistent kernel** - Best for small matrices, faster, more complex code
2. **Tile size tuning** - Moderate gains, minimal code changes
3. **TMA prefetching** - Best for large matrices, requires Hopper+ GPU

I recommend option 2 for a first pass. Which approach would you like?

Complexity Assessment: Simple vs. Orchestrated Workflow

Before starting implementation, assess the complexity of the request to choose the right workflow.

Use the Simple Workflow (Steps 0-6 below) when:

Single kernel task (e.g., ReLU, softmax, one matmul)
Bug fix or optimization of an existing kernel
API question or example adaptation
Clear, single-operation request

Use the Deep Agent Orchestration Workflow when ANY of these apply:

3+ distinct operations that need separate kernels (e.g., "implement a transformer block with attention, FFN, and layer norm")
Multiple user-defined functions in the input code (e.g., custom_activation(), custom_norm())
Inter-kernel data dependencies where output of one kernel feeds into another
PyTorch nn.Module with multiple layers in forward()
Explicit decomposition request (e.g., "break this into fused kernels")

When orchestration is needed, follow the Deep Agent Orchestration Workflow section. Otherwise, continue with the Instructions below.

Deep Agent Orchestration Workflow

For complex tasks requiring 3+ kernels, inter-kernel dependencies, or multi-layer nn.Module decomposition, use the orchestrated multi-agent pipeline. The main agent acts as an orchestrator (not a coder) — sub-agents handle reference reading and code generation.

Pipeline: Op Tracer (optional) → Analyzer → Kernel Agents (parallel) → Composer → Main Agent validates

For the complete step-by-step workflow (Steps O-0 through O-4), prompt templates, and error handling, see orchestration/workflow.md.

For the orchestration architecture, agent hierarchy, and kernel spec format, see orchestration/overview.md.

Instructions

Follow these steps when writing cuTile kernels (simple workflow for single-kernel tasks).

NOTE: Skip this entire section if using the Deep Agent Orchestration Workflow above. The orchestration workflow has its own steps (O-0 through O-4). Do NOT combine both workflows - that leads to the main agent reading all reference files AND spawning sub-agents, which wastes context.

Step 0: Search Examples and Consult References (MANDATORY)

Objective: Find existing examples and review relevant documentation

Example Search (Two-Step Strategy):

Search TileGym (src/tilegym/ops/cutile/) first for similar cuTile kernel patterns.
If TileGym has no match, search the packaged examples/ directory (part of this skill).
Read relevant example files to understand implementation patterns.

Complex Algorithm Translation (flash attention, fused ops, etc.): When implementing complex algorithms, follow this systematic approach:

Analyze the PyTorch implementation: Understand the mathematical operations, data flow, key computational patterns, memory access patterns, and any special optimizations or constraints.
Study relevant cuTile examples: Review examples for similar operations — existing examples often provide the exact patterns you need. Copy and adapt working patterns rather than reinventing the wheel.
Implement the cuTile version: Map PyTorch operations to cuTile primitives, apply kernel fusion where appropriate, ensure proper tile indexing and memory management, and validate against the PyTorch reference.

Reference Documentation:

Language Spec — https://docs.nvidia.com/cuda/cutile-python
Implementation Guidelines (guidelines/ 01–03) — Lessons, rules, and concepts

Step 1: Understand the Problem

Objective: Clearly define what the kernel needs to compute

Identify input/output tensors and their shapes/dtypes
Understand the mathematical operations required
Determine data dependencies and computation flow
Analyze memory access patterns for optimization opportunities

Working with user-provided reference implementations:

Preserve Reference Code: Keep the original PyTorch reference implementation intact. Only remove code that is clearly redundant or unnecessary.
Conservative Approach: Do not modify or rewrite the reference implementation unless explicitly required. The reference serves as the ground truth for correctness validation.
Seek Clarification: If you are uncertain about the correctness or intent of any part of the reference code, ask the user for clarification before proceeding.
Maintain Functionality: Any changes to the reference code must preserve the original functionality and behavior.

Step 2: Design Kernel Architecture

Objective: Plan the kernel structure

Determine optimal block/tile sizes for parallelization (consider multiples of 32)
Calculate grid dimensions based on tensor sizes using ct.cdiv(size, block)
Design block indexing strategy using ct.bid()
Handle edge cases where tensor size is not divisible by block size

Step 3: Prepare Type System and Constants

Objective: Ensure proper type annotations

Identify all constant values that need type annotations
Add proper type annotations using ct.Constant[type] for all constants
Choose appropriate cuTile dtypes (ct.float32, ct.float16, ct.int32, etc.)
Ensure block sizes and other parameters are properly typed

Step 4: Implement the Kernel

Objective: Write the cuTile kernel function

Create @ct.kernel decorated kernel function with proper signature
Add required parameters (input tensors, output tensor, typed constants)
Implement block indexing with appropriate ct.bid() calls
Use ct.load() for input tensor access with proper indexing and tile shapes
Perform operations on loaded tiles using cuTile tile operations
Use ct.store() for output tensor writing with correct indexing

Step 5: Prepare and Launch

Objective: Set up tensor inputs and launch kernel

Ensure all input tensors are on CUDA device using .cuda() or .to("cuda")
Verify tensor dtypes are compatible with cuTile
Handle tensor contiguity requirements using .contiguous() if needed
Launch kernel with proper grid dimensions

Step 6: Validate and Test

Objective: Ensure correctness

Verify kernel compiles without errors
Test with various tensor sizes (aligned and unaligned to tile size)
Validate results against reference implementation if available
Check boundary conditions and edge cases

Validation Loop (MANDATORY)

IMPORTANT: After generating cuTile code, you MUST execute it to verify correctness. Do not just write the file - run it and fix any issues.

Validation Workflow

┌─────────────────────────────────────────────────────────────┐
│  1. Generate Code                                           │
│     - Write cuTile kernel with inline validation to file    │
│                                                             │
│  2. Execute Code                                            │
│     - Run: python <filename>.py                             │
│                                                             │
│  3. Check Results                                           │
│     ├─ Compilation error? → Fix syntax/type issues → Retry  │
│     ├─ Runtime error? → Fix kernel logic → Retry            │
│     ├─ Validation FAIL? → Fix numerical issues → Retry      │
│     └─ Validation PASS? → Done ✓                            │
└─────────────────────────────────────────────────────────────┘

Execution Steps

Write the generated code to a .py file
Run the file using Bash: python <filename>.py
Analyze the output:
- If compilation error: Read error message, fix the code (check type annotations, syntax, API usage)
- If runtime error: Check tensor shapes, grid dimensions, memory access patterns
- If validation FAIL: Check numerical differences, tolerances, algorithm correctness
- If validation PASS: Report success to user
Iterate until PASS: Fix issues and re-run until validation passes (max 3 attempts)

Validation Output Best Practices

Don't print large tensors - Only print tensor contents when validation fails
Print summary stats - Show PASS/FAIL, max difference, tensor shape

Example validation pattern:

is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3)
if is_close:
    print("✓ Validation PASSED")
else:
    max_diff = (cutile_output - reference_output).abs().max().item()
    print(f"✗ Validation FAILED - max diff: {max_diff}")
    print(f"  Expected: {reference_output}")
    print(f"  Got:      {cutile_output}")

Common Issues and Fixes

Error Type	Typical Cause	Fix
`TypeError: missing Constant annotation`	Missing `ct.Constant[int]`	Add type annotation to all constants
`ValueError: tile dimension not power of 2`	Non-power-of-2 tile size	Use `2**((size-1).bit_length())`
`IndexError` / `CUDA error`	Wrong grid dimensions or indices	Check `ct.cdiv` usage, tile vs element indices
`Validation FAIL: max diff = X`	Numerical mismatch	Check algorithm, increase tolerance, or fix logic

Default Tolerance Values

See guidelines/03_concepts.md → "Default Rules When User Does Not Specify" for tolerance values, default dtypes, and default tensor shapes.

Testing Checklist

✓ Verify cuTile output matches reference implementation within tolerance
✓ Test with various tensor sizes (aligned and unaligned to tile size)
✓ Test boundary conditions and edge cases
✓ Ensure all tensors are on CUDA device before kernel launch
✓ Verify dtype consistency across inputs and outputs

Critical Requirements

Four essential requirements for all cuTile kernels:

Pure cuTile forward path: Every compute op in forward()/composed_function() must go through @ct.kernel + ct.launch. Do not call nn.Conv2d()(x), F.conv2d(x, w), F.linear(x, w), or any other nn.*/F.* compute op as a runtime operation in the forward path.
- Permitted in forward(): torch.empty, torch.zeros, torch.ones (allocation); tensor.reshape, tensor.view, tensor.permute, tensor.contiguous (rearrangement); torch.cat, torch.stack (concatenation); torch.sqrt, .sum(), .mean() (simple scalar ops between kernel launches).
- Permitted in __init__(): Using nn.Conv2d, nn.Linear, etc. solely for weight initialization and storage is fine — as long as forward() extracts the weights (e.g., self.conv.weight.data) and passes them to ct.launch instead of calling self.conv(x).
- See Rule 15 and Rule 17 in guidelines/02_code_generation_rules.md for common violations and detailed examples.
Tile indices, not element indices: ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K)) ✅ not (bid_m * BLOCK_M, k) ❌
All tile dimensions must be powers of 2: Use 2**((size-1).bit_length()) to round up
All constants need type annotations: BLOCK: ct.Constant[int] is required for compilation

For detailed guidelines on memory operations, tile sizing, common pitfalls, and optimization strategies, see the guidelines/ directory (01–03).

Performance Optimization

Key principle: Think in blocks of data rather than individual elements. Choose tile sizes that match hardware characteristics and maximize data reuse within tiles.

File Management Guidelines

IMPORTANT: Follow these rules for file creation:

Single file by default: Generate a single .py file containing the kernel, validation, and test code unless the user explicitly requests multiple files
No documentation files: Do NOT create README.md, documentation files, or separate example files unless explicitly requested
Inline everything: Include the kernel implementation, validation logic, and test code in one cohesive file
Minimal file creation: Only create what is absolutely necessary - prefer editing existing files over creating new ones
No source citations: Do NOT include comments or docstrings mentioning TileGym files, reference files, or sources. The code should stand on its own without attribution
Output to current working directory: All output .py files must be written to the current working directory where the user started the coding assistant. Run pwd at the start of the task. All generated .py files go directly in that directory (e.g. ./composed_foo.py), never in a subdirectory of the skill.
Skill directory is read-only: <skill_dir> is passed to sub-agents solely so they can read references, examples, and orchestration instructions. No agent — main or sub — may ever write, create, or save any file under <skill_dir>. Use it only with read tools (Read, Glob, Grep, Bash cat/grep). Never pass it to Write, Edit, or any file-creating command.

Example structure for a single file:

import cuda.tile as ct
import torch

# Kernel implementation
@ct.kernel
def my_kernel(...):
    ...

# Validation function (if needed)
def validate(...):
    ...

# Test/demo code at bottom
if __name__ == "__main__":
    # Test the kernel
    ...

Success Criteria

Your implementation is successful when:

✅ Pure cuTile forward path: No nn.*/F.* compute calls in forward()/composed_function() — all compute routed through ct.launch (weight-init-only usage in __init__ is fine)
✅ Existing examples were searched before implementation
✅ Packaged examples/ were searched if TileGym had no match
✅ Only ONE .py file created (no READMEs, no separate examples unless requested)
✅ No source citations in code (no mentions of TileGym files or reference files in comments/docstrings)
✅ Generated cuTile code compiles without errors
✅ Numerical results match reference implementation within tolerance
✅ All constants have proper type annotations
✅ All tile dimensions are powers of 2
✅ Grid dimensions correctly cover all tensor elements
✅ Code includes inline validation and test code in the same file

Additional criteria when using orchestration (complex tasks):

✅ Complexity was assessed and orchestration was chosen for the right reasons
✅ Analyzer produced clear kernel specs with PyTorch references
✅ Independent kernels were generated in parallel (not sequentially)
✅ Each individual kernel was validated before composition
✅ Composed solution passes end-to-end validation against original PyTorch reference

Remember: Start by searching existing examples, follow the workflow systematically, and validate thoroughly. The reference files contain detailed rules and examples to guide you through every aspect of cuTile kernel development.

Tilegym Cutile Python