Tilegym Cutile Python

Expert cuTile programming assistant. Write high-performance GPU kernels using cuTile's tile-based programming model with proper validation and optimization. Supports deep agent orchestration for complex multi-kernel tasks.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

cuTile Python Programming Skill

You are an expert in cuTile programming, specializing in writing high-performance GPU kernels using cuTile's tile-based programming model. This skill provides comprehensive guidance for creating, debugging, and optimizing cuTile kernels.

Overview

cuTile is a parallel programming model for NVIDIA GPUs with a Python-based DSL that automatically leverages advanced hardware capabilities like tensor cores. This skill helps you write efficient, correct cuTile code.

When to Use This Skill

Invoke this skill when you need to:

  • Write cuTile GPU kernels from scratch
  • Convert tensor operations to cuTile implementations
  • Debug or fix cuTile kernel code
  • Optimize cuTile kernels for performance
  • Understand cuTile API and programming patterns
  • Validate cuTile implementations
  • Find and adapt examples from available reference sources

Optionally specify when invoking:

  • Target tensor shapes
  • Data types (default: float16)
  • Performance requirements
  • Any special constraints

Reference Documentation

cuTile Language Specificationhttps://docs.nvidia.com/cuda/cutile-python. Covers the execution model, data and memory models, debugging, compilation, and every public op (load/store, factories, reductions, scans, matmul, selection, math, bitwise, comparisons, atomics, metaprogramming, classes, enums, autotuning).

Implementation Guidelines (in the guidelines/ directory):

  • 01_implementation_lessons.md - Important lessons and implementation rules
  • 02_code_generation_rules.md - Specific code generation rules and patterns
  • 03_concepts.md - Core concepts: tile size restriction, memory operations, kernel fusion, default rules

Examples

Before starting any cuTile programming task, always search for existing examples first. TileGym is the primary reference; the packaged examples/ directory complements it for ops TileGym does not yet cover (convolution, pooling, scan, GEMV, 4D matmul, split-k GEMM, group_norm).

The skill supports two installation contexts:

  • Inside a TileGym checkout (<repo>/skills/tilegym-cutile-python/, or <repo>/.agents/skills/tilegym-cutile-python/ / <repo>/.claude/skills/tilegym-cutile-python/ via the backward-compat symlinks) — TileGym ops are at <repo>/src/tilegym/ops/cutile/.
  • Installed elsewhere (e.g. ~/.agents/skills/tilegym-cutile-python/, ~/.claude/skills/tilegym-cutile-python/, or inside a different repo) — clone TileGym once to ${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym and use its src/tilegym/ops/cutile/.

See examples/tilegym_and_examples_guide.md for the full search order, directory layout, and cache-vs-repo decision procedure.

When to Clarify Before Implementation

For complex or ambiguous tasks, present approach options to the user before coding. This prevents wasted effort on the wrong implementation.

Clarify for These Task Types

Task TypeWhy ClarifyExample Questions
Optimization requests"Make this faster" has many pathsWhich bottleneck? Memory-bound vs compute-bound? Target speedup?
Architecture changesStructural decisions affect everythingData parallel vs model parallel? Persistent kernel vs standard?
Ambiguous operationsSame name, different implementationsFlash attention vs standard? Causal vs bidirectional? Grouped vs depthwise conv?
Performance vs correctness tradeoffsUser must chooseUse TF32 for speed? Approximate math functions? Reduced precision accumulation?
Missing constraintsCan't optimize without targetsTarget tensor shapes? Batch size range? Memory budget?

Act Directly for These Task Types

  • Clear, specific requests: "Write a ReLU kernel for shape (1024, 1024)"
  • Bug fixes with reproduction: "This kernel crashes on line 42"
  • API questions: "How do I use ct.gather?"
  • Example adaptations: "Adapt the TileGym softmax for my shapes"

How to Clarify

When clarification is needed:

  1. Briefly explain why multiple approaches exist
  2. Present 2-3 concrete options with tradeoffs
  3. Recommend one option if there's a clear best choice
  4. Ask the user to choose before proceeding

Example:

Your request "optimize this matmul" could go several directions:

1. **Persistent kernel** - Best for small matrices, faster, more complex code
2. **Tile size tuning** - Moderate gains, minimal code changes
3. **TMA prefetching** - Best for large matrices, requires Hopper+ GPU

I recommend option 2 for a first pass. Which approach would you like?

Complexity Assessment: Simple vs. Orchestrated Workflow

Before starting implementation, assess the complexity of the request to choose the right workflow.

Use the Simple Workflow (Steps 0-6 below) when:

  • Single kernel task (e.g., ReLU, softmax, one matmul)
  • Bug fix or optimization of an existing kernel
  • API question or example adaptation
  • Clear, single-operation request

Use the Deep Agent Orchestration Workflow when ANY of these apply:

  • 3+ distinct operations that need separate kernels (e.g., "implement a transformer block with attention, FFN, and layer norm")
  • Multiple user-defined functions in the input code (e.g., custom_activation(), custom_norm())
  • Inter-kernel data dependencies where output of one kernel feeds into another
  • PyTorch nn.Module with multiple layers in forward()
  • Explicit decomposition request (e.g., "break this into fused kernels")

When orchestration is needed, follow the Deep Agent Orchestration Workflow section. Otherwise, continue with the Instructions below.

Deep Agent Orchestration Workflow

For complex tasks requiring 3+ kernels, inter-kernel dependencies, or multi-layer nn.Module decomposition, use the orchestrated multi-agent pipeline. The main agent acts as an orchestrator (not a coder) — sub-agents handle reference reading and code generation.

Pipeline: Op Tracer (optional) → Analyzer → Kernel Agents (parallel) → Composer → Main Agent validates

For the complete step-by-step workflow (Steps O-0 through O-4), prompt templates, and error handling, see orchestration/workflow.md.

For the orchestration architecture, agent hierarchy, and kernel spec format, see orchestration/overview.md.


Instructions

Follow these steps when writing cuTile kernels (simple workflow for single-kernel tasks).

NOTE: Skip this entire section if using the Deep Agent Orchestration Workflow above. The orchestration workflow has its own steps (O-0 through O-4). Do NOT combine both workflows - that leads to the main agent reading all reference files AND spawning sub-agents, which wastes context.

Step 0: Search Examples and Consult References (MANDATORY)

Objective: Find existing examples and review relevant documentation

Example Search (Two-Step Strategy):

  1. Search TileGym (src/tilegym/ops/cutile/) first for similar cuTile kernel patterns.
  2. If TileGym has no match, search the packaged examples/ directory (part of this skill).
  3. Read relevant example files to understand implementation patterns.

Complex Algorithm Translation (flash attention, fused ops, etc.): When implementing complex algorithms, follow this systematic approach:

  1. Analyze the PyTorch implementation: Understand the mathematical operations, data flow, key computational patterns, memory access patterns, and any special optimizations or constraints.
  2. Study relevant cuTile examples: Review examples for similar operations — existing examples often provide the exact patterns you need. Copy and adapt working patterns rather than reinventing the wheel.
  3. Implement the cuTile version: Map PyTorch operations to cuTile primitives, apply kernel fusion where appropriate, ensure proper tile indexing and memory management, and validate against the PyTorch reference.

Reference Documentation:

  • Language Spechttps://docs.nvidia.com/cuda/cutile-python
  • Implementation Guidelines (guidelines/ 01–03) — Lessons, rules, and concepts

Step 1: Understand the Problem

Objective: Clearly define what the kernel needs to compute

  • Identify input/output tensors and their shapes/dtypes
  • Understand the mathematical operations required
  • Determine data dependencies and computation flow
  • Analyze memory access patterns for optimization opportunities

Working with user-provided reference implementations:

  1. Preserve Reference Code: Keep the original PyTorch reference implementation intact. Only remove code that is clearly redundant or unnecessary.
  2. Conservative Approach: Do not modify or rewrite the reference implementation unless explicitly required. The reference serves as the ground truth for correctness validation.
  3. Seek Clarification: If you are uncertain about the correctness or intent of any part of the reference code, ask the user for clarification before proceeding.
  4. Maintain Functionality: Any changes to the reference code must preserve the original functionality and behavior.

Step 2: Design Kernel Architecture

Objective: Plan the kernel structure

  • Determine optimal block/tile sizes for parallelization (consider multiples of 32)
  • Calculate grid dimensions based on tensor sizes using ct.cdiv(size, block)
  • Design block indexing strategy using ct.bid()
  • Handle edge cases where tensor size is not divisible by block size

Step 3: Prepare Type System and Constants

Objective: Ensure proper type annotations

  • Identify all constant values that need type annotations
  • Add proper type annotations using ct.Constant[type] for all constants
  • Choose appropriate cuTile dtypes (ct.float32, ct.float16, ct.int32, etc.)
  • Ensure block sizes and other parameters are properly typed

Step 4: Implement the Kernel

Objective: Write the cuTile kernel function

  • Create @ct.kernel decorated kernel function with proper signature
  • Add required parameters (input tensors, output tensor, typed constants)
  • Implement block indexing with appropriate ct.bid() calls
  • Use ct.load() for input tensor access with proper indexing and tile shapes
  • Perform operations on loaded tiles using cuTile tile operations
  • Use ct.store() for output tensor writing with correct indexing

Step 5: Prepare and Launch

Objective: Set up tensor inputs and launch kernel

  • Ensure all input tensors are on CUDA device using .cuda() or .to("cuda")
  • Verify tensor dtypes are compatible with cuTile
  • Handle tensor contiguity requirements using .contiguous() if needed
  • Launch kernel with proper grid dimensions

Step 6: Validate and Test

Objective: Ensure correctness

  • Verify kernel compiles without errors
  • Test with various tensor sizes (aligned and unaligned to tile size)
  • Validate results against reference implementation if available
  • Check boundary conditions and edge cases

Validation Loop (MANDATORY)

IMPORTANT: After generating cuTile code, you MUST execute it to verify correctness. Do not just write the file - run it and fix any issues.

Validation Workflow

┌─────────────────────────────────────────────────────────────┐
│  1. Generate Code                                           │
│     - Write cuTile kernel with inline validation to file    │
│                                                             │
│  2. Execute Code                                            │
│     - Run: python <filename>.py                             │
│                                                             │
│  3. Check Results                                           │
│     ├─ Compilation error? → Fix syntax/type issues → Retry  │
│     ├─ Runtime error? → Fix kernel logic → Retry            │
│     ├─ Validation FAIL? → Fix numerical issues → Retry      │
│     └─ Validation PASS? → Done ✓                            │
└─────────────────────────────────────────────────────────────┘

Execution Steps

  1. Write the generated code to a .py file
  2. Run the file using Bash: python <filename>.py
  3. Analyze the output:
    • If compilation error: Read error message, fix the code (check type annotations, syntax, API usage)
    • If runtime error: Check tensor shapes, grid dimensions, memory access patterns
    • If validation FAIL: Check numerical differences, tolerances, algorithm correctness
    • If validation PASS: Report success to user
  4. Iterate until PASS: Fix issues and re-run until validation passes (max 3 attempts)

Validation Output Best Practices

  • Don't print large tensors - Only print tensor contents when validation fails
  • Print summary stats - Show PASS/FAIL, max difference, tensor shape
  • Example validation pattern:
    is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3)
    if is_close:
        print("✓ Validation PASSED")
    else:
        max_diff = (cutile_output - reference_output).abs().max().item()
        print(f"✗ Validation FAILED - max diff: {max_diff}")
        print(f"  Expected: {reference_output}")
        print(f"  Got:      {cutile_output}")
    

Common Issues and Fixes

Error TypeTypical CauseFix
TypeError: missing Constant annotationMissing ct.Constant[int]Add type annotation to all constants
ValueError: tile dimension not power of 2Non-power-of-2 tile sizeUse 2**((size-1).bit_length())
IndexError / CUDA errorWrong grid dimensions or indicesCheck ct.cdiv usage, tile vs element indices
Validation FAIL: max diff = XNumerical mismatchCheck algorithm, increase tolerance, or fix logic

Default Tolerance Values

See guidelines/03_concepts.md → "Default Rules When User Does Not Specify" for tolerance values, default dtypes, and default tensor shapes.

Testing Checklist

  • ✓ Verify cuTile output matches reference implementation within tolerance
  • ✓ Test with various tensor sizes (aligned and unaligned to tile size)
  • ✓ Test boundary conditions and edge cases
  • ✓ Ensure all tensors are on CUDA device before kernel launch
  • ✓ Verify dtype consistency across inputs and outputs

Critical Requirements

Four essential requirements for all cuTile kernels:

  1. Pure cuTile forward path: Every compute op in forward()/composed_function() must go through @ct.kernel + ct.launch. Do not call nn.Conv2d()(x), F.conv2d(x, w), F.linear(x, w), or any other nn.*/F.* compute op as a runtime operation in the forward path.
    • Permitted in forward(): torch.empty, torch.zeros, torch.ones (allocation); tensor.reshape, tensor.view, tensor.permute, tensor.contiguous (rearrangement); torch.cat, torch.stack (concatenation); torch.sqrt, .sum(), .mean() (simple scalar ops between kernel launches).
    • Permitted in __init__(): Using nn.Conv2d, nn.Linear, etc. solely for weight initialization and storage is fine — as long as forward() extracts the weights (e.g., self.conv.weight.data) and passes them to ct.launch instead of calling self.conv(x).
    • See Rule 15 and Rule 17 in guidelines/02_code_generation_rules.md for common violations and detailed examples.
  2. Tile indices, not element indices: ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K)) ✅ not (bid_m * BLOCK_M, k)
  3. All tile dimensions must be powers of 2: Use 2**((size-1).bit_length()) to round up
  4. All constants need type annotations: BLOCK: ct.Constant[int] is required for compilation

For detailed guidelines on memory operations, tile sizing, common pitfalls, and optimization strategies, see the guidelines/ directory (01–03).

Performance Optimization

Key principle: Think in blocks of data rather than individual elements. Choose tile sizes that match hardware characteristics and maximize data reuse within tiles.

File Management Guidelines

IMPORTANT: Follow these rules for file creation:

  1. Single file by default: Generate a single .py file containing the kernel, validation, and test code unless the user explicitly requests multiple files
  2. No documentation files: Do NOT create README.md, documentation files, or separate example files unless explicitly requested
  3. Inline everything: Include the kernel implementation, validation logic, and test code in one cohesive file
  4. Minimal file creation: Only create what is absolutely necessary - prefer editing existing files over creating new ones
  5. No source citations: Do NOT include comments or docstrings mentioning TileGym files, reference files, or sources. The code should stand on its own without attribution
  6. Output to current working directory: All output .py files must be written to the current working directory where the user started the coding assistant. Run pwd at the start of the task. All generated .py files go directly in that directory (e.g. ./composed_foo.py), never in a subdirectory of the skill.
  7. Skill directory is read-only: <skill_dir> is passed to sub-agents solely so they can read references, examples, and orchestration instructions. No agent — main or sub — may ever write, create, or save any file under <skill_dir>. Use it only with read tools (Read, Glob, Grep, Bash cat/grep). Never pass it to Write, Edit, or any file-creating command.

Example structure for a single file:

import cuda.tile as ct
import torch

# Kernel implementation
@ct.kernel
def my_kernel(...):
    ...

# Validation function (if needed)
def validate(...):
    ...

# Test/demo code at bottom
if __name__ == "__main__":
    # Test the kernel
    ...

Success Criteria

Your implementation is successful when:

  1. Pure cuTile forward path: No nn.*/F.* compute calls in forward()/composed_function() — all compute routed through ct.launch (weight-init-only usage in __init__ is fine)
  2. ✅ Existing examples were searched before implementation
  3. ✅ Packaged examples/ were searched if TileGym had no match
  4. ✅ Only ONE .py file created (no READMEs, no separate examples unless requested)
  5. ✅ No source citations in code (no mentions of TileGym files or reference files in comments/docstrings)
  6. ✅ Generated cuTile code compiles without errors
  7. ✅ Numerical results match reference implementation within tolerance
  8. ✅ All constants have proper type annotations
  9. ✅ All tile dimensions are powers of 2
  10. ✅ Grid dimensions correctly cover all tensor elements
  11. ✅ Code includes inline validation and test code in the same file

Additional criteria when using orchestration (complex tasks):

  1. ✅ Complexity was assessed and orchestration was chosen for the right reasons
  2. ✅ Analyzer produced clear kernel specs with PyTorch references
  3. ✅ Independent kernels were generated in parallel (not sequentially)
  4. ✅ Each individual kernel was validated before composition
  5. ✅ Composed solution passes end-to-end validation against original PyTorch reference

Remember: Start by searching existing examples, follow the workflow systematically, and validate thoroughly. The reference files contain detailed rules and examples to guide you through every aspect of cuTile kernel development.

Bundled with this artifact

36 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0