Tilegym Converting Cutile To Julia

Converts cuTile Python GPU kernels (@ct.kernel) to cuTile.jl Julia equivalents. Handles kernel syntax translation, 0-indexed to 1-indexed conversion, broadcasting differences, memory layout (row-major to column-major), type system mapping, and launch API differences. Use when converting, porting, or translating cuTile Python kernels to Julia cuTile.jl, or debugging/optimizing existing Julia cuTile translations.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

cuTile Python → cuTile.jl (Julia) Conversion

Convert @ct.kernel Python kernels to Julia function ... end cuTile.jl kernels.

Workflow Selection

  • Standard conversion → Full workflow: translations/workflow.md
  • Errors (MethodError, IRError, numerical mismatch) → references/debugging.md
  • Quick referencereferences/api-mapping.md + references/critical-rules.md
  • Test patternsreferences/testing.md

Architecture

Julia kernels are standalone — no Python bridge, no pytest integration. The Julia sub-project lives in julia/ at the repo root with its own Project.toml for dependency management.

julia/                          # Self-contained Julia sub-project
├── Project.toml                # Dependencies: CUDA.jl, cuTile.jl, NNlib.jl, Test
├── kernels/                    # cuTile.jl kernel implementations
│   ├── add.jl                  # ← Ground-truth: 1D element-wise with alpha scaling (tensor+tensor, tensor+scalar)
│   ├── matmul.jl               # ← Ground-truth: 2D tiled MMA, standard Julia layout (M,K)×(K,N)→(M,N)
│   └── softmax.jl              # ← Ground-truth: 3 strategies (TMA, online, chunked) using ct.load/ct.store
└── test/                       # Julia-native tests (using Test stdlib)
    ├── runtests.jl             # Test runner entry point
    ├── test_add.jl
    ├── test_matmul.jl
    └── test_softmax.jl

Ground-truth reference: Always consult julia/kernels/*.jl and julia/test/*.jl for patterns that compile and pass tests. These are the canonical examples of working cuTile.jl code.

Instructions

  1. Analyze the Python kernel: identify patterns, shapes, dtypes, operations
  2. Write Julia kerneljulia/kernels/<op>.jl with cuTile.jl kernel + bridge function(s)
  3. Convert kernel signature (see translations/workflow.md Phase 2)
  4. Convert kernel body (apply references/api-mapping.md + references/critical-rules.md)
  5. Write Julia testjulia/test/test_<op>.jl using Test stdlib + NNlib.jl for reference
  6. Register test — add include(...) in julia/test/runtests.jl
  7. Validate — run the bundled validator: python <skill-dir>/scripts/validate_cutile_jl.py <file.jl>
  8. Test — run julia --project=julia/ julia/test/runtests.jl

Full conversion checklist with post-conversion verification → translations/workflow.md

⚠️ Top Pitfalls

The most dangerous translation errors. Full rules (17 total) in references/critical-rules.md.

#PitfallOne-line fix
1ct.full() doesn't exist in JuliaUse fill(val, shape), zeros(T, dims...), or ones(T, dims...)
2max(a, b) on tiles → IRErrorUse max.(a, b) (broadcast dot)
3IRError / MethodError mentioning IRStructurizerCompiler bug — file upstream with minimal reproducer
4ct.launch arg order silently wrongArgs are positional — match kernel signature exactly
5ct.load with order — index positions wrongorder remaps BOTH shape AND index (Critical Rule 16)

Worked Examples

Side-by-side Python → Julia conversions matching the released Julia kernels in julia/kernels/. Each directory contains cutile_python.py (before) and cutile_julia.jl (after).

#ExampleKey PatternsWhen to Reference
01add1D ct.load/ct.store, alpha scaling, scalar broadcast, fill/zeros, keyword load/storeStarting point; basic TMA + element-wise patterns
02matmulmuladd, TF32 conversion, K-loop with for, 2D swizzle, standard Julia layout, ct.@compiler_optionsMMA / tensor core operations
03softmaxPersistent scheduling, for loops, gather/scatter, padding_mode, multi-passLarge-tensor reduction patterns

These match the released kernels in julia/kernels/ (add.jl, matmul.jl, softmax.jl). The examples are simplified teaching versions — always consult julia/kernels/*.jl for the canonical, tested implementations.

Reference Documents

CategoryDocumentContent
Workflowstranslations/workflow.mdFull conversion workflow with todo list, validation loop, checklist
Rulesreferences/critical-rules.md17 Critical Rules for cuTile Python → Julia conversion
APIreferences/api-mapping.mdPython↔Julia bidirectional API mapping + kernel patterns
Testingreferences/testing.mdJulia-native test patterns, tolerances, failure diagnosis
Debuggingreferences/debugging.mdJulia-specific error diagnosis + IR debug commands
Scriptsscripts/validate_cutile_jl.pyStatic validation for Julia anti-patterns (run it)
Ground Truthjulia/kernels/*.jl + julia/test/*.jlActual working implementations in the codebase

Environment Setup

Prerequisite — Julia: this skill requires the Julia version declared in julia/Project.toml under [compat] julia. If julia --version is missing or older than that, install from the official Julia site at https://julialang.org/install/ following the verified installer instructions for your OS. Resume below once julia --version is compatible.

Then, from the repo root:

# Install Julia dependencies declared in julia/Project.toml
julia --project=julia/ -e 'using Pkg; Pkg.instantiate()'

# Run tests
julia --project=julia/ julia/test/runtests.jl

Requirements:

  • Julia (minimum version declared in julia/Project.toml under [compat] julia)
  • CUDA 13.1+ driver
  • Blackwell GPU (compute capability 10+)
  • Dependencies managed via julia/Project.toml: CUDA.jl, cuTile.jl, NNlib.jl, Test

Bundled with this artifact

16 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0