CUDA SIMT to cuTile Python Transpiler
Transform your CUDA kernels for NVIDIA Blackwell GPUs
Live Demo • Quick Start • Features • Patterns • Discord
RightNow Tile is a production-grade transpiler that converts traditional CUDA SIMT (Single Instruction, Multiple Threads) kernels into cuTile Python code — NVIDIA's new tile-based programming model optimized for Blackwell GPUs (compute capability 10.x+).
Part of the RightNow AI ecosystem — a code editor built for GPU kernel development.
NVIDIA's cuTile represents a paradigm shift in GPU programming:
| Traditional CUDA | cuTile |
|---|---|
| Thread-centric programming | Tile-centric programming |
| Manual memory coalescing | Automatic tile-based loads |
| Complex index calculations | Declarative tile operations |
| Low-level synchronization | High-level tile semantics |
RightNow Tile bridges the gap — take your existing CUDA kernels and transform them for next-gen hardware.
# Clone the repository
git clone https://github.com/RightNow-AI/RightNow-Tile.git
cd RightNow-Tile
# Install dependencies
npm install
# Start development server
npm run devOpen http://localhost:3000 and start transpiling!
Automatically identifies 18 computational patterns with 60+ variant-specific optimizations:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Your CUDA │ ──► │ Pattern Match │ ──► │ Optimized │
│ Kernel │ │ + Analysis │ │ cuTile Code │
└─────────────────┘ └──────────────────┘ └─────────────────┘
CUDA Source
│
▼
┌──────────────┐
│ 1. Extractor │ Parse kernel signatures, parameters, memory accesses
└──────┬───────┘
▼
┌──────────────┐
│ 2. Parser │ Recognize 150+ CUDA intrinsics & index patterns
└──────┬───────┘
▼
┌──────────────┐
│ 3. Semantic │ Detect reductions, dependencies, race conditions
└──────┬───────┘
▼
┌──────────────┐
│ 4. Memory │ Analyze coalescing, bank conflicts, access patterns
└──────┬───────┘
▼
┌──────────────┐
│ 5. Pattern │ Match against 18 patterns with confidence scoring
└──────┬───────┘
▼
┌──────────────┐
│ 6. IR Build │ Generate intermediate representation with config
└──────┬───────┘
▼
┌──────────────┐
│ 7. Optimize │ Select optimal tile sizes & configurations
└──────┬───────┘
▼
┌──────────────┐
│ 8. CodeGen │ Apply variant-specific templates
└──────┬───────┘
▼
┌──────────────┐
│ 9. Validate │ Verify correctness & generate diagnostics
└──────┴───────┘
│
▼
cuTile Python
- Monaco Editor — VS Code-quality editing with syntax highlighting
- Real-time Transpilation — See results instantly
- Dark/Light Themes — Easy on the eyes
- Expandable Output — Full-screen code view
- One-Click Copy — Get your code ready to deploy
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| GEMM | naive, tiled, register_blocked |
Matrix multiplication, deep learning | High |
| Reduction | tree, warp_shuffle, multi_block, segmented |
Sum, max, min, dot product | High |
| Scan | inclusive, exclusive, segmented |
Prefix sum, stream compaction | High |
| Stencil | 1d_3pt, 1d_5pt, 2d_5pt, 2d_9pt, 3d |
Image processing, PDE solvers | High |
| Elementwise | simple, vectorized |
Point-wise operations | High |
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| Attention | flash_attention, flash_attention_v2, multi_head, causal, cross |
Transformer models | High |
| Normalization | layernorm, rmsnorm, batchnorm, groupnorm, instancenorm |
Neural network layers | High |
| Convolution | conv1d, conv2d, conv3d, depthwise, grouped, winograd, im2col |
CNNs, signal processing | High |
| Pooling | max_pool_2d, avg_pool_2d, global_avg, global_max, adaptive |
Feature downsampling | High |
| Embedding | lookup, embedding_bag, positional |
NLP, recommender systems | Medium |
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| RoPE | standard, neox, cached |
Rotary position embeddings | High |
| KV Cache | append, paged, prefix, gqa |
LLM inference optimization | High |
| Quantization | int8, int4, fp8, dequantize |
Model compression | Medium |
| Fused | matmul_activation, matmul_bias_activation, layernorm_residual |
Kernel fusion | Medium |
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| FFT | radix2, radix4, radix8, inverse, real |
Signal processing | High |
| Sparse | spmv_csr, spmv_csr_warp, spmv_coo, spmv_ell, spmm, sddmm |
Sparse matrix operations | Medium |
| Histogram | atomic, privatized, multipass, weighted, 2d |
Data distribution, statistics | Medium |
| Sorting | bitonic, bitonic_shared, radix, merge |
Parallel sorting | Medium |
Input: CUDA SIMT Kernel
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}Output: cuTile Python
import cuda_tile as ct
import cupy
TILE_SIZE = 256
@ct.kernel
def vector_add(a, b, c, n: ct.Constant[int], tile_size: ct.Constant[int]):
"""
Elementwise kernel - auto-transpiled from CUDA
Original: vectorAdd
Confidence: 100%
"""
pid = ct.bid(0)
# Load input tiles
a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
b_tile = ct.load(b, index=(pid,), shape=(tile_size,))
# Compute
result = a_tile + b_tile
# Store result
ct.store(c, index=(pid,), tile=result)
def launch_vector_add(a, b, c):
"""Launch the vector_add kernel"""
n = a.shape[0]
grid = (ct.cdiv(n, TILE_SIZE), 1, 1)
stream = cupy.cuda.get_current_stream()
ct.launch(stream, grid, vector_add, (a, b, c, TILE_SIZE))Input: Flash Attention CUDA Kernel
__global__ void flash_attention_kernel(
float* Q, float* K, float* V, float* O,
int seq_len, int head_dim, float scale
) {
// Complex multi-phase attention implementation
// with online softmax and tiled matrix multiply
...
}Output: cuTile Python (Flash Attention)
import cuda_tile as ct
import cupy
BLOCK_Q = 64
BLOCK_KV = 64
@ct.kernel
def flash_attention(
Q, K, V, O,
seq_len_q: ct.Constant[int],
seq_len_kv: ct.Constant[int],
head_dim: ct.Constant[int],
scale: ct.Constant[float],
block_q: ct.Constant[int],
block_kv: ct.Constant[int]
):
"""
Flash Attention kernel - auto-transpiled from CUDA
Confidence: 95%
Variant: flash_attention_v2
"""
block_q_idx = ct.bid(0)
head_idx = ct.bid(1)
# Initialize output accumulator and softmax stats
acc = ct.zeros((block_q, head_dim), dtype=ct.float32)
m_i = ct.full((block_q,), float('-inf'), dtype=ct.float32)
l_i = ct.zeros((block_q,), dtype=ct.float32)
# Load Q tile (stays in registers)
q_tile = ct.load(Q, index=(head_idx, block_q_idx), shape=(block_q, head_dim))
# Iterate over K,V blocks with online softmax
for block_kv_idx in range(0, ct.cdiv(seq_len_kv, block_kv)):
k_tile = ct.load(K, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))
v_tile = ct.load(V, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))
# QK^T with scaling
qk = ct.tile_matmul(q_tile, ct.transpose(k_tile)) * scale
# Online softmax update
m_ij = ct.reduce(qk, op=ct.max, axis=1)
m_new = ct.maximum(m_i, m_ij)
alpha = ct.exp(m_i - m_new)
acc = acc * alpha[:, None]
l_i = l_i * alpha
p = ct.exp(qk - m_new[:, None])
l_ij = ct.reduce(p, op=ct.sum, axis=1)
l_i = l_i + l_ij
# Accumulate output
acc = acc + ct.tile_matmul(p, v_tile)
m_i = m_new
# Normalize and store
out = acc / l_i[:, None]
ct.store(O, index=(head_idx, block_q_idx), tile=out)Use the transpiler programmatically:
import { transpile } from './lib/transpiler';
const result = await transpile(cudaCode);
// Access results
result.tileCode // Generated cuTile Python code
result.pattern.archetype // Detected pattern (e.g., 'attention', 'gemm')
result.pattern.confidence // Confidence score (0-1)
result.pattern.variant // Specific variant (e.g., 'flash_attention_v2')
result.validation.isValid // Validation status
result.diagnostics // Warnings and suggestions
result.memoryAnalysis // Memory access analysis
result.semanticAnalysis // Semantic analysis resultscurl -X POST http://localhost:3000/api/transpile \
-H "Content-Type: application/json" \
-d '{"code": "__global__ void add(float* a, float* b, float* c, int n) { ... }"}'rightnow-tile/
├── app/
│ ├── api/transpile/ # REST API endpoint
│ ├── components/ # React components
│ │ ├── ScientificVisualization.tsx
│ │ ├── ThemeProvider.tsx
│ │ └── ThemeToggle.tsx
│ ├── page.tsx # Main UI
│ └── globals.css # Styling
├── lib/
│ ├── ast/ # AST extraction & semantic analysis
│ │ ├── extractor.ts # Kernel parsing
│ │ ├── semantic-analyzer.ts
│ │ ├── memory-analyzer.ts
│ │ ├── phase-analyzer.ts # Multi-phase kernel detection
│ │ └── types.ts # 18 archetypes, 60+ variants
│ ├── parser/
│ │ └── intrinsics.ts # 150+ CUDA intrinsics
│ ├── patterns/ # Pattern matchers (18 patterns)
│ │ └── matchers/
│ │ ├── attention.ts # Flash Attention, MHA
│ │ ├── fused.ts # Fused kernels
│ │ ├── fft.ts # FFT variants
│ │ ├── gemm.ts # Matrix multiply
│ │ ├── reduction.ts # Reductions
│ │ ├── scan.ts # Prefix sums
│ │ ├── stencil.ts # Stencil patterns
│ │ ├── sparse.ts # Sparse matrix ops
│ │ ├── histogram.ts # Histogram
│ │ ├── convolution.ts # CNN convolutions
│ │ ├── sorting.ts # Sorting algorithms
│ │ ├── pooling.ts # Pooling layers
│ │ ├── normalization.ts # Norm layers
│ │ ├── embedding.ts # Embeddings
│ │ ├── rope.ts # Rotary embeddings
│ │ ├── kvcache.ts # KV cache ops
│ │ ├── quantization.ts # Quantization
│ │ └── elementwise.ts
│ ├── ir/ # Intermediate representation
│ │ ├── builder.ts # 11 specialized IR types
│ │ ├── optimizer.ts
│ │ └── types.ts
│ ├── codegen/ # Code generation
│ │ ├── generator.ts # Routes to all 18 archetypes
│ │ └── templates/ # 14 template files
│ │ ├── attention.ts
│ │ ├── fused.ts
│ │ ├── sparse.ts
│ │ ├── histogram.ts
│ │ ├── convolution.ts
│ │ ├── sorting.ts
│ │ ├── pooling.ts
│ │ ├── normalization.ts
│ │ ├── embedding.ts
│ │ ├── rope.ts
│ │ ├── kvcache.ts
│ │ ├── quantization.ts
│ │ ├── reduction.ts
│ │ └── stencil.ts
│ ├── validation/ # Validation & diagnostics
│ └── transpiler.ts # Main entry point
├── docs/ # Documentation
└── public/ # Static assets
- Framework: Next.js 16 with Turbopack
- Language: TypeScript 5.9
- UI: React 19, Tailwind CSS, Framer Motion
- Editor: Monaco Editor
- Target: NVIDIA cuTile
- Node.js 18+
- npm or yarn
- For running generated code: NVIDIA Blackwell GPU (compute capability 10.x+)
# Build for production
npm run build
# Start production server
npm startDeploy to Vercel, AWS, or any Node.js hosting platform.
We welcome contributions! Here's how to get started:
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Run development server
npm run dev
# Type checking
npx tsc --noEmit
# Build
npm run build- Support for 18 CUDA patterns with 60+ variants
- Flash Attention and Transformer-specific patterns
- LLM inference patterns (RoPE, KV Cache, Quantization)
- Comprehensive convolution support (Winograd, im2col)
- Batch transpilation for multiple kernels
- Performance benchmarking comparisons
- VS Code extension integration
- CLI tool for CI/CD pipelines
- CUDA to Triton transpilation
This project is licensed under the MIT License — see the LICENSE file for details.
RightNow AI · GPU Kernel Code Editor
Live Demo •
cuTile Docs •
Discord •
Issues
Made with ♥ by RightNow AI
