RightNow Tile

CUDA SIMT to cuTile Python Transpiler
Transform your CUDA kernels for NVIDIA Blackwell GPUs

Live Demo • Quick Start • Features • Patterns • Discord

What is RightNow Tile?

RightNow Tile is a production-grade transpiler that converts traditional CUDA SIMT (Single Instruction, Multiple Threads) kernels into cuTile Python code — NVIDIA's new tile-based programming model optimized for Blackwell GPUs (compute capability 10.x+).

Part of the RightNow AI ecosystem — a code editor built for GPU kernel development.

Why cuTile?

NVIDIA's cuTile represents a paradigm shift in GPU programming:

Traditional CUDA	cuTile
Thread-centric programming	Tile-centric programming
Manual memory coalescing	Automatic tile-based loads
Complex index calculations	Declarative tile operations
Low-level synchronization	High-level tile semantics

RightNow Tile bridges the gap — take your existing CUDA kernels and transform them for next-gen hardware.

Quick Start

# Clone the repository
git clone https://github.com/RightNow-AI/RightNow-Tile.git
cd RightNow-Tile

# Install dependencies
npm install

# Start development server
npm run dev

Open http://localhost:3000 and start transpiling!

Features

Intelligent Pattern Detection

Automatically identifies 18 computational patterns with 60+ variant-specific optimizations:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Your CUDA     │ ──► │  Pattern Match   │ ──► │  Optimized      │
│   Kernel        │     │  + Analysis      │     │  cuTile Code    │
└─────────────────┘     └──────────────────┘     └─────────────────┘

9-Stage Transpilation Pipeline

CUDA Source
    │
    ▼
┌──────────────┐
│ 1. Extractor │  Parse kernel signatures, parameters, memory accesses
└──────┬───────┘
       ▼
┌──────────────┐
│ 2. Parser    │  Recognize 150+ CUDA intrinsics & index patterns
└──────┬───────┘
       ▼
┌──────────────┐
│ 3. Semantic  │  Detect reductions, dependencies, race conditions
└──────┬───────┘
       ▼
┌──────────────┐
│ 4. Memory    │  Analyze coalescing, bank conflicts, access patterns
└──────┬───────┘
       ▼
┌──────────────┐
│ 5. Pattern   │  Match against 18 patterns with confidence scoring
└──────┬───────┘
       ▼
┌──────────────┐
│ 6. IR Build  │  Generate intermediate representation with config
└──────┬───────┘
       ▼
┌──────────────┐
│ 7. Optimize  │  Select optimal tile sizes & configurations
└──────┬───────┘
       ▼
┌──────────────┐
│ 8. CodeGen   │  Apply variant-specific templates
└──────┬───────┘
       ▼
┌──────────────┐
│ 9. Validate  │  Verify correctness & generate diagnostics
└──────┴───────┘
       │
       ▼
  cuTile Python

Modern Developer Experience

Monaco Editor — VS Code-quality editing with syntax highlighting
Real-time Transpilation — See results instantly
Dark/Light Themes — Easy on the eyes
Expandable Output — Full-screen code view
One-Click Copy — Get your code ready to deploy

Supported Patterns

Core Compute Patterns

Pattern	Variants	Use Cases	Confidence
GEMM	`naive`, `tiled`, `register_blocked`	Matrix multiplication, deep learning	High
Reduction	`tree`, `warp_shuffle`, `multi_block`, `segmented`	Sum, max, min, dot product	High
Scan	`inclusive`, `exclusive`, `segmented`	Prefix sum, stream compaction	High
Stencil	`1d_3pt`, `1d_5pt`, `2d_5pt`, `2d_9pt`, `3d`	Image processing, PDE solvers	High
Elementwise	`simple`, `vectorized`	Point-wise operations	High

ML/Deep Learning Patterns

Pattern	Variants	Use Cases	Confidence
Attention	`flash_attention`, `flash_attention_v2`, `multi_head`, `causal`, `cross`	Transformer models	High
Normalization	`layernorm`, `rmsnorm`, `batchnorm`, `groupnorm`, `instancenorm`	Neural network layers	High
Convolution	`conv1d`, `conv2d`, `conv3d`, `depthwise`, `grouped`, `winograd`, `im2col`	CNNs, signal processing	High
Pooling	`max_pool_2d`, `avg_pool_2d`, `global_avg`, `global_max`, `adaptive`	Feature downsampling	High
Embedding	`lookup`, `embedding_bag`, `positional`	NLP, recommender systems	Medium

LLM/Transformer-Specific Patterns

Pattern	Variants	Use Cases	Confidence
RoPE	`standard`, `neox`, `cached`	Rotary position embeddings	High
KV Cache	`append`, `paged`, `prefix`, `gqa`	LLM inference optimization	High
Quantization	`int8`, `int4`, `fp8`, `dequantize`	Model compression	Medium
Fused	`matmul_activation`, `matmul_bias_activation`, `layernorm_residual`	Kernel fusion	Medium

Specialized Patterns

Pattern	Variants	Use Cases	Confidence
FFT	`radix2`, `radix4`, `radix8`, `inverse`, `real`	Signal processing	High
Sparse	`spmv_csr`, `spmv_csr_warp`, `spmv_coo`, `spmv_ell`, `spmm`, `sddmm`	Sparse matrix operations	Medium
Histogram	`atomic`, `privatized`, `multipass`, `weighted`, `2d`	Data distribution, statistics	Medium
Sorting	`bitonic`, `bitonic_shared`, `radix`, `merge`	Parallel sorting	Medium

Example

Input: CUDA SIMT Kernel

__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Output: cuTile Python

import cuda_tile as ct
import cupy

TILE_SIZE = 256

@ct.kernel
def vector_add(a, b, c, n: ct.Constant[int], tile_size: ct.Constant[int]):
    """
    Elementwise kernel - auto-transpiled from CUDA
    Original: vectorAdd
    Confidence: 100%
    """
    pid = ct.bid(0)

    # Load input tiles
    a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
    b_tile = ct.load(b, index=(pid,), shape=(tile_size,))

    # Compute
    result = a_tile + b_tile

    # Store result
    ct.store(c, index=(pid,), tile=result)


def launch_vector_add(a, b, c):
    """Launch the vector_add kernel"""
    n = a.shape[0]
    grid = (ct.cdiv(n, TILE_SIZE), 1, 1)
    stream = cupy.cuda.get_current_stream()
    ct.launch(stream, grid, vector_add, (a, b, c, TILE_SIZE))

Flash Attention Example

Input: Flash Attention CUDA Kernel

__global__ void flash_attention_kernel(
    float* Q, float* K, float* V, float* O,
    int seq_len, int head_dim, float scale
) {
    // Complex multi-phase attention implementation
    // with online softmax and tiled matrix multiply
    ...
}

Output: cuTile Python (Flash Attention)

import cuda_tile as ct
import cupy

BLOCK_Q = 64
BLOCK_KV = 64

@ct.kernel
def flash_attention(
    Q, K, V, O,
    seq_len_q: ct.Constant[int],
    seq_len_kv: ct.Constant[int],
    head_dim: ct.Constant[int],
    scale: ct.Constant[float],
    block_q: ct.Constant[int],
    block_kv: ct.Constant[int]
):
    """
    Flash Attention kernel - auto-transpiled from CUDA
    Confidence: 95%
    Variant: flash_attention_v2
    """
    block_q_idx = ct.bid(0)
    head_idx = ct.bid(1)

    # Initialize output accumulator and softmax stats
    acc = ct.zeros((block_q, head_dim), dtype=ct.float32)
    m_i = ct.full((block_q,), float('-inf'), dtype=ct.float32)
    l_i = ct.zeros((block_q,), dtype=ct.float32)

    # Load Q tile (stays in registers)
    q_tile = ct.load(Q, index=(head_idx, block_q_idx), shape=(block_q, head_dim))

    # Iterate over K,V blocks with online softmax
    for block_kv_idx in range(0, ct.cdiv(seq_len_kv, block_kv)):
        k_tile = ct.load(K, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))
        v_tile = ct.load(V, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))

        # QK^T with scaling
        qk = ct.tile_matmul(q_tile, ct.transpose(k_tile)) * scale

        # Online softmax update
        m_ij = ct.reduce(qk, op=ct.max, axis=1)
        m_new = ct.maximum(m_i, m_ij)
        alpha = ct.exp(m_i - m_new)
        acc = acc * alpha[:, None]
        l_i = l_i * alpha

        p = ct.exp(qk - m_new[:, None])
        l_ij = ct.reduce(p, op=ct.sum, axis=1)
        l_i = l_i + l_ij

        # Accumulate output
        acc = acc + ct.tile_matmul(p, v_tile)
        m_i = m_new

    # Normalize and store
    out = acc / l_i[:, None]
    ct.store(O, index=(head_idx, block_q_idx), tile=out)

API Usage

Use the transpiler programmatically:

import { transpile } from './lib/transpiler';

const result = await transpile(cudaCode);

// Access results
result.tileCode              // Generated cuTile Python code
result.pattern.archetype     // Detected pattern (e.g., 'attention', 'gemm')
result.pattern.confidence    // Confidence score (0-1)
result.pattern.variant       // Specific variant (e.g., 'flash_attention_v2')
result.validation.isValid    // Validation status
result.diagnostics           // Warnings and suggestions
result.memoryAnalysis        // Memory access analysis
result.semanticAnalysis      // Semantic analysis results

REST API

curl -X POST http://localhost:3000/api/transpile \
  -H "Content-Type: application/json" \
  -d '{"code": "__global__ void add(float* a, float* b, float* c, int n) { ... }"}'

Project Structure

rightnow-tile/
├── app/
│   ├── api/transpile/        # REST API endpoint
│   ├── components/           # React components
│   │   ├── ScientificVisualization.tsx
│   │   ├── ThemeProvider.tsx
│   │   └── ThemeToggle.tsx
│   ├── page.tsx              # Main UI
│   └── globals.css           # Styling
├── lib/
│   ├── ast/                  # AST extraction & semantic analysis
│   │   ├── extractor.ts      # Kernel parsing
│   │   ├── semantic-analyzer.ts
│   │   ├── memory-analyzer.ts
│   │   ├── phase-analyzer.ts # Multi-phase kernel detection
│   │   └── types.ts          # 18 archetypes, 60+ variants
│   ├── parser/
│   │   └── intrinsics.ts     # 150+ CUDA intrinsics
│   ├── patterns/             # Pattern matchers (18 patterns)
│   │   └── matchers/
│   │       ├── attention.ts  # Flash Attention, MHA
│   │       ├── fused.ts      # Fused kernels
│   │       ├── fft.ts        # FFT variants
│   │       ├── gemm.ts       # Matrix multiply
│   │       ├── reduction.ts  # Reductions
│   │       ├── scan.ts       # Prefix sums
│   │       ├── stencil.ts    # Stencil patterns
│   │       ├── sparse.ts     # Sparse matrix ops
│   │       ├── histogram.ts  # Histogram
│   │       ├── convolution.ts # CNN convolutions
│   │       ├── sorting.ts    # Sorting algorithms
│   │       ├── pooling.ts    # Pooling layers
│   │       ├── normalization.ts # Norm layers
│   │       ├── embedding.ts  # Embeddings
│   │       ├── rope.ts       # Rotary embeddings
│   │       ├── kvcache.ts    # KV cache ops
│   │       ├── quantization.ts # Quantization
│   │       └── elementwise.ts
│   ├── ir/                   # Intermediate representation
│   │   ├── builder.ts        # 11 specialized IR types
│   │   ├── optimizer.ts
│   │   └── types.ts
│   ├── codegen/              # Code generation
│   │   ├── generator.ts      # Routes to all 18 archetypes
│   │   └── templates/        # 14 template files
│   │       ├── attention.ts
│   │       ├── fused.ts
│   │       ├── sparse.ts
│   │       ├── histogram.ts
│   │       ├── convolution.ts
│   │       ├── sorting.ts
│   │       ├── pooling.ts
│   │       ├── normalization.ts
│   │       ├── embedding.ts
│   │       ├── rope.ts
│   │       ├── kvcache.ts
│   │       ├── quantization.ts
│   │       ├── reduction.ts
│   │       └── stencil.ts
│   ├── validation/           # Validation & diagnostics
│   └── transpiler.ts         # Main entry point
├── docs/                     # Documentation
└── public/                   # Static assets

Tech Stack

Framework: Next.js 16 with Turbopack
Language: TypeScript 5.9
UI: React 19, Tailwind CSS, Framer Motion
Editor: Monaco Editor
Target: NVIDIA cuTile

Requirements

Node.js 18+
npm or yarn
For running generated code: NVIDIA Blackwell GPU (compute capability 10.x+)

Production Deployment

# Build for production
npm run build

# Start production server
npm start

Deploy to Vercel, AWS, or any Node.js hosting platform.

Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development

# Run development server
npm run dev

# Type checking
npx tsc --noEmit

# Build
npm run build

Roadmap

Support for 18 CUDA patterns with 60+ variants
Flash Attention and Transformer-specific patterns
LLM inference patterns (RoPE, KV Cache, Quantization)
Comprehensive convolution support (Winograd, im2col)
Batch transpilation for multiple kernels
Performance benchmarking comparisons
VS Code extension integration
CLI tool for CI/CD pipelines
CUDA to Triton transpilation

License

This project is licensed under the MIT License — see the LICENSE file for details.

Links

RightNow AI · GPU Kernel Code Editor

Live Demo • cuTile Docs • Discord • Issues

Made with ♥ by RightNow AI

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
docs		docs
lib		lib
public		public
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
next-env.d.ts		next-env.d.ts
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RightNow Tile

What is RightNow Tile?

Why cuTile?

Quick Start

Features

Intelligent Pattern Detection

9-Stage Transpilation Pipeline

Modern Developer Experience

Supported Patterns

Core Compute Patterns

ML/Deep Learning Patterns

LLM/Transformer-Specific Patterns

Specialized Patterns

Example

Flash Attention Example

API Usage

REST API

Project Structure

Tech Stack

Requirements

Production Deployment

Contributing

Development

Roadmap

License

Links

About

Uh oh!

Languages

License

RightNow-AI/RightNow-Tile

Folders and files

Latest commit

History

Repository files navigation

RightNow Tile

What is RightNow Tile?

Why cuTile?

Quick Start

Features

Intelligent Pattern Detection

9-Stage Transpilation Pipeline

Modern Developer Experience

Supported Patterns

Core Compute Patterns

ML/Deep Learning Patterns

LLM/Transformer-Specific Patterns

Specialized Patterns

Example

Flash Attention Example

API Usage

REST API

Project Structure

Tech Stack

Requirements

Production Deployment

Contributing

Development

Roadmap

License

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages