Skip to content

hw-native-sys/pypto-lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

213 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPTO-Lib

Tensor-level kernels and model implementations built on the pypto programming framework, targeting Ascend NPUs (910B/C, 950).

examples/        Self-contained kernels for learning the DSL
  beginner/        hello_world, matmul,etc.
  intermediate/    softmax, rms_norm, etc.
models/          End-to-end LLM kernels organized by family
  qwen3/14b/       Qwen3-14B prefill + decode
  qwen3/32b/       Qwen3-32B decode
  deepseek/v3_2/   DeepSeek V3.2-EXP
  deepseek/v4/     DeepSeek V4
llm/             Minimal inference engine (engine, scheduler, kv-cache, sampler)
                 with a Qwen3-14B example wiring decode + prefill + lm_head
golden/          Test harness — compile, run on device, validate against torch
tests/           Lint checks and golden-fn unit tests
docs/            Coding-style and workflow reference

Files ending in _draft.py are works-in-progress and excluded from CI.

Quick start

Install pypto + simpler + ptoas (see .claude/skills/setup_env/SKILL.md or use the /setup_env skill), then run any example:

python examples/beginner/hello_world.py -p a2a3sim   # simulator
python models/qwen3/14b/qwen3_14b_decode.py -p a2a3 -d 0   # real NPU, device 0

Every example accepts -p {a2a3, a2a3sim, a5, a5sim} and -d <device_id>, and exits non-zero on validation mismatch. See docs/compile-runtime-workflow.md for the full flow (compile → input gen → runtime → golden → validate).

Writing a kernel

Read docs/pypto-coding-style.md — it covers program structure (@pl.program + @pl.function + pl.at scopes), the four loop constructs (pl.range, pl.parallel, pl.pipeline, pl.spmd), and the vector / cube / mte op set.

Existing kernels under examples/intermediate/ are the best reference for single-stage patterns; models/qwen3/14b/qwen3_14b_decode.py shows a full-model fused kernel.

Dependencies

Repo Role
pypto Tile-based programming framework — lowers Tensor → Tile → Block → Execution graphs through multi-level IR and codegen
simpler PTO runtime — builds and executes task dependency graphs across AICPU + AICore on Ascend devices (submodule of pypto)
ptoas LLVM/MLIR-based assembler/optimizer for PTO Bytecode — parses .pto, runs Da Vinci-specific passes, lowers to C++
pto-isa PTO Tile Library — virtual tile-ISA implementations and headers shared across Ascend generations

Pinned versions live in .github/workflows/ci.yml.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors