You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Packed sequence training: pack multiple shorter sequences into a single training sample to maximize GPU utilization and reduce padding waste, especially for datasets with variable-length inputs
Additional training methods: expand beyond Eagle3 to support DFlash, MTP, and other speculative decoding training approaches, broadening the range of draft model architectures TorchSpec can train
LK Loss (PR #29): add LK^alpha and LK^lambda losses for direct acceptance rate optimization, improving average acceptance length by 3-8% over Forward KL on Eagle3
Context Parallel under DP ranks: support context parallelism within data-parallel ranks
FlexAttention native FA4 backend (Issue #30): adopt BACKEND="FLASH" in FlexAttention to unify the flex_attention and fa_experimental code paths, replacing manual CuTeDSL integration with a stable PyTorch API for FA4-level performance on Hopper/Blackwell GPUs
Inference
TensorRT-LLM integration: add as an inference backend alongside SGLang and vLLM so users can plug in whichever engine best fits their deployment stack
Inference auto-expansion: automatically scale inference when more nodes become available
Support chunked-prefill: Support chunked prefill to allow longer context
Framework
Placement group node pinning by IP: allow users to pin inference to specific nodes by IP, with finer granularity for multiple inference engines on the same node
Automatic Mooncake config determination: derive Mooncake transfer config from batch size and max sampling pool size; auto-compute max sampling pool size as global_batch_size * delay_deletion_ratio
Debugging mode: add a debugging mode for both inference and training sides
Colocate Training: design and implement colocate training
TorchSpec Roadmap 2026 Q2
Model Support
Training
BACKEND="FLASH"in FlexAttention to unify theflex_attentionandfa_experimentalcode paths, replacing manual CuTeDSL integration with a stable PyTorch API for FA4-level performance on Hopper/Blackwell GPUsInference
Framework
global_batch_size * delay_deletion_ratio