Continuous pretraining on Tensor Series models and custom base model variations. Point it at your data, get a production model back.
tensor-pretrain is part of the Tensor Framework by Netangular. It takes either a Tensor Series base model or any compatible HuggingFace pretrained base, and lets you inject new knowledge into it — from a local directory of files — with automatic dataset conversion, tokenization, and training handled for you.
No preprocessing pipelines. No data wrangling. Just your files and a production-ready model on the other side.
tensor-pretrain Continuous pretraining on Tensor or custom base models ← you are here
tensor-adapt Teach behavior via low-rank adapters (instruct / chat)
tensor-datagen Generate instruct fine-tune datasets from your files
tensor-inference Run .safetensors output at high velocity (C++)
tensor-pretrain supports two base model paths — the Tensor Series (managed, registry-backed) and Custom Upstream (any compatible HuggingFace pretrained base). Both go through the same full pipeline:
- Point it at a directory — raw files, code, documents, mixed formats. The library recursively walks the directory, converts everything to a training-ready format, and tokenizes it automatically against the model's tokenizer.
- Configure your run — set your token budget, devices, and output path. That's it.
- Get a production model — output is a clean
.safetensorscheckpoint, ready fortensor-adaptortensor-inference.
Tensor Series base models ship without instruct fine-tuning — they are pure completion engines, which makes them ideal starting points for domain-specific pretraining. There's no pre-baked behavior to fight against and no alignment tax on your data.
All Tensor Series models come with strong math and coding capabilities baked in at the base training stage, giving you a high-quality, managed foundation to build from. The registry handles architecture updates automatically when using :latest.
Tip: Using
:latestensures you always pull the highest-performing architecture for that size class. If you need reproducibility, pin a version liketensor-pro:1.0.
If you want to build on a base model outside the Tensor registry — such as Qwen3, Mistral 3, or any other compatible open-weight model — pass the HuggingFace ID directly to Pretrain. Tensor still handles the full pipeline: dataset conversion, tokenization, distributed training, and .safetensors export.
Note: Instruction-tuned variants (e.g.
-it,-instruct,-Chat) are not supported and will raise aModelVariantErrorat load time. Always use the base variant.
Timing estimates assume NVIDIA A100 80GB GPUs with FlashAttention-3 enabled. Token counts below are recommended targets for effective domain transfer.
Compiler source trees, mathematics textbooks, or large scientific and systems programming corpora.
| Model | Tag | Params | Rec. Tokens | Cluster Size | Est. Wall Time |
|---|---|---|---|---|---|
| Tensor Nano | tensor-nano:latest |
0.6B | ~1B | 1× A100 | ~7 hrs |
| Tensor Micro | tensor-micro:latest |
1.5B | ~2B | 1× A100 | ~32 hrs |
| Tensor Mini | tensor-mini:latest |
3B | ~3B | 2× A100 | ~25 hrs |
| Tensor Pro | tensor-pro:latest |
7B | ~5B | 8× A100 | ~24 hrs |
| Tensor Matrix | tensor-matrix:latest |
14B | ~5B | 8× A100 | ~48 hrs |
| Tensor Ultra | tensor-ultra:latest |
32B+ | ~10B | 16× A100 | ~4–5 days |
A specific language's stdlib, a niche compiler spec, or a proprietary API surface.
| Model | Tag | Params | Rec. Tokens | Cluster Size | Est. Wall Time |
|---|---|---|---|---|---|
| Tensor Nano | tensor-nano:latest |
0.6B | ~10M | 1× A100 | ~5 min |
| Tensor Micro | tensor-micro:latest |
1.5B | ~20M | 1× A100 | ~20 min |
| Tensor Mini | tensor-mini:latest |
3B | ~50M | 1× A100 | ~1 hr |
| Tensor Pro | tensor-pro:latest |
7B | ~100M | 2× A100 | ~2 hrs |
| Tensor Matrix | tensor-matrix:latest |
14B | ~200M | 4× A100 | ~3 hrs |
| Tensor Ultra | tensor-ultra:latest |
32B+ | ~500M | 8× A100 | ~7 hrs |
pip install tensor-pretrainfrom tensor.pretrain import Pretrain
run = Pretrain(
base="tensor-pro:latest",
data="./my-knowledge-base",
output="./output/my-model",
)
run.train()Pass any compatible HuggingFace base ID directly — the library resolves the path automatically.
from tensor.pretrain import Pretrain
run = Pretrain(
base="Qwen/Qwen3-8B-Base",
data="./my-knowledge-base",
output="./output/my-qwen-base",
)
run.train()RunConfig is optional. When omitted, tensor-pretrain infers sensible defaults from the model size and available hardware.
from tensor.pretrain import Pretrain, RunConfig
run = Pretrain(
base="tensor-pro:latest",
data="./my-knowledge-base",
output="./output/my-model",
config=RunConfig(
total_tokens=5_000_000_000,
devices=8,
dtype="bfloat16",
),
)
run.train()data= accepts a path string, a single source, or a weighted Mix. All three are valid:
data="./my-docs" # path string
data=LocalSource("./my-docs") # explicit single source
data=Mix({ LocalSource(...): 0.7, ... }) # weighted mixFor multi-source runs:
from tensor.pretrain import Pretrain
from tensor.data import Mix, LocalSource, HubSource
run = Pretrain(
base="tensor-matrix:latest",
data=Mix({
LocalSource("./my-docs"): 0.7,
HubSource("HuggingFaceTB/smollm-corpus", subset="python-edu"): 0.3,
}),
output="./output/my-matrix-base",
)
run.train()Before committing to a full training run, use validate() and estimate() to catch problems early and review projected cost and time.
run = Pretrain(
base="tensor-pro:latest",
data="./my-knowledge-base",
output="./output/my-model",
)
run.validate() # checks data, model, and hardware — raises early if anything is wrong
run.estimate() # prints token count, wall time estimate, and cluster recommendation
run.train()run.estimate() output:
base tensor-pro:latest (7B)
data ./my-knowledge-base → 4.2B tokens detected
devices 8× A100 80GB
est. time ~20 hrs
output ./output/my-model
from tensor.pretrain import Pretrain
run = Pretrain.resume("./output/my-model")
run.train()Pretrain.resume() reads the run config and checkpoint state from the output directory. No need to reconstruct the original arguments.
After run.train() completes, the result is available on the instance:
run.train()
print(run.result.checkpoint) # path to .safetensors output
print(run.result.tokens_trained) # actual tokens consumed
print(run.result.elapsed) # wall timeThe .safetensors checkpoint is your domain-trained base model, ready to take directly into tensor-adapt for instruct or chat fine-tuning.
Tensor models are designed specifically for agentic workflows.
Most chat models come with an alignment tax — they are fine-tuned to be polite assistants, which makes them unreliable for strict agentic tasks. They often refuse to output raw JSON, add conversational filler that breaks parsers, or reject valid tool calls due to over-safety tuning.
Pretrained base models are neutral. They have no personality, no refusal mechanisms, and no chat formatting.
- Pick your base — a Tensor Series model for a managed foundation, or a custom upstream for full control.
- Use
tensor-pretrainto inject your domain knowledge. - Use
tensor-adaptto define behavior strictly on your terms — function calling, JSON output, or any task format you need.
Because the base is neutral, it learns exactly the behavior you define — nothing more, nothing less.
We treat the AI ecosystem exactly like the Linux ecosystem.
- Upstream (The Kernel): The raw, complex, bleeding-edge engine (e.g., Llama, Mistral, Qwen, Linux Kernel).
- Downstream (The Distro): The polished, stable, production-ready product (e.g., Tensor, Ubuntu, Android).
You choose Tensor for the same reason you choose Ubuntu: you want the power of the kernel without the headache of managing it. And just like Linux, you can also bring your own kernel if you know what you're doing.
| Ecosystem | The "Upstream" Kernel | The "Distro" Platform | What The User Actually Gets |
|---|---|---|---|
| Server OS | Debian / Linux Kernel | Ubuntu LTS | Security updates, apt-get, and a system that boots every time. |
| Mobile OS | Linux Kernel | Android | A seamless touch interface and apps. No need to know which kernel version is running. |
| Tensor AI | SOTA Foundation Weights | Your trained model | A model that knows your domain. Tensor handles tokenizer alignment, distributed training, and .safetensors export. |
| Component | Role |
|---|---|
tensor.pretrain.Pretrain |
Main entry point — configuration, validation, estimation, and training |
tensor.pretrain.Pretrain.resume |
Reconstructs a run from an existing output directory checkpoint |
tensor.pretrain.RunConfig |
Optional training parameters — tokens, devices, dtype |
tensor.data.LocalSource |
Recursive file ingestion, format conversion, and tokenization |
tensor.data.HubSource |
HuggingFace Hub dataset source |
tensor.data.Mix |
Multi-source weighted token mixing |
- v1.0 (Current): Tensor Series support and custom upstream via HuggingFace ID, local and Hub data sources
- v1.1: Full Tensor model size support · automatic reasoning-retention mixing
- v1.2: Direct export to
tensor-inferenceoptimised graph format
Apache 2.0 — free for commercial and private use within the Tensor Framework ecosystem.
Part of the Tensor Framework by Netangular.