tensor-pretrain

Continuous pretraining on Tensor Series models and custom base model variations. Point it at your data, get a production model back.

tensor-pretrain is part of the Tensor Framework by Netangular. It takes either a Tensor Series base model or any compatible HuggingFace pretrained base, and lets you inject new knowledge into it — from a local directory of files — with automatic dataset conversion, tokenization, and training handled for you.

No preprocessing pipelines. No data wrangling. Just your files and a production-ready model on the other side.

The Tensor Framework

tensor-pretrain    Continuous pretraining on Tensor or custom base models   ← you are here
tensor-adapt       Teach behavior via low-rank adapters (instruct / chat)
tensor-datagen     Generate instruct fine-tune datasets from your files
tensor-inference   Run .safetensors output at high velocity (C++)

How It Works

tensor-pretrain supports two base model paths — the Tensor Series (managed, registry-backed) and Custom Upstream (any compatible HuggingFace pretrained base). Both go through the same full pipeline:

Point it at a directory — raw files, code, documents, mixed formats. The library recursively walks the directory, converts everything to a training-ready format, and tokenizes it automatically against the model's tokenizer.
Configure your run — set your token budget, devices, and output path. That's it.
Get a production model — output is a clean .safetensors checkpoint, ready for tensor-adapt or tensor-inference.

Base Model Paths

Tensor Series

Tensor Series base models ship without instruct fine-tuning — they are pure completion engines, which makes them ideal starting points for domain-specific pretraining. There's no pre-baked behavior to fight against and no alignment tax on your data.

All Tensor Series models come with strong math and coding capabilities baked in at the base training stage, giving you a high-quality, managed foundation to build from. The registry handles architecture updates automatically when using :latest.

Tip: Using :latest ensures you always pull the highest-performing architecture for that size class. If you need reproducibility, pin a version like tensor-pro:1.0.

Custom Upstream

If you want to build on a base model outside the Tensor registry — such as Qwen3, Mistral 3, or any other compatible open-weight model — pass the HuggingFace ID directly to Pretrain. Tensor still handles the full pipeline: dataset conversion, tokenization, distributed training, and .safetensors export.

Note: Instruction-tuned variants (e.g. -it, -instruct, -Chat) are not supported and will raise a ModelVariantError at load time. Always use the base variant.

Supported Tensor Series Models

Timing estimates assume NVIDIA A100 80GB GPUs with FlashAttention-3 enabled. Token counts below are recommended targets for effective domain transfer.

Large Domain

Compiler source trees, mathematics textbooks, or large scientific and systems programming corpora.

Model	Tag	Params	Rec. Tokens	Cluster Size	Est. Wall Time
Tensor Nano	`tensor-nano:latest`	0.6B	~1B	1× A100	~7 hrs
Tensor Micro	`tensor-micro:latest`	1.5B	~2B	1× A100	~32 hrs
Tensor Mini	`tensor-mini:latest`	3B	~3B	2× A100	~25 hrs
Tensor Pro	`tensor-pro:latest`	7B	~5B	8× A100	~24 hrs
Tensor Matrix	`tensor-matrix:latest`	14B	~5B	8× A100	~48 hrs
Tensor Ultra	`tensor-ultra:latest`	32B+	~10B	16× A100	~4–5 days

Focused Domain

A specific language's stdlib, a niche compiler spec, or a proprietary API surface.

Model	Tag	Params	Rec. Tokens	Cluster Size	Est. Wall Time
Tensor Nano	`tensor-nano:latest`	0.6B	~10M	1× A100	~5 min
Tensor Micro	`tensor-micro:latest`	1.5B	~20M	1× A100	~20 min
Tensor Mini	`tensor-mini:latest`	3B	~50M	1× A100	~1 hr
Tensor Pro	`tensor-pro:latest`	7B	~100M	2× A100	~2 hrs
Tensor Matrix	`tensor-matrix:latest`	14B	~200M	4× A100	~3 hrs
Tensor Ultra	`tensor-ultra:latest`	32B+	~500M	8× A100	~7 hrs

Quick Start

Install

pip install tensor-pretrain

Tensor Series Path

from tensor.pretrain import Pretrain

run = Pretrain(
    base="tensor-pro:latest",
    data="./my-knowledge-base",
    output="./output/my-model",
)

run.train()

Custom Upstream Path

Pass any compatible HuggingFace base ID directly — the library resolves the path automatically.

from tensor.pretrain import Pretrain

run = Pretrain(
    base="Qwen/Qwen3-8B-Base",
    data="./my-knowledge-base",
    output="./output/my-qwen-base",
)

run.train()

Configuration

RunConfig is optional. When omitted, tensor-pretrain infers sensible defaults from the model size and available hardware.

from tensor.pretrain import Pretrain, RunConfig

run = Pretrain(
    base="tensor-pro:latest",
    data="./my-knowledge-base",
    output="./output/my-model",
    config=RunConfig(
        total_tokens=5_000_000_000,
        devices=8,
        dtype="bfloat16",
    ),
)

run.train()

Data Mixing

data= accepts a path string, a single source, or a weighted Mix. All three are valid:

data="./my-docs"                         # path string
data=LocalSource("./my-docs")            # explicit single source
data=Mix({ LocalSource(...): 0.7, ... }) # weighted mix

For multi-source runs:

from tensor.pretrain import Pretrain
from tensor.data import Mix, LocalSource, HubSource

run = Pretrain(
    base="tensor-matrix:latest",
    data=Mix({
        LocalSource("./my-docs"): 0.7,
        HubSource("HuggingFaceTB/smollm-corpus", subset="python-edu"): 0.3,
    }),
    output="./output/my-matrix-base",
)

run.train()

Validate and Estimate

Before committing to a full training run, use validate() and estimate() to catch problems early and review projected cost and time.

run = Pretrain(
    base="tensor-pro:latest",
    data="./my-knowledge-base",
    output="./output/my-model",
)

run.validate()   # checks data, model, and hardware — raises early if anything is wrong
run.estimate()   # prints token count, wall time estimate, and cluster recommendation
run.train()

run.estimate() output:

  base          tensor-pro:latest (7B)
  data          ./my-knowledge-base  →  4.2B tokens detected
  devices       8× A100 80GB
  est. time     ~20 hrs
  output        ./output/my-model

Resuming

from tensor.pretrain import Pretrain

run = Pretrain.resume("./output/my-model")
run.train()

Pretrain.resume() reads the run config and checkpoint state from the output directory. No need to reconstruct the original arguments.

Results

After run.train() completes, the result is available on the instance:

run.train()

print(run.result.checkpoint)       # path to .safetensors output
print(run.result.tokens_trained)   # actual tokens consumed
print(run.result.elapsed)          # wall time

The .safetensors checkpoint is your domain-trained base model, ready to take directly into tensor-adapt for instruct or chat fine-tuning.

Building Production Agents

Tensor models are designed specifically for agentic workflows.

Most chat models come with an alignment tax — they are fine-tuned to be polite assistants, which makes them unreliable for strict agentic tasks. They often refuse to output raw JSON, add conversational filler that breaks parsers, or reject valid tool calls due to over-safety tuning.

Pretrained base models are neutral. They have no personality, no refusal mechanisms, and no chat formatting.

Pick your base — a Tensor Series model for a managed foundation, or a custom upstream for full control.
Use tensor-pretrain to inject your domain knowledge.
Use tensor-adapt to define behavior strictly on your terms — function calling, JSON output, or any task format you need.

Because the base is neutral, it learns exactly the behavior you define — nothing more, nothing less.

The "Distro" Philosophy

We treat the AI ecosystem exactly like the Linux ecosystem.

Upstream (The Kernel): The raw, complex, bleeding-edge engine (e.g., Llama, Mistral, Qwen, Linux Kernel).
Downstream (The Distro): The polished, stable, production-ready product (e.g., Tensor, Ubuntu, Android).

You choose Tensor for the same reason you choose Ubuntu: you want the power of the kernel without the headache of managing it. And just like Linux, you can also bring your own kernel if you know what you're doing.

Ecosystem	The "Upstream" Kernel	The "Distro" Platform	What The User Actually Gets
Server OS	Debian / Linux Kernel	Ubuntu LTS	Security updates, `apt-get`, and a system that boots every time.
Mobile OS	Linux Kernel	Android	A seamless touch interface and apps. No need to know which kernel version is running.
Tensor AI	SOTA Foundation Weights	Your trained model	A model that knows your domain. Tensor handles tokenizer alignment, distributed training, and `.safetensors` export.

Architecture

Component	Role
`tensor.pretrain.Pretrain`	Main entry point — configuration, validation, estimation, and training
`tensor.pretrain.Pretrain.resume`	Reconstructs a run from an existing output directory checkpoint
`tensor.pretrain.RunConfig`	Optional training parameters — tokens, devices, dtype
`tensor.data.LocalSource`	Recursive file ingestion, format conversion, and tokenization
`tensor.data.HubSource`	HuggingFace Hub dataset source
`tensor.data.Mix`	Multi-source weighted token mixing

Roadmap

v1.0 (Current): Tensor Series support and custom upstream via HuggingFace ID, local and Hub data sources
v1.1: Full Tensor model size support · automatic reasoning-retention mixing
v1.2: Direct export to tensor-inference optimised graph format

License

Apache 2.0 — free for commercial and private use within the Tensor Framework ecosystem.

Part of the Tensor Framework by Netangular.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
tensor		tensor
testing		testing
.gitignore		.gitignore
BUILD.md		BUILD.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tensor-pretrain

The Tensor Framework

How It Works

Base Model Paths

Tensor Series

Custom Upstream

Supported Tensor Series Models

Large Domain

Focused Domain

Quick Start

Install

Tensor Series Path

Custom Upstream Path

Configuration

Data Mixing

Validate and Estimate

Resuming

Results

Building Production Agents

The "Distro" Philosophy

Architecture

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tensor-pretrain

The Tensor Framework

How It Works

Base Model Paths

Tensor Series

Custom Upstream

Supported Tensor Series Models

Large Domain

Focused Domain

Quick Start

Install

Tensor Series Path

Custom Upstream Path

Configuration

Data Mixing

Validate and Estimate

Resuming

Results

Building Production Agents

The "Distro" Philosophy

Architecture

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages