Skip to content

carbon-os/tensor-pretrain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tensor-pretrain

Continuous pretraining on Tensor Series models and custom base model variations. Point it at your data, get a production model back.

tensor-pretrain is part of the Tensor Framework by Netangular. It takes either a Tensor Series base model or any compatible HuggingFace pretrained base, and lets you inject new knowledge into it — from a local directory of files — with automatic dataset conversion, tokenization, and training handled for you.

No preprocessing pipelines. No data wrangling. Just your files and a production-ready model on the other side.


The Tensor Framework

tensor-pretrain    Continuous pretraining on Tensor or custom base models   ← you are here
tensor-adapt       Teach behavior via low-rank adapters (instruct / chat)
tensor-datagen     Generate instruct fine-tune datasets from your files
tensor-inference   Run .safetensors output at high velocity (C++)

How It Works

tensor-pretrain supports two base model paths — the Tensor Series (managed, registry-backed) and Custom Upstream (any compatible HuggingFace pretrained base). Both go through the same full pipeline:

  1. Point it at a directory — raw files, code, documents, mixed formats. The library recursively walks the directory, converts everything to a training-ready format, and tokenizes it automatically against the model's tokenizer.
  2. Configure your run — set your token budget, devices, and output path. That's it.
  3. Get a production model — output is a clean .safetensors checkpoint, ready for tensor-adapt or tensor-inference.

Base Model Paths

Tensor Series

Tensor Series base models ship without instruct fine-tuning — they are pure completion engines, which makes them ideal starting points for domain-specific pretraining. There's no pre-baked behavior to fight against and no alignment tax on your data.

All Tensor Series models come with strong math and coding capabilities baked in at the base training stage, giving you a high-quality, managed foundation to build from. The registry handles architecture updates automatically when using :latest.

Tip: Using :latest ensures you always pull the highest-performing architecture for that size class. If you need reproducibility, pin a version like tensor-pro:1.0.

Custom Upstream

If you want to build on a base model outside the Tensor registry — such as Qwen3, Mistral 3, or any other compatible open-weight model — pass the HuggingFace ID directly to Pretrain. Tensor still handles the full pipeline: dataset conversion, tokenization, distributed training, and .safetensors export.

Note: Instruction-tuned variants (e.g. -it, -instruct, -Chat) are not supported and will raise a ModelVariantError at load time. Always use the base variant.


Supported Tensor Series Models

Timing estimates assume NVIDIA A100 80GB GPUs with FlashAttention-3 enabled. Token counts below are recommended targets for effective domain transfer.

Large Domain

Compiler source trees, mathematics textbooks, or large scientific and systems programming corpora.

Model Tag Params Rec. Tokens Cluster Size Est. Wall Time
Tensor Nano tensor-nano:latest 0.6B ~1B 1× A100 ~7 hrs
Tensor Micro tensor-micro:latest 1.5B ~2B 1× A100 ~32 hrs
Tensor Mini tensor-mini:latest 3B ~3B 2× A100 ~25 hrs
Tensor Pro tensor-pro:latest 7B ~5B 8× A100 ~24 hrs
Tensor Matrix tensor-matrix:latest 14B ~5B 8× A100 ~48 hrs
Tensor Ultra tensor-ultra:latest 32B+ ~10B 16× A100 ~4–5 days

Focused Domain

A specific language's stdlib, a niche compiler spec, or a proprietary API surface.

Model Tag Params Rec. Tokens Cluster Size Est. Wall Time
Tensor Nano tensor-nano:latest 0.6B ~10M 1× A100 ~5 min
Tensor Micro tensor-micro:latest 1.5B ~20M 1× A100 ~20 min
Tensor Mini tensor-mini:latest 3B ~50M 1× A100 ~1 hr
Tensor Pro tensor-pro:latest 7B ~100M 2× A100 ~2 hrs
Tensor Matrix tensor-matrix:latest 14B ~200M 4× A100 ~3 hrs
Tensor Ultra tensor-ultra:latest 32B+ ~500M 8× A100 ~7 hrs

Quick Start

Install

pip install tensor-pretrain

Tensor Series Path

from tensor.pretrain import Pretrain

run = Pretrain(
    base="tensor-pro:latest",
    data="./my-knowledge-base",
    output="./output/my-model",
)

run.train()

Custom Upstream Path

Pass any compatible HuggingFace base ID directly — the library resolves the path automatically.

from tensor.pretrain import Pretrain

run = Pretrain(
    base="Qwen/Qwen3-8B-Base",
    data="./my-knowledge-base",
    output="./output/my-qwen-base",
)

run.train()

Configuration

RunConfig is optional. When omitted, tensor-pretrain infers sensible defaults from the model size and available hardware.

from tensor.pretrain import Pretrain, RunConfig

run = Pretrain(
    base="tensor-pro:latest",
    data="./my-knowledge-base",
    output="./output/my-model",
    config=RunConfig(
        total_tokens=5_000_000_000,
        devices=8,
        dtype="bfloat16",
    ),
)

run.train()

Data Mixing

data= accepts a path string, a single source, or a weighted Mix. All three are valid:

data="./my-docs"                         # path string
data=LocalSource("./my-docs")            # explicit single source
data=Mix({ LocalSource(...): 0.7, ... }) # weighted mix

For multi-source runs:

from tensor.pretrain import Pretrain
from tensor.data import Mix, LocalSource, HubSource

run = Pretrain(
    base="tensor-matrix:latest",
    data=Mix({
        LocalSource("./my-docs"): 0.7,
        HubSource("HuggingFaceTB/smollm-corpus", subset="python-edu"): 0.3,
    }),
    output="./output/my-matrix-base",
)

run.train()

Validate and Estimate

Before committing to a full training run, use validate() and estimate() to catch problems early and review projected cost and time.

run = Pretrain(
    base="tensor-pro:latest",
    data="./my-knowledge-base",
    output="./output/my-model",
)

run.validate()   # checks data, model, and hardware — raises early if anything is wrong
run.estimate()   # prints token count, wall time estimate, and cluster recommendation
run.train()

run.estimate() output:

  base          tensor-pro:latest (7B)
  data          ./my-knowledge-base  →  4.2B tokens detected
  devices       8× A100 80GB
  est. time     ~20 hrs
  output        ./output/my-model

Resuming

from tensor.pretrain import Pretrain

run = Pretrain.resume("./output/my-model")
run.train()

Pretrain.resume() reads the run config and checkpoint state from the output directory. No need to reconstruct the original arguments.


Results

After run.train() completes, the result is available on the instance:

run.train()

print(run.result.checkpoint)       # path to .safetensors output
print(run.result.tokens_trained)   # actual tokens consumed
print(run.result.elapsed)          # wall time

The .safetensors checkpoint is your domain-trained base model, ready to take directly into tensor-adapt for instruct or chat fine-tuning.


Building Production Agents

Tensor models are designed specifically for agentic workflows.

Most chat models come with an alignment tax — they are fine-tuned to be polite assistants, which makes them unreliable for strict agentic tasks. They often refuse to output raw JSON, add conversational filler that breaks parsers, or reject valid tool calls due to over-safety tuning.

Pretrained base models are neutral. They have no personality, no refusal mechanisms, and no chat formatting.

  1. Pick your base — a Tensor Series model for a managed foundation, or a custom upstream for full control.
  2. Use tensor-pretrain to inject your domain knowledge.
  3. Use tensor-adapt to define behavior strictly on your terms — function calling, JSON output, or any task format you need.

Because the base is neutral, it learns exactly the behavior you define — nothing more, nothing less.


The "Distro" Philosophy

We treat the AI ecosystem exactly like the Linux ecosystem.

  • Upstream (The Kernel): The raw, complex, bleeding-edge engine (e.g., Llama, Mistral, Qwen, Linux Kernel).
  • Downstream (The Distro): The polished, stable, production-ready product (e.g., Tensor, Ubuntu, Android).

You choose Tensor for the same reason you choose Ubuntu: you want the power of the kernel without the headache of managing it. And just like Linux, you can also bring your own kernel if you know what you're doing.

Ecosystem The "Upstream" Kernel The "Distro" Platform What The User Actually Gets
Server OS Debian / Linux Kernel Ubuntu LTS Security updates, apt-get, and a system that boots every time.
Mobile OS Linux Kernel Android A seamless touch interface and apps. No need to know which kernel version is running.
Tensor AI SOTA Foundation Weights Your trained model A model that knows your domain. Tensor handles tokenizer alignment, distributed training, and .safetensors export.

Architecture

Component Role
tensor.pretrain.Pretrain Main entry point — configuration, validation, estimation, and training
tensor.pretrain.Pretrain.resume Reconstructs a run from an existing output directory checkpoint
tensor.pretrain.RunConfig Optional training parameters — tokens, devices, dtype
tensor.data.LocalSource Recursive file ingestion, format conversion, and tokenization
tensor.data.HubSource HuggingFace Hub dataset source
tensor.data.Mix Multi-source weighted token mixing

Roadmap

  • v1.0 (Current): Tensor Series support and custom upstream via HuggingFace ID, local and Hub data sources
  • v1.1: Full Tensor model size support · automatic reasoning-retention mixing
  • v1.2: Direct export to tensor-inference optimised graph format

License

Apache 2.0 — free for commercial and private use within the Tensor Framework ecosystem.


Part of the Tensor Framework by Netangular.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages