Skip to content

ITheClixs/micro-gpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Micro-GPT Research Lab

This repository is being converted from a GPT-2 summarization fine-tuning demo into a research-grade laboratory for deep-learning algorithms and small language models. The project now has two roles:

  1. Build and visualize core learning algorithms from first principles with PyTorch tensor primitives.
  2. Develop a from-scratch micro-GPT track for controlled next-token language-modeling research.

The original GPT-2 summarization workflow remains as a Hugging Face baseline, not the primary research target.

Research Surfaces

  • Backpropagation: explicit two-layer MLP derivatives, autograd agreement checks, loss curves, and gradient norms.
  • Optimizers: SGD, momentum, RMSProp, AdamW, Lion, Muon-style matrix updates, and update-geometry payloads.
  • CNNs: manual convolution, pooling, normalization, kernels, activations, and feature maps.
  • RNNs: vanilla RNN, GRU, LSTM cells, hidden-state trajectories, and gradient-flow probes.
  • Alignment and RL: GridWorld, discounted returns, advantages, DPO loss, GRPO-style grouped advantages, and PPO-style clipped objectives.
  • Parameter-efficient adaptation: LoRA linear adapters with frozen base weights and trainable low-rank residuals.
  • Micro-GPT: decoder-only Transformer with RoPE, RMSNorm, SwiGLU, causal attention, tied embeddings, generation, dry-run training, and inspection hooks.
  • Visualization: local Streamlit app with deterministic payloads generated by repository code.

Install

python3 -m venv venv
./venv/bin/pip install -r requirements.txt

Run Tests

./venv/bin/python -m unittest

Micro-GPT Dry Run

This command exercises the training loop on a tiny in-memory corpus. It does not checkpoint or launch a long training job.

./venv/bin/python -m src.micro_gpt.train \
  --config configs/micro_gpt/tiny_debug.json \
  --dry-run

Local Micro-GPT Training

This is a bounded local run that writes a checkpoint and JSON metrics. It does not use Hugging Face Jobs or any remote GPU.

./venv/bin/python -m src.micro_gpt.train \
  --config configs/micro_gpt/tiny_debug.json \
  --train \
  --text "micro gpt local training corpus for terminal verification" \
  --checkpoint-out /tmp/micro_gpt_local.pt \
  --metrics-out /tmp/micro_gpt_metrics.json \
  --run-name local-smoke

Generate from the local checkpoint:

./venv/bin/python -m src.micro_gpt.cli generate \
  --config configs/micro_gpt/tiny_debug.json \
  --checkpoint /tmp/micro_gpt_local.pt \
  --prompt "micro" \
  --max-new-tokens 8

Hugging Face CPU Dataset Smoke Run

Fetch a tiny text slice from the Hugging Face Dataset Viewer API:

./venv/bin/python scripts/fetch_hf_text_sample.py \
  --dataset roneneldan/TinyStories \
  --config default \
  --split train \
  --text-field text \
  --rows 32 \
  --output /tmp/tinystories_cpu_sample.txt

Train the repository-native micro-GPT on CPU:

./venv/bin/python -m src.micro_gpt.train \
  --config configs/micro_gpt/cpu_m4_smoke.json \
  --train \
  --text-file /tmp/tinystories_cpu_sample.txt \
  --checkpoint-out /tmp/micro_gpt_cpu_tinystories.pt \
  --metrics-out /tmp/micro_gpt_cpu_tinystories_metrics.json \
  --run-name macbook-m4-cpu-tinystories-smoke

Quant Research Micro-GPT Tuning

The repository includes a public-source quant research corpus for terminal-only domain tuning. It covers factor premia, statistical arbitrage, order-flow imbalance, inventory-aware market making, LOB deep learning, RL execution, labeling, backtesting, and risk sizing. It is for research vocabulary and formula grounding only, not trading advice.

./venv/bin/python -m src.micro_gpt.train \
  --config configs/micro_gpt/quant_cpu_smoke.json \
  --train \
  --text-file data/quant_research_corpus.md \
  --checkpoint-out /tmp/micro_gpt_quant_research.pt \
  --metrics-out /tmp/micro_gpt_quant_research_metrics.json \
  --run-name quant-research-cpu-smoke

Generate from the tuned checkpoint:

./venv/bin/python -m src.micro_gpt.cli generate \
  --config configs/micro_gpt/quant_cpu_smoke.json \
  --checkpoint /tmp/micro_gpt_quant_research.pt \
  --prompt "Order-flow imbalance" \
  --max-new-tokens 96

Quantlab Microstructure Spine

The src.quantlab package provides deterministic public-market research primitives for BTCUSDT-style order-flow work. It includes market-event schemas, feature formulas, direction and triple-barrier labels, cost-aware baselines, walk-forward backtesting, and dataset manifests for future venue adapters.

Build a dataset manifest from the public MVP config:

./venv/bin/python -m src.quantlab.datasets build \
  --config configs/quantlab/btcusdt_public.json \
  --output /tmp/quantlab_btcusdt_manifest.json

Convert local market events into feature and label rows:

./venv/bin/python -m src.quantlab.features build \
  --input /tmp/btcusdt_events.jsonl \
  --output /tmp/btcusdt_features.jsonl

./venv/bin/python -m src.quantlab.labels build \
  --input /tmp/btcusdt_events.jsonl \
  --horizons 1s,5s,30s \
  --cost-threshold spread \
  --output /tmp/btcusdt_labels.jsonl

Train the deterministic baseline and run a bounded backtest:

./venv/bin/python -m src.quantlab.baselines train \
  --model ofi_logistic \
  --features /tmp/btcusdt_features.jsonl \
  --labels /tmp/btcusdt_labels.jsonl \
  --output /tmp/btcusdt_baseline.json

./venv/bin/python -m src.quantlab.backtest run \
  --predictions /tmp/btcusdt_predictions.jsonl \
  --labels /tmp/btcusdt_labels.jsonl \
  --output /tmp/btcusdt_backtest.json

Run the synthetic end-to-end quantlab demo. This generates BTCUSDT-like market events, builds aligned features and labels, trains the MLP, writes predictions, and runs the cost-aware backtest without launching long training:

./venv/bin/python -m src.quantlab.demo run \
  --output-dir /tmp/quantlab_demo \
  --rows 96 \
  --max-epochs 50 \
  --no-trade-threshold 0.05

Train the first CPU-safe supervised model, a compact MLP over stationary order-flow features:

./venv/bin/python -m src.quantlab.models train \
  --features /tmp/btcusdt_features.jsonl \
  --labels /tmp/btcusdt_labels.jsonl \
  --model-out /tmp/quantlab_mlp_direction.pt \
  --predictions-out /tmp/btcusdt_mlp_predictions.jsonl \
  --metrics-out /tmp/quantlab_mlp_direction_metrics.json \
  --hidden-dim 16 \
  --max-epochs 50 \
  --learning-rate 0.01 \
  --no-trade-threshold 0.05

./venv/bin/python -m src.quantlab.backtest run \
  --predictions /tmp/btcusdt_mlp_predictions.jsonl \
  --labels /tmp/btcusdt_labels.jsonl \
  --output /tmp/btcusdt_mlp_backtest.json \
  --no-trade-threshold 0.05

The src.micro_gpt.train path also supports BPE-aware configs such as configs/micro_gpt/quant_bpe_6m.json and configs/micro_gpt/quant_bpe_15m.json. These configs keep the from-scratch micro-GPT identity while allowing domain-specific token packing for quant research text.

Hugging Face Quant Finance Reasoning Tuning

Build a temporary corpus from public Hugging Face quantitative-finance reasoning datasets:

./venv/bin/python scripts/build_quant_hf_corpus.py \
  --output /tmp/quant_hf_reasoning_corpus.md \
  --metadata-out /tmp/quant_hf_reasoning_metadata.json

Train a bounded CPU checkpoint:

./venv/bin/python -m src.micro_gpt.train \
  --config configs/micro_gpt/quant_hf_cpu.json \
  --train \
  --text-file /tmp/quant_hf_reasoning_corpus.md \
  --checkpoint-out /tmp/micro_gpt_quant_hf.pt \
  --metrics-out /tmp/micro_gpt_quant_hf_metrics.json \
  --run-name quant-hf-reasoning-cpu

Sample conservatively from the tuned checkpoint:

./venv/bin/python -m src.micro_gpt.cli generate \
  --config configs/micro_gpt/quant_hf_cpu.json \
  --checkpoint /tmp/micro_gpt_quant_hf.pt \
  --prompt "question: Derive the Black-Scholes equation" \
  --max-new-tokens 160 \
  --temperature 0.55 \
  --top-k 12

Terminal Micro-GPT CLI

Inspect the architecture and parameter count:

./venv/bin/python -m src.micro_gpt.cli inspect \
  --config configs/micro_gpt/tiny_debug.json

Run a one-step terminal smoke pass with generation, without saving files:

./venv/bin/python -m src.micro_gpt.cli smoke \
  --config configs/micro_gpt/tiny_debug.json \
  --text "terminal micro gpt smoke corpus" \
  --max-new-tokens 8

Save a tiny smoke checkpoint, then generate from it:

./venv/bin/python -m src.micro_gpt.cli smoke \
  --config configs/micro_gpt/tiny_debug.json \
  --text "checkpoint smoke corpus for terminal generation" \
  --max-new-tokens 4 \
  --save-checkpoint /tmp/micro_gpt_smoke.pt

./venv/bin/python -m src.micro_gpt.cli generate \
  --config configs/micro_gpt/tiny_debug.json \
  --checkpoint /tmp/micro_gpt_smoke.pt \
  --prompt "checkpoint" \
  --max-new-tokens 8

Generate from a random initialized model for terminal plumbing checks only:

./venv/bin/python -m src.micro_gpt.cli generate \
  --config configs/micro_gpt/tiny_debug.json \
  --prompt "abc" \
  --max-new-tokens 5 \
  --random-init

Research Lab App

./venv/bin/python -m streamlit run src/research_lab/app.py

The app visualizes live demo tensors for backpropagation, CNNs, RNNs, reinforcement learning, optimizers, and micro-GPT internals.

Legacy Hugging Face Baseline

The previous summarization path is still available:

./venv/bin/python main.py \
  --train-file data/train.jsonl \
  --validation-file data/validation.jsonl \
  --article-column text \
  --summary-column summary \
  --model-checkpoint gpt2 \
  --train-size 100 \
  --eval-size 10 \
  --num-train-epochs 1

This path fine-tunes a GPT-2 style causal language model for summarization. It is retained for comparison and Hugging Face workflow continuity.

Documentation

Current Research Rule

Full-scale model training, Hugging Face Jobs submission, or benchmark claims require an explicit experiment request. Bounded CPU smoke training is allowed for terminal verification when the dataset slice, config, checkpoint path, and metrics path are recorded. Entirely built on a Macbook Air without cloud compute.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages