This repository is being converted from a GPT-2 summarization fine-tuning demo into a research-grade laboratory for deep-learning algorithms and small language models. The project now has two roles:
- Build and visualize core learning algorithms from first principles with PyTorch tensor primitives.
- Develop a from-scratch micro-GPT track for controlled next-token language-modeling research.
The original GPT-2 summarization workflow remains as a Hugging Face baseline, not the primary research target.
- Backpropagation: explicit two-layer MLP derivatives, autograd agreement checks, loss curves, and gradient norms.
- Optimizers: SGD, momentum, RMSProp, AdamW, Lion, Muon-style matrix updates, and update-geometry payloads.
- CNNs: manual convolution, pooling, normalization, kernels, activations, and feature maps.
- RNNs: vanilla RNN, GRU, LSTM cells, hidden-state trajectories, and gradient-flow probes.
- Alignment and RL: GridWorld, discounted returns, advantages, DPO loss, GRPO-style grouped advantages, and PPO-style clipped objectives.
- Parameter-efficient adaptation: LoRA linear adapters with frozen base weights and trainable low-rank residuals.
- Micro-GPT: decoder-only Transformer with RoPE, RMSNorm, SwiGLU, causal attention, tied embeddings, generation, dry-run training, and inspection hooks.
- Visualization: local Streamlit app with deterministic payloads generated by repository code.
python3 -m venv venv
./venv/bin/pip install -r requirements.txt./venv/bin/python -m unittestThis command exercises the training loop on a tiny in-memory corpus. It does not checkpoint or launch a long training job.
./venv/bin/python -m src.micro_gpt.train \
--config configs/micro_gpt/tiny_debug.json \
--dry-runThis is a bounded local run that writes a checkpoint and JSON metrics. It does not use Hugging Face Jobs or any remote GPU.
./venv/bin/python -m src.micro_gpt.train \
--config configs/micro_gpt/tiny_debug.json \
--train \
--text "micro gpt local training corpus for terminal verification" \
--checkpoint-out /tmp/micro_gpt_local.pt \
--metrics-out /tmp/micro_gpt_metrics.json \
--run-name local-smokeGenerate from the local checkpoint:
./venv/bin/python -m src.micro_gpt.cli generate \
--config configs/micro_gpt/tiny_debug.json \
--checkpoint /tmp/micro_gpt_local.pt \
--prompt "micro" \
--max-new-tokens 8Fetch a tiny text slice from the Hugging Face Dataset Viewer API:
./venv/bin/python scripts/fetch_hf_text_sample.py \
--dataset roneneldan/TinyStories \
--config default \
--split train \
--text-field text \
--rows 32 \
--output /tmp/tinystories_cpu_sample.txtTrain the repository-native micro-GPT on CPU:
./venv/bin/python -m src.micro_gpt.train \
--config configs/micro_gpt/cpu_m4_smoke.json \
--train \
--text-file /tmp/tinystories_cpu_sample.txt \
--checkpoint-out /tmp/micro_gpt_cpu_tinystories.pt \
--metrics-out /tmp/micro_gpt_cpu_tinystories_metrics.json \
--run-name macbook-m4-cpu-tinystories-smokeThe repository includes a public-source quant research corpus for terminal-only domain tuning. It covers factor premia, statistical arbitrage, order-flow imbalance, inventory-aware market making, LOB deep learning, RL execution, labeling, backtesting, and risk sizing. It is for research vocabulary and formula grounding only, not trading advice.
./venv/bin/python -m src.micro_gpt.train \
--config configs/micro_gpt/quant_cpu_smoke.json \
--train \
--text-file data/quant_research_corpus.md \
--checkpoint-out /tmp/micro_gpt_quant_research.pt \
--metrics-out /tmp/micro_gpt_quant_research_metrics.json \
--run-name quant-research-cpu-smokeGenerate from the tuned checkpoint:
./venv/bin/python -m src.micro_gpt.cli generate \
--config configs/micro_gpt/quant_cpu_smoke.json \
--checkpoint /tmp/micro_gpt_quant_research.pt \
--prompt "Order-flow imbalance" \
--max-new-tokens 96The src.quantlab package provides deterministic public-market research primitives for BTCUSDT-style order-flow work. It includes market-event schemas, feature formulas, direction and triple-barrier labels, cost-aware baselines, walk-forward backtesting, and dataset manifests for future venue adapters.
Build a dataset manifest from the public MVP config:
./venv/bin/python -m src.quantlab.datasets build \
--config configs/quantlab/btcusdt_public.json \
--output /tmp/quantlab_btcusdt_manifest.jsonConvert local market events into feature and label rows:
./venv/bin/python -m src.quantlab.features build \
--input /tmp/btcusdt_events.jsonl \
--output /tmp/btcusdt_features.jsonl
./venv/bin/python -m src.quantlab.labels build \
--input /tmp/btcusdt_events.jsonl \
--horizons 1s,5s,30s \
--cost-threshold spread \
--output /tmp/btcusdt_labels.jsonlTrain the deterministic baseline and run a bounded backtest:
./venv/bin/python -m src.quantlab.baselines train \
--model ofi_logistic \
--features /tmp/btcusdt_features.jsonl \
--labels /tmp/btcusdt_labels.jsonl \
--output /tmp/btcusdt_baseline.json
./venv/bin/python -m src.quantlab.backtest run \
--predictions /tmp/btcusdt_predictions.jsonl \
--labels /tmp/btcusdt_labels.jsonl \
--output /tmp/btcusdt_backtest.jsonRun the synthetic end-to-end quantlab demo. This generates BTCUSDT-like market events, builds aligned features and labels, trains the MLP, writes predictions, and runs the cost-aware backtest without launching long training:
./venv/bin/python -m src.quantlab.demo run \
--output-dir /tmp/quantlab_demo \
--rows 96 \
--max-epochs 50 \
--no-trade-threshold 0.05Train the first CPU-safe supervised model, a compact MLP over stationary order-flow features:
./venv/bin/python -m src.quantlab.models train \
--features /tmp/btcusdt_features.jsonl \
--labels /tmp/btcusdt_labels.jsonl \
--model-out /tmp/quantlab_mlp_direction.pt \
--predictions-out /tmp/btcusdt_mlp_predictions.jsonl \
--metrics-out /tmp/quantlab_mlp_direction_metrics.json \
--hidden-dim 16 \
--max-epochs 50 \
--learning-rate 0.01 \
--no-trade-threshold 0.05
./venv/bin/python -m src.quantlab.backtest run \
--predictions /tmp/btcusdt_mlp_predictions.jsonl \
--labels /tmp/btcusdt_labels.jsonl \
--output /tmp/btcusdt_mlp_backtest.json \
--no-trade-threshold 0.05The src.micro_gpt.train path also supports BPE-aware configs such as configs/micro_gpt/quant_bpe_6m.json and configs/micro_gpt/quant_bpe_15m.json. These configs keep the from-scratch micro-GPT identity while allowing domain-specific token packing for quant research text.
Build a temporary corpus from public Hugging Face quantitative-finance reasoning datasets:
./venv/bin/python scripts/build_quant_hf_corpus.py \
--output /tmp/quant_hf_reasoning_corpus.md \
--metadata-out /tmp/quant_hf_reasoning_metadata.jsonTrain a bounded CPU checkpoint:
./venv/bin/python -m src.micro_gpt.train \
--config configs/micro_gpt/quant_hf_cpu.json \
--train \
--text-file /tmp/quant_hf_reasoning_corpus.md \
--checkpoint-out /tmp/micro_gpt_quant_hf.pt \
--metrics-out /tmp/micro_gpt_quant_hf_metrics.json \
--run-name quant-hf-reasoning-cpuSample conservatively from the tuned checkpoint:
./venv/bin/python -m src.micro_gpt.cli generate \
--config configs/micro_gpt/quant_hf_cpu.json \
--checkpoint /tmp/micro_gpt_quant_hf.pt \
--prompt "question: Derive the Black-Scholes equation" \
--max-new-tokens 160 \
--temperature 0.55 \
--top-k 12Inspect the architecture and parameter count:
./venv/bin/python -m src.micro_gpt.cli inspect \
--config configs/micro_gpt/tiny_debug.jsonRun a one-step terminal smoke pass with generation, without saving files:
./venv/bin/python -m src.micro_gpt.cli smoke \
--config configs/micro_gpt/tiny_debug.json \
--text "terminal micro gpt smoke corpus" \
--max-new-tokens 8Save a tiny smoke checkpoint, then generate from it:
./venv/bin/python -m src.micro_gpt.cli smoke \
--config configs/micro_gpt/tiny_debug.json \
--text "checkpoint smoke corpus for terminal generation" \
--max-new-tokens 4 \
--save-checkpoint /tmp/micro_gpt_smoke.pt
./venv/bin/python -m src.micro_gpt.cli generate \
--config configs/micro_gpt/tiny_debug.json \
--checkpoint /tmp/micro_gpt_smoke.pt \
--prompt "checkpoint" \
--max-new-tokens 8Generate from a random initialized model for terminal plumbing checks only:
./venv/bin/python -m src.micro_gpt.cli generate \
--config configs/micro_gpt/tiny_debug.json \
--prompt "abc" \
--max-new-tokens 5 \
--random-init./venv/bin/python -m streamlit run src/research_lab/app.pyThe app visualizes live demo tensors for backpropagation, CNNs, RNNs, reinforcement learning, optimizers, and micro-GPT internals.
The previous summarization path is still available:
./venv/bin/python main.py \
--train-file data/train.jsonl \
--validation-file data/validation.jsonl \
--article-column text \
--summary-column summary \
--model-checkpoint gpt2 \
--train-size 100 \
--eval-size 10 \
--num-train-epochs 1This path fine-tunes a GPT-2 style causal language model for summarization. It is retained for comparison and Hugging Face workflow continuity.
- Research program
- Implementation plan
- Literature review
- Experiment protocol
- Hugging Face CPU training runbook
- Quant micro-GPT research article
- Agent instructions
Full-scale model training, Hugging Face Jobs submission, or benchmark claims require an explicit experiment request. Bounded CPU smoke training is allowed for terminal verification when the dataset slice, config, checkpoint path, and metrics path are recorded. Entirely built on a Macbook Air without cloud compute.