You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quantize inactive expert weights to int8 for VRAM savings
experts_to_train
string
"all"
"all" or comma-separated expert indices (e.g., "0,1,2")
model.multimodal (Optional — VLM models)
Field
Type
Default
Description
enabled
bool
false
Enable vision-language model fine-tuning
image_column
string
"image"
Column name for image paths/URLs in dataset
text_column
string
"text"
Column name for text/captions
lora
Field
Type
Default
Description
r
int
8
LoRA rank. Higher = more parameters
alpha
int
16
LoRA scaling factor
dropout
float
0.1
Dropout probability
bias
string
"none"
"none", "all", or "lora_only"
method
string
"lora"
PEFT method: "lora", "dora", "pissa", "rslora"
use_dora
bool
false
Enable DoRA (Weight-Decomposed LoRA)
use_rslora
bool
false
Rank-stabilized LoRA (recommended for r>64)
target_modules
list
["q_proj", "v_proj"]
Model modules to apply LoRA
task_type
string
"CAUSAL_LM"
Task type for PEFT
training
Field
Type
Default
Description
output_dir
string
"./checkpoints"
Checkpoint save directory
final_model_dir
string
"final_model"
Subdirectory for final artifacts
merge_adapters
bool
false
Merge adapters into base model before saving
trainer_type
string
"sft"
"sft", "dpo", "simpo", "kto", "orpo", "grpo"
num_train_epochs
int
3
Number of training epochs
per_device_train_batch_size
int
4
Batch size per GPU
gradient_accumulation_steps
int
2
Steps to accumulate before backward pass
learning_rate
float
2e-5
Learning rate (lower for alignment: 5e-6)
warmup_ratio
float
0.1
Warmup proportion
weight_decay
float
0.01
AdamW weight decay
eval_steps
int
200
Evaluate every N steps
save_steps
int
200
Save checkpoint every N steps
save_total_limit
int
3
Max checkpoints to keep
packing
bool
false
Sequence packing (SFT only)
report_to
string
"tensorboard"
"tensorboard", "wandb", "mlflow", "none"
run_name
string
null
W&B/MLflow run name (auto-generated if null)
OOM Recovery
Automatically halves per_device_train_batch_size and doubles gradient_accumulation_steps
on CUDA out-of-memory errors, preserving the effective batch size. Retries until the minimum
batch size is reached.
Field
Type
Default
Description
oom_recovery
bool
false
Retry training with smaller batch size on CUDA OOM
oom_recovery_min_batch_size
int
1
Stop retrying when batch size reaches this value
Example:
training:
per_device_train_batch_size: 8gradient_accumulation_steps: 2oom_recovery: trueoom_recovery_min_batch_size: 1# try down to batch_size=1 before failing
Effective batch size (per_device_train_batch_size × gradient_accumulation_steps) is preserved
across retries. Each retry attempt is logged to the audit trail.
Batched generation size for safety evaluation. 1 disables batching; raise for throughput on large VRAM, lower to reduce OOM risk on small VRAM.
evaluation.llm_judge (Optional)
Field
Type
Default
Description
enabled
bool
false
Enable LLM-as-Judge scoring
judge_model
string
"gpt-4o"
Judge model (API or local path)
judge_api_key_env
string
null
Env var name for API key (null = local)
judge_api_base
string
null
Override the judge API base URL (Azure OpenAI, self-hosted vLLM, OpenAI-compatible gateway, e.g. https://api.together.xyz/v1). When unset, the SDK default endpoint is used.
eval_dataset
string
"eval_prompts.jsonl"
Evaluation prompts file
min_score
float
5.0
Minimum average score (1-10)
batch_size
int
8
Number of (prompt, completion) pairs scored per LLM-judge round. 1 disables batching.
Deprecated:evaluation.staging_ttl_days is superseded by
retention.staging_ttl_days.
The legacy key is alias-forwarded with a DeprecationWarning during the
v0.5.5 → v0.6.x window and removed in v0.7.0. See
release.md.
Defines maximum retention horizons for compliance, training, and evaluation
artefacts. Horizons honour GDPR Article 5(1)(e) "storage limitation" and
Article 17 "right to erasure" deadlines. The enforce knob switches between
log-only, warning, and hard-block modes so a regulated CI gate cannot
silently extend the retention horizon by re-using a stale workspace.
Field
Type
Default
Description
audit_log_retention_days
int
1825 (~5 years)
Days to retain audit_log.jsonl before flagging it as overdue under Article 5(1)(e). Set to 0 to retain indefinitely (Article 17(3)(b) defence).
staging_ttl_days
int
7
Days to retain final_model.staging.<run_id>/ after a forgelm reject decision before scheduled cleanup. Set to 0 to retain indefinitely. Replaces the deprecated evaluation.staging_ttl_days; both keys accepted with identical values during the v0.5.5 → v0.6.x deprecation window.
ephemeral_artefact_retention_days
int
90
Days to retain compliance bundles, data audit reports, and other run-scoped derived artefacts. Set to 0 to retain indefinitely.
raw_documents_retention_days
int
90
Days to retain ingested raw documents (PDF / DOCX / EPUB / TXT / Markdown) under the operator's ingestion-output directory. Set to 0 to retain indefinitely.
Deprecation:evaluation.staging_ttl_days is deprecated as of v0.5.5 in
favour of retention.staging_ttl_days. The legacy key is alias-forwarded
with a DeprecationWarning until v0.7.0. See
release.md for the full
deprecation cadence policy.
webhook (Optional)
Field
Type
Default
Description
url
string
null
Webhook destination URL
url_env
string
null
Env var name containing URL
notify_on_start
bool
true
Notify on training start
notify_on_success
bool
true
Notify on success
notify_on_failure
bool
true
Notify on failure
timeout
int
10
HTTP request timeout (seconds). Clamped to ≥ 1s by the notifier. Default raised to 10s in v0.5.5 (was 5s) — Slack/Teams gateway latency spikes regularly cross 5s in production, and a webhook timeout silently degrades the audit chain (webhook failure is best-effort).
allow_private_destinations
bool
false
Opt in to webhooks pointing at RFC1918 / loopback / link-local hosts (in-cluster Slack proxy, on-prem Teams gateway). Defaults to public-internet only — SSRF guard
tls_ca_bundle
string
null
Path to a custom CA bundle forwarded to requests as verify= (e.g. corporate MITM CA). When unset, certifi's bundled store is used
distributed (Optional)
Field
Type
Default
Description
strategy
string
null
"deepspeed" or "fsdp" (null = single GPU)
deepspeed_config
string
null
Preset ("zero2", "zero3", "zero3_offload") or JSON path
compliance (Optional — EU AI Act Art. 11 + Annex IV)
Field
Type
Default
Description
provider_name
string
""
Organization name
provider_contact
string
""
Contact email
system_name
string
""
AI system name
intended_purpose
string
""
What the model is for
known_limitations
string
""
What it should not be used for
system_version
string
""
Version identifier
risk_classification
string
"minimal-risk"
One of the 5 EU AI Act RiskTier values: "unknown" (pre-classification placeholder), "minimal-risk", "limited-risk", "high-risk" (Article 6 — full Annex IV documentation), "unacceptable" (Article 5 prohibited practice — emits a startup banner).
risk_assessment (Optional — EU AI Act Art. 9)
Field
Type
Default
Description
intended_use
string
""
Intended use description
foreseeable_misuse
list
[]
List of misuse scenarios
risk_category
string
"minimal-risk"
Same 5 RiskTier values as compliance.risk_classification: "unknown", "minimal-risk", "limited-risk", "high-risk", "unacceptable". Drives auto-revert thresholds and Annex IV gating.
pipeline (Optional — Multi-Stage Training Chains, Phase 14)
Chains 2+ training stages (typically SFT → DPO → GRPO) into one config-driven run with auto-chaining, per-stage gates, crash-safe resume, and a chain-level Annex IV manifest. When omitted, ForgeLM behaves byte-identically to a v0.6.0 single-stage run; the orchestrator module is not imported. Full operator walkthrough: Multi-Stage Pipelines guide.
Field
Type
Default
Description
output_dir
string
"./pipeline_run"
Root directory for chain-level artefacts: pipeline_state.json, compliance/pipeline_manifest.json, and the pipeline-scoped audit_log.jsonl. Per-stage trainer artefacts continue to live under each stage's own training.output_dir.
stages
List[PipelineStage]
[] (required: ≥ 1)
Ordered list of stages. Each stage's model.name_or_path is auto-set to the previous stage's training.output_dir/final_model unless the stage supplies an explicit model: block.
pipeline.stages[].* — PipelineStage fields
A PipelineStage is a per-stage override layered onto the root config. Section-wholesale inheritance: omitting a block inherits root's wholesale; supplying a block REPLACES root's wholesale (no deep-merge).
Field
Type
Default
Description
name
string
— (required)
Stage identifier matching ^[a-z0-9_]{1,32}$. Unique within the pipeline. Used as the identifier in --stage <name>, --resume-from <name>, audit-log payloads, and per-stage manifest entries.
model
Optional[ModelConfig]
null
Per-stage override of the root model: block. When null, auto-chains from the previous stage's final_model (or root for stage 0). When set, disables the auto-chain for that stage (operator escape hatch).
lora
Optional[LoraConfig]
null
Per-stage LoRA config. Inherits root wholesale when null.
training
Optional[TrainingConfig]
null
Per-stage training config. Inherits root wholesale when null. When supplied, trainer_type MUST be set explicitly — every stage records its alignment paradigm in the manifest for audit clarity.
data
Optional[DataConfig]
null
Per-stage data config. Inherits root wholesale when null; per-stage override is the norm because each stage typically consumes a different dataset (SFT/DPO/preference/etc.).
evaluation
Optional[EvaluationConfig]
null
Per-stage gates (loss thresholds, auto_revert, safety, judge, human-approval). Each stage may independently configure its gate.
Root-only sections — rejected at the stage level with EXIT_CONFIG_ERROR (1): distributed, webhook, compliance, risk_assessment, monitoring, retention, synthetic, merge, auth. These are pipeline-level concerns (distributed strategy stays consistent across the run; compliance metadata covers the whole chain; etc.).
Run only the named stage in isolation (audit / re-run scenarios). Auto-chains from the previous stage's on-disk output.
--resume-from <name>
Resume from the named stage onward; already-completed (or human-approved gated) stages with on-disk output are skipped.
--force-resume
Accept a pipeline_config_hash mismatch on resume (logged + audited via pipeline.force_resume). Stage topology mismatch (count / names / order) is refused even with this flag.
--input-model <path>
Operator escape hatch — overrides the auto-chained model for the --stage target. Audit-logged with input_source: cli_override.
--dry-run
Validates every stage's merged config + cross-stage chain integrity + training.output_dir collision check before any GPU is allocated; collects all errors before exiting.
The --fit-check, --merge, --generate-data, --compliance-export, --benchmark-only flags are single-stage operations and are rejected at dispatch time when a pipeline: block is present — drop the pipeline: block or remove the flag.