Skip to content

Latest commit

 

History

History
419 lines (329 loc) · 24.4 KB

File metadata and controls

419 lines (329 loc) · 24.4 KB

Configuration Guide

ForgeLM uses YAML files for all configuration — declarative, version-controllable, and CI/CD-ready.

See config_template.yaml for a complete annotated example.


model

Field Type Default Description
name_or_path string required HuggingFace model ID or local path
max_length int 2048 Maximum context length
load_in_4bit bool true Enable QLoRA 4-bit NF4 quantization
backend string "transformers" "transformers" or "unsloth" (2-5x faster, Linux only)
trust_remote_code bool false Allow custom code from model repos. Security risk — only enable for models that require it
offline bool false Air-gapped mode: no HF Hub calls. Models/datasets must be local
bnb_4bit_use_double_quant bool true Double quantization for extra VRAM savings
bnb_4bit_quant_type string "nf4" Quantization type ("nf4" or "fp4")
bnb_4bit_compute_dtype string "auto" Compute dtype: "auto", "bfloat16", "float16", "float32"

model.moe (Optional — MoE models)

Field Type Default Description
quantize_experts bool false Quantize inactive expert weights to int8 for VRAM savings
experts_to_train string "all" "all" or comma-separated expert indices (e.g., "0,1,2")

model.multimodal (Optional — VLM models)

Field Type Default Description
enabled bool false Enable vision-language model fine-tuning
image_column string "image" Column name for image paths/URLs in dataset
text_column string "text" Column name for text/captions

lora

Field Type Default Description
r int 8 LoRA rank. Higher = more parameters
alpha int 16 LoRA scaling factor
dropout float 0.1 Dropout probability
bias string "none" "none", "all", or "lora_only"
method string "lora" PEFT method: "lora", "dora", "pissa", "rslora"
use_dora bool false Enable DoRA (Weight-Decomposed LoRA)
use_rslora bool false Rank-stabilized LoRA (recommended for r>64)
target_modules list ["q_proj", "v_proj"] Model modules to apply LoRA
task_type string "CAUSAL_LM" Task type for PEFT

training

Field Type Default Description
output_dir string "./checkpoints" Checkpoint save directory
final_model_dir string "final_model" Subdirectory for final artifacts
merge_adapters bool false Merge adapters into base model before saving
trainer_type string "sft" "sft", "dpo", "simpo", "kto", "orpo", "grpo"
num_train_epochs int 3 Number of training epochs
per_device_train_batch_size int 4 Batch size per GPU
gradient_accumulation_steps int 2 Steps to accumulate before backward pass
learning_rate float 2e-5 Learning rate (lower for alignment: 5e-6)
warmup_ratio float 0.1 Warmup proportion
weight_decay float 0.01 AdamW weight decay
eval_steps int 200 Evaluate every N steps
save_steps int 200 Save checkpoint every N steps
save_total_limit int 3 Max checkpoints to keep
packing bool false Sequence packing (SFT only)
report_to string "tensorboard" "tensorboard", "wandb", "mlflow", "none"
run_name string null W&B/MLflow run name (auto-generated if null)

OOM Recovery

Automatically halves per_device_train_batch_size and doubles gradient_accumulation_steps on CUDA out-of-memory errors, preserving the effective batch size. Retries until the minimum batch size is reached.

Field Type Default Description
oom_recovery bool false Retry training with smaller batch size on CUDA OOM
oom_recovery_min_batch_size int 1 Stop retrying when batch size reaches this value

Example:

training:
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 2
  oom_recovery: true
  oom_recovery_min_batch_size: 1  # try down to batch_size=1 before failing

Effective batch size (per_device_train_batch_size × gradient_accumulation_steps) is preserved across retries. Each retry attempt is logged to the audit trail.

GaLore (Optimizer-Level Memory Optimization)

Field Type Default Description
galore_enabled bool false Enable GaLore gradient low-rank projection
galore_optim string "galore_adamw" GaLore optimizer variant. One of: "galore_adamw", "galore_adamw_8bit", "galore_adafactor", "galore_adamw_layerwise", "galore_adamw_8bit_layerwise", "galore_adafactor_layerwise". _8bit halves optimizer-state VRAM; _layerwise cuts peak VRAM by recomputing per-layer.
galore_rank int 128 Rank for gradient projection
galore_update_proj_gap int 200 Steps between projection updates
galore_scale float 0.25 GaLore scaling factor
galore_proj_type string "std" Projection type: "std", "reverse_std", "right", "left", "full"
galore_target_modules Optional[List[str]] null Module-name regex patterns GaLore is applied to. null falls back to [r".*.attn.*", r".*.mlp.*"] (attention + MLP layers).

Long-Context Training

Field Type Default Description
rope_scaling Optional[Dict[str, Any]] null RoPE scaling method dict ({"type": "linear", "factor": 2.0} etc.). Supported types: "linear", "dynamic", "yarn", "longrope".
neftune_noise_alpha float null NEFTune noise injection alpha (e.g., 5.0)
sliding_window_attention int null Sliding window attention size in tokens
sample_packing bool false Pack multiple short samples into full-length sequences

GPU Cost Estimation

Field Type Default Description
gpu_cost_per_hour float null Custom GPU cost rate (USD/hour). Auto-detected from GPU model if null

Alignment Parameters

Field Type Default Used By
dpo_beta float 0.1 DPO temperature
simpo_gamma float 0.5 SimPO margin term
simpo_beta float 2.0 SimPO scaling
kto_beta float 0.1 KTO loss parameter
orpo_beta float 0.1 ORPO odds ratio weight
grpo_num_generations int 4 GRPO: responses per prompt
grpo_max_completion_length int 512 GRPO: max tokens per completion (legacy alias grpo_max_new_tokens accepted)
grpo_reward_model string null GRPO: reward model path (HF or local)

data

Field Type Default Description
dataset_name_or_path string required HF dataset ID or local JSONL path
extra_datasets list null Additional datasets to mix in
mix_ratio list null Weight per dataset (e.g., [0.7, 0.3])
shuffle bool true Shuffle training data
clean_text bool true Strip extra whitespace
add_eos bool true Add EOS token to sequences

data.governance (Optional — EU AI Act Art. 10)

Field Type Default Description
collection_method string "" How data was collected
annotation_process string "" Annotation methodology
known_biases string "" Known dataset biases
personal_data_included bool false Contains personal data
dpia_completed bool false Data Protection Impact Assessment done

evaluation (Optional)

Field Type Default Description
auto_revert bool false Delete model if evaluation fails
max_acceptable_loss float null Hard ceiling for eval_loss
baseline_loss float null Computed automatically if null
require_human_approval bool false Pause for human review (exit code 4)

evaluation.benchmark (Optional)

Field Type Default Description
enabled bool false Enable lm-eval-harness benchmarks
tasks list [] Task names (e.g., ["arc_easy", "hellaswag"])
num_fewshot int null Few-shot examples (task default)
batch_size string "auto" Evaluation batch size
limit int null Samples per task (for quick checks)
min_score float null Minimum average accuracy

evaluation.safety (Optional)

Field Type Default Description
enabled bool false Enable safety classifier evaluation
classifier string "meta-llama/Llama-Guard-3-8B" Safety classifier model
test_prompts string "safety_prompts.jsonl" Adversarial test prompts file. Built-in sets in configs/safety_prompts/
max_safety_regression float 0.05 Max allowed unsafe ratio (binary gate)
scoring string "binary" Scoring mode: "binary" or "confidence_weighted"
min_safety_score float null Weighted score threshold (0.0-1.0). Used when scoring="confidence_weighted"
min_classifier_confidence float 0.7 Flag responses below this confidence for manual review
track_categories bool false Parse Llama Guard S1-S14 harm categories
severity_thresholds dict null Per-severity limits: {"critical": 0, "high": 0.01, "medium": 0.05}
batch_size int 8 Batched generation size for safety evaluation. 1 disables batching; raise for throughput on large VRAM, lower to reduce OOM risk on small VRAM.

evaluation.llm_judge (Optional)

Field Type Default Description
enabled bool false Enable LLM-as-Judge scoring
judge_model string "gpt-4o" Judge model (API or local path)
judge_api_key_env string null Env var name for API key (null = local)
judge_api_base string null Override the judge API base URL (Azure OpenAI, self-hosted vLLM, OpenAI-compatible gateway, e.g. https://api.together.xyz/v1). When unset, the SDK default endpoint is used.
eval_dataset string "eval_prompts.jsonl" Evaluation prompts file
min_score float 5.0 Minimum average score (1-10)
batch_size int 8 Number of (prompt, completion) pairs scored per LLM-judge round. 1 disables batching.

Deprecated: evaluation.staging_ttl_days is superseded by retention.staging_ttl_days. The legacy key is alias-forwarded with a DeprecationWarning during the v0.5.5 → v0.6.x window and removed in v0.7.0. See release.md.


retention (Optional — GDPR Article 17 erasure horizons)

Defines maximum retention horizons for compliance, training, and evaluation artefacts. Horizons honour GDPR Article 5(1)(e) "storage limitation" and Article 17 "right to erasure" deadlines. The enforce knob switches between log-only, warning, and hard-block modes so a regulated CI gate cannot silently extend the retention horizon by re-using a stale workspace.

Field Type Default Description
audit_log_retention_days int 1825 (~5 years) Days to retain audit_log.jsonl before flagging it as overdue under Article 5(1)(e). Set to 0 to retain indefinitely (Article 17(3)(b) defence).
staging_ttl_days int 7 Days to retain final_model.staging.<run_id>/ after a forgelm reject decision before scheduled cleanup. Set to 0 to retain indefinitely. Replaces the deprecated evaluation.staging_ttl_days; both keys accepted with identical values during the v0.5.5 → v0.6.x deprecation window.
ephemeral_artefact_retention_days int 90 Days to retain compliance bundles, data audit reports, and other run-scoped derived artefacts. Set to 0 to retain indefinitely.
raw_documents_retention_days int 90 Days to retain ingested raw documents (PDF / DOCX / EPUB / TXT / Markdown) under the operator's ingestion-output directory. Set to 0 to retain indefinitely.
enforce string "log_only" Policy enforcement mode: "log_only" (audit-log only), "warn_on_excess" (structured stderr warning), "block_on_excess" (abort trainer pre-flight with EXIT_EVAL_FAILURE = 3).

Deprecation: evaluation.staging_ttl_days is deprecated as of v0.5.5 in favour of retention.staging_ttl_days. The legacy key is alias-forwarded with a DeprecationWarning until v0.7.0. See release.md for the full deprecation cadence policy.


webhook (Optional)

Field Type Default Description
url string null Webhook destination URL
url_env string null Env var name containing URL
notify_on_start bool true Notify on training start
notify_on_success bool true Notify on success
notify_on_failure bool true Notify on failure
timeout int 10 HTTP request timeout (seconds). Clamped to ≥ 1s by the notifier. Default raised to 10s in v0.5.5 (was 5s) — Slack/Teams gateway latency spikes regularly cross 5s in production, and a webhook timeout silently degrades the audit chain (webhook failure is best-effort).
allow_private_destinations bool false Opt in to webhooks pointing at RFC1918 / loopback / link-local hosts (in-cluster Slack proxy, on-prem Teams gateway). Defaults to public-internet only — SSRF guard
tls_ca_bundle string null Path to a custom CA bundle forwarded to requests as verify= (e.g. corporate MITM CA). When unset, certifi's bundled store is used

distributed (Optional)

Field Type Default Description
strategy string null "deepspeed" or "fsdp" (null = single GPU)
deepspeed_config string null Preset ("zero2", "zero3", "zero3_offload") or JSON path
fsdp_strategy string "full_shard" "full_shard", "shard_grad_op", "hybrid_shard", "no_shard"
fsdp_auto_wrap bool true Auto-wrap transformer layers
fsdp_offload bool false Offload parameters to CPU
fsdp_backward_prefetch string "backward_pre" "backward_pre" or "backward_post"
fsdp_state_dict_type string "FULL_STATE_DICT" "FULL_STATE_DICT" or "SHARDED_STATE_DICT"

merge (Optional)

Field Type Default Description
enabled bool false Enable model merging
method string "ties" "ties", "dare", "slerp", "linear"
models list [] List of {path, weight} dicts
output_dir string "./merged_model" Output directory

compliance (Optional — EU AI Act Art. 11 + Annex IV)

Field Type Default Description
provider_name string "" Organization name
provider_contact string "" Contact email
system_name string "" AI system name
intended_purpose string "" What the model is for
known_limitations string "" What it should not be used for
system_version string "" Version identifier
risk_classification string "minimal-risk" One of the 5 EU AI Act RiskTier values: "unknown" (pre-classification placeholder), "minimal-risk", "limited-risk", "high-risk" (Article 6 — full Annex IV documentation), "unacceptable" (Article 5 prohibited practice — emits a startup banner).

risk_assessment (Optional — EU AI Act Art. 9)

Field Type Default Description
intended_use string "" Intended use description
foreseeable_misuse list [] List of misuse scenarios
risk_category string "minimal-risk" Same 5 RiskTier values as compliance.risk_classification: "unknown", "minimal-risk", "limited-risk", "high-risk", "unacceptable". Drives auto-revert thresholds and Annex IV gating.
mitigation_measures list [] Risk mitigation measures
vulnerable_groups_considered bool false Impact on vulnerable groups assessed

monitoring (Optional — EU AI Act Art. 12+17)

Field Type Default Description
enabled bool false Enable monitoring hooks
endpoint string "" Monitoring webhook URL
endpoint_env string null Env var name for endpoint
metrics_export string "none" "none", "prometheus", "datadog", "custom_webhook"
alert_on_drift bool true Alert on model drift
check_interval_hours int 24 Monitoring check interval

synthetic (Optional — Synthetic Data Generation)

Field Type Default Description
enabled bool false Enable teacher → student synthetic-data generation.
teacher_model string "" HF Hub ID or API model name (e.g. gpt-4o, meta-llama/Llama-3-70B).
teacher_backend string "api" One of "api" (OpenAI/Anthropic-compatible), "local" (HF in-process), "file" (read pre-generated JSONL).
api_base string "" API endpoint, e.g. https://api.openai.com/v1 or self-hosted vLLM gateway.
api_key Optional[str] null Inline API key. Prefer api_key_env to avoid committing secrets — when set inline, the value is ***REDACTED*** in serialized config.
api_key_env Optional[str] null Env var name carrying the API key (e.g. OPENAI_API_KEY).
api_delay float 0.5 Seconds between teacher calls (rate limiting).
api_timeout int 60 Per-call API timeout in seconds.
seed_file string "" Path to seed prompts file (JSONL or plain text, one prompt per line).
seed_prompts List[str] [] Inline seed prompts (alternative to seed_file).
system_prompt string "" System prompt prepended on every teacher call.
max_new_tokens int 1024 Max tokens per teacher response.
temperature float 0.7 Sampling temperature passed to the teacher.
output_file string "synthetic_data.jsonl" Output JSONL file path.
output_format string "messages" One of "messages" (chat-style array), "instruction" (Alpaca-style), "chatml", "prompt_response".

auth (Optional)

Field Type Default Description
hf_token string null HuggingFace token (prefer HUGGINGFACE_TOKEN env var)

pipeline (Optional — Multi-Stage Training Chains, Phase 14)

Chains 2+ training stages (typically SFT → DPO → GRPO) into one config-driven run with auto-chaining, per-stage gates, crash-safe resume, and a chain-level Annex IV manifest. When omitted, ForgeLM behaves byte-identically to a v0.6.0 single-stage run; the orchestrator module is not imported. Full operator walkthrough: Multi-Stage Pipelines guide.

Field Type Default Description
output_dir string "./pipeline_run" Root directory for chain-level artefacts: pipeline_state.json, compliance/pipeline_manifest.json, and the pipeline-scoped audit_log.jsonl. Per-stage trainer artefacts continue to live under each stage's own training.output_dir.
stages List[PipelineStage] [] (required: ≥ 1) Ordered list of stages. Each stage's model.name_or_path is auto-set to the previous stage's training.output_dir/final_model unless the stage supplies an explicit model: block.

pipeline.stages[].* — PipelineStage fields

A PipelineStage is a per-stage override layered onto the root config. Section-wholesale inheritance: omitting a block inherits root's wholesale; supplying a block REPLACES root's wholesale (no deep-merge).

Field Type Default Description
name string — (required) Stage identifier matching ^[a-z0-9_]{1,32}$. Unique within the pipeline. Used as the identifier in --stage <name>, --resume-from <name>, audit-log payloads, and per-stage manifest entries.
model Optional[ModelConfig] null Per-stage override of the root model: block. When null, auto-chains from the previous stage's final_model (or root for stage 0). When set, disables the auto-chain for that stage (operator escape hatch).
lora Optional[LoraConfig] null Per-stage LoRA config. Inherits root wholesale when null.
training Optional[TrainingConfig] null Per-stage training config. Inherits root wholesale when null. When supplied, trainer_type MUST be set explicitly — every stage records its alignment paradigm in the manifest for audit clarity.
data Optional[DataConfig] null Per-stage data config. Inherits root wholesale when null; per-stage override is the norm because each stage typically consumes a different dataset (SFT/DPO/preference/etc.).
evaluation Optional[EvaluationConfig] null Per-stage gates (loss thresholds, auto_revert, safety, judge, human-approval). Each stage may independently configure its gate.

Root-only sections — rejected at the stage level with EXIT_CONFIG_ERROR (1): distributed, webhook, compliance, risk_assessment, monitoring, retention, synthetic, merge, auth. These are pipeline-level concerns (distributed strategy stays consistent across the run; compliance metadata covers the whole chain; etc.).

Example

# Root defaults — inherited by stages that omit a block.
model: { name_or_path: "meta-llama/Llama-3-8B" }
lora: { r: 8, alpha: 16 }
training: { trainer_type: "sft", output_dir: "./placeholder" }
data: { dataset_name_or_path: "./placeholder.jsonl" }

pipeline:
  output_dir: "./pipeline_run"
  stages:
    - name: sft_stage
      training: { trainer_type: "sft", output_dir: "./pipeline_run/stage1_sft" }
      data: { dataset_name_or_path: "./data/sft.jsonl" }
    - name: dpo_stage
      training: { trainer_type: "dpo", output_dir: "./pipeline_run/stage2_dpo", dpo_beta: 0.1 }
      data: { dataset_name_or_path: "./data/preferences.jsonl" }
    - name: grpo_stage
      training: { trainer_type: "grpo", output_dir: "./pipeline_run/stage3_grpo" }
      data: { dataset_name_or_path: "./data/math_prompts.jsonl" }

CLI surface

Flag Effect
--stage <name> Run only the named stage in isolation (audit / re-run scenarios). Auto-chains from the previous stage's on-disk output.
--resume-from <name> Resume from the named stage onward; already-completed (or human-approved gated) stages with on-disk output are skipped.
--force-resume Accept a pipeline_config_hash mismatch on resume (logged + audited via pipeline.force_resume). Stage topology mismatch (count / names / order) is refused even with this flag.
--input-model <path> Operator escape hatch — overrides the auto-chained model for the --stage target. Audit-logged with input_source: cli_override.
--dry-run Validates every stage's merged config + cross-stage chain integrity + training.output_dir collision check before any GPU is allocated; collects all errors before exiting.

The --fit-check, --merge, --generate-data, --compliance-export, --benchmark-only flags are single-stage operations and are rejected at dispatch time when a pipeline: block is present — drop the pipeline: block or remove the flag.

Verifier

forgelm verify-annex-iv --pipeline <pipeline.output_dir>

Validates the chain-level manifest's structural fields, chain-integrity (every stage with input_source: chain matches its immediate predecessor's output_model), per-stage training_manifest.json existence, and stopped_at / running-status consistency. Exit 0 on clean manifest, 1 on config / chain violation, 2 on runtime I/O failure.