Scope: Everything under
tests/+ CI workflows under.github/workflows/. Enforced by:.github/workflows/ci.yml(required) +.github/workflows/nightly.yml(compatibility).
Current structure (post Wave 5 / Phase 12.6 closure cycle: ~70 test
modules, one per feature area; the collected-test count grows over time —
run pytest --collect-only -q for current). The tree below is a
representative subset — see git ls-files tests/ for the full
inventory:
tests/
├── conftest.py # Shared fixtures (minimal_config factory)
├── runtime_smoke.py # Full-pipeline smoke fixture generator
├── test_smoke.py # Basic imports + CLI invocation
├── test_integration.py # End-to-end dry-run across trainer types
├── test_cli.py # CLI argument parsing + exit codes
├── test_cli_subcommands.py # Subcommand dispatching
├── test_config.py # Pydantic schemas + validators
├── test_trainer.py # Trainer orchestration logic
├── test_alignment.py # DPO / SimPO / KTO / ORPO / GRPO
├── test_long_context.py # RoPE scaling, NEFTune, sample packing
├── test_galore.py # GaLore optimizer
├── test_moe_functions.py # MoE expert quantize + freeze
├── test_phase7.py # VLM + merging + PiSSA
├── test_merging_algos.py # TIES / DARE / SLERP
├── test_synthetic.py # Teacher → student distillation
├── test_benchmark.py # lm-eval-harness wrapper
├── test_safety_advanced.py # Llama Guard + severity + categories
├── test_judge_functions.py # LLM-as-judge evaluation
├── test_compliance.py # Audit log + manifests + provenance
├── test_eu_ai_act.py # Articles 9-15 + Annex IV
├── test_model_card.py # Model card generation
├── test_cost_estimation.py # GPU cost heuristics
├── test_webhook.py # Slack/Teams notifier
├── test_distributed.py # DeepSpeed / FSDP config
├── test_data_edge_cases.py # Malformed datasets, edge cases
├── test_supply_chain_security.py # Wave 4 / Faz 23 — pip-audit + bandit + SBOM
├── test_check_anchor_resolution.py # Wave 4 / Faz 26 — markdown anchor resolver
├── test_check_bilingual_parity.py # Bilingual EN/TR mirror parity
├── test_gdpr_erasure.py # GDPR Article 17 (forgelm purge)
└── …
Rules:
- One
test_<module>.pyperforgelm/<module>.pywhere practical. - Cross-cutting features (EU AI Act, alignment) get their own file that may import multiple modules.
conftest.pyholds shared fixtures only. Domain-specific helpers live in the test files that need them.- A new feature PR adds the matching
test_*.pyin the same PR. No "tests in next PR."
From tests/conftest.py:
def minimal_config(**overrides):
"""Create a minimal valid ForgeConfig dict for testing."""
data = {
"model": {"name_or_path": "org/model"},
"lora": {},
"training": {},
"data": {"dataset_name_or_path": "org/dataset"},
}
data.update(overrides)
return dataRules:
- Factory functions over static fixtures.
minimal_config(training={"trainer_type": "dpo"})is better than 50 parametrized fixtures. - No GPU in unit tests. Mock
torch.cuda.is_available()or use CPU-only paths. Unit tests must run on a laptop with no GPU. - No network in unit tests. Mock
requests.post,huggingface_hub.snapshot_download, etc. Integration tests may hitlocalhostbut never external services. - Deterministic.
random.seed(42),torch.manual_seed(42)in any test that touches RNG. Flaky test = broken test.
| Category | Scope | Speed | Runs in CI? |
|---|---|---|---|
Smoke (test_smoke.py) |
Import + CLI --help works |
< 5s | Every push |
Unit (test_<module>.py) |
One function / method at a time, heavy mocking | < 60s total | Every push |
Integration smoke (test_integration.py) |
Full pipeline dry-run, no GPU, mocked HF | < 5min | Every push |
Distributed (test_distributed.py) |
DeepSpeed/FSDP config generation (no actual multi-GPU) | < 30s | Every push |
Compatibility (via nightly.yml) |
Upstream dep upgrades — latest TRL, PEFT, Unsloth | ~10min | Nightly only |
Cross-OS release-tag matrix (via publish.yml) |
Wheel install + pytest on 3 OS × 4 Python = 12 combos. Linux-only extras (qlora, unsloth) gated to the Linux runners. |
~25-40 min | Release tag push only |
Supply-chain (via nightly.yml + on-tag) |
pip-audit (CVE feed) + bandit (Python SAST) + CycloneDX SBOM emission per combo. [security] extra. |
~5min | Nightly + every release tag |
Never write a test that requires an actual GPU. The fixture runtime_smoke.py exists so "full pipeline" checks are dry-runs. If you genuinely need GPU validation, document it as a manual release-gate check in release.md, not a CI test.
Preferred libraries: unittest.mock.patch, pytest.monkeypatch, requests_mock.
What to mock:
- Network:
requests.post,huggingface_hub.*downloads, OpenAI/Anthropic API calls - GPU:
torch.cuda.is_available,torch.cuda.get_device_name - Time:
time.sleep, datetime patterns where tests need determinism - Third-party heavy imports:
unsloth,bitsandbytes,deepspeed,lm_eval(when unavailable)
What NOT to mock:
- Pydantic validation (use real config objects)
- File I/O under
tmp_path(use the real filesystem via pytest fixture) - YAML parsing
- Anything
forgelm-internal that has a fast real implementation
From pyproject.toml:
[tool.coverage.report]
fail_under = 4040% is the floor, not the target. Current repo sits well above it. The floor was raised from 25 to 40 during Phase 11/11.5 review cycles once the audit / ingest module suite landed; the standard is now in lock-step with the toml. Rules:
- Every new module starts at or above the overall floor.
- Public API (non-underscore functions) has coverage.
- Error paths (every
raiseand everysys.exit(!=0)) must have at least one test that triggers them. pragma: no coveris allowed only for:if __name__ == "__main__":blocksexcept ImportError:fallbacks for optional deps- Explicit "not implemented on this platform" branches
If you need to exempt more, file an issue first.
From .github/workflows/ci.yml:
- Lint —
ruff check+ruff format --checkon entire repo. Failure = PR blocked. - Test matrix — Python 3.10, 3.11, 3.12, 3.13 on ubuntu-latest.
- Coverage —
pytest --cov=forgelm --cov-fail-under=40(enforced viaaddoptsinpyproject.toml's[tool.pytest.ini_options], kept in lock-step with[tool.coverage.report].fail_under). - Dry-run validation —
forgelm --config config_template.yaml --dry-runmust succeed. - Doc CI guards (Wave 3 / Wave 4 / Wave 5):
python3 tools/check_bilingual_parity.py --strict— H2/H3/H4 spine sync between EN and TR mirrors (39/39 pairs today).python3 tools/check_anchor_resolution.py --strict— every relative markdown link with a#anchorfragment resolves to a real heading.python3 tools/check_cli_help_consistency.py --strict— CLI--helpoutput ↔docs/usermanuals/{en,tr}/reference/cli.mdparity.
From .github/workflows/nightly.yml:
- Unbounded upstream versions (latest TRL/PEFT/Unsloth) to catch breaking changes early.
- Supply-chain pass:
pip-audit(CVE feed) +bandit(Python SAST). Provided by the[security]optional extra. - Failure does not block PRs but triggers an issue.
From .github/workflows/publish.yml (release-tag trigger only):
- Cross-OS matrix: 3 OS × 4 Python = 12 combos installing the packaged wheel + running pytest + emitting a per-combo CycloneDX 1.5 SBOM. Every combo must pass before PyPI publish runs. No
fail-fast— all 12 combos run to completion so the failure surface is visible.
|| true discipline.
Bare <command> || true in CI is forbidden — it converts non-zero exits into success and creates a fake-green status. The only sanctioned exception is the scanner + severity-tiering helper pattern:
- run: |
pip-audit --format json > pip-audit.json || true
python3 tools/check_pip_audit.py pip-audit.jsonIn this shape, || true only captures the scanner's exit code so the helper can read its JSON output; the helper itself enforces the actual severity gate. Allowed only when:
- The
|| trueis on the immediately preceding scanner line (not on apytest,ruff, or test command). - The next line in the same
run:block invokes atools/check_*.pyhelper that does its own severity-tiered exit code. - Both lines live in the same step (so the helper truly gates the scanner output).
The canonical examples are tools/check_pip_audit.py (CVE severity tier) and tools/check_bandit.py (issue-severity tier). Any other || true in CI must be replaced with continue-on-error: true at the YAML step level, and that step must explicitly document the rationale in a YAML # comment:.
Template for a new module's test file:
"""Tests for forgelm.<module>."""
import pytest
from forgelm.<module> import <public_api>
class TestPublicFunction:
def test_happy_path(self, tmp_path):
result = <public_api>(...)
assert result.status == "ok"
def test_raises_on_invalid_input(self):
with pytest.raises(ConfigError, match="trainer_type"):
<public_api>(bad_input)
@pytest.mark.parametrize("value,expected", [
("sft", "SFTTrainer"),
("dpo", "DPOTrainer"),
])
def test_trainer_class_selection(self, value, expected):
assert select_trainer(value).__name__ == expected| Anti-pattern | Why rejected | Correct form |
|---|---|---|
pytest.skip("not yet implemented") |
Tracks zero actual behaviour | Use pytest.mark.xfail(reason="...") with an issue link |
| Test that just calls the function and asserts no exception | Proves nothing | Assert specific returns/side effects |
sleep(2); assert something |
Flaky by construction | Use monkeypatch for time, or event-based waits |
print() to debug tests |
Noisy output | Use caplog fixture for log assertions |
@pytest.fixture(scope="session") for mutable state |
Cross-test pollution | Scope to function unless immutable |
-
pytest tests/passes locally -
ruff check . && ruff format --check .passes -
forgelm --config config_template.yaml --dry-runsucceeds if you touched CLI or trainer - If docs touched:
python3 tools/check_bilingual_parity.py --strict,python3 tools/check_anchor_resolution.py --strict,python3 tools/check_cli_help_consistency.py --strict - New public function or class has tests covering happy path + one error path
- Any new exit code or exception is tested
- No GPU or network required for new unit tests