Scope. This page covers the SFT-row schema. For the modern data pipeline (DPO/SimPO/KTO/GRPO row formats, audit, ingestion from raw docs, PII / secrets masking, multi-dataset mixing), see:
- Document Ingestion Guide —
forgelm ingestraw → JSONL.- Dataset Audit Guide —
forgelm auditpre-flight gate (PII, secrets, near-duplicate, leakage, quality).- Dataset Formats — per-trainer format reference (SFT messages, DPO
chosen/rejected, KTOcompletion+label, GRPOpromptonly).forgelm.data.prepare_dataset— Python API.
ForgeLM uses the Hugging Face datasets library under the hood. While it can connect to thousands of datasets on the HF Hub, your dataset must adhere to specific structural patterns to be formatted perfectly for supervised fine-tuning.
ForgeLM's data processor expects an Instruction/Response structured dataset.
If loading via the Hugging Face Hub (e.g., dataset_name_or_path: "HuggingFaceH4/ultrachat_200k"), or via a local JSONL file, ForgeLM attempts to parse the rows looking for conversational columns.
The processor will attempt to map the following columns respectively:
- System Context (Optional): If your dataset has a
Systemcolumn, it will be injected. Otherwise, it is left blank. - User Prompt (Required): Looked for in the
User,instruction, ortextcolumn. - Assistant Response (Required): Looked for in the
Assistant,output, orresponsecolumn.
If you are bringing custom company data, format it into a .jsonl file where each line is a JSON object. Set the dataset_name_or_path in your config to the absolute path of this file.
{"System": "You are a helpful Python coding assistant.", "User": "How do I reverse a list?", "Assistant": "You can use `[::-1]` or the `.reverse()` method."}
{"System": "You are a helpful Python coding assistant.", "User": "What is a loop?", "Assistant": "A loop is used to iterate over a sequence."}In 2026, modern conversational fine-tuning no longer relies on manual string formatting (e.g., [SYSTEM]...[USER]...).
Instead, ForgeLM utilizes Hugging Face's tokenizer.apply_chat_template(). This means ForgeLM dynamically understands the architecture of the model you are using (be it Llama-3, Mistral, Gemma, or Qwen) and automatically formats your data into that specific model's native conversational token structure (e.g., <|im_start|>user\n...<|im_end|>).
This guarantees the highest possible fine-tuning accuracy with zero manual effort required on your part.
Ensure your base model supports chat templates. If it does not, ForgeLM will fall back to a generic bounding token structure.