Skip to content

feat: new release version#108

Merged
n0w0f merged 10 commits intomainfrom
development
Mar 4, 2026
Merged

feat: new release version#108
n0w0f merged 10 commits intomainfrom
development

Conversation

@n0w0f
Copy link
Copy Markdown
Collaborator

@n0w0f n0w0f commented Mar 4, 2026

  • Updated TRL and accelerate api
  • more ddp support
  • update xtal2txt

Summary by Sourcery

Add distributed data parallel (DDP)–aware training, inference, and logging across finetuning pipelines, update library versions and dependencies for a new release, and modernize installation, testing, and docs workflows.

New Features:

  • Support completion-only SFT training for LLaMA fine-tuning with prompt/completion columns and configurable context length.
  • Allow skipping specific folds during experiments via configuration.
  • Add pre-commit configuration and CI job to enforce formatting and linting.

Bug Fixes:

  • Prevent DDP checkpoint loading conflicts by disabling incompatible training arguments like load_best_model_at_end and save_on_each_node when running with multiple processes.
  • Avoid DistributedDataParallel misuse by letting Hugging Face trainers handle DDP instead of wrapping models manually.
  • Ensure result directories are created safely using os.makedirs with exist_ok=True.
  • Fix logging calls that previously passed multiple arguments incorrectly to the logger.

Enhancements:

  • Make all finetuning and benchmarking flows DDP-aware by centralizing rank detection, avoiding redundant WandB usage, and delegating synchronization to the trainer/accelerate.
  • Improve benchmarking to only run inference, logging, and result saving on the main process while synchronizing other ranks between folds.
  • Refine saving, evaluation, and cleanup logic so that only the main process writes checkpoints, predictions, and finishes WandB runs.
  • Switch project installation guidance and CI/docs workflows to use uv for dependency management and environment setup.
  • Document system requirements, optional OpenBabel installation methods, and development setup steps in README and getting-started docs.

Build:

  • Bump project version to 0.0.2 and pin supported Python versions to 3.10–3.11 with numpy<2.0.
  • Point xtal2txt dependency to the maintained GitHub repository and expand optional decoder extras accordingly.

CI:

  • Update tests workflow to run on Python 3.10 and 3.11 using uv and system-level OpenBabel/FFTW, and disable fail-fast in the matrix.
  • Update docs workflow to use uv-based installs for docs extras.
  • Add a pre-commit workflow to run checks on pushes and pull requests to main and development branches.

Documentation:

  • Revise README and getting-started documentation to highlight uv-based installation, supported Python versions, dev setup, and OpenBabel installation options.

Summary by CodeRabbit

  • New Features

    • Add CI pre-commit workflow and uv-based install steps; explicit Python 3.10/3.11 support.
  • Improvements

    • Safer multi-GPU/distributed behavior and reduced duplicate telemetry; richer logging and main-process-only telemetry actions.
    • Modernized type hints and safer defaults to avoid mutable-default issues.
  • Documentation

    • Updated installation/getting-started, API docs, and tokenizer/representation docs.
  • Chores

    • Version bumped to 0.0.2 and dependency constraints updated; added pre-commit tooling.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 4, 2026

Reviewer's Guide

Updates MatText to version 0.0.2 with TRL/Accelerate-style SFT training, improved DDP handling (single-rank logging/checkpointing, proper synchronization), updated xtal2txt dependency, and a uv-based Python 3.10/3.11-only installation and CI setup.

Sequence diagram for DDP benchmarking and finetuning workflow

sequenceDiagram
    actor User
    participant CLI as CLI_main_py
    participant TaskRunner
    participant Benchmark as Matbenchmark
    participant BaseBench as BaseBenchmark
    participant Finetuner as FinetuneModel
    participant LlamaSFT as LlamaSFTTrainer
    participant Wandb as wandb
    participant HF as SFTTrainer
    participant Dist as PartialState

    User->>CLI: run python main.py -cn=benchmark ...
    CLI->>CLI: read cfg
    CLI->>CLI: read LOCAL_RANK env
    CLI->>CLI: set local_rank=None or int
    CLI->>TaskRunner: create TaskRunner()
    CLI->>TaskRunner: initialize_wandb() only if main process
    CLI->>TaskRunner: run_benchmarking(cfg.model, local_rank)

    TaskRunner->>Benchmark: Matbenchmark.run_benchmarking(local_rank)
    Benchmark->>Dist: create PartialState()
    loop folds
        Benchmark->>Benchmark: for each exp_name, test_name
        alt main process (state.is_main_process)
            Benchmark->>Wandb: wandb.init(config=finetune_cfg,..., name=exp_name)
        end
        Benchmark->>BaseBench: _run_experiment(task, i, exp_name, test_name, local_rank)

        BaseBench->>Dist: create PartialState()
        BaseBench->>BaseBench: log fold and dataset info
        BaseBench->>BaseBench: prepare exp_cfg (train and test paths)
        BaseBench->>Benchmark: _get_finetuner(exp_cfg, local_rank, fold_name)
        Benchmark-->>BaseBench: FinetuneModel instance

        BaseBench->>Finetuner: finetune()
        Finetuner->>LlamaSFT: construct finetune wrapper with cfg, local_rank
        LlamaSFT->>LlamaSFT: prepare_data, prepare_test_data, _setup_model_tokenizer
        LlamaSFT->>LlamaSFT: _add_prompt_completion_columns(train_dataset)
        LlamaSFT->>LlamaSFT: _add_prompt_completion_columns(eval_dataset)
        LlamaSFT->>LlamaSFT: build SFTConfig(config_dict, packing=False, completion_only_loss=True, max_length)
        LlamaSFT->>HF: create SFTTrainer(model, peft_config, train_dataset, eval_dataset, processing_class=tokenizer, args=SFTConfig, callbacks)

        alt main process (local_rank is None or 0)
            LlamaSFT->>Wandb: wandb.log(Training Arguments)
            LlamaSFT->>Wandb: wandb.log(model_summary)
        end

        LlamaSFT->>HF: trainer.train()
        HF-->>LlamaSFT: trained model

        alt main process
            LlamaSFT->>HF: trainer.save_state()
            LlamaSFT->>HF: trainer.save_model(output_dir)
            LlamaSFT->>LlamaSFT: tokenizer.save_pretrained(...)
            LlamaSFT->>HF: create pipeline(text-generation, model, tokenizer, completion_only output)
            LlamaSFT->>HF: pipe(formatting_tests_func(testdata))
            HF-->>LlamaSFT: predictions
            LlamaSFT->>LlamaSFT: write predictions to json
            LlamaSFT->>Wandb: wandb.finish()
        end

        LlamaSFT->>LlamaSFT: cleanup (del trainer, model, tokenizer, gc.collect)
        LlamaSFT-->>Finetuner: finetuned checkpoint path
        Finetuner-->>BaseBench: checkpoint path

        alt main process (state.is_main_process)
            BaseBench->>Wandb: wandb.init(config=inference_cfg,..., name=test_name)
            BaseBench->>BaseBench: set inference test_data and pretrained_checkpoint
            BaseBench->>BaseBench: _get_inference(exp_cfg, fold_name)
            BaseBench->>BaseBench: predict.predict()
            BaseBench-->>BaseBench: predictions, prediction_ids
            BaseBench->>BaseBench: _record_predictions(task, i, predictions, prediction_ids)
        else non main ranks
            BaseBench->>BaseBench: skip inference and logging
        end

        BaseBench->>Dist: state.wait_for_everyone() synchronize ranks

        alt main process
            BaseBench->>Wandb: wandb.finish()
        end
    end

    alt main process in Benchmark
        Benchmark->>Benchmark: _save_results(task)
    end
    Benchmark-->>TaskRunner: benchmarking complete
    TaskRunner-->>CLI: return
    CLI-->>User: benchmarking finished
Loading

Updated class diagram for benchmarking and training orchestration

classDiagram
    class TaskRunner {
        +_run_experiment(task_cfg, experiment_type, local_rank, use_folds, use_train_data_path)
        +run_benchmarking(task_cfg, local_rank)
        +initialize_wandb()
    }

    class BaseBenchmark {
        <<abstract>>
        -train_data
        -test_data
        -task_cfg
        -benchmark_save_path
        +_initialize_task()*
        +_run_experiment(task, i, exp_name, test_name, local_rank)
        +_get_finetuner(exp_cfg, local_rank, fold_name)*
        +_get_inference(exp_cfg, fold_name)*
        +_record_predictions(task, fold, predictions, prediction_ids)
        +_save_results(task)
    }

    class Matbenchmark {
        +run_benchmarking(local_rank)
        +_get_finetuner(exp_cfg, local_rank, fold_name)
        +_get_inference(exp_cfg, fold_name)
        +_initialize_task()
    }

    class MatbenchmarkClassification {
        +run_benchmarking(local_rank)
        +_get_finetuner(exp_cfg, local_rank, fold_name)
        +_get_inference(exp_cfg, fold_name)
        +_initialize_task()
    }

    class FinetuneModel {
        -cfg
        -local_rank
        -tokenized_dataset
        -_wrapped_tokenizer
        +finetune() str
        +_callbacks()
        +_compute_metrics(eval_pred)
        +get_best_metric()
        +is_greater_better()
    }

    class PotentialModel {
        -cfg
        -local_rank
        -tokenized_dataset
        -tokenized_testset
        -_wrapped_tokenizer
        +finetune() str
        +evaluate()
        +_callbacks()
        +_compute_metrics(eval_pred)
    }

    class LlamaSFTTrainer {
        -cfg
        -local_rank
        -dataset
        -testdata
        -model
        -tokenizer
        -peft_config
        -property_
        -material_
        -representation
        -test_sample_size
        +__init__(cfg, local_rank, fold, train_data, test_data, representation, test_sample_size)
        +prepare_data(train_subset)
        +prepare_test_data(test_subset)
        +_setup_model_tokenizer()
        +_callbacks()
        +_add_prompt_completion_columns(dataset)
        +formatting_prompts_func(example)
        +formatting_tests_func(example)
        +finetune() str
    }

    class SFTTrainer {
        +train()
        +evaluate(eval_dataset)
        +save_model(output_dir)
        +save_state()
    }

    class Trainer {
        +train()
        +evaluate(eval_dataset)
        +save_model(output_dir)
    }

    class TrainingArguments {
    }

    class SFTConfig {
    }

    class PartialState {
        +is_main_process bool
        +wait_for_everyone()
    }

    class wandb {
        +init(config, project, name)
        +log(data)
        +finish()
    }

    TaskRunner --> Matbenchmark : uses
    TaskRunner --> MatbenchmarkClassification : uses

    BaseBenchmark <|-- Matbenchmark
    BaseBenchmark <|-- MatbenchmarkClassification

    Matbenchmark --> FinetuneModel : _get_finetuner()
    MatbenchmarkClassification --> FinetuneModel : _get_finetuner()

    BaseBenchmark --> LlamaSFTTrainer : finetune for Llama models
    BaseBenchmark --> PotentialModel : finetune for potential models

    FinetuneModel --> Trainer : uses
    PotentialModel --> Trainer : uses
    LlamaSFTTrainer --> SFTTrainer : uses

    FinetuneModel --> TrainingArguments : builds
    PotentialModel --> TrainingArguments : builds
    LlamaSFTTrainer --> SFTConfig : builds

    BaseBenchmark --> PartialState : controls DDP flow
    Matbenchmark --> PartialState : controls DDP flow
    MatbenchmarkClassification --> PartialState : controls DDP flow

    BaseBenchmark --> wandb : experiment and inference logging
    Matbenchmark --> wandb : benchmarking run logging
    MatbenchmarkClassification --> wandb : benchmarking run logging
    LlamaSFTTrainer --> wandb : training logging
    FinetuneModel --> wandb : training logging
    PotentialModel --> wandb : training logging
Loading

File-Level Changes

Change Details Files
Switch llama SFT finetuning to TRL SFTTrainer with completion-only loss and DDP-safe logging/checkpointing.
  • Introduce helper to map datasets to prompt/completion columns and enable completion_only_loss-style training.
  • Replace TrainingArguments+DataCollatorForCompletionOnlyLM with SFTConfig and SFTTrainer, set max_length by representation, and pass tokenizer via processing_class instead of tokenizer/data_collator.
  • Guard wandb logging, model saving, and inference/prediction steps so they run only on rank 0, and adjust generation parameters (max_new_tokens, answer template).
  • Simplify teardown: rely on TRL/Trainer for DDP sync, delete trainer/model/tokenizer, and call garbage collection once.
src/mattext/models/llama_sft.py
Make benchmarking and main entrypoint DDP-aware by only running logging/inference on main process and synchronizing workers.
  • Instantiate accelerate.PartialState in benchmark flows, run wandb.init/inference/_save_results only when state.is_main_process, and sync ranks after each fold with wait_for_everyone().
  • Fix logger usage when printing fold/checkpoint names to use f-strings instead of passing multiple args.
  • In main TaskRunner, support skipping specific folds via a model.skip_folds list and restrict wandb.init/finish to main rank.
  • Add LOCAL_RANK-based detection of DDP vs single-process in main(), print diagnostics, and only initialize wandb on main rank.
src/mattext/models/benchmark.py
src/mattext/main.py
Update potential and generic finetune trainers to be DDP-safe and align with newer Trainer APIs.
  • Wrap training_arguments in a dict, disable load_best_model_at_end and save_on_each_node in DDP mode to avoid checkpoint contention.
  • Stop manually wrapping models in DistributedDataParallel and instead rely on HF Trainer’s internal DDP support, only moving models to local_rank or CUDA.
  • Switch Trainer args from tokenizer= to processing_class= for compatibility with newer TRL/Transformers APIs.
  • Limit wandb logging and model saving to main rank while still calling trainer.evaluate on all ranks to satisfy distributed collectives.
src/mattext/models/potential.py
src/mattext/models/finetune.py
src/mattext/models/llama.py
Modernize installation/docs/CI to target Python 3.10–3.11, use uv, and pin critical dependencies.
  • Document uv-based installation and development workflow in README and docs/getting_started, including optional pre-commit setup and OS-specific OpenBabel instructions.
  • Update GitHub Actions test workflow to use setup-python + uv, install system OpenBabel/FFTW, and test on Python 3.10 and 3.11 with fail-fast disabled.
  • Update docs workflow to use uv and a constrained Python 3.10.19.
  • Constrain supported Python versions in pyproject.toml to >=3.10,<3.12, drop 3.9/3.12 classifiers, pin numpy<2.0 globally and in decoder extra, and point xtal2txt dependency to the lamalab-org GitHub repo.
README.md
docs/getting_started.md
.github/workflows/tests.yml
.github/workflows/docs.yml
pyproject.toml
Introduce project tooling for formatting/linting and enforce it in CI, along with a version bump.
  • Add pre-commit configuration with basic hygiene hooks, Ruff (lint+format), and docformatter.
  • Create a pre-commit GitHub Actions workflow that runs on pushes/PRs against main and development.
  • Bump project version from 0.0.1 to 0.0.2 and synchronize bumpversion current_version to 0.0.2-dev.
.pre-commit-config.yaml
.github/workflows/pre-commit.yml
pyproject.toml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 4, 2026

Warning

Rate limit exceeded

@n0w0f has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 18 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 70ba357f-3151-41ab-a2e1-684214fe28fc

📥 Commits

Reviewing files that changed from the base of the PR and between 9fbea01 and 887fba3.

📒 Files selected for processing (1)
  • src/mattext/models/llama_sft.py
📝 Walkthrough

Walkthrough

Pin CI to Python 3.10.19, add uv-based installs and pre-commit CI; modernize typing and defaults; add DDP-aware control flow (LOCAL_RANK), main-process-gated WandB and synchronization barriers; migrate some training to SFTTrainer/SFTConfig; many doc, config, and formatting tweaks across repo.

Changes

Cohort / File(s) Summary
CI & Workflows
/.github/workflows/docs.yml, /.github/workflows/tests.yml, /.github/workflows/pre-commit.yml
Pin Python to 3.10.19, add astral-sh/setup-uv and uv-based install steps, broaden test matrix, replace Mambaforge setup, disable fail-fast, add separate pre-commit workflow.
Pre-commit & Linting
.pre-commit-config.yaml, pyproject.toml
Add pre-commit hooks (ruff, docformatter, pre-commit-hooks), add pre-commit to dev extras, update ruff/isort/tooling config.
Docs & README
README.md, docs/getting_started.md, docs/*, mkdocs.yml
Switch install/dev instructions to uv, add OpenBabel/system dependency guidance, update docs/examples and formatting.
Project Metadata
pyproject.toml
Bump version to 0.0.2, tighten Python requirement to >=3.10,<3.12, add numpy<2.0/protobuf constraints, add tensorflow/keras deps, replace xtal2txt with git URL, tooling updates.
Core CLI / Entry
src/mattext/main.py
Add LOCAL_RANK handling and DDP mode detection, change _run_experiment typing to `int
Benchmark & Inference
src/mattext/models/benchmark.py, src/mattext/models/inference.py
Introduce PartialState synchronization, main-process-only WandB/logging/save, add state.wait_for_everyone barriers, use strict=False in zips, improve pathlib usage and exception logging.
Finetune / Trainer changes
src/mattext/models/finetune.py, src/mattext/models/llama.py, src/mattext/models/potential.py
Modernize type hints, remove manual DDP wrapping, build TrainingArguments from dict to toggle DDP-safe options, switch Trainer tokenizerprocessing_class, restrict WandB and saving to main process.
SFT migration & LLM SFT
src/mattext/models/llama_sft.py
Migrate to SFTConfig/SFTTrainer, add test_sample_size and _add_prompt_completion_columns, use completion-only loss and packing settings, ensure main-process-only saving/inference and WandB lifecycle.
Dataprep & Timeouts
src/mattext/dataprep/*, revision-scripts/*
Replace mutable defaults with None, add TimeoutException, safer initialization, progress logging, and numerous typing modernizations.
Typing & Misc. Modernization
many scripts/*, src/mattext/*, tests/*
Replace typing generics with built-in generics (list/dict), remove unused imports, adjust docstrings and small formatting across many files.
Small/Formatting Changes
many conf/*, LICENSE, notebooks/*
Whitespace, EOF newline fixes, and minor YAML formatting across archived/experiment configs.
sequenceDiagram
  autonumber
  participant Launcher as Launcher
  participant Worker as Worker (rank N)
  participant Trainer as Trainer
  participant WandB as WandB (main)
  participant FS as FileSystem

  Launcher->>Worker: start process (set LOCAL_RANK)
  Worker->>Trainer: initialize Trainer/model (receive local_rank)
  alt main process (is_main)
    Trainer->>WandB: wandb.init() (gated)
    WandB-->>Trainer: run handle
  end
  Trainer->>Trainer: train / evaluate across folds
  Trainer->>Worker: state.wait_for_everyone() (barrier)
  alt main process (is_main)
    Trainer->>FS: save checkpoints & write predictions/results
    Trainer->>WandB: log metrics / artifacts
    WandB-->>Trainer: finish
  end
Loading

🎯 4 (Complex) | ⏱️ ~50 minutes

🐰 I hopped through configs, tidy and spry,
uv carried installs up to the sky,
Ranks queued neatly, waiting their turn,
WandB logs only where leaders return,
A rabbit applauds — code ready to fly!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat: new release version' is vague and generic. While it indicates a feature release, it does not convey the specific, substantial changes in the PR (DDP support, SFT training, version bump, Python 3.10-3.11, modernized workflows, etc.). Consider a more descriptive title that highlights the main changes, such as 'feat: DDP support and LLaMA SFT training with modernized CI/workflows' or 'feat: v0.0.2 release with distributed training and Python 3.10-3.11'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch development

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • In several places you gate logic on is_main or PartialState().is_main_process, but you still call collective operations (e.g., trainer.evaluate, state.wait_for_everyone) on all ranks—double‑check that every DDP code path consistently calls these in the same order and that PartialState is imported wherever it’s used to avoid runtime issues.
  • The new _add_prompt_completion_columns helper in llama_sft.py hardcodes the prompt format and numeric formatting ({example['labels']:.3f} and @@@ terminator); consider making these configurable via the config or class attributes so different tasks/properties can adjust formatting without modifying code.
  • You now infer skip_folds in main._run_experiment via task_cfg.model.get("skip_folds", []); it would be safer to normalize this value (e.g., always a list of ints) and validate it against range(task_cfg.model.fold) to avoid subtle misconfigurations or type mismatches in user configs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In several places you gate logic on `is_main` or `PartialState().is_main_process`, but you still call collective operations (e.g., `trainer.evaluate`, `state.wait_for_everyone`) on all ranks—double‑check that every DDP code path consistently calls these in the same order and that `PartialState` is imported wherever it’s used to avoid runtime issues.
- The new `_add_prompt_completion_columns` helper in `llama_sft.py` hardcodes the prompt format and numeric formatting (`{example['labels']:.3f}` and `@@@` terminator); consider making these configurable via the config or class attributes so different tasks/properties can adjust formatting without modifying code.
- You now infer `skip_folds` in `main._run_experiment` via `task_cfg.model.get("skip_folds", [])`; it would be safer to normalize this value (e.g., always a list of ints) and validate it against `range(task_cfg.model.fold)` to avoid subtle misconfigurations or type mismatches in user configs.

## Individual Comments

### Comment 1
<location path="src/mattext/models/llama_sft.py" line_range="127-136" />
<code_context>
             output_texts.append(text)
         return output_texts

+    def _add_prompt_completion_columns(self, dataset):
+        """Add prompt and completion columns for completion-only loss."""
+        def _map_fn(example):
+            return {
+                "prompt": f"### What is the {self.property_} of {example[self.representation]}\n ### Answer:",
+                "completion": f" {example['labels']:.3f}@@@",
+            }
+        dataset = dataset.map(_map_fn)
+        # Remove original columns to avoid conflicts with the trainer's label handling
+        cols_to_remove = [c for c in dataset.column_names if c not in ("prompt", "completion")]
+        return dataset.remove_columns(cols_to_remove)
+
     def formatting_tests_func(self, example):
</code_context>
<issue_to_address>
**issue (bug_risk):** SFTTrainer will not find a `dataset_text_field` (`text` by default), so training is likely to fail or ignore your prompt/completion columns.

The new helper now leaves the dataset with only `prompt` and `completion` columns, but `SFTTrainer` is still being used in its default mode, which looks for a single `"text"` column (or whatever is passed as `dataset_text_field`). With this setup:

- The datasets only expose `"prompt"` and `"completion"`.
- `SFTTrainer` is not told how to consume these columns via `dataset_text_field` / `dataset_kwargs`.

That will usually either fail at construction or cause the trainer to ignore the intended fields. You’ll need to either:

- Go back to a single `"text"` column with `prompt + completion` and use `DataCollatorForCompletionOnlyLM`, or
- Explicitly configure `dataset_text_field` / `dataset_kwargs` so `SFTTrainer` can use separate prompt/completion fields (depending on your TRL version).
</issue_to_address>

### Comment 2
<location path="src/mattext/models/llama_sft.py" line_range="180-187" />
<code_context>
+                print(f"[Rank {self.local_rank}] WARNING: save_on_each_node should be False in DDP")
+                config_dict['save_on_each_node'] = False
+
+        max_length = 2048 if self.representation == "cif_p1" else 1024
+
+        training_args = SFTConfig(
+            **config_dict,
+            packing=False,
+            max_length=max_length,
+            completion_only_loss=True,
         )
+        callbacks = self._callbacks()
</code_context>
<issue_to_address>
**suggestion (bug_risk):** SFTConfig may ignore `max_length`; you likely want `max_seq_length` instead for SFTTrainer.

Here `max_length` is passed into `SFTConfig`, but `SFTTrainer` generally respects `max_seq_length` for truncation/packing. Depending on your TRL version, `max_length` may be ignored. To avoid silent misconfiguration, wire this through as `max_seq_length=max_length` (either in `SFTConfig` or the `SFTTrainer` call), matching the argument name `SFTTrainer` expects.

```suggestion
        max_length = 2048 if self.representation == "cif_p1" else 1024

        training_args = SFTConfig(
            **config_dict,
            packing=False,
            max_seq_length=max_length,
            completion_only_loss=True,
        )
```
</issue_to_address>

### Comment 3
<location path="src/mattext/models/llama.py" line_range="244-252" />
<code_context>

-        wandb.log({"Training Arguments": str(config_train_args)})
-        wandb.log({"model_summary": str(model)})
+        is_main = self.local_rank is None or self.local_rank == 0
+
+        if is_main:
</code_context>
<issue_to_address>
**issue (bug_risk):** In DDP, all ranks will call `save_model` to the same path, which can cause races or corrupted checkpoints.

You already use `is_main` to gate wandb logging, but `trainer.save_model` is still called on every rank. Please wrap the `save_model` call with the same `is_main` condition so only rank 0 writes the checkpoint:

```python
if is_main:
    trainer.save_model(
        f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
    )
```

The Trainer already synchronizes weights across ranks, so saving from a single process is enough.
</issue_to_address>

### Comment 4
<location path="README.md" line_range="23" />
<code_context>
-MatText is a framework for text-based materials modeling. It supports 
+MatText is a framework for text-based materials modeling. It supports

-- conversion of crystal structures in to text representations 
+- conversion of crystal structures in to text representations
 - transformations of crystal structures for sensitivity analyses
</code_context>
<issue_to_address>
**issue (typo):** Change "in to" to "into" for correct usage.

```suggestion
- conversion of crystal structures into text representations
```
</issue_to_address>

### Comment 5
<location path="README.md" line_range="125" />
<code_context>
 To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.


</code_context>
<issue_to_address>
**issue (typo):** Adjust "for e.g." to the standard "e.g.".

"For e.g." is nonstandard; use "e.g.," instead: `use ++, e.g., ++model.pretrain...`

```suggestion
To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +127 to +136
def _add_prompt_completion_columns(self, dataset):
"""Add prompt and completion columns for completion-only loss."""
def _map_fn(example):
return {
"prompt": f"### What is the {self.property_} of {example[self.representation]}\n ### Answer:",
"completion": f" {example['labels']:.3f}@@@",
}
dataset = dataset.map(_map_fn)
# Remove original columns to avoid conflicts with the trainer's label handling
cols_to_remove = [c for c in dataset.column_names if c not in ("prompt", "completion")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): SFTTrainer will not find a dataset_text_field (text by default), so training is likely to fail or ignore your prompt/completion columns.

The new helper now leaves the dataset with only prompt and completion columns, but SFTTrainer is still being used in its default mode, which looks for a single "text" column (or whatever is passed as dataset_text_field). With this setup:

  • The datasets only expose "prompt" and "completion".
  • SFTTrainer is not told how to consume these columns via dataset_text_field / dataset_kwargs.

That will usually either fail at construction or cause the trainer to ignore the intended fields. You’ll need to either:

  • Go back to a single "text" column with prompt + completion and use DataCollatorForCompletionOnlyLM, or
  • Explicitly configure dataset_text_field / dataset_kwargs so SFTTrainer can use separate prompt/completion fields (depending on your TRL version).

Comment on lines +244 to 252
is_main = self.local_rank is None or self.local_rank == 0

if is_main:
wandb.log({"Training Arguments": str(config_train_args)})
wandb.log({"model_summary": str(self.model)})

# Trainer handles all DDP synchronization automatically
trainer.train()
trainer.save_model(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): In DDP, all ranks will call save_model to the same path, which can cause races or corrupted checkpoints.

You already use is_main to gate wandb logging, but trainer.save_model is still called on every rank. Please wrap the save_model call with the same is_main condition so only rank 0 writes the checkpoint:

if is_main:
    trainer.save_model(
        f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
    )

The Trainer already synchronizes weights across ranks, so saving from a single process is enough.


The `+` symbol before a configuration key indicates that you are adding a new key-value pair to the configuration. This is useful when you want to specify parameters that are not part of the default configuration.

To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Adjust "for e.g." to the standard "e.g.".

"For e.g." is nonstandard; use "e.g.," instead: use ++, e.g., ++model.pretrain...

Suggested change
To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mattext/models/llama.py (1)

244-260: ⚠️ Potential issue | 🔴 Critical

Guard post-training logging and saving operations behind is_main.

In a distributed data parallel (DDP) setup, all ranks execute the training loop. Currently, lines 252–260 are executed by every rank, causing multiple processes to:

  • Write to the same output paths via trainer.save_model() and self.model.save_pretrained()
  • Call W&B logging and cleanup APIs via wandb.log() and wandb.finish(), which only the main process should manage

This results in file conflicts, W&B API errors, and potential artifact corruption. The trainer.evaluate() call is safe to execute from all ranks (Trainer handles DDP synchronization), but logging results and saving models must be restricted to the main process.

Proposed fix
        # Trainer handles all DDP synchronization automatically
        trainer.train()
-       trainer.save_model(
-           f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
-       )

        eval_result = trainer.evaluate(eval_dataset=self.tokenized_dataset["test"])
-       wandb.log(eval_result)
-
-       self.model.save_pretrained(self.cfg.path.finetuned_modelname)
-       wandb.finish()
+       if is_main:
+           wandb.log(eval_result)
+           trainer.save_model(
+               f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
+           )
+           self.model.save_pretrained(self.cfg.path.finetuned_modelname)
+           wandb.finish()
        return self.cfg.path.finetuned_modelname
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama.py` around lines 244 - 260, Guard the post-training
logging and saving so only the main rank performs them: keep trainer.train() and
trainer.evaluate(eval_dataset=self.tokenized_dataset["test"]) as-is (Trainer
handles DDP), but wrap calls to wandb.log({"Training Arguments": ...}),
wandb.log({"model_summary": ...}), trainer.save_model(...),
wandb.log(eval_result), self.model.save_pretrained(...), and wandb.finish() with
the existing is_main check (the is_main variable defined earlier) so only when
is_main is True these logging/saving/cleanup operations run.
🧹 Nitpick comments (4)
.github/workflows/tests.yml (1)

27-30: Consider pinning uv version for consistency.

Same as in docs.yml, using version: "latest" can cause reproducibility issues. Pin to a specific version for predictable CI runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/tests.yml around lines 27 - 30, The workflow currently
pins the setup action to astral-sh/setup-uv@v3 but uses version: "latest" which
harms reproducibility; change the version field under the Install uv step to a
specific, tested semantic version (e.g., "1.2.3") instead of "latest" so CI runs
are deterministic and match docs.yml's pinned version.
pyproject.toml (1)

32-32: Consider pinning xtal2txt to a specific commit or tag.

Installing from git+https://github.com/lamalab-org/xtal2txt.git (HEAD) can lead to non-reproducible builds if the upstream repository changes. Consider pinning to a specific commit hash or release tag for stability.

Example with commit hash
-  "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git",
+  "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git@v1.0.0",  # or `@abc1234` for commit
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 32, The xtal2txt dependency is installed from the
repository HEAD which makes builds non-reproducible; update the dependency
string in pyproject.toml (the "xtal2txt @
git+https://github.com/lamalab-org/xtal2txt.git" entry) to pin to a specific
commit hash or release tag (e.g., append @<commit-hash> or @<tag> to the git
URL) so installations are deterministic, and update any related docs or
changelogs to note the chosen pinned ref.
.github/workflows/docs.yml (1)

21-24: Consider pinning uv version for reproducibility.

Using version: "latest" can lead to inconsistent CI behavior if a breaking change is introduced in a future uv release. Consider pinning to a specific version (e.g., version: "0.6.0") for more predictable builds.

Suggested change
       - name: Install uv
         uses: astral-sh/setup-uv@v3
         with:
-          version: "latest"
+          version: "0.6.0"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/docs.yml around lines 21 - 24, Replace the unpinned uv
installer version to a fixed release to avoid CI drift: in the GitHub Actions
step that uses astral-sh/setup-uv@v3 (the step with name "Install uv" and the
input key version), change version: "latest" to a specific pinned version string
(e.g., "0.6.0" or another tested semver) and update any related documentation or
workflow comments to note the chosen pinned version.
.pre-commit-config.yaml (1)

14-19: Consider updating ruff-pre-commit to a more recent version.

The pinned version v0.3.4 (March 2024) is significantly outdated. The latest version is v0.15.4 (February 2026), representing substantial development including new linting rules and bug fixes. Updating to the latest stable release will improve code quality checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.pre-commit-config.yaml around lines 14 - 19, Update the pinned
ruff-pre-commit version by changing the rev value currently set to v0.3.4 to the
latest stable release (e.g., v0.15.4) in the pre-commit config where the
repository URL "https://github.com/astral-sh/ruff-pre-commit" and hooks "ruff"
and "ruff-format" are declared so the pre-commit hook uses the newer ruff rules
and fixes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/tests.yml:
- Around line 32-35: Remove openbabel from the system dependencies step in the
"Install system dependencies" job by deleting "openbabel" from the apt-get
install command (leave "libfftw3-dev" intact) so CI stops installing OpenBabel
for tests; if OpenBabel is needed later for integration tests, re-add it with a
clarifying comment near the "Install system dependencies" step.

In `@README.md`:
- Line 125: The README line contains the redundant phrase "for e.g." — replace
it with either "e.g." or "for example" so the sentence reads e.g.
`++model.pretrain.training_arguments.per_device_train_batch_size=32` or for
example `++model.pretrain.training_arguments.per_device_train_batch_size=32`;
update the text surrounding the example in the same sentence accordingly.
- Around line 150-159: Update the citation block for the
`@misc`{alampara2024mattextlanguagemodelsneed} entry: change the code fence to
specify the language (```bibtex), add a missing comma after the primaryClass
field value (primaryClass={cond-mat.mtrl-sci},) and ensure the url field is
punctuated correctly (url={https://arxiv.org/abs/2406.17295} without an extra
trailing comma), then close the fenced code block.

In `@src/mattext/main.py`:
- Line 178: Remove the unnecessary f-string prefix in the print call that emits
the LOCAL_RANK warning in mattext.main (the print statement currently:
print(f"WARNING: LOCAL_RANK is set but torch.distributed is not initialized!"));
change it to a normal string literal (remove the leading "f") so the message is
printed without using an f-string and eliminates the F541 lint violation.

In `@src/mattext/models/benchmark.py`:
- Around line 117-119: The loop pairing experiment and test names uses
zip(self.exp_names, self.test_exp_names) which silently truncates mismatched
sequences; update both occurrences (the for i, (exp_name, test_name) in
enumerate(...) loops at the shown spots) to use zip(self.exp_names,
self.test_exp_names, strict=True) so a length mismatch raises immediately; apply
the same change to the second occurrence around lines 151-153 and run tests
(ensure Python version supports zip(..., strict=True)).
- Around line 64-87: The wandb run started by wandb.init in the
state.is_main_process branch isn't closed; wrap the inference sequence that
calls self._get_inference and predict.predict in a try/except/finally (still
preserving the existing except behavior that re-raises ValueError/TypeError) and
call wandb.finish() in the finally block (only when state.is_main_process and
after wandb.init was invoked) so each fold's run is always closed even on
errors; ensure the finally runs before state.wait_for_everyone().

In `@src/mattext/models/llama_sft.py`:
- Around line 209-220: The output directory self.output_dir_ is currently set to
a static path and all folds write to the same location, risking overwrites;
update the path generation to include the fold (e.g., incorporate self.fold or a
fold-scoped subdirectory) before calling trainer.save_model(self.output_dir_)
and tokenizer.save_pretrained(...), and ensure trainer.save_state() and any
other artifact saves use the same fold-scoped directory so each fold writes to a
unique directory under self.cfg.path.finetuned_modelname.

---

Outside diff comments:
In `@src/mattext/models/llama.py`:
- Around line 244-260: Guard the post-training logging and saving so only the
main rank performs them: keep trainer.train() and
trainer.evaluate(eval_dataset=self.tokenized_dataset["test"]) as-is (Trainer
handles DDP), but wrap calls to wandb.log({"Training Arguments": ...}),
wandb.log({"model_summary": ...}), trainer.save_model(...),
wandb.log(eval_result), self.model.save_pretrained(...), and wandb.finish() with
the existing is_main check (the is_main variable defined earlier) so only when
is_main is True these logging/saving/cleanup operations run.

---

Nitpick comments:
In @.github/workflows/docs.yml:
- Around line 21-24: Replace the unpinned uv installer version to a fixed
release to avoid CI drift: in the GitHub Actions step that uses
astral-sh/setup-uv@v3 (the step with name "Install uv" and the input key
version), change version: "latest" to a specific pinned version string (e.g.,
"0.6.0" or another tested semver) and update any related documentation or
workflow comments to note the chosen pinned version.

In @.github/workflows/tests.yml:
- Around line 27-30: The workflow currently pins the setup action to
astral-sh/setup-uv@v3 but uses version: "latest" which harms reproducibility;
change the version field under the Install uv step to a specific, tested
semantic version (e.g., "1.2.3") instead of "latest" so CI runs are
deterministic and match docs.yml's pinned version.

In @.pre-commit-config.yaml:
- Around line 14-19: Update the pinned ruff-pre-commit version by changing the
rev value currently set to v0.3.4 to the latest stable release (e.g., v0.15.4)
in the pre-commit config where the repository URL
"https://github.com/astral-sh/ruff-pre-commit" and hooks "ruff" and
"ruff-format" are declared so the pre-commit hook uses the newer ruff rules and
fixes.

In `@pyproject.toml`:
- Line 32: The xtal2txt dependency is installed from the repository HEAD which
makes builds non-reproducible; update the dependency string in pyproject.toml
(the "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git" entry) to pin
to a specific commit hash or release tag (e.g., append @<commit-hash> or @<tag>
to the git URL) so installations are deterministic, and update any related docs
or changelogs to note the chosen pinned ref.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 866a0e2 and e7b6a9a.

📒 Files selected for processing (13)
  • .github/workflows/docs.yml
  • .github/workflows/pre-commit.yml
  • .github/workflows/tests.yml
  • .pre-commit-config.yaml
  • README.md
  • docs/getting_started.md
  • pyproject.toml
  • src/mattext/main.py
  • src/mattext/models/benchmark.py
  • src/mattext/models/finetune.py
  • src/mattext/models/llama.py
  • src/mattext/models/llama_sft.py
  • src/mattext/models/potential.py

Comment on lines +32 to +35
- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y openbabel libfftw3-dev
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if any test files use LocalEnv or openbabel
rg -i "local.?env|openbabel" tests/

Repository: lamalab-org/MatText

Length of output: 45


🏁 Script executed:

#!/bin/bash
# List test files to understand the test structure
find tests/ -name "*.py" -type f | head -20

Repository: lamalab-org/MatText

Length of output: 107


🏁 Script executed:

#!/bin/bash
# Read test files to understand their scope
echo "=== test_imports.py ==="
cat tests/test_imports.py
echo -e "\n=== test_xtal2pot.py ==="
cat tests/test_xtal2pot.py

Repository: lamalab-org/MatText

Length of output: 1109


🏁 Script executed:

#!/bin/bash
# Search the entire codebase for LocalEnv usage
echo "=== Searching for LocalEnv usage ==="
rg -i "local.?env" --type py
echo -e "\n=== Searching for openbabel imports ==="
rg -i "openbabel|import open" --type py

Repository: lamalab-org/MatText

Length of output: 1732


🏁 Script executed:

#!/bin/bash
# Check pyproject.toml for dependencies
echo "=== pyproject.toml dependencies ==="
cat pyproject.toml | grep -A 50 "dependencies\|optional"
echo -e "\n=== Check if LocalEnv/local_env is implemented in src ==="
rg -l "local_env|LocalEnv" src/

Repository: lamalab-org/MatText

Length of output: 1484


🏁 Script executed:

#!/bin/bash
# Check how local_env is used in the source
echo "=== src/mattext/models/utils.py ==="
cat src/mattext/models/utils.py

Repository: lamalab-org/MatText

Length of output: 8533


🏁 Script executed:

#!/bin/bash
# Check if SmilesTokenizer or local_env implementation uses openbabel
echo "=== Searching for SmilesTokenizer implementation ==="
find src -name "*.py" -type f | xargs grep -l "SmilesTokenizer"
echo -e "\n=== Check tokenizer implementations ==="
rg "class SmilesTokenizer|class.*Tokenizer" src/ -A 5

Repository: lamalab-org/MatText

Length of output: 2281


🏁 Script executed:

#!/bin/bash
# Find and check tokenizer implementations
echo "=== tokenizer/__init__.py ==="
cat src/mattext/tokenizer/__init__.py
echo -e "\n=== Check for tokenizer directory files ==="
ls -la src/mattext/tokenizer/

Repository: lamalab-org/MatText

Length of output: 666


🏁 Script executed:

#!/bin/bash
# Check README for OpenBabel mentions
echo "=== README.md mentions of OpenBabel ==="
rg -i "openbabel|local.*env" README.md -A 2 -B 2
echo -e "\n=== Check if xtal2txt is in dependencies and what it might need ==="
cat pyproject.toml | grep -i xtal2txt
echo -e "\n=== Search for any OpenBabel-related comments in codebase ==="
rg -i "openbabel" --type py --type yml

Repository: lamalab-org/MatText

Length of output: 577


🏁 Script executed:

#!/bin/bash
# Check if libfftw3-dev is needed for xtal2txt or other dependencies
echo "=== Searching for fftw references ==="
rg -i "fftw" src/ scripts/ tests/ revision-scripts/
echo -e "\n=== Check xtal2txt GitHub or documentation for dependencies ==="
echo "xtal2txt is a git dependency, checking if it's imported in tests"
rg "from xtal2txt|import xtal2txt" tests/

Repository: lamalab-org/MatText

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Final check: verify what the tests actually import and execute
echo "=== All imports in test files ==="
rg "^import |^from " tests/
echo -e "\n=== Check if pymatgen structures internally use OpenBabel ==="
echo "Checking if pymatgen or xtal2txt might require OpenBabel transitively"
rg -i "openbabel" . --type py | head -20

Repository: lamalab-org/MatText

Length of output: 681


Remove openbabel from system dependencies—it's not needed for the current test suite.

LocalEnv representation (which requires OpenBabel) is not covered by the tests (test_imports.py and test_xtal2pot.py only test basic tokenizers and structures). Since the test suite doesn't exercise this feature, installing OpenBabel adds unnecessary overhead to CI. If needed later for integration testing, add it back with a comment explaining its purpose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/tests.yml around lines 32 - 35, Remove openbabel from the
system dependencies step in the "Install system dependencies" job by deleting
"openbabel" from the apt-get install command (leave "libfftw3-dev" intact) so CI
stops installing OpenBabel for tests; if OpenBabel is needed later for
integration tests, re-add it with a clarifying comment near the "Install system
dependencies" step.


The `+` symbol before a configuration key indicates that you are adding a new key-value pair to the configuration. This is useful when you want to specify parameters that are not part of the default configuration.

To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix redundant phrase "for e.g."

The phrase "for e.g." is redundant since "e.g." already means "for example". Use either "e.g." or "for example".

Suggested fix
-To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.
+To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
🧰 Tools
🪛 LanguageTool

[style] ~125-~125: The phrase ‘for e.g.’ is an tautology (‘e.g.’ means ‘for example’). Consider using just “e.g.” or “for example”.
Context: ...isting default configuration, use ++, for e.g., `++model.pretrain.training_arguments.p...

(FOR_EG_REDUNDANCY)


[uncategorized] ~125-~125: Do not mix variants of the same word (‘pretrain’ and ‘pre-train’) within a single text.
Context: ...ault configuration, use ++, for e.g., `++model.pretrain.training_arguments.per_device_train_bat...

(EN_WORD_COHERENCY)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 125, The README line contains the redundant phrase "for
e.g." — replace it with either "e.g." or "for example" so the sentence reads
e.g. `++model.pretrain.training_arguments.per_device_train_batch_size=32` or for
example `++model.pretrain.training_arguments.per_device_train_batch_size=32`;
update the text surrounding the example in the same sentence accordingly.

Comment on lines 150 to 159
```
@misc{alampara2024mattextlanguagemodelsneed,
title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
year={2024},
eprint={2406.17295},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci}
url={https://arxiv.org/abs/2406.17295},
url={https://arxiv.org/abs/2406.17295},
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix citation block formatting.

The code block should specify a language (e.g., bibtex), and the url field appears to be placed after primaryClass without a trailing comma, which would be invalid BibTeX syntax.

Suggested fix
-```
+```bibtex
 `@misc`{alampara2024mattextlanguagemodelsneed,
       title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
       author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
       year={2024},
       eprint={2406.17295},
       archivePrefix={arXiv},
-      primaryClass={cond-mat.mtrl-sci}
-      url={https://arxiv.org/abs/2406.17295},
+      primaryClass={cond-mat.mtrl-sci},
+      url={https://arxiv.org/abs/2406.17295}
 }
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 150-150: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 150 - 159, Update the citation block for the
`@misc`{alampara2024mattextlanguagemodelsneed} entry: change the code fence to
specify the language (```bibtex), add a missing comma after the primaryClass
field value (primaryClass={cond-mat.mtrl-sci},) and ensure the url field is
punctuated correctly (url={https://arxiv.org/abs/2406.17295} without an extra
trailing comma), then close the fenced code block.

Comment on lines +64 to +87
if state.is_main_process:
wandb.init(
config=dict(self.task_cfg.model.inference),
project=self.task_cfg.model.logging.wandb_project,
name=test_name,
)

exp_cfg.model.inference.path.test_data = self.test_data
exp_cfg.model.inference.path.pretrained_checkpoint = ckpt
exp_cfg.model.inference.path.test_data = self.test_data
exp_cfg.model.inference.path.pretrained_checkpoint = ckpt

try:
predict = self._get_inference(exp_cfg, fold_name)
predictions, prediction_ids = predict.predict()
self._record_predictions(task, i, predictions, prediction_ids)
except Exception as e:
logger.error(
f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
)
if isinstance(e, (ValueError, TypeError)):
raise
logger.error(traceback.format_exc())
try:
predict = self._get_inference(exp_cfg, fold_name)
predictions, prediction_ids = predict.predict()
self._record_predictions(task, i, predictions, prediction_ids)
except Exception as e:
logger.error(
f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
)
if isinstance(e, (ValueError, TypeError)):
raise
logger.error(traceback.format_exc())

# All ranks sync after inference before next fold
state.wait_for_everyone()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Always close the inference W&B run (wandb.finish) in this method.

wandb.init at Line 65 is not paired with wandb.finish here, so inference runs can bleed across folds.

Proposed fix
         if state.is_main_process:
             wandb.init(
                 config=dict(self.task_cfg.model.inference),
                 project=self.task_cfg.model.logging.wandb_project,
                 name=test_name,
             )

             exp_cfg.model.inference.path.test_data = self.test_data
             exp_cfg.model.inference.path.pretrained_checkpoint = ckpt

             try:
                 predict = self._get_inference(exp_cfg, fold_name)
                 predictions, prediction_ids = predict.predict()
                 self._record_predictions(task, i, predictions, prediction_ids)
             except Exception as e:
                 logger.error(
                     f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
                 )
                 if isinstance(e, (ValueError, TypeError)):
                         raise
                 logger.error(traceback.format_exc())
+            finally:
+                wandb.finish()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if state.is_main_process:
wandb.init(
config=dict(self.task_cfg.model.inference),
project=self.task_cfg.model.logging.wandb_project,
name=test_name,
)
exp_cfg.model.inference.path.test_data = self.test_data
exp_cfg.model.inference.path.pretrained_checkpoint = ckpt
exp_cfg.model.inference.path.test_data = self.test_data
exp_cfg.model.inference.path.pretrained_checkpoint = ckpt
try:
predict = self._get_inference(exp_cfg, fold_name)
predictions, prediction_ids = predict.predict()
self._record_predictions(task, i, predictions, prediction_ids)
except Exception as e:
logger.error(
f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
)
if isinstance(e, (ValueError, TypeError)):
raise
logger.error(traceback.format_exc())
try:
predict = self._get_inference(exp_cfg, fold_name)
predictions, prediction_ids = predict.predict()
self._record_predictions(task, i, predictions, prediction_ids)
except Exception as e:
logger.error(
f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
)
if isinstance(e, (ValueError, TypeError)):
raise
logger.error(traceback.format_exc())
# All ranks sync after inference before next fold
state.wait_for_everyone()
if state.is_main_process:
wandb.init(
config=dict(self.task_cfg.model.inference),
project=self.task_cfg.model.logging.wandb_project,
name=test_name,
)
exp_cfg.model.inference.path.test_data = self.test_data
exp_cfg.model.inference.path.pretrained_checkpoint = ckpt
try:
predict = self._get_inference(exp_cfg, fold_name)
predictions, prediction_ids = predict.predict()
self._record_predictions(task, i, predictions, prediction_ids)
except Exception as e:
logger.error(
f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
)
if isinstance(e, (ValueError, TypeError)):
raise
logger.error(traceback.format_exc())
finally:
wandb.finish()
# All ranks sync after inference before next fold
state.wait_for_everyone()
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 80-80: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/benchmark.py` around lines 64 - 87, The wandb run started
by wandb.init in the state.is_main_process branch isn't closed; wrap the
inference sequence that calls self._get_inference and predict.predict in a
try/except/finally (still preserving the existing except behavior that re-raises
ValueError/TypeError) and call wandb.finish() in the finally block (only when
state.is_main_process and after wandb.init was invoked) so each fold's run is
always closed even on errors; ensure the finally runs before
state.wait_for_everyone().

Comment on lines 117 to 119
for i, (exp_name, test_name) in enumerate(
zip(self.exp_names, self.test_exp_names)
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/mattext/models/benchmark.py | head -200

Repository: lamalab-org/MatText

Length of output: 7757


🏁 Script executed:

fd -e "pyproject.toml" -o -e "setup.py" -o -e "setup.cfg" -o -e "requirements.txt" | head -10

Repository: lamalab-org/MatText

Length of output: 179


🏁 Script executed:

git ls-files | grep -E "(pyproject.toml|setup.py|setup.cfg|requirements)" | head -10

Repository: lamalab-org/MatText

Length of output: 77


🏁 Script executed:

cat pyproject.toml

Repository: lamalab-org/MatText

Length of output: 4003


🏁 Script executed:

rg -i "(exp_name|test_exp_names)" src/ --type py -A 2 -B 2 | head -80

Repository: lamalab-org/MatText

Length of output: 5612


🏁 Script executed:

rg "exp_name|exp_names" --type yaml --type toml --type json | head -30

Repository: lamalab-org/MatText

Length of output: 3410


Use zip(..., strict=True) to catch length mismatches between experiment and test names.

Without strict=True, if self.exp_names and self.test_exp_names differ in length, the loop silently discards unmatched pairs, causing silent data loss in benchmark results.

Proposed fix
         for i, (exp_name, test_name) in enumerate(
-            zip(self.exp_names, self.test_exp_names)
+            zip(self.exp_names, self.test_exp_names, strict=True)
         ):

Also applies to: 151-153

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 118-118: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/benchmark.py` around lines 117 - 119, The loop pairing
experiment and test names uses zip(self.exp_names, self.test_exp_names) which
silently truncates mismatched sequences; update both occurrences (the for i,
(exp_name, test_name) in enumerate(...) loops at the shown spots) to use
zip(self.exp_names, self.test_exp_names, strict=True) so a length mismatch
raises immediately; apply the same change to the second occurrence around lines
151-153 and run tests (ensure Python version supports zip(..., strict=True)).

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (14)
scripts/qmof_prepare_data.py (1)

18-26: ⚠️ Potential issue | 🟡 Minor

Keep return type docs consistent with the new annotation.

Line 18 returns list[dict], but Line 25 still documents List[Dict]. Please update the docstring to match.

Suggested fix
-        List[Dict]: A list of dictionaries containing the JSON data.
+        list[dict]: A list of dictionaries containing the JSON data.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/qmof_prepare_data.py` around lines 18 - 26, The docstring for
read_json currently uses capitalized typing (List[Dict]) that doesn't match the
function annotation list[dict]; update the "Returns:" section to use the
lowercase, modern annotation style (e.g., "list[dict]") and ensure the
descriptive text matches the return type; locate the read_json function and
change the docstring return type line to "list[dict]".
scripts/linear_potential.py (1)

16-24: ⚠️ Potential issue | 🟡 Minor

Update docstring return type to match the annotation.

Line 16 uses list[dict], while Line 23 still says List[Dict].

Suggested fix
-        List[Dict]: A list of dictionaries containing the JSON data.
+        list[dict]: A list of dictionaries containing the JSON data.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/linear_potential.py` around lines 16 - 24, The docstring for function
read_json has a mismatched return type description ("List[Dict]") that doesn't
match the type annotation (list[dict]); update the docstring return section to
exactly match the annotation (use "list[dict]" or the same canonical form you
choose) so the return type in the docstring and the function signature
(read_json) are consistent.
pyproject.toml (1)

124-135: ⚠️ Potential issue | 🟡 Minor

Bumpversion current_version format doesn't match the parse regex.

The current_version = "0.0.2-dev" uses a hyphen before "dev", but the parse regex only matches:

  • major.minor.patch (e.g., 0.0.2)
  • major.minor.patch.postN.devN (e.g., 0.0.2.post1.dev1)

The regex only partially matches "0.0.2-dev" (capturing just "0.0.2"), leaving the "-dev" suffix unmatched. This incomplete parsing will cause bumpversion commands to fail or behave unexpectedly.

Additionally, the project version = "0.0.2" (line 9) mismatches current_version = "0.0.2-dev" and should be synchronized.

Fix by either updating current_version to match the regex pattern:

-current_version = "0.0.2-dev"
+current_version = "0.0.2"

Or update the parse regex to support the hyphenated dev format:

-parse = "(?P<major>\\d+)\\.(?P<minor>\\d+)\\.(?P<patch>\\d+)(\\.(?P<dev>post)\\d+\\.dev\\d+)?"
+parse = "(?P<major>\\d+)\\.(?P<minor>\\d+)\\.(?P<patch>\\d+)(-(?P<dev>dev))?"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 124 - 135, The bumpversion config is
inconsistent: current_version = "0.0.2-dev" doesn't match the parse regex and
also disagrees with project version = "0.0.2"; either make the values consistent
or expand the regex. Fix option A: set current_version (and project version) to
a value that matches parse (e.g., "0.0.2") so both "current_version" and
"version" are identical; or Fix option B: update the parse pattern to accept the
hyphenated dev suffix (so the parse regex used by [tool.bumpversion] will match
"0.0.2-dev") and ensure serialize still produces the intended tags. Ensure you
update the unique keys current_version, parse, and version accordingly.
src/mattext/models/inference.py (2)

20-35: ⚠️ Potential issue | 🔴 Critical

Initialize self.task_cfg before dereferencing it.

self.task_cfg is used on Line 28 onward but never assigned in __init__, which will raise immediately during object creation.

Suggested fix
 def __init__(self, task_cfg: DictConfig):
+    self.task_cfg = task_cfg
     self.representation = self.task_cfg.model.representation
     self.benchmark = self.task_cfg.model.inference.benchmark_dataset
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/inference.py` around lines 20 - 35, The __init__ method
dereferences self.task_cfg before it is assigned, causing an AttributeError; fix
by assigning the incoming task_cfg parameter to self.task_cfg at the start of
__init__ (e.g., self.task_cfg = task_cfg) before any access, then keep the
existing assignments to self.representation, self.benchmark, self.exp_names,
self.test_exp_names, self.train_data, self.test_data, self.wndb, and
self.benchmark_save_path so they reference the stored configuration object.

91-95: ⚠️ Potential issue | 🔴 Critical

Unpack predict.predict() before recording benchmark results.

In this block, predict.predict() is treated as a single object, but the benchmark flow in src/mattext/models/benchmark.py (Lines 74-84) expects unpacked values. Passing the raw return value to benchmark.record(...) can fail and then be silently skipped by the broad exception path.

Suggested fix
             try:
                 predict = Inference(exp_cfg)
-                predictions = predict.predict()
+                predictions, prediction_ids = predict.predict()
                 benchmark.record(i, predictions)
             except Exception:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/inference.py` around lines 91 - 95, The call to
predict.predict() returns multiple values that must be unpacked before passing
into benchmark.record; modify the try block around Inference(exp_cfg) so you
assign the returned tuple from Inference.predict() to separate variables (e.g.,
y_pred, timings or similar based on Inference.predict signature) and then call
benchmark.record(i, y_pred, timings) instead of benchmark.record(i,
predictions). Also avoid silently swallowing errors — catch Exception as e and
log or re-raise so failures in Inference.predict() are visible (referencing the
Inference class and its predict() method and the benchmark.record(...) call).
scripts/filterdataset.py (1)

25-34: ⚠️ Potential issue | 🔴 Critical

Bug: wycoff_rep mapped to None will cause TypeError at runtime.

The _TOKENIZER_MAP maps wycoff_rep to None, but _tokenize_without_truncation (line 60-61) directly calls tokenizer_class(...) without a None check. This will raise TypeError: 'NoneType' object is not callable if wycoff_rep is used.

This is inconsistent with src/mattext/models/utils.py (lines 19-32), which correctly maps wycoff_rep to RobocrysTokenizer.

Proposed fix - align with utils.py
 _TOKENIZER_MAP: dict[str, type] = {
     "slice": SliceTokenizer,
     "composition": CompositionTokenizer,
     "cif_symmetrized": CifTokenizer,
     "cif_p1": CifTokenizer,
     "cif_bonding": CifTokenizer,
     "crystal_llm_rep": CrysllmTokenizer,
     "robocrys_rep": RobocrysTokenizer,
-    "wycoff_rep": None,
+    "wycoff_rep": RobocrysTokenizer,
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/filterdataset.py` around lines 25 - 34, The _TOKENIZER_MAP currently
maps "wycoff_rep" to None which leads to a TypeError when
_tokenize_without_truncation calls tokenizer_class(...); update the
_TOKENIZER_MAP entry for "wycoff_rep" to reference RobocrysTokenizer (same as in
src/mattext/models/utils.py) so that _tokenize_without_truncation, which
instantiates tokenizer_class, receives a valid callable class; verify the map
uses the exact symbol RobocrysTokenizer and no None values for supported keys.
src/mattext/models/llama_sft.py (1)

90-99: ⚠️ Potential issue | 🔴 Critical

Bug: compute_dtype may be unbound when quantization is disabled.

If both use_4bit and use_8bit are False, the code jumps to else at line 90 and sets bnb_config = None, but compute_dtype (line 82) is never assigned. Line 94's check if compute_dtype == torch.float16 will raise UnboundLocalError.

Proposed fix - ensure compute_dtype is always defined
+        compute_dtype = getattr(torch, self.bnb_config.bnb_4bit_compute_dtype)
+
         if self.bnb_config.use_4bit and self.bnb_config.use_8bit:
             raise ValueError(
                 "You can't load the model in 8 bits and 4 bits at the same time"
             )

         elif self.bnb_config.use_4bit or self.bnb_config.use_8bit:
-            compute_dtype = getattr(torch, self.bnb_config.bnb_4bit_compute_dtype)
             bnb_config = BitsAndBytesConfig(
                 load_in_4bit=self.bnb_config.use_4bit,
                 load_in_8bit=self.bnb_config.use_8bit,

Or skip the GPU capability check when quantization is disabled:

         else:
             bnb_config = None
+            compute_dtype = None

         # Check GPU compatibility with bfloat16
-        if compute_dtype == torch.float16:
+        if compute_dtype is not None and compute_dtype == torch.float16:
             major, _ = torch.cuda.get_device_capability()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 90 - 99, The GPU bfloat16
compatibility check uses compute_dtype which can be unassigned when quantization
is disabled (use_4bit and use_8bit both False); to fix, ensure compute_dtype is
always defined or skip the check when bnb_config is None: either set a default
(e.g., compute_dtype = None or torch.float32) before the quantization branches
or wrap the check with a guard like "if bnb_config is not None and compute_dtype
== torch.float16:" so the call in torch.cuda.get_device_capability() only runs
when compute_dtype was set by the quantization branches (referencing
compute_dtype, bnb_config, use_4bit, use_8bit, and logger).
src/mattext/models/utils.py (1)

256-264: ⚠️ Potential issue | 🟠 Major

Bug: Duplicate code in on_evaluate logs predictions twice.

Lines 256-260 and 261-264 are identical - the method calls super().on_evaluate(...), generates a samples table, and logs it twice per evaluation. This doubles computation and logs duplicate sample_predictions entries to WandB.

Proposed fix - remove duplicate block
     def on_evaluate(self, args, state, control, **kwargs):
         """Log the wandb.Table after calling trainer.evaluate."""
         super().on_evaluate(args, state, control, **kwargs)
         records_table = self.samples_table(self.sample_dataset)
         self._wandb.log({"sample_predictions": records_table})
-        # "Log the wandb.Table after calling trainer.evaluate"
-        super().on_evaluate(args, state, control, **kwargs)
-        records_table = self.samples_table(self.sample_dataset)
-        self._wandb.log({"sample_predictions": records_table})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/utils.py` around lines 256 - 264, The on_evaluate method
currently executes the same block twice (calls super().on_evaluate(...), builds
records_table via samples_table(self.sample_dataset), and calls
self._wandb.log({"sample_predictions": records_table})), causing duplicate
computation and duplicate WandB entries; remove the duplicated second block so
on_evaluate only calls super().on_evaluate(args, state, control, **kwargs),
creates records_table = self.samples_table(self.sample_dataset) once, and logs
it once with self._wandb.log({"sample_predictions": records_table}).
src/mattext/models/llama.py (2)

248-264: ⚠️ Potential issue | 🟠 Major

Inconsistent main-process gating for WandB and model saving.

Unlike finetune.py, the wandb.log(eval_result), trainer.save_model(), model.save_pretrained(), and wandb.finish() calls at lines 256-264 are not gated to the main process. This will cause:

  1. All ranks attempting to save models (potential file conflicts)
  2. Duplicate WandB logging from non-main ranks
  3. wandb.finish() called on ranks where wandb.init() was never called (per src/mattext/main.py:75-79, only main process calls wandb.init())
Suggested fix to match finetune.py pattern
     # Trainer handles all DDP synchronization automatically
     trainer.train()
-    trainer.save_model(
-        f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
-    )

     eval_result = trainer.evaluate(eval_dataset=self.tokenized_dataset["test"])
-    wandb.log(eval_result)

-    self.model.save_pretrained(self.cfg.path.finetuned_modelname)
-    wandb.finish()
+    if is_main:
+        wandb.log(eval_result)
+        trainer.save_model(
+            f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
+        )
+        self.model.save_pretrained(self.cfg.path.finetuned_modelname)
+        wandb.finish()
+
     return self.cfg.path.finetuned_modelname
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama.py` around lines 248 - 264, The calls after training
(wandb.log(eval_result), trainer.save_model(...),
self.model.save_pretrained(...), and wandb.finish()) are not gated by the
main-process check and will run on all ranks; wrap these calls with the same
is_main condition used earlier (derived from self.local_rank is None or
self.local_rank == 0) so only the main process performs wandb logging, saves the
trainer/model, and calls wandb.finish; use the existing is_main variable and
mirror the finetune.py pattern to avoid duplicate logs, file conflicts, and
wandb.finish() being called on non-initialized ranks.

99-119: ⚠️ Potential issue | 🔴 Critical

Potential NameError: compute_dtype may be undefined.

compute_dtype is only assigned inside the elif block (line 100), but is referenced unconditionally at line 112. If both use_4bit and use_8bit are False, the code will raise a NameError.

Suggested fix
+        compute_dtype = None
         if self.bnb_config.use_4bit and self.bnb_config.use_8bit:
             raise ValueError(
                 "You can't load the model in 8 bits and 4 bits at the same time"
             )

         elif self.bnb_config.use_4bit or self.bnb_config.use_8bit:
             compute_dtype = getattr(torch, self.bnb_config.bnb_4bit_compute_dtype)
             bnb_config = BitsAndBytesConfig(
                 load_in_4bit=self.bnb_config.use_4bit,
                 load_in_8bit=self.bnb_config.use_8bit,
                 bnb_4bit_quant_type=self.bnb_config.bnb_4bit_quant_type,
                 bnb_4bit_compute_dtype=compute_dtype,
                 bnb_4bit_use_double_quant=self.bnb_config.use_nested_quant,
             )
         else:
             bnb_config = None

         # Check GPU compatibility with bfloat16
-        if compute_dtype == torch.float16:
+        if compute_dtype is not None and compute_dtype == torch.float16:
             major, _ = torch.cuda.get_device_capability()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama.py` around lines 99 - 119, The variable
compute_dtype can be referenced before assignment; set compute_dtype = None
before the BitsAndBytesConfig branch (or move the bfloat16 check into the same
branch) and guard the GPU bfloat16 check so it only runs when compute_dtype is
not None and equals torch.float16; update the code around the BitsAndBytesConfig
construction (the elif that checks bnb_config.use_4bit or bnb_config.use_8bit
and the subsequent bfloat16 GPU check) to avoid a NameError.
src/mattext/models/finetune.py (1)

117-126: ⚠️ Potential issue | 🟡 Minor

Specify minimum transformers version requirement in pyproject.toml.

The processing_class parameter (used in the Trainer initialization) was introduced in transformers v4.46.0. Since pyproject.toml currently specifies "transformers" without a version constraint, the code will fail on earlier versions. Add transformers>=4.46.0 to enforce compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/finetune.py` around lines 117 - 126, The Trainer
instantiation uses the processing_class parameter
(processing_class=self._wrapped_tokenizer) which requires transformers>=4.46.0;
update pyproject.toml to pin the transformers dependency to at least 4.46.0
(e.g. "transformers>=4.46.0") so the Trainer(..., processing_class=...) call
will work reliably across installs.
revision-scripts/text_rep.py (1)

140-146: ⚠️ Potential issue | 🟡 Minor

Always persist final resume progress at loop end.

If total batches are fewer than save_interval, no progress file is written. That breaks restart-from-log expectations.

Proposed fix
     for i, batch_data in enumerate(batch_iterator, start=1):
         ...
         if i % save_interval == 0:
             with open(log_file_path, "w") as log_file:
                 log_file.write(f"Last processed entry index: {last_processed_entry}\n")
                 log_file.write(f"Last processed batch number: {i}\n")
 
+    with open(log_file_path, "w") as log_file:
+        log_file.write(f"Last processed entry index: {last_processed_entry}\n")
+        log_file.write(f"Last processed batch number: {i if 'i' in locals() else 0}\n")
+
     print(f"Finished !!! logging at {log_file_path}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/text_rep.py` around lines 140 - 146, The loop only writes
progress when i % save_interval == 0, so if total batches < save_interval no
progress file is created; after the loop in revision-scripts/text_rep.py add a
final unconditional write to log_file_path that persists last_processed_entry
and the final batch index (i or last batch number) so restart-from-log works;
use the same format as the periodic writes (f"Last processed entry index:
{last_processed_entry}\n" and f"Last processed batch number: {i}\n") to
overwrite/create the file and ensure resume state is always saved.
revision-scripts/prep_rep.py (1)

140-146: ⚠️ Potential issue | 🟡 Minor

Write final progress log after processing completes.

With current logic, the log may never be written when batch count is below save_interval.

Proposed fix
     for i, batch_data in enumerate(batch_iterator, start=1):
         ...
         if i % save_interval == 0:
             with open(log_file_path, "w") as log_file:
                 log_file.write(f"Last processed entry index: {last_processed_entry}\n")
                 log_file.write(f"Last processed batch number: {i}\n")
 
+    with open(log_file_path, "w") as log_file:
+        log_file.write(f"Last processed entry index: {last_processed_entry}\n")
+        log_file.write(f"Last processed batch number: {i if 'i' in locals() else 0}\n")
+
     print(f"Finished !!! logging at {log_file_path}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/prep_rep.py` around lines 140 - 146, The final progress log
may not be written if the total batch count is less than save_interval; after
the batch-processing loop completes, ensure you write the final progress to
log_file_path using the same fields (last_processed_entry and the last batch
index i or a final batch counter) so the log is always present; update the code
around the loop that updates last_processed_entry and checks save_interval
(refer to last_processed_entry, save_interval, log_file_path, and i) to perform
one last write to the log after the loop finishes.
revision-scripts/prep_json.py (1)

39-42: ⚠️ Potential issue | 🟡 Minor

Use os.makedirs(..., exist_ok=True) instead of check-then-create.

The pattern at lines 39–40 (os.path.exists() + os.mkdir()) is race-prone and fails for nested paths. os.makedirs(..., exist_ok=True) is safer and handles parent directory creation automatically.

Proposed fix
-    if not os.path.exists(save_path):
-        os.mkdir(save_path)
-    else:
-        print(f"Directory '{save_path}' already exists.")
+    already_exists = os.path.isdir(save_path)
+    os.makedirs(save_path, exist_ok=True)
+    if already_exists:
+        print(f"Directory '{save_path}' already exists.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/prep_json.py` around lines 39 - 42, Replace the race-prone
check-then-create block that uses os.path.exists() and os.mkdir() with a single
safe call to os.makedirs(save_path, exist_ok=True); remove the else print branch
since exist_ok=True will silently handle existing directories and will also
create any needed parent directories. Keep the surrounding logic intact and
ensure the os import remains available for prep_json.py.
♻️ Duplicate comments (2)
src/mattext/models/benchmark.py (1)

64-87: ⚠️ Potential issue | 🟠 Major

Missing wandb.finish() for inference W&B run.

wandb.init at line 65 is not paired with wandb.finish(), so inference runs can bleed across folds. The W&B run started in the is_main_process block should be closed after inference completes.

Proposed fix - add wandb.finish in finally block
         if state.is_main_process:
             wandb.init(
                 config=dict(self.task_cfg.model.inference),
                 project=self.task_cfg.model.logging.wandb_project,
                 name=test_name,
             )

             exp_cfg.model.inference.path.test_data = self.test_data
             exp_cfg.model.inference.path.pretrained_checkpoint = ckpt

             try:
                 predict = self._get_inference(exp_cfg, fold_name)
                 predictions, prediction_ids = predict.predict()
                 self._record_predictions(task, i, predictions, prediction_ids)
             except Exception as e:
                 logger.error(
                     f"Error occurred during inference for finetuned checkpoint '{exp_name}': {e!s}"
                 )
                 if isinstance(e, ValueError | TypeError):
                     raise
                 logger.error(traceback.format_exc())
+            finally:
+                wandb.finish()

         # All ranks sync after inference before next fold
         state.wait_for_everyone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/benchmark.py` around lines 64 - 87, The W&B run opened in
the state.is_main_process block via wandb.init is never closed; add a finally
that calls wandb.finish() after the inference code so the run is always
terminated (success or exception). Specifically, wrap the calls around
_get_inference(...), predict.predict(), and _record_predictions(...) in a
try/except/finally (or add a nested try/finally) such that wandb.finish() is
executed only when state.is_main_process opened the run, and preserve the
existing exception handling (re-raise ValueError/TypeError and log others)
before finishing the run.
src/mattext/models/llama_sft.py (1)

215-217: ⚠️ Potential issue | 🟠 Major

Output directory not fold-scoped, causing checkpoint overwrites.

Line 215-217 writes all folds to the same output_dir_ path, so later folds overwrite earlier checkpoints. Line 227 correctly uses {self.fold} for tokenizer save, but trainer.save_model uses the non-scoped path.

Proposed fix - scope output_dir_ by fold
-        self.output_dir_ = (
-            f"{self.cfg.path.finetuned_modelname}/llamav3-8b-lora-fine-tune"
-        )
+        self.output_dir_ = (
+            f"{self.cfg.path.finetuned_modelname}_{self.fold}/llamav3-8b-lora-fine-tune"
+        )

Also applies to: 226-228

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 215 - 217, The output_dir_
currently is not fold-scoped so multiple folds overwrite checkpoints; change the
initialization of self.output_dir_ to include the fold (e.g. incorporate
{self.fold} into the f-string) so each fold writes to a unique directory, and
ensure any calls that save the model (e.g. trainer.save_model) use this
fold-scoped self.output_dir_ instead of a shared path—also update the related
save calls around the tokenizer/save_pretrained area so both tokenizer and
trainer use the same fold-scoped directory.
🧹 Nitpick comments (12)
src/mattext/analysis/xtal2pot.py (1)

32-42: Consider strict=True to catch data integrity issues early.

Using strict=False (which is also the default) silently truncates when the DataFrame columns have mismatched lengths. If the CSVs fetched from Dropbox ever have inconsistent row counts, element mappings would be silently lost.

If data integrity is a concern, strict=True would raise a ValueError on length mismatch, making issues visible immediately.

Suggested change
 _ATOM_PARAMETERS["uff_energy"] = dict(
-    zip(_df_uff_energy["Element"], _df_uff_energy["energy"], strict=False)
+    zip(_df_uff_energy["Element"], _df_uff_energy["energy"], strict=True)
 )

 _ATOM_PARAMETERS["uff_xi"] = dict(
-    zip(_df_uff_xi["Element"], _df_uff_xi["xi"], strict=False)
+    zip(_df_uff_xi["Element"], _df_uff_xi["xi"], strict=True)
 )

 _ATOM_PARAMETERS["uff_z"] = dict(
-    zip(_df_uff_z["Element"], _df_uff_z["zi"], strict=False)
+    zip(_df_uff_z["Element"], _df_uff_z["zi"], strict=True)
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/analysis/xtal2pot.py` around lines 32 - 42, The zip calls that
build _ATOM_PARAMETERS ("uff_energy", "uff_xi", "uff_z") use strict=False which
silently ignores mismatched lengths; change these zip(...) calls to use
strict=True so a ValueError is raised if the DataFrame columns
(_df_uff_energy["Element"]/["energy"], _df_uff_xi["Element"]/["xi"],
_df_uff_z["Element"]/["zi"]) have differing lengths, ensuring any CSV/data
integrity issues surface immediately (handle the ValueError where appropriate if
you want custom error messaging).
revision-scripts/5fold_split.py (1)

25-34: Consider de-duplicating the split-file write logic.

Line 25-34 repeats the same pattern for train/test writing. A small helper would reduce duplication and future drift.

♻️ Suggested refactor
 def split_dataset(input_json, output_dir, n_splits=5, random_state=42):
@@
     # Ensure output directory exists
     os.makedirs(output_dir, exist_ok=True)
 
+    def _write_fold_json(split_name, fold, payload):
+        file_path = os.path.join(output_dir, f"{split_name}_mp_classification_fold_{fold}.json")
+        with open(file_path, "w") as f:
+            json.dump(payload, f, indent=2)
+        return file_path
+
     # Perform the split
     for fold, (train_index, test_index) in enumerate(kf.split(data), 1):
         train_data = [data[i] for i in train_index]
         test_data = [data[i] for i in test_index]
 
-        # Save train data
-        train_file = os.path.join(
-            output_dir, f"train_mp_classification_fold_{fold}.json"
-        )
-        with open(train_file, "w") as f:
-            json.dump(train_data, f, indent=2)
-
-        # Save test data
-        test_file = os.path.join(output_dir, f"test_mp_classification_fold_{fold}.json")
-        with open(test_file, "w") as f:
-            json.dump(test_data, f, indent=2)
+        train_file = _write_fold_json("train", fold, train_data)
+        test_file = _write_fold_json("test", fold, test_data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/5fold_split.py` around lines 25 - 34, The train/test file
write logic in the 5fold_split.py block duplicates opening and json.dump calls
for train_data/test_data; extract a small helper (e.g., write_split_file or
save_fold_split) that accepts parameters (data, split_name, output_dir, fold)
and performs os.path.join, open, and json.dump, then replace the duplicate
blocks that set train_file/test_file and write to disk with calls to that helper
for "train" and "test" using train_data and test_data respectively; ensure you
preserve indent=2 and the filename pattern
f"{split_name}_mp_classification_fold_{fold}.json".
src/mattext/dataprep/matbench_prepare_data.py (1)

33-43: Module-level signal handler registration is a side effect on import.

Registering the SIGALRM handler at module level (line 43) affects the entire process whenever this module is imported, even if the timeout functionality isn't used. Consider moving the signal registration inside the functions that need it, or using a context manager pattern.

Additionally, note that revision-scripts/text_rep.py catches TimeoutError from concurrent.futures, not this custom TimeoutException. If these modules are expected to share exception-handling patterns, this inconsistency could cause timeouts to go unhandled.

♻️ Consider lazy registration pattern
-# Change the behavior of SIGALRM
-signal.signal(signal.SIGALRM, timeout_handler)
+_signal_handler_registered = False
+
+def _ensure_signal_handler():
+    global _signal_handler_registered
+    if not _signal_handler_registered:
+        signal.signal(signal.SIGALRM, timeout_handler)
+        _signal_handler_registered = True

Then call _ensure_signal_handler() at the start of process_entry_train_matbench and process_entry_test_matbench.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/dataprep/matbench_prepare_data.py` around lines 33 - 43, The
module currently registers the SIGALRM handler at import time
(signal.signal(signal.SIGALRM, timeout_handler)), which is a global side effect;
move that registration into a lazy helper (e.g. _ensure_signal_handler) or a
context manager and call it from the functions that actually use timeouts
(process_entry_train_matbench and process_entry_test_matbench) so the handler is
only installed when needed; also reconcile the exception type by either
translating your custom TimeoutException to concurrent.futures.TimeoutError
where those callers expect it (or raise TimeoutError directly) so external code
that catches concurrent.futures.TimeoutError will correctly handle timeout
events.
pyproject.toml (1)

31-32: Consider pinning git dependencies to a specific commit or tag.

Both matbench and xtal2txt use git URLs without pinning to a specific ref. This can lead to non-reproducible builds if the main branch changes. Consider pinning:

"xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git@v1.0.0"  # or commit SHA

This is optional if the dependency is under active development and you want the latest changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 31 - 32, The git dependencies for matbench and
xtal2txt are unpinned which breaks reproducibility; update the dependency
strings "matbench @ git+https://github.com/materialsproject/matbench.git" and
"xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git" to include a
specific tag or commit SHA (e.g., append `@vX.Y.Z` or @<commit-sha>) so builds are
deterministic, or document why leaving them floating is intentional if you
prefer tracking latest changes.
src/mattext/models/pretrain.py (1)

112-112: Remove stale commented-out code.

Line 112 looks like leftover debug/experiment code; better remove it to keep the training path clean.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/pretrain.py` at line 112, Remove the stale commented-out
debug line that sets config_model_args["max_position_embeddings"] =
self.context_length in src/mattext/models/pretrain.py; delete this commented
code (the line referencing config_model_args, max_position_embeddings and
self.context_length) to keep the training code clean and avoid leaving legacy
debug artifacts.
src/mattext/models/llama_sft.py (1)

240-241: Replace deprecated torch.cuda.amp.autocast() with torch.amp.autocast(device_type='cuda').

torch.cuda.amp.autocast() is deprecated in favor of the unified AMP API. Use torch.amp.autocast(device_type='cuda') instead.

Proposed fix
-            with torch.cuda.amp.autocast():
+            with torch.amp.autocast(device_type='cuda'):
                 pred = pipe(self.formatting_tests_func(self.testdata))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 240 - 241, Replace the
deprecated torch.cuda.amp.autocast() context manager with the unified API: use
torch.amp.autocast(device_type='cuda') around the inference call (the block
using pipe(self.formatting_tests_func(self.testdata))). Update the single
context manager invocation in the method containing that line to
torch.amp.autocast(device_type='cuda') so AMP is used via the unified API while
keeping the existing call to pipe and formatting_tests_func/testdata unchanged.
src/mattext/models/llama.py (2)

115-119: Redundant "=" separators in warning messages.

Using logger.warning() with separator lines is noisy. Consider a single concise warning message.

Suggested fix
-            logger.warning("=" * 80)
-            logger.warning(
-                "Your GPU supports bfloat16: accelerate training with bf16=True"
-            )
-            logger.warning("=" * 80)
+            logger.warning(
+                "Your GPU supports bfloat16 - consider enabling bf16=True for faster training"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama.py` around lines 115 - 119, The three consecutive
logger.warning calls that print separator lines and a message are noisy;
consolidate them into a single concise warning by removing the "=" separator
lines and calling logger.warning once with a clear message (e.g., "Your GPU
supports bfloat16; consider using bf16=True to accelerate training") where the
current logger.warning calls appear in src/mattext/models/llama.py.

216-228: Redundant "WARNING:" prefix in logger.warning() calls.

Same issue as in finetune.py - the logger.warning() method already indicates WARNING level.

Suggested fix
         if config_dict.get("load_best_model_at_end", False):
             logger.warning(
-                f"[Rank {self.local_rank}] WARNING: Disabling load_best_model_at_end in DDP mode"
+                f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP mode"
             )
             config_dict["load_best_model_at_end"] = False
         if config_dict.get("save_on_each_node", False):
             logger.warning(
-                f"[Rank {self.local_rank}] WARNING: save_on_each_node should be False in DDP"
+                f"[Rank {self.local_rank}] save_on_each_node should be False in DDP"
             )
             config_dict["save_on_each_node"] = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama.py` around lines 216 - 228, The logger.warning
messages in the DDP handling block include a redundant "WARNING:" prefix; update
the two logger.warning calls in the method that manipulates
config_dict/local_rank (the block checking "load_best_model_at_end" and
"save_on_each_node") to remove the literal "WARNING:" prefix so the messages
read e.g. f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP
mode" and f"[Rank {self.local_rank}] save_on_each_node should be False in DDP"
respectively, leaving the rest of the logic (setting config_dict[...] = False)
unchanged.
src/mattext/models/finetune.py (2)

86-98: Redundant "WARNING:" prefix in logger.warning() calls.

The logger.warning() method already outputs messages at WARNING level. Including "WARNING:" in the message string is redundant.

Suggested fix
         if config_dict.get("load_best_model_at_end", False):
             logger.warning(
-                f"[Rank {self.local_rank}] WARNING: Disabling load_best_model_at_end in DDP mode"
+                f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP mode"
             )
             config_dict["load_best_model_at_end"] = False
         if config_dict.get("save_on_each_node", False):
             logger.warning(
-                f"[Rank {self.local_rank}] WARNING: save_on_each_node should be False in DDP"
+                f"[Rank {self.local_rank}] save_on_each_node should be False in DDP"
             )
             config_dict["save_on_each_node"] = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/finetune.py` around lines 86 - 98, The logger.warning
messages in the DDP handling block (inside finetune logic where config_dict is
created from config_train_args) redundantly include the literal "WARNING:"
prefix; update the two logger.warning(...) calls (the ones guarded by
self.local_rank and checking config_dict.get("load_best_model_at_end") and
config_dict.get("save_on_each_node")) to remove the "WARNING:" text so they read
e.g. f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP mode"
and f"[Rank {self.local_rank}] save_on_each_node should be False in DDP" while
leaving the rest of the logic and assignments (setting the config_dict keys to
False) unchanged.

137-144: Consider adding a synchronization barrier before non-main ranks exit.

All ranks call evaluate(), but only the main process saves the model and calls wandb.finish(). If non-main ranks exit while the main process is still saving, this could cause issues in some distributed setups. Consider adding torch.distributed.barrier() after the gated block to ensure all ranks wait for the main process to complete saving before any rank terminates.

Suggested fix
     if is_main:
         wandb.log(eval_result)
         trainer.save_model(self.cfg.path.finetuned_modelname)
         wandb.finish()

+    # Ensure all ranks wait for main process to finish saving
+    if self.local_rank is not None:
+        torch.distributed.barrier()
+
     return self.cfg.path.finetuned_modelname
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/finetune.py` around lines 137 - 144, The code calls
trainer.evaluate(...) on all ranks but only the main rank runs wandb.log,
trainer.save_model, and wandb.finish, so add a synchronization point after that
gated block to ensure non-main ranks wait for the main rank to finish saving and
finalizing wandb; specifically, after the if is_main: ... block call
torch.distributed.barrier() when torch.distributed.is_initialized() is True (or
skip if not initialized) so all ranks block until the main process completes
save/finish operations.
revision-scripts/prep_rep.py (1)

26-28: Update process_entry_train_matbench and process_entry_test_matbench to accept and pass the transformation parameter.

These functions don't currently accept transformations as a parameter or pass it to TextRep.from_input(). The equivalent functions in src/mattext/dataprep/matbench_prepare_data.py include transformations: dict | None = None in their signatures and pass it via transformation=transformations to TextRep.from_input(). This divergence can cause inconsistent behavior between the two scripts.

Update both function signatures to include transformations: dict | None = None and pass it to the TextRep.from_input() calls at lines 26-28 and 58-60.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/prep_rep.py` around lines 26 - 28, Add a new optional
parameter transformations: dict | None = None to both
process_entry_train_matbench and process_entry_test_matbench, and pass it into
TextRep.from_input by calling TextRep.from_input(entry["structure"],
transformation=transformations) so the transformation dict is forwarded; update
both function signatures and the corresponding TextRep.from_input invocations
(the calls around the existing get_requested_text_reps usages) to include
transformation=transformations.
revision-scripts/text_rep.py (1)

26-28: Add transformation parameter for consistency with matbench pipeline.

src/mattext/dataprep/matbench_prepare_data.py passes transformation=transformations to TextRep.from_input(), but this script omits it at lines 26-28 and 57-59. This inconsistency could cause behavior differences between the two data processing pipelines. Requires adding the parameter to function signatures and threading it through the call chain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/text_rep.py` around lines 26 - 28, The TextRep.from_input
call is missing the transformation parameter (used in matbench pipeline) causing
inconsistent behavior; update TextRep.from_input invocations in this file (the
call at TextRep.from_input(entry["structure"]).get_requested_text_reps(...) and
the similar call later) to pass transformation=transformations, and then add a
transformation argument to any local function/method signatures that build or
forward the call so the parameter is threaded through; specifically locate and
update the function that calls TextRep.from_input as well as any helper
functions around get_requested_text_reps to accept and forward transformation to
ensure parity with src/mattext/dataprep/matbench_prepare_data.py.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a3148b49-3948-4061-a797-680cc7291c3f

📥 Commits

Reviewing files that changed from the base of the PR and between e7b6a9a and efe2737.

📒 Files selected for processing (138)
  • .pre-commit-config.yaml
  • LICENSE
  • conf/archived_experiments/config-hydra.yaml
  • conf/archived_experiments/config-potential.yaml
  • conf/archived_experiments/config-sft.yaml
  • conf/archived_experiments/config-smiles.yaml
  • conf/archived_experiments/config.yaml
  • conf/archived_experiments/santiago/cifp1.yaml
  • conf/archived_experiments/santiago_100k/cifp1.yaml
  • conf/archived_experiments/testing_perturb_100/cifp1.yaml
  • conf/archived_experiments/testing_perturb_100/cifsymmetrized.yaml
  • conf/archived_experiments/testing_perturb_100/crystal_llm.yaml
  • conf/archived_experiments/testing_perturb_300/cifp1.yaml
  • conf/archived_experiments/testing_perturb_300/cifsymmetrized.yaml
  • conf/archived_experiments/testing_perturb_300/crystal_llm.yaml
  • conf/archived_experiments/testing_translate_100/cifsymmetrized.yaml
  • conf/archived_experiments/testing_translate_100/crystal_llm.yaml
  • conf/archived_experiments/testing_translate_300/cifsymmetrized.yaml
  • conf/archived_experiments/testing_translate_300/crystal_llm.yaml
  • conf/bandgap.yaml
  • conf/benchmark.yaml
  • conf/bg/atoms.yaml
  • conf/bg/atoms_params.yaml
  • conf/bg/cifp1.yaml
  • conf/bg/cifpsym.yaml
  • conf/bg/composition.yaml
  • conf/bg/crystal_llm.yaml
  • conf/bg/local_env.yaml
  • conf/bg/slices.yaml
  • conf/bg/zmatrix.yaml
  • conf/classification.yaml
  • conf/form/atoms.yaml
  • conf/form/atoms_params.yaml
  • conf/form/cifp1.yaml
  • conf/form/cifpsym.yaml
  • conf/form/composition.yaml
  • conf/form/crystal_llm.yaml
  • conf/form/local_env.yaml
  • conf/form/slices.yaml
  • conf/form/zmatrix.yaml
  • conf/form_energy.yaml
  • conf/group-test/composition.yaml
  • conf/group-test/slices.yaml
  • conf/is_metal/atoms.yaml
  • conf/is_metal/atoms_params.yaml
  • conf/is_metal/cifp1.yaml
  • conf/is_metal/cifpsym.yaml
  • conf/is_metal/composition.yaml
  • conf/is_metal/crystal_llm.yaml
  • conf/is_metal/local_env.yaml
  • conf/is_metal/slices.yaml
  • conf/is_metal/zmatrix.yaml
  • conf/llama_8b_bg/atoms.yaml
  • conf/llama_8b_bg/atoms_params.yaml
  • conf/llama_8b_bg/cifp1.yaml
  • conf/llama_8b_bg/composition.yaml
  • conf/llama_8b_bg/crystal_llm.yaml
  • conf/llm_sft.yaml
  • conf/model/archived_base/archived/finetune_template.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_atoms.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_permuted copy.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_permuted.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_unfiltered.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_zmatrix.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_atoms.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_permuted copy.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_permuted.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_unfiltered.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_zmatrix.yaml
  • conf/model/archived_base/finetune_template_gvrh.yaml
  • conf/model/archived_base/finetune_template_gvrh_perturb.yaml
  • conf/model/archived_base/finetune_template_gvrh_smiles.yaml
  • conf/model/archived_base/finetune_template_gvrh_translate.yaml
  • conf/model/archived_base/finetune_template_kvrh_perturb.yaml
  • conf/model/archived_base/finetune_template_kvrh_translate.yaml
  • conf/model/archived_base/llama_gvrh_sft_nb_fold_2.yaml
  • conf/model/benchmark_example.yaml
  • conf/model/classification_example.yaml
  • conf/model/formation_energy.yaml
  • conf/model/llama_8b.yaml
  • conf/model/llama_example.yaml
  • conf/model/pretrain_example.yaml
  • conf/model/pretrain_other_models.yaml
  • conf/model/pretrain_own_data_example.yaml
  • conf/pretrain.yaml
  • docs/api.md
  • docs/index.md
  • docs/representations.md
  • docs/tokenizers.md
  • mkdocs.yml
  • notebooks/dataprep.ipynb
  • pyproject.toml
  • revision-scripts/5fold_split.py
  • revision-scripts/mp_classification.py
  • revision-scripts/prep_json.py
  • revision-scripts/prep_rep.py
  • revision-scripts/text_rep.py
  • scripts/filterdataset.py
  • scripts/linear_potential.py
  • scripts/llama-eval-responses/llama_evals_matbench_dielectric_cif_p1.json
  • scripts/llama-eval-responses/llama_evals_matbench_dielectric_composition.json
  • scripts/llama-eval-responses/llama_evals_matbench_dielectric_crystal_llm_rep.json
  • scripts/llama-eval-responses/llama_evals_matbench_dielectric_slice.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_gvrh_cif_p1.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_gvrh_composition.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_gvrh_crystal_llm_rep.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_kvrh_cif_p1.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_kvrh_composition.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_kvrh_composition_old.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_kvrh_crystal_llm_rep.json
  • scripts/llama-eval-responses/llama_evals_matbench_log_kvrh_slice.json
  • scripts/llama-eval-responses/llama_evals_matbench_perovskites_cif_p1.json
  • scripts/llama-eval-responses/llama_evals_matbench_perovskites_composition.json
  • scripts/llama-eval-responses/llama_evals_matbench_perovskites_crystal_llm_rep.json
  • scripts/llama-eval-responses/llama_evals_matbench_perovskites_slice.json
  • scripts/llama_sft_evals.py
  • scripts/modify_mb_json.py
  • scripts/nomad_postprocess_parallel.py
  • scripts/qmof_prepare_data.py
  • src/mattext/analysis/attention.py
  • src/mattext/analysis/xtal2pot.py
  • src/mattext/dataprep/matbench_prepare_data.py
  • src/mattext/dataprep/nomad_prepare_data.py
  • src/mattext/main.py
  • src/mattext/models/benchmark.py
  • src/mattext/models/finetune.py
  • src/mattext/models/helper.py
  • src/mattext/models/inference.py
  • src/mattext/models/llama.py
  • src/mattext/models/llama_sft.py
  • src/mattext/models/potential.py
  • src/mattext/models/predict.py
  • src/mattext/models/pretrain.py
  • src/mattext/models/score.py
  • src/mattext/models/utils.py
  • tests/test_imports.py
  • tests/test_xtal2pot.py
💤 Files with no reviewable changes (22)
  • conf/bg/crystal_llm.yaml
  • conf/archived_experiments/testing_perturb_300/cifp1.yaml
  • conf/archived_experiments/testing_perturb_100/cifp1.yaml
  • conf/group-test/slices.yaml
  • conf/is_metal/atoms.yaml
  • conf/llama_8b_bg/composition.yaml
  • conf/bg/atoms_params.yaml
  • conf/bg/composition.yaml
  • conf/is_metal/crystal_llm.yaml
  • conf/form/composition.yaml
  • conf/archived_experiments/santiago/cifp1.yaml
  • conf/llama_8b_bg/atoms_params.yaml
  • revision-scripts/mp_classification.py
  • conf/bg/atoms.yaml
  • conf/is_metal/composition.yaml
  • conf/llama_8b_bg/cifp1.yaml
  • conf/llama_8b_bg/atoms.yaml
  • conf/form/crystal_llm.yaml
  • conf/form/atoms_params.yaml
  • conf/is_metal/atoms_params.yaml
  • conf/form/atoms.yaml
  • conf/group-test/composition.yaml
✅ Files skipped from review due to trivial changes (62)
  • conf/archived_experiments/testing_translate_300/crystal_llm.yaml
  • conf/bg/slices.yaml
  • conf/archived_experiments/testing_perturb_100/cifsymmetrized.yaml
  • conf/archived_experiments/testing_translate_300/cifsymmetrized.yaml
  • conf/model/archived_base/llama_gvrh_sft_nb_fold_2.yaml
  • conf/is_metal/zmatrix.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_permuted.yaml
  • conf/archived_experiments/config-sft.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_permuted copy.yaml
  • conf/form/cifp1.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_unfiltered.yaml
  • conf/model/archived_base/finetune_template_kvrh_perturb.yaml
  • conf/model/llama_example.yaml
  • mkdocs.yml
  • notebooks/dataprep.ipynb
  • conf/archived_experiments/testing_perturb_100/crystal_llm.yaml
  • conf/model/formation_energy.yaml
  • conf/bg/local_env.yaml
  • conf/is_metal/cifp1.yaml
  • conf/model/archived_base/finetune_template_gvrh.yaml
  • conf/model/classification_example.yaml
  • conf/form/cifpsym.yaml
  • conf/archived_experiments/testing_translate_100/crystal_llm.yaml
  • docs/representations.md
  • conf/model/pretrain_own_data_example.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_permuted.yaml
  • conf/bg/cifp1.yaml
  • conf/bg/zmatrix.yaml
  • conf/model/archived_base/finetune_template_kvrh_translate.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_atoms.yaml
  • conf/form/slices.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_unfiltered.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_atoms.yaml
  • conf/archived_experiments/config-smiles.yaml
  • conf/pretrain.yaml
  • LICENSE
  • conf/model/llama_8b.yaml
  • conf/form_energy.yaml
  • conf/benchmark.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh.yaml
  • conf/archived_experiments/config-hydra.yaml
  • conf/bg/cifpsym.yaml
  • conf/archived_experiments/testing_perturb_300/cifsymmetrized.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_zmatrix.yaml
  • conf/is_metal/slices.yaml
  • conf/archived_experiments/config.yaml
  • conf/model/archived_base/archived/finetune_template_gvrh_permuted copy.yaml
  • conf/classification.yaml
  • conf/archived_experiments/testing_perturb_300/crystal_llm.yaml
  • conf/model/benchmark_example.yaml
  • conf/model/pretrain_other_models.yaml
  • conf/llama_8b_bg/crystal_llm.yaml
  • conf/is_metal/local_env.yaml
  • conf/bandgap.yaml
  • conf/model/archived_base/archived/finetune_template_kvrh_zmatrix.yaml
  • conf/form/zmatrix.yaml
  • conf/form/local_env.yaml
  • conf/model/archived_base/archived/finetune_template.yaml
  • tests/test_imports.py
  • conf/is_metal/cifpsym.yaml
  • tests/test_xtal2pot.py
  • docs/index.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .pre-commit-config.yaml

???+ tip "tip"

You can access the `[CLS]` token using the `cls_token` attribute of the tokenizer.
You can access the `[CLS]` token using the `cls_token` attribute of the tokenizer.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve markdownlint MD046 in this section.

The warning points to this changed block; please convert the affected code block style to fenced format so docs lint passes consistently.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 35-35: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/tokenizers.md` at line 35, Convert the inline/backticked code mention
into a fenced code block to satisfy markdownlint MD046: replace the current
inline sentence "You can access the `[CLS]` token using the `cls_token`
attribute of the tokenizer." with a fenced code block (triple backticks)
containing a short example showing the [CLS] token and the cls_token attribute
(referencing cls_token and [CLS] so reviewers can find the spot), ensuring the
fenced block uses correct language marker (e.g., plaintext or python) and
preserves the explanatory text around it.




def process_entry_train_matbench(entry: dict, timeout: int) -> dict:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files revision-scripts/ | head -20

Repository: lamalab-org/MatText

Length of output: 258


🏁 Script executed:

wc -l revision-scripts/prep_rep.py

Repository: lamalab-org/MatText

Length of output: 95


🏁 Script executed:

cat -n revision-scripts/prep_rep.py | head -100

Repository: lamalab-org/MatText

Length of output: 3939


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "=== Function definitions ==="
rg -nP 'def process_entry_(train|test)_matbench\(.*timeout' revision-scripts/prep_rep.py
echo ""
echo "=== executor.map calls ==="
rg -nP 'executor\.map\(' revision-scripts/prep_rep.py
echo ""
echo "=== future.result with timeout ==="
rg -nP 'future\.result\s*\(\s*timeout=' revision-scripts/prep_rep.py

Repository: lamalab-org/MatText

Length of output: 373


🏁 Script executed:

cat -n revision-scripts/prep_rep.py | tail -60

Repository: lamalab-org/MatText

Length of output: 2350


🏁 Script executed:

rg -l 'TextRep' revision-scripts/

Repository: lamalab-org/MatText

Length of output: 120


Enforce timeout at pool result retrieval—currently, long-running entries can stall batch completion.

The timeout parameter is passed through to entry functions but never enforced during pool execution. The executor.map() call (line 90) lacks a timeout at result retrieval, and the except TimeoutError handlers (lines 47–48, 78–79) in process_entry_train_matbench and process_entry_test_matbench are ineffective since no mechanism triggers a timeout exception within those functions.

Switch to executor.submit() with future.result(timeout=timeout) to enforce timeouts per entry:

Proposed fix
 def process_batch(num_workers, batch, timeout, process_entry_func):
-    process_entry_with_timeout = partial(process_entry_func, timeout=timeout)
-
-    with ProcessPoolExecutor(max_workers=num_workers) as executor:
-        results = list(executor.map(process_entry_with_timeout, batch))
-
-    return [result for result in results if result is not None]
+    with ProcessPoolExecutor(max_workers=num_workers) as executor:
+        futures = [executor.submit(process_entry_func, entry, timeout) for entry in batch]
+        results = []
+        for future in futures:
+            try:
+                result = future.result(timeout=timeout)
+            except TimeoutError:
+                future.cancel()
+                continue
+            if result is not None:
+                results.append(result)
+    return results

Also applies to: 47–48, 55, 78–79, 87–92

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 24-24: Unused function argument: timeout

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/prep_rep.py` at line 24, The pool result retrieval never
enforces the passed timeout because executor.map is used; update
process_entry_train_matbench and process_entry_test_matbench to submit tasks
with executor.submit(...) instead of executor.map and then call
future.result(timeout=timeout) for each Future so the TimeoutError handlers
actually trigger; replace the iterator over map results with collecting Futures
(from submit), iterate them and call future.result(timeout=timeout) catching
concurrent.futures.TimeoutError to handle per-entry timeouts and cancel
timed-out futures as needed.

Comment on lines +55 to 56
def process_entry_test_matbench(entry: list, timeout: int) -> dict:
# Ensure the give_slice function and necessary data are picklable
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if file exists and get line count
if [ -f "revision-scripts/prep_rep.py" ]; then
  echo "=== File found ==="
  wc -l revision-scripts/prep_rep.py
  echo ""
  
  # Show lines around 55 to see the function definition
  echo "=== Lines 50-65 (function definition and usage context) ==="
  sed -n '50,65p' revision-scripts/prep_rep.py | cat -n
  echo ""
  
  # Show lines around 76 to see other usage
  echo "=== Lines 70-85 (checking line 76 usage) ==="
  sed -n '70,85p' revision-scripts/prep_rep.py | cat -n
else
  echo "File not found"
  exit 1
fi

Repository: lamalab-org/MatText

Length of output: 1531


Change entry type annotation from list to dict.

The function signature at line 55 declares entry: list, but the implementation accesses it as a dictionary (lines 59 and 76 use entry["structure"] and entry["mbid"]).

-def process_entry_test_matbench(entry: list, timeout: int) -> dict:
+def process_entry_test_matbench(entry: dict, timeout: int) -> dict:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def process_entry_test_matbench(entry: list, timeout: int) -> dict:
# Ensure the give_slice function and necessary data are picklable
def process_entry_test_matbench(entry: dict, timeout: int) -> dict:
# Ensure the give_slice function and necessary data are picklable
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 55-55: Unused function argument: timeout

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/prep_rep.py` around lines 55 - 56, The function
process_entry_test_matbench currently types its parameter as entry: list but
treats it like a dict (uses entry["structure"] and entry["mbid"]); update the
type annotation to entry: dict (or a more specific TypedDict/Mapping if
available) in the process_entry_test_matbench signature and adjust any related
type hints or imports so the code and type annotations match the actual usage of
entry["structure"] and entry["mbid"].




def process_entry_train_matbench(entry: dict, timeout: int) -> dict:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# First, let's read the relevant sections of the file to understand the implementation
echo "=== Reading lines 20-100 of text_rep.py ==="
sed -n '20,100p' revision-scripts/text_rep.py | cat -n

echo ""
echo "=== Checking file size ==="
wc -l revision-scripts/text_rep.py

Repository: lamalab-org/MatText

Length of output: 3403


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Searching for executor.map usage ==="
rg -nP 'executor\.map\(' revision-scripts/text_rep.py -A 2 -B 2

echo ""
echo "=== Searching for future.result with timeout ==="
rg -nP 'future\.result\s*\(\s*timeout=' revision-scripts/text_rep.py -A 1 -B 1

echo ""
echo "=== Searching for TimeoutError except blocks ==="
rg -nP 'except\s+TimeoutError' revision-scripts/text_rep.py -A 2 -B 2

echo ""
echo "=== Searching for process_batch function ==="
rg -nP 'def process_batch' revision-scripts/text_rep.py -A 20

Repository: lamalab-org/MatText

Length of output: 373


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full process_batch function ==="
rg -nP 'def process_batch' revision-scripts/text_rep.py -A 10

echo ""
echo "=== Checking imports at top of file ==="
head -20 revision-scripts/text_rep.py

echo ""
echo "=== Full context of process_entry_train_matbench function ==="
rg -nP 'def process_entry_train_matbench' revision-scripts/text_rep.py -A 12

Repository: lamalab-org/MatText

Length of output: 1555


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify the actual behavior of executor.map() 
# Check Python documentation to confirm timeout enforcement
python3 << 'EOF'
from concurrent.futures import ProcessPoolExecutor
import inspect

# Check if map() has a timeout parameter
sig = inspect.signature(ProcessPoolExecutor.map)
print("ProcessPoolExecutor.map() signature:")
print(sig)
print("\nParameters:")
for param_name, param in sig.parameters.items():
    print(f"  - {param_name}: {param}")
EOF

Repository: lamalab-org/MatText

Length of output: 267


timeout parameter is not enforced at the executor level.

At line 89, executor.map() is called without passing the timeout parameter. The timeout is only forwarded to worker functions via partial(), which does not enforce executor-level timeouts. A stuck worker can block the batch indefinitely. The TimeoutError handlers at lines 27-29 and 58-60 would only catch errors raised within worker code, not executor timeouts.

Apply the proposed fix to pass timeout to executor.map() or use future.result(timeout=timeout) as shown.

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 24-24: Unused function argument: timeout

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/text_rep.py` at line 24, The function
process_entry_train_matbench currently uses executor.map(...) without enforcing
the timeout, relying only on the partial-wrapped worker timeout; update the call
to enforce executor-level timeouts by either passing timeout=timeout to
executor.map(...) or by switching to executor.submit(...) and calling
future.result(timeout=timeout) for each future (and keep the partial-wrapped
worker as-is). Also ensure the existing TimeoutError handlers catch
concurrent.futures.TimeoutError from the executor, not just worker-raised errors
so stuck workers are properly handled.

Comment on lines +54 to 55
def process_entry_test_matbench(entry: list, timeout: int) -> dict:
# Ensure the give_slice function and necessary data are picklable
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# First, let's see the function definition and surrounding lines
echo "=== Function definition and context ==="
cat -n revision-scripts/text_rep.py | sed -n '50,80p'

echo -e "\n=== Search for entry annotation ==="
rg -nP 'def process_entry_test_matbench\(entry:\s*list' revision-scripts/text_rep.py

echo -e "\n=== Search for mapping access patterns ==="
rg -nP 'entry\["structure"\]|entry\["mbid"\]' revision-scripts/text_rep.py

Repository: lamalab-org/MatText

Length of output: 1738


Fix entry annotation in test processor (listdict).

At line 54, the function annotates entry as list, but the code uses mapping access with string keys at lines 58 and 75 (entry["structure"] and entry["mbid"]). List indexing requires integers; this should be annotated as dict.

Proposed fix
-def process_entry_test_matbench(entry: list, timeout: int) -> dict:
+def process_entry_test_matbench(entry: dict, timeout: int) -> dict:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def process_entry_test_matbench(entry: list, timeout: int) -> dict:
# Ensure the give_slice function and necessary data are picklable
def process_entry_test_matbench(entry: dict, timeout: int) -> dict:
# Ensure the give_slice function and necessary data are picklable
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 54-54: Unused function argument: timeout

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@revision-scripts/text_rep.py` around lines 54 - 55, The type annotation for
the parameter in process_entry_test_matbench is incorrect: change the annotation
of entry from list to dict because the function accesses entry["structure"] and
entry["mbid"]; update the signature of process_entry_test_matbench(entry: dict,
timeout: int) -> dict to reflect this and ensure any related callers or imports
expecting a list are adjusted to pass a mapping with those keys.

"""
Return a datapoint
"""
"""Return a datapoint."""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix small docstring typo.

Line 38 has a double space ("a datapoint"). Please normalize to "a datapoint" for doc quality consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/nomad_postprocess_parallel.py` at line 38, Fix the docstring typo by
replacing the double space in the string "Return a  datapoint." with a single
space so it reads "Return a datapoint."; update the triple-quoted docstring
containing that text (the function/method that returns a datapoint) to the
corrected wording to normalize doc quality.

@@ -1,5 +1,4 @@
"""
matbench_prepare_data.py
"""matbench_prepare_data.py.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix stale module name in the file header docstring.

Line 1 says matbench_prepare_data.py, but this file is scripts/qmof_prepare_data.py. Please align it to avoid confusion in docs/search.

Suggested fix
-"""matbench_prepare_data.py.
+"""qmof_prepare_data.py.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""matbench_prepare_data.py.
"""qmof_prepare_data.py.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/qmof_prepare_data.py` at line 1, The module-level docstring at the
top of the file incorrectly names the file "matbench_prepare_data.py"; update
that header docstring to the correct module name "qmof_prepare_data.py" so the
file header accurately reflects this script (check the top-of-file docstring in
scripts/qmof_prepare_data.py and replace the stale name).

Comment on lines 260 to 263
fig, axs = plt.subplots(len(unique_tokens), 1, figsize=(10, len(unique_tokens) * 5))

for ax, token in zip(axs, unique_tokens):
for ax, token in zip(axs, unique_tokens, strict=False):
# Extract the weights for the current token
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/mattext/analysis/attention.py | sed -n '250,275p'

Repository: lamalab-org/MatText

Length of output: 1236


🏁 Script executed:

head -30 src/mattext/analysis/attention.py

Repository: lamalab-org/MatText

Length of output: 750


🏁 Script executed:

web_search
matplotlib.pyplot.subplots return type for 1x1 grid

Repository: lamalab-org/MatText

Length of output: 178


🌐 Web query:

matplotlib pyplot subplots return type single subplot

💡 Result:

matplotlib.pyplot.subplots() returns (fig, ax) where ax is a single matplotlib.axes.Axes object (a scalar), not an array, when you create exactly one subplot (nrows=1, ncols=1) and squeeze=True (the default). [1][2]

If you want a consistent array return type even for 1×1, use:

fig, axs = plt.subplots(1, 1, squeeze=False)  # axs is a 2D array with shape (1, 1)
ax = axs[0, 0]

With the default squeeze=True, ax becomes:

  • scalar Axes for 1×1
  • 1D array of Axes for Nx1 or 1xM
  • 2D array of Axes for NxM (N>1 and M>1) [1]

Sources: [1] Matplotlib pyplot.subplots docs, [2] Matplotlib subplots demo gallery.


Fix iteration over matplotlib subplots when only one token exists.

When len(unique_tokens) == 1, plt.subplots(1, 1, ...) returns a single Axes object (not an array) due to the default squeeze=True behavior. Attempting to iterate over it at line 262 will raise TypeError. This must be handled to ensure the loop works for both single and multiple tokens.

Proposed fix
    fig, axs = plt.subplots(len(unique_tokens), 1, figsize=(10, len(unique_tokens) * 5))
+   axs = np.atleast_1d(axs).ravel()

    for ax, token in zip(axs, unique_tokens, strict=False):
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 260-260: Unpacked variable fig is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/analysis/attention.py` around lines 260 - 263, The loop over
matplotlib axes fails when len(unique_tokens) == 1 because plt.subplots returns
a single Axes, not an iterable; modify the code after fig, axs =
plt.subplots(...) to normalize axs into an iterable: if axs is not a
list/ndarray wrap it in [axs], otherwise flatten it (e.g., axs =
np.atleast_1d(axs).ravel()) so the for ax, token in zip(axs, unique_tokens,
strict=False) loop over unique_tokens works for both single and multiple tokens;
update imports to include numpy if needed and reference fig, axs, unique_tokens
and the for loop in attention.py.

model.save_pretrained(self.cfg.path.finetuned_modelname)

# All ranks must call evaluate (distributed collectives)
eval_result = trainer.evaluate(eval_dataset=self.tokenized_testset)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and the specific lines mentioned
git ls-files src/mattext/models/potential.py

Repository: lamalab-org/MatText

Length of output: 94


🏁 Script executed:

# Read the file to understand the context around line 179
cat -n src/mattext/models/potential.py | head -200

Repository: lamalab-org/MatText

Length of output: 8241


🏁 Script executed:

# Look at the _prepare_datasets method to understand what it returns
ast-grep --pattern 'def _prepare_datasets' -A 20

Repository: lamalab-org/MatText

Length of output: 17418


🏁 Script executed:

# Check what train_test_split returns and how tokenized_testset is used
rg -n "tokenized_testset" src/mattext/models/potential.py

Repository: lamalab-org/MatText

Length of output: 231


🏁 Script executed:

# Check the imports to understand which trainer is being used
head -20 src/mattext/models/potential.py

Repository: lamalab-org/MatText

Length of output: 522


🏁 Script executed:

# Verify what Trainer.evaluate() expects by checking transformers documentation or usage patterns
# Also check if there are other evaluate() calls in the codebase
rg -n "\.evaluate\(" src/mattext/models/ -A 2

Repository: lamalab-org/MatText

Length of output: 695


🏁 Script executed:

# Check if there's any documentation or hints about what evaluate() accepts
rg -n "def evaluate" --type py -A 5 | head -100

Repository: lamalab-org/MatText

Length of output: 1350


Use self.tokenized_testset["test"] instead of passing the entire DatasetDict to evaluate().

At line 179, self.tokenized_testset is a DatasetDict containing both "train" and "test" splits (from train_test_split() at line 71). This is inconsistent with line 165 (which correctly uses self.tokenized_dataset["test"]) and with the same pattern used in llama.py:260 and finetune.py:138. The evaluate() method expects a single Dataset, not a DatasetDict.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/potential.py` at line 179, The call to trainer.evaluate is
passing the whole DatasetDict (self.tokenized_testset) instead of a single
Dataset split; change the argument to the test split
(self.tokenized_testset["test"]) so trainer.evaluate receives a Dataset; locate
the evaluate call in the method where eval_result = trainer.evaluate(...) and
replace the parameter accordingly to match the pattern used with
self.tokenized_dataset["test"] elsewhere.

Comment on lines +45 to +47
def _prepare_datasets(
self, subset: str, local_file_path: str | None = None
) -> DatasetDict:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix _prepare_datasets return type annotation.

The function is annotated as returning DatasetDict (Line 47), but it returns a 2-tuple of datasets at Line 77-87. Please align the type hint with actual behavior to avoid typing/tooling drift.

Also applies to: 77-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/pretrain.py` around lines 45 - 47, The _prepare_datasets
function is annotated to return DatasetDict but actually returns a 2-tuple
(train_dataset, val_dataset); update its return type annotation to reflect that
(e.g., tuple[Dataset, Dataset] or Tuple[Dataset, Dataset]) and ensure Dataset is
imported from the datasets package (or add typing.Tuple if using the older
form). Change the signature of _prepare_datasets and any related type
hints/usages (and update the docstring if present) so callers and tooling match
the actual return of train_dataset and val_dataset.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
README.md (2)

125-125: ⚠️ Potential issue | 🟡 Minor

Fix redundant phrase "for e.g."

The phrase "for e.g." is redundant since "e.g." already means "for example".

📝 Proposed fix
-To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.
+To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 125, The README contains a redundant phrase "for e.g." in
the sentence explaining how to override the default config (example:
"++model.pretrain.training_arguments.per_device_train_batch_size=32"); remove
"for " so it reads "e.g.,
`++model.pretrain.training_arguments.per_device_train_batch_size=32`" (or
replace "e.g." with "for example") to avoid redundancy while preserving the
example and link to the docs.

150-159: ⚠️ Potential issue | 🟡 Minor

Fix citation block formatting.

The code block should specify a language (e.g., bibtex), and there's a missing comma after primaryClass which would be invalid BibTeX syntax.

📝 Proposed fix
-```
+```bibtex
 `@misc`{alampara2024mattextlanguagemodelsneed,
       title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
       author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
       year={2024},
       eprint={2406.17295},
       archivePrefix={arXiv},
-      primaryClass={cond-mat.mtrl-sci}
-      url={https://arxiv.org/abs/2406.17295},
+      primaryClass={cond-mat.mtrl-sci},
+      url={https://arxiv.org/abs/2406.17295}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 150 - 159, The BibTeX citation block is missing a
language tag and has a syntax error: add a fenced code block language specifier
(e.g., ```bibtex) and fix the missing comma after the primaryClass field so the
entry reads primaryClass={cond-mat.mtrl-sci}, and also remove the trailing comma
after the final url field if required by your style (ensure the entry ends with
a closing brace). Locate the citation starting with the
`@misc`{alampara2024mattextlanguagemodelsneed and update the surrounding fence and
the primaryClass and url lines accordingly.
🧹 Nitpick comments (2)
docs/representations.md (1)

129-134: Specify language for fenced code block.

The code block at line 132 is missing a language specifier, which affects syntax highlighting and accessibility.

📝 Proposed fix
 ??? success "output"

-    ```bash
+    ```text
     Translated Crystal-text-LLM Representations:

Alternatively, if this is meant to be Python output:

-    ```bash
+    ```python
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/representations.md` around lines 129 - 134, The fenced code block
containing the sample output "Translated Crystal-text-LLM Representations:" is
missing a language specifier; update that triple-backtick fence (the block
starting with ``` and the text "Translated Crystal-text-LLM Representations:")
to include an appropriate language tag such as text, bash, or python (e.g.,
change ``` to ```text or ```python) to enable correct syntax highlighting and
accessibility.
pyproject.toml (1)

32-34: Use a tagged release or version constraint for xtal2txt instead of pinning to a commit hash.

The commit c6cf9593596ce99f054f1ebbdc4346d7dbec99fb is a stable point (xtal2txt v0.2.0, GPG-verified from maintainer Nawaf Alampara), but pinning to a bare commit hash is unconventional for production dependencies. If a release tag exists for this commit, reference that instead. If not, consider whether pinning to the version number (xtal2txt==0.2.0) would suffice, as it's more maintainable and allows for patch releases.

The exact pins for tensorflow==2.14.0 and keras==2.14.0 are stricter than xtal2txt's own constraints (keras<3). These exact versions may cause dependency conflicts with users' environments unless they're required for specific MatText features—document this if so.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 32 - 34, Replace the git-commit pinned
dependency "xtal2txt @
git+https://github.com/lamalab-org/xtal2txt.git@c6cf9593596ce99f054f1ebbdc4346d7dbec99fb"
with a tagged release or explicit version (e.g., use the release tag for v0.2.0
or change to "xtal2txt==0.2.0") to avoid a bare-commit pin; also relax or
justify the strict pins for "tensorflow==2.14.0" and "keras==2.14.0" (e.g., use
version ranges like "keras<3" or document why exact pins are required for the
MatText features) so dependency resolution is more maintainable and less likely
to conflict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/representations.md`:
- Around line 24-35: The codebase uses outdated representation keys causing
KeyErrors: replace occurrences of the old names "slice" -> "slices",
"crystal_llm_rep" -> "crystal_text_llm", and "atoms_params" ->
"atom_sequences_plusplus" in the modules that map or look up representations
(e.g., the representation key dicts/lookup logic in attention.py and the
dataprep modules); update the representation-key mappings and any conditionals
in attention.py, matbench_prepare_data.py, and nomad_prepare_data.py to use the
new names from src/mattext/models/utils.py, or add a small alias map that
translates the old keys to the new ones to preserve backward compatibility.
Ensure all lookups reference the unified representation names and run tests to
confirm no KeyError remains.

---

Duplicate comments:
In `@README.md`:
- Line 125: The README contains a redundant phrase "for e.g." in the sentence
explaining how to override the default config (example:
"++model.pretrain.training_arguments.per_device_train_batch_size=32"); remove
"for " so it reads "e.g.,
`++model.pretrain.training_arguments.per_device_train_batch_size=32`" (or
replace "e.g." with "for example") to avoid redundancy while preserving the
example and link to the docs.
- Around line 150-159: The BibTeX citation block is missing a language tag and
has a syntax error: add a fenced code block language specifier (e.g., ```bibtex)
and fix the missing comma after the primaryClass field so the entry reads
primaryClass={cond-mat.mtrl-sci}, and also remove the trailing comma after the
final url field if required by your style (ensure the entry ends with a closing
brace). Locate the citation starting with the
`@misc`{alampara2024mattextlanguagemodelsneed and update the surrounding fence and
the primaryClass and url lines accordingly.

---

Nitpick comments:
In `@docs/representations.md`:
- Around line 129-134: The fenced code block containing the sample output
"Translated Crystal-text-LLM Representations:" is missing a language specifier;
update that triple-backtick fence (the block starting with ``` and the text
"Translated Crystal-text-LLM Representations:") to include an appropriate
language tag such as text, bash, or python (e.g., change ``` to ```text or
```python) to enable correct syntax highlighting and accessibility.

In `@pyproject.toml`:
- Around line 32-34: Replace the git-commit pinned dependency "xtal2txt @
git+https://github.com/lamalab-org/xtal2txt.git@c6cf9593596ce99f054f1ebbdc4346d7dbec99fb"
with a tagged release or explicit version (e.g., use the release tag for v0.2.0
or change to "xtal2txt==0.2.0") to avoid a bare-commit pin; also relax or
justify the strict pins for "tensorflow==2.14.0" and "keras==2.14.0" (e.g., use
version ranges like "keras<3" or document why exact pins are required for the
MatText features) so dependency resolution is more maintainable and less likely
to conflict.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8aa001de-853c-4b5a-8780-5cbb7f7069e2

📥 Commits

Reviewing files that changed from the base of the PR and between efe2737 and 4e60578.

📒 Files selected for processing (3)
  • README.md
  • docs/representations.md
  • pyproject.toml

Comment on lines +24 to 35
For example, to obtain the `cif_p1`, `slices`, `atom_sequences_plusplus`, `crystal_text_llm`, and `zmatrix` representations, use the following code:

```python
text_rep = TextRep(structure)

# Define a list of requested representations
requested_reps = [
"cif_p1",
"slice",
"atoms_params",
"crystal_llm_rep",
"slices",
"atom_sequences_plusplus",
"crystal_text_llm",
"zmatrix"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for old representation names still in use across the codebase
echo "=== Checking for OLD representation names ==="
echo "--- 'slice' (should be 'slices') ---"
rg -n --type py '\bslice\b' -g '!*test*' | head -30

echo ""
echo "--- 'crystal_llm_rep' (should be 'crystal_text_llm') ---"
rg -n --type py 'crystal_llm_rep' | head -30

echo ""
echo "--- 'atoms_params' (should be 'atom_sequences_plusplus') ---"
rg -n --type py 'atoms_params' | head -30

echo ""
echo "=== Checking what names ARE being used in these files ==="
echo "--- In attention.py ---"
rg -n 'representation|_rep|_param|slice' src/mattext/analysis/attention.py 2>/dev/null | head -20

echo ""
echo "--- In matbench_prepare_data.py ---"
rg -n 'representation|_rep|_param' src/mattext/dataprep/matbench_prepare_data.py 2>/dev/null | head -20

echo ""
echo "--- In nomad_prepare_data.py ---"
rg -n 'representation|_rep|_param' src/mattext/dataprep/nomad_prepare_data.py 2>/dev/null | head -20

Repository: lamalab-org/MatText

Length of output: 5497


🏁 Script executed:

#!/bin/bash
# Check the documentation file to see what names it advertises
echo "=== Lines 24-35 from docs/representations.md ==="
sed -n '24,35p' docs/representations.md

echo ""
echo "=== Checking for new representation names in the documentation ==="
rg -n 'slices|atom_sequences_plusplus|crystal_text_llm' docs/representations.md

echo ""
echo "=== Checking TextRep class for representation definitions ==="
find . -name '*.py' -type f | xargs grep -l 'class TextRep' | head -1 | xargs head -100

echo ""
echo "=== Checking for representation method names ==="
rg -n 'def get_.*text_rep|def get_.*representation' src/mattext/representations.py | head -20

Repository: lamalab-org/MatText

Length of output: 3996


🏁 Script executed:

#!/bin/bash
# Find the representations module
echo "=== Finding representations module ==="
find . -name '*representation*' -type f | grep -E '\.(py|md)$' | head -20

echo ""
echo "=== Searching for where new representation names are defined/used ==="
rg -n 'slices|atom_sequences_plusplus|crystal_text_llm' --type py | head -40

echo ""
echo "=== Checking if TextRep has methods for these representations ==="
find . -name '*.py' -type f | xargs grep -l 'TextRep' | head -3 | xargs grep -A 5 'def get_'

Repository: lamalab-org/MatText

Length of output: 604


🏁 Script executed:

#!/bin/bash
# Check src/mattext/models/utils.py to see the full definitions
echo "=== src/mattext/models/utils.py content ==="
head -50 src/mattext/models/utils.py

echo ""
echo "=== Check if attention.py imports from models/utils or defines its own ==="
head -50 src/mattext/analysis/attention.py

echo ""
echo "=== Verify: Does TextRep class have get_slices, get_atom_sequences_plusplus, get_crystal_text_llm methods? ==="
find . -path '*/mattext/*' -name '*.py' -type f | xargs grep -l 'def get_slices\|def get_atom_sequences_plusplus\|def get_crystal_text_llm' | head -5

echo ""
echo "=== Search for where these representation methods are defined ==="
rg -n 'def get_slices|def get_atom_sequences_plusplus|def get_crystal_text_llm|def get_slice|def get_crystal_llm_rep|def get_atoms_params' --type py

Repository: lamalab-org/MatText

Length of output: 2664


Update representation names in src/mattext/analysis/attention.py and data preparation modules to match documentation.

The documentation advertises new representation names (slices, crystal_text_llm, atom_sequences_plusplus) which are defined in src/mattext/models/utils.py, but src/mattext/analysis/attention.py still uses old names (slice, crystal_llm_rep, atoms_params). Additionally, src/mattext/dataprep/matbench_prepare_data.py uses the old crystal_llm_rep name and src/mattext/dataprep/nomad_prepare_data.py uses the old atoms_params name.

Users following the updated documentation will encounter KeyError exceptions when attempting to use these modules with the new representation names, creating a breaking inconsistency between documented and actual supported representation identifiers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/representations.md` around lines 24 - 35, The codebase uses outdated
representation keys causing KeyErrors: replace occurrences of the old names
"slice" -> "slices", "crystal_llm_rep" -> "crystal_text_llm", and "atoms_params"
-> "atom_sequences_plusplus" in the modules that map or look up representations
(e.g., the representation key dicts/lookup logic in attention.py and the
dataprep modules); update the representation-key mappings and any conditionals
in attention.py, matbench_prepare_data.py, and nomad_prepare_data.py to use the
new names from src/mattext/models/utils.py, or add a small alias map that
translates the old keys to the new ones to preserve backward compatibility.
Ensure all lookups reference the unified representation names and run tests to
confirm no KeyError remains.

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/mattext/models/llama_sft.py (3)

81-95: ⚠️ Potential issue | 🔴 Critical

compute_dtype can be undefined on non-quantized runs (runtime crash).

Line 94 reads compute_dtype even when Line 91 path is taken (bnb_config = None), causing UnboundLocalError.

Proposed fix
-        if self.bnb_config.use_4bit and self.bnb_config.use_8bit:
+        compute_dtype = None
+        if self.bnb_config.use_4bit and self.bnb_config.use_8bit:
             raise ValueError(
                 "You can't load the model in 8 bits and 4 bits at the same time"
             )
@@
-        if compute_dtype == torch.float16:
+        if compute_dtype == torch.float16 and torch.cuda.is_available():
             major, _ = torch.cuda.get_device_capability()
             if major >= 8:
                 logger.info(
                     "Your GPU supports bfloat16: accelerate training with bf16=True!"
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 81 - 95, The code reads
compute_dtype unconditionally but only sets it inside the quantized branch,
causing UnboundLocalError; to fix, initialize compute_dtype (e.g., compute_dtype
= None) before the if/elif where bnb_config is created or guard the later check,
and change the GPU compatibility check to only run when compute_dtype is not
None (and/or bnb_config is not None) before comparing to torch.float16 and
calling torch.cuda.get_device_capability(); ensure references to compute_dtype,
bnb_config, BitsAndBytesConfig, and the torch.float16 check are updated
accordingly.

63-64: ⚠️ Potential issue | 🟡 Minor

test_sample_size handling has edge-case bugs.

Using a truthy check means 0 behaves like “unset”, and selecting range(self.test_sample_size) can exceed dataset length.

Proposed fix
-        if self.test_sample_size:
-            dataset = dataset.select(range(self.test_sample_size))
+        if self.test_sample_size is not None:
+            n = min(self.test_sample_size, len(dataset))
+            dataset = dataset.select(range(n))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 63 - 64, The current truthy
check misinterprets 0 as unset and may request more samples than exist; change
the conditional to check for None (e.g., if self.test_sample_size is not None),
coerce to an int and clamp to dataset length (n = max(0,
int(self.test_sample_size)); n = min(n, len(dataset))), then call
dataset.select(range(n)) so 0 yields an empty selection and values larger than
the dataset are safely capped; update the code around test_sample_size and
dataset.select accordingly.

111-115: ⚠️ Potential issue | 🟠 Major

Fix device_map="auto" for DDP-compatible model loading.

In DDP mode (which this code supports via local_rank), device_map="auto" attempts model sharding across devices, conflicting with DDP's expectation of one full model replica per process. With BitsAndBytesConfig quantization, this causes training failures or undefined behavior. Use rank-scoped placement instead:

+        device_map = {"": f"cuda:{self.local_rank}"} if self.local_rank is not None else "auto"
         model = AutoModelForCausalLM.from_pretrained(
             self.ckpt,
             quantization_config=bnb_config,
-            device_map="auto",
+            device_map=device_map,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 111 - 115, The model loading
uses AutoModelForCausalLM.from_pretrained with device_map="auto", which
conflicts with DDP: when a local rank is present you must place a full model
replica on that GPU rather than auto-shard it. Change the call that references
self.ckpt and bnb_config so that if a DDP/local rank is set (e.g.
self.local_rank or similar), pass a rank-scoped device_map like {"":
self.local_rank} (and ensure the process sets its CUDA device accordingly);
otherwise keep device_map="auto" for single-process loading.
♻️ Duplicate comments (1)
src/mattext/models/llama_sft.py (1)

215-228: ⚠️ Potential issue | 🟠 Major

Model checkpoint path is still not fold-scoped (artifact overwrite risk).

Line 216 uses a shared directory across folds, so later folds can overwrite earlier model artifacts. This is the same concern previously raised.

Proposed fix
-        self.output_dir_ = (
-            f"{self.cfg.path.finetuned_modelname}/llamav3-8b-lora-fine-tune"
-        )
+        self.output_dir_ = (
+            f"{self.cfg.path.finetuned_modelname}_{self.fold}/llamav3-8b-lora-fine-tune"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 215 - 228, The output directory
and saved artifacts are not fold-scoped and risk being overwritten; update the
path construction for self.output_dir_ (and any uses like
trainer.save_model(self.output_dir_)) to include the fold identifier (self.fold)
and/or fold-specific folder, and similarly change the tokenizer save path used
in self.tokenizer.save_pretrained to use the same fold-scoped directory (e.g.,
include f".../{self.fold}/..." or append _{self.fold}) so each fold writes to a
unique directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mattext/models/llama_sft.py`:
- Around line 186-193: The code currently ignores the configurable context
length by hardcoding max_length based on self.representation; update the logic
where max_length is set (before building SFTConfig) to use self.context_length
when provided (e.g., if self.context_length is truthy/positive) and fall back to
the existing representation-based defaults (2048 for "cif_p1", 1024 otherwise);
then pass that computed max_length into SFTConfig so SFTConfig(...,
max_seq_length=max_length, ...) uses the configured value (ensure
self.context_length is validated/coerced to an int if necessary).
- Line 242: The debug log in llama_sft.py uses a printf-style placeholder
("Prediction: %s") which Loguru doesn't expand; update the logger.debug call in
the predict/logging area (the line that currently reads
logger.debug("Prediction: %s", pred)) to use Loguru's brace-style formatting
(e.g., "Prediction: {}") or an f-string so the pred value is interpolated
correctly.

---

Outside diff comments:
In `@src/mattext/models/llama_sft.py`:
- Around line 81-95: The code reads compute_dtype unconditionally but only sets
it inside the quantized branch, causing UnboundLocalError; to fix, initialize
compute_dtype (e.g., compute_dtype = None) before the if/elif where bnb_config
is created or guard the later check, and change the GPU compatibility check to
only run when compute_dtype is not None (and/or bnb_config is not None) before
comparing to torch.float16 and calling torch.cuda.get_device_capability();
ensure references to compute_dtype, bnb_config, BitsAndBytesConfig, and the
torch.float16 check are updated accordingly.
- Around line 63-64: The current truthy check misinterprets 0 as unset and may
request more samples than exist; change the conditional to check for None (e.g.,
if self.test_sample_size is not None), coerce to an int and clamp to dataset
length (n = max(0, int(self.test_sample_size)); n = min(n, len(dataset))), then
call dataset.select(range(n)) so 0 yields an empty selection and values larger
than the dataset are safely capped; update the code around test_sample_size and
dataset.select accordingly.
- Around line 111-115: The model loading uses
AutoModelForCausalLM.from_pretrained with device_map="auto", which conflicts
with DDP: when a local rank is present you must place a full model replica on
that GPU rather than auto-shard it. Change the call that references self.ckpt
and bnb_config so that if a DDP/local rank is set (e.g. self.local_rank or
similar), pass a rank-scoped device_map like {"": self.local_rank} (and ensure
the process sets its CUDA device accordingly); otherwise keep device_map="auto"
for single-process loading.

---

Duplicate comments:
In `@src/mattext/models/llama_sft.py`:
- Around line 215-228: The output directory and saved artifacts are not
fold-scoped and risk being overwritten; update the path construction for
self.output_dir_ (and any uses like trainer.save_model(self.output_dir_)) to
include the fold identifier (self.fold) and/or fold-specific folder, and
similarly change the tokenizer save path used in self.tokenizer.save_pretrained
to use the same fold-scoped directory (e.g., include f".../{self.fold}/..." or
append _{self.fold}) so each fold writes to a unique directory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb5dde5a-24e4-44e7-875c-9bdfb123c076

📥 Commits

Reviewing files that changed from the base of the PR and between 4e60578 and 9fbea01.

📒 Files selected for processing (1)
  • src/mattext/models/llama_sft.py

Comment on lines 186 to 193
max_length = 2048 if self.representation == "cif_p1" else 1024

training_args = SFTConfig(
**config_dict,
packing=False,
max_seq_length=max_length,
completion_only_loss=True,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Configured context length is ignored.

Line 186 hardcodes sequence length by representation, but self.context_length (Line 49) is never used. This makes context length non-configurable in practice.

Proposed fix
-        max_length = 2048 if self.representation == "cif_p1" else 1024
+        max_length = self.context_length
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` around lines 186 - 193, The code currently
ignores the configurable context length by hardcoding max_length based on
self.representation; update the logic where max_length is set (before building
SFTConfig) to use self.context_length when provided (e.g., if
self.context_length is truthy/positive) and fall back to the existing
representation-based defaults (2048 for "cif_p1", 1024 otherwise); then pass
that computed max_length into SFTConfig so SFTConfig(...,
max_seq_length=max_length, ...) uses the configured value (ensure
self.context_length is validated/coerced to an int if necessary).


with torch.cuda.amp.autocast():
pred = pipe(self.formatting_tests_func(self.testdata))
logger.debug("Prediction: %s", pred)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Loguru logger formatting style %s vs {} placeholders

💡 Result:

  • Loguru does not support %-style placeholders (%s, %d, …) in log messages. You must use brace-style {} formatting. Example: replace logger.debug("x=%s", x) with logger.debug("x={}", x). Loguru uses message.format(*args, **kwargs) internally. [1]
  • Passing arguments separately (instead of building an f-string) keeps formatting lazy-ish: if the log level is filtered out, the message won’t be formatted. [1]
  • If your message contains literal { or }, escape them as {{ and }}, otherwise Loguru will treat them as format fields and may raise formatting errors when args/kwargs are present. [2]
  • For expensive computations, use lazy callables via logger.opt(lazy=True) so functions are only evaluated if the message is actually emitted. [3]

Sources:
[1] Loguru docs (migration): “Replacing % style formatting of messages” — Loguru only supports {} and calls .format()
[2] Loguru docs (troubleshooting): brace formatting pitfalls / escaping and errors with braces
[3] Loguru docs (API): logger.opt(lazy=True) behavior


🏁 Script executed:

# First, let's check the actual file and line in question
cat -n src/mattext/models/llama_sft.py | sed -n '235,250p'

Repository: lamalab-org/MatText

Length of output: 677


🏁 Script executed:

# Search for all loguru usage in the codebase to see formatting patterns
rg "logger\.(debug|info|warning|error)" --context=1 -t py | head -60

Repository: lamalab-org/MatText

Length of output: 3187


🏁 Script executed:

# Check loguru import in the file
head -20 src/mattext/models/llama_sft.py | grep -i "loguru\|logger"

Repository: lamalab-org/MatText

Length of output: 88


Fix Loguru formatting placeholder: change %s to {}

Loguru does not support %-style placeholders (%s, %d, etc.). It uses {} brace-style formatting exclusively via .format() internally. The current code will log the literal string "Prediction: %s" without substituting the pred value.

Fix
-            logger.debug("Prediction: %s", pred)
+            logger.debug("Prediction: {}", pred)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.debug("Prediction: %s", pred)
logger.debug("Prediction: {}", pred)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mattext/models/llama_sft.py` at line 242, The debug log in llama_sft.py
uses a printf-style placeholder ("Prediction: %s") which Loguru doesn't expand;
update the logger.debug call in the predict/logging area (the line that
currently reads logger.debug("Prediction: %s", pred)) to use Loguru's
brace-style formatting (e.g., "Prediction: {}") or an f-string so the pred value
is interpolated correctly.

@n0w0f n0w0f merged commit eaefa11 into main Mar 4, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant