Conversation
Reviewer's GuideUpdates MatText to version 0.0.2 with TRL/Accelerate-style SFT training, improved DDP handling (single-rank logging/checkpointing, proper synchronization), updated xtal2txt dependency, and a uv-based Python 3.10/3.11-only installation and CI setup. Sequence diagram for DDP benchmarking and finetuning workflowsequenceDiagram
actor User
participant CLI as CLI_main_py
participant TaskRunner
participant Benchmark as Matbenchmark
participant BaseBench as BaseBenchmark
participant Finetuner as FinetuneModel
participant LlamaSFT as LlamaSFTTrainer
participant Wandb as wandb
participant HF as SFTTrainer
participant Dist as PartialState
User->>CLI: run python main.py -cn=benchmark ...
CLI->>CLI: read cfg
CLI->>CLI: read LOCAL_RANK env
CLI->>CLI: set local_rank=None or int
CLI->>TaskRunner: create TaskRunner()
CLI->>TaskRunner: initialize_wandb() only if main process
CLI->>TaskRunner: run_benchmarking(cfg.model, local_rank)
TaskRunner->>Benchmark: Matbenchmark.run_benchmarking(local_rank)
Benchmark->>Dist: create PartialState()
loop folds
Benchmark->>Benchmark: for each exp_name, test_name
alt main process (state.is_main_process)
Benchmark->>Wandb: wandb.init(config=finetune_cfg,..., name=exp_name)
end
Benchmark->>BaseBench: _run_experiment(task, i, exp_name, test_name, local_rank)
BaseBench->>Dist: create PartialState()
BaseBench->>BaseBench: log fold and dataset info
BaseBench->>BaseBench: prepare exp_cfg (train and test paths)
BaseBench->>Benchmark: _get_finetuner(exp_cfg, local_rank, fold_name)
Benchmark-->>BaseBench: FinetuneModel instance
BaseBench->>Finetuner: finetune()
Finetuner->>LlamaSFT: construct finetune wrapper with cfg, local_rank
LlamaSFT->>LlamaSFT: prepare_data, prepare_test_data, _setup_model_tokenizer
LlamaSFT->>LlamaSFT: _add_prompt_completion_columns(train_dataset)
LlamaSFT->>LlamaSFT: _add_prompt_completion_columns(eval_dataset)
LlamaSFT->>LlamaSFT: build SFTConfig(config_dict, packing=False, completion_only_loss=True, max_length)
LlamaSFT->>HF: create SFTTrainer(model, peft_config, train_dataset, eval_dataset, processing_class=tokenizer, args=SFTConfig, callbacks)
alt main process (local_rank is None or 0)
LlamaSFT->>Wandb: wandb.log(Training Arguments)
LlamaSFT->>Wandb: wandb.log(model_summary)
end
LlamaSFT->>HF: trainer.train()
HF-->>LlamaSFT: trained model
alt main process
LlamaSFT->>HF: trainer.save_state()
LlamaSFT->>HF: trainer.save_model(output_dir)
LlamaSFT->>LlamaSFT: tokenizer.save_pretrained(...)
LlamaSFT->>HF: create pipeline(text-generation, model, tokenizer, completion_only output)
LlamaSFT->>HF: pipe(formatting_tests_func(testdata))
HF-->>LlamaSFT: predictions
LlamaSFT->>LlamaSFT: write predictions to json
LlamaSFT->>Wandb: wandb.finish()
end
LlamaSFT->>LlamaSFT: cleanup (del trainer, model, tokenizer, gc.collect)
LlamaSFT-->>Finetuner: finetuned checkpoint path
Finetuner-->>BaseBench: checkpoint path
alt main process (state.is_main_process)
BaseBench->>Wandb: wandb.init(config=inference_cfg,..., name=test_name)
BaseBench->>BaseBench: set inference test_data and pretrained_checkpoint
BaseBench->>BaseBench: _get_inference(exp_cfg, fold_name)
BaseBench->>BaseBench: predict.predict()
BaseBench-->>BaseBench: predictions, prediction_ids
BaseBench->>BaseBench: _record_predictions(task, i, predictions, prediction_ids)
else non main ranks
BaseBench->>BaseBench: skip inference and logging
end
BaseBench->>Dist: state.wait_for_everyone() synchronize ranks
alt main process
BaseBench->>Wandb: wandb.finish()
end
end
alt main process in Benchmark
Benchmark->>Benchmark: _save_results(task)
end
Benchmark-->>TaskRunner: benchmarking complete
TaskRunner-->>CLI: return
CLI-->>User: benchmarking finished
Updated class diagram for benchmarking and training orchestrationclassDiagram
class TaskRunner {
+_run_experiment(task_cfg, experiment_type, local_rank, use_folds, use_train_data_path)
+run_benchmarking(task_cfg, local_rank)
+initialize_wandb()
}
class BaseBenchmark {
<<abstract>>
-train_data
-test_data
-task_cfg
-benchmark_save_path
+_initialize_task()*
+_run_experiment(task, i, exp_name, test_name, local_rank)
+_get_finetuner(exp_cfg, local_rank, fold_name)*
+_get_inference(exp_cfg, fold_name)*
+_record_predictions(task, fold, predictions, prediction_ids)
+_save_results(task)
}
class Matbenchmark {
+run_benchmarking(local_rank)
+_get_finetuner(exp_cfg, local_rank, fold_name)
+_get_inference(exp_cfg, fold_name)
+_initialize_task()
}
class MatbenchmarkClassification {
+run_benchmarking(local_rank)
+_get_finetuner(exp_cfg, local_rank, fold_name)
+_get_inference(exp_cfg, fold_name)
+_initialize_task()
}
class FinetuneModel {
-cfg
-local_rank
-tokenized_dataset
-_wrapped_tokenizer
+finetune() str
+_callbacks()
+_compute_metrics(eval_pred)
+get_best_metric()
+is_greater_better()
}
class PotentialModel {
-cfg
-local_rank
-tokenized_dataset
-tokenized_testset
-_wrapped_tokenizer
+finetune() str
+evaluate()
+_callbacks()
+_compute_metrics(eval_pred)
}
class LlamaSFTTrainer {
-cfg
-local_rank
-dataset
-testdata
-model
-tokenizer
-peft_config
-property_
-material_
-representation
-test_sample_size
+__init__(cfg, local_rank, fold, train_data, test_data, representation, test_sample_size)
+prepare_data(train_subset)
+prepare_test_data(test_subset)
+_setup_model_tokenizer()
+_callbacks()
+_add_prompt_completion_columns(dataset)
+formatting_prompts_func(example)
+formatting_tests_func(example)
+finetune() str
}
class SFTTrainer {
+train()
+evaluate(eval_dataset)
+save_model(output_dir)
+save_state()
}
class Trainer {
+train()
+evaluate(eval_dataset)
+save_model(output_dir)
}
class TrainingArguments {
}
class SFTConfig {
}
class PartialState {
+is_main_process bool
+wait_for_everyone()
}
class wandb {
+init(config, project, name)
+log(data)
+finish()
}
TaskRunner --> Matbenchmark : uses
TaskRunner --> MatbenchmarkClassification : uses
BaseBenchmark <|-- Matbenchmark
BaseBenchmark <|-- MatbenchmarkClassification
Matbenchmark --> FinetuneModel : _get_finetuner()
MatbenchmarkClassification --> FinetuneModel : _get_finetuner()
BaseBenchmark --> LlamaSFTTrainer : finetune for Llama models
BaseBenchmark --> PotentialModel : finetune for potential models
FinetuneModel --> Trainer : uses
PotentialModel --> Trainer : uses
LlamaSFTTrainer --> SFTTrainer : uses
FinetuneModel --> TrainingArguments : builds
PotentialModel --> TrainingArguments : builds
LlamaSFTTrainer --> SFTConfig : builds
BaseBenchmark --> PartialState : controls DDP flow
Matbenchmark --> PartialState : controls DDP flow
MatbenchmarkClassification --> PartialState : controls DDP flow
BaseBenchmark --> wandb : experiment and inference logging
Matbenchmark --> wandb : benchmarking run logging
MatbenchmarkClassification --> wandb : benchmarking run logging
LlamaSFTTrainer --> wandb : training logging
FinetuneModel --> wandb : training logging
PotentialModel --> wandb : training logging
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughPin CI to Python 3.10.19, add uv-based installs and pre-commit CI; modernize typing and defaults; add DDP-aware control flow (LOCAL_RANK), main-process-gated WandB and synchronization barriers; migrate some training to SFTTrainer/SFTConfig; many doc, config, and formatting tweaks across repo. Changes
sequenceDiagram
autonumber
participant Launcher as Launcher
participant Worker as Worker (rank N)
participant Trainer as Trainer
participant WandB as WandB (main)
participant FS as FileSystem
Launcher->>Worker: start process (set LOCAL_RANK)
Worker->>Trainer: initialize Trainer/model (receive local_rank)
alt main process (is_main)
Trainer->>WandB: wandb.init() (gated)
WandB-->>Trainer: run handle
end
Trainer->>Trainer: train / evaluate across folds
Trainer->>Worker: state.wait_for_everyone() (barrier)
alt main process (is_main)
Trainer->>FS: save checkpoints & write predictions/results
Trainer->>WandB: log metrics / artifacts
WandB-->>Trainer: finish
end
🎯 4 (Complex) | ⏱️ ~50 minutes
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 5 issues, and left some high level feedback:
- In several places you gate logic on
is_mainorPartialState().is_main_process, but you still call collective operations (e.g.,trainer.evaluate,state.wait_for_everyone) on all ranks—double‑check that every DDP code path consistently calls these in the same order and thatPartialStateis imported wherever it’s used to avoid runtime issues. - The new
_add_prompt_completion_columnshelper inllama_sft.pyhardcodes the prompt format and numeric formatting ({example['labels']:.3f}and@@@terminator); consider making these configurable via the config or class attributes so different tasks/properties can adjust formatting without modifying code. - You now infer
skip_foldsinmain._run_experimentviatask_cfg.model.get("skip_folds", []); it would be safer to normalize this value (e.g., always a list of ints) and validate it againstrange(task_cfg.model.fold)to avoid subtle misconfigurations or type mismatches in user configs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In several places you gate logic on `is_main` or `PartialState().is_main_process`, but you still call collective operations (e.g., `trainer.evaluate`, `state.wait_for_everyone`) on all ranks—double‑check that every DDP code path consistently calls these in the same order and that `PartialState` is imported wherever it’s used to avoid runtime issues.
- The new `_add_prompt_completion_columns` helper in `llama_sft.py` hardcodes the prompt format and numeric formatting (`{example['labels']:.3f}` and `@@@` terminator); consider making these configurable via the config or class attributes so different tasks/properties can adjust formatting without modifying code.
- You now infer `skip_folds` in `main._run_experiment` via `task_cfg.model.get("skip_folds", [])`; it would be safer to normalize this value (e.g., always a list of ints) and validate it against `range(task_cfg.model.fold)` to avoid subtle misconfigurations or type mismatches in user configs.
## Individual Comments
### Comment 1
<location path="src/mattext/models/llama_sft.py" line_range="127-136" />
<code_context>
output_texts.append(text)
return output_texts
+ def _add_prompt_completion_columns(self, dataset):
+ """Add prompt and completion columns for completion-only loss."""
+ def _map_fn(example):
+ return {
+ "prompt": f"### What is the {self.property_} of {example[self.representation]}\n ### Answer:",
+ "completion": f" {example['labels']:.3f}@@@",
+ }
+ dataset = dataset.map(_map_fn)
+ # Remove original columns to avoid conflicts with the trainer's label handling
+ cols_to_remove = [c for c in dataset.column_names if c not in ("prompt", "completion")]
+ return dataset.remove_columns(cols_to_remove)
+
def formatting_tests_func(self, example):
</code_context>
<issue_to_address>
**issue (bug_risk):** SFTTrainer will not find a `dataset_text_field` (`text` by default), so training is likely to fail or ignore your prompt/completion columns.
The new helper now leaves the dataset with only `prompt` and `completion` columns, but `SFTTrainer` is still being used in its default mode, which looks for a single `"text"` column (or whatever is passed as `dataset_text_field`). With this setup:
- The datasets only expose `"prompt"` and `"completion"`.
- `SFTTrainer` is not told how to consume these columns via `dataset_text_field` / `dataset_kwargs`.
That will usually either fail at construction or cause the trainer to ignore the intended fields. You’ll need to either:
- Go back to a single `"text"` column with `prompt + completion` and use `DataCollatorForCompletionOnlyLM`, or
- Explicitly configure `dataset_text_field` / `dataset_kwargs` so `SFTTrainer` can use separate prompt/completion fields (depending on your TRL version).
</issue_to_address>
### Comment 2
<location path="src/mattext/models/llama_sft.py" line_range="180-187" />
<code_context>
+ print(f"[Rank {self.local_rank}] WARNING: save_on_each_node should be False in DDP")
+ config_dict['save_on_each_node'] = False
+
+ max_length = 2048 if self.representation == "cif_p1" else 1024
+
+ training_args = SFTConfig(
+ **config_dict,
+ packing=False,
+ max_length=max_length,
+ completion_only_loss=True,
)
+ callbacks = self._callbacks()
</code_context>
<issue_to_address>
**suggestion (bug_risk):** SFTConfig may ignore `max_length`; you likely want `max_seq_length` instead for SFTTrainer.
Here `max_length` is passed into `SFTConfig`, but `SFTTrainer` generally respects `max_seq_length` for truncation/packing. Depending on your TRL version, `max_length` may be ignored. To avoid silent misconfiguration, wire this through as `max_seq_length=max_length` (either in `SFTConfig` or the `SFTTrainer` call), matching the argument name `SFTTrainer` expects.
```suggestion
max_length = 2048 if self.representation == "cif_p1" else 1024
training_args = SFTConfig(
**config_dict,
packing=False,
max_seq_length=max_length,
completion_only_loss=True,
)
```
</issue_to_address>
### Comment 3
<location path="src/mattext/models/llama.py" line_range="244-252" />
<code_context>
- wandb.log({"Training Arguments": str(config_train_args)})
- wandb.log({"model_summary": str(model)})
+ is_main = self.local_rank is None or self.local_rank == 0
+
+ if is_main:
</code_context>
<issue_to_address>
**issue (bug_risk):** In DDP, all ranks will call `save_model` to the same path, which can cause races or corrupted checkpoints.
You already use `is_main` to gate wandb logging, but `trainer.save_model` is still called on every rank. Please wrap the `save_model` call with the same `is_main` condition so only rank 0 writes the checkpoint:
```python
if is_main:
trainer.save_model(
f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
)
```
The Trainer already synchronizes weights across ranks, so saving from a single process is enough.
</issue_to_address>
### Comment 4
<location path="README.md" line_range="23" />
<code_context>
-MatText is a framework for text-based materials modeling. It supports
+MatText is a framework for text-based materials modeling. It supports
-- conversion of crystal structures in to text representations
+- conversion of crystal structures in to text representations
- transformations of crystal structures for sensitivity analyses
</code_context>
<issue_to_address>
**issue (typo):** Change "in to" to "into" for correct usage.
```suggestion
- conversion of crystal structures into text representations
```
</issue_to_address>
### Comment 5
<location path="README.md" line_range="125" />
<code_context>
To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
</code_context>
<issue_to_address>
**issue (typo):** Adjust "for e.g." to the standard "e.g.".
"For e.g." is nonstandard; use "e.g.," instead: `use ++, e.g., ++model.pretrain...`
```suggestion
To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src/mattext/models/llama_sft.py
Outdated
| def _add_prompt_completion_columns(self, dataset): | ||
| """Add prompt and completion columns for completion-only loss.""" | ||
| def _map_fn(example): | ||
| return { | ||
| "prompt": f"### What is the {self.property_} of {example[self.representation]}\n ### Answer:", | ||
| "completion": f" {example['labels']:.3f}@@@", | ||
| } | ||
| dataset = dataset.map(_map_fn) | ||
| # Remove original columns to avoid conflicts with the trainer's label handling | ||
| cols_to_remove = [c for c in dataset.column_names if c not in ("prompt", "completion")] |
There was a problem hiding this comment.
issue (bug_risk): SFTTrainer will not find a dataset_text_field (text by default), so training is likely to fail or ignore your prompt/completion columns.
The new helper now leaves the dataset with only prompt and completion columns, but SFTTrainer is still being used in its default mode, which looks for a single "text" column (or whatever is passed as dataset_text_field). With this setup:
- The datasets only expose
"prompt"and"completion". SFTTraineris not told how to consume these columns viadataset_text_field/dataset_kwargs.
That will usually either fail at construction or cause the trainer to ignore the intended fields. You’ll need to either:
- Go back to a single
"text"column withprompt + completionand useDataCollatorForCompletionOnlyLM, or - Explicitly configure
dataset_text_field/dataset_kwargssoSFTTrainercan use separate prompt/completion fields (depending on your TRL version).
| is_main = self.local_rank is None or self.local_rank == 0 | ||
|
|
||
| if is_main: | ||
| wandb.log({"Training Arguments": str(config_train_args)}) | ||
| wandb.log({"model_summary": str(self.model)}) | ||
|
|
||
| # Trainer handles all DDP synchronization automatically | ||
| trainer.train() | ||
| trainer.save_model( |
There was a problem hiding this comment.
issue (bug_risk): In DDP, all ranks will call save_model to the same path, which can cause races or corrupted checkpoints.
You already use is_main to gate wandb logging, but trainer.save_model is still called on every rank. Please wrap the save_model call with the same is_main condition so only rank 0 writes the checkpoint:
if is_main:
trainer.save_model(
f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune"
)The Trainer already synchronizes weights across ranks, so saving from a single process is enough.
|
|
||
| The `+` symbol before a configuration key indicates that you are adding a new key-value pair to the configuration. This is useful when you want to specify parameters that are not part of the default configuration. | ||
|
|
||
| To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups. |
There was a problem hiding this comment.
issue (typo): Adjust "for e.g." to the standard "e.g.".
"For e.g." is nonstandard; use "e.g.," instead: use ++, e.g., ++model.pretrain...
| To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups. | |
| To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups. |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mattext/models/llama.py (1)
244-260:⚠️ Potential issue | 🔴 CriticalGuard post-training logging and saving operations behind
is_main.In a distributed data parallel (DDP) setup, all ranks execute the training loop. Currently, lines 252–260 are executed by every rank, causing multiple processes to:
- Write to the same output paths via
trainer.save_model()andself.model.save_pretrained()- Call W&B logging and cleanup APIs via
wandb.log()andwandb.finish(), which only the main process should manageThis results in file conflicts, W&B API errors, and potential artifact corruption. The
trainer.evaluate()call is safe to execute from all ranks (Trainer handles DDP synchronization), but logging results and saving models must be restricted to the main process.Proposed fix
# Trainer handles all DDP synchronization automatically trainer.train() - trainer.save_model( - f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune" - ) eval_result = trainer.evaluate(eval_dataset=self.tokenized_dataset["test"]) - wandb.log(eval_result) - - self.model.save_pretrained(self.cfg.path.finetuned_modelname) - wandb.finish() + if is_main: + wandb.log(eval_result) + trainer.save_model( + f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune" + ) + self.model.save_pretrained(self.cfg.path.finetuned_modelname) + wandb.finish() return self.cfg.path.finetuned_modelname🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama.py` around lines 244 - 260, Guard the post-training logging and saving so only the main rank performs them: keep trainer.train() and trainer.evaluate(eval_dataset=self.tokenized_dataset["test"]) as-is (Trainer handles DDP), but wrap calls to wandb.log({"Training Arguments": ...}), wandb.log({"model_summary": ...}), trainer.save_model(...), wandb.log(eval_result), self.model.save_pretrained(...), and wandb.finish() with the existing is_main check (the is_main variable defined earlier) so only when is_main is True these logging/saving/cleanup operations run.
🧹 Nitpick comments (4)
.github/workflows/tests.yml (1)
27-30: Consider pinninguvversion for consistency.Same as in
docs.yml, usingversion: "latest"can cause reproducibility issues. Pin to a specific version for predictable CI runs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/tests.yml around lines 27 - 30, The workflow currently pins the setup action to astral-sh/setup-uv@v3 but uses version: "latest" which harms reproducibility; change the version field under the Install uv step to a specific, tested semantic version (e.g., "1.2.3") instead of "latest" so CI runs are deterministic and match docs.yml's pinned version.pyproject.toml (1)
32-32: Consider pinningxtal2txtto a specific commit or tag.Installing from
git+https://github.com/lamalab-org/xtal2txt.git(HEAD) can lead to non-reproducible builds if the upstream repository changes. Consider pinning to a specific commit hash or release tag for stability.Example with commit hash
- "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git", + "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git@v1.0.0", # or `@abc1234` for commit🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 32, The xtal2txt dependency is installed from the repository HEAD which makes builds non-reproducible; update the dependency string in pyproject.toml (the "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git" entry) to pin to a specific commit hash or release tag (e.g., append @<commit-hash> or @<tag> to the git URL) so installations are deterministic, and update any related docs or changelogs to note the chosen pinned ref..github/workflows/docs.yml (1)
21-24: Consider pinninguvversion for reproducibility.Using
version: "latest"can lead to inconsistent CI behavior if a breaking change is introduced in a futureuvrelease. Consider pinning to a specific version (e.g.,version: "0.6.0") for more predictable builds.Suggested change
- name: Install uv uses: astral-sh/setup-uv@v3 with: - version: "latest" + version: "0.6.0"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/docs.yml around lines 21 - 24, Replace the unpinned uv installer version to a fixed release to avoid CI drift: in the GitHub Actions step that uses astral-sh/setup-uv@v3 (the step with name "Install uv" and the input key version), change version: "latest" to a specific pinned version string (e.g., "0.6.0" or another tested semver) and update any related documentation or workflow comments to note the chosen pinned version..pre-commit-config.yaml (1)
14-19: Consider updatingruff-pre-committo a more recent version.The pinned version
v0.3.4(March 2024) is significantly outdated. The latest version is v0.15.4 (February 2026), representing substantial development including new linting rules and bug fixes. Updating to the latest stable release will improve code quality checks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.pre-commit-config.yaml around lines 14 - 19, Update the pinned ruff-pre-commit version by changing the rev value currently set to v0.3.4 to the latest stable release (e.g., v0.15.4) in the pre-commit config where the repository URL "https://github.com/astral-sh/ruff-pre-commit" and hooks "ruff" and "ruff-format" are declared so the pre-commit hook uses the newer ruff rules and fixes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/tests.yml:
- Around line 32-35: Remove openbabel from the system dependencies step in the
"Install system dependencies" job by deleting "openbabel" from the apt-get
install command (leave "libfftw3-dev" intact) so CI stops installing OpenBabel
for tests; if OpenBabel is needed later for integration tests, re-add it with a
clarifying comment near the "Install system dependencies" step.
In `@README.md`:
- Line 125: The README line contains the redundant phrase "for e.g." — replace
it with either "e.g." or "for example" so the sentence reads e.g.
`++model.pretrain.training_arguments.per_device_train_batch_size=32` or for
example `++model.pretrain.training_arguments.per_device_train_batch_size=32`;
update the text surrounding the example in the same sentence accordingly.
- Around line 150-159: Update the citation block for the
`@misc`{alampara2024mattextlanguagemodelsneed} entry: change the code fence to
specify the language (```bibtex), add a missing comma after the primaryClass
field value (primaryClass={cond-mat.mtrl-sci},) and ensure the url field is
punctuated correctly (url={https://arxiv.org/abs/2406.17295} without an extra
trailing comma), then close the fenced code block.
In `@src/mattext/main.py`:
- Line 178: Remove the unnecessary f-string prefix in the print call that emits
the LOCAL_RANK warning in mattext.main (the print statement currently:
print(f"WARNING: LOCAL_RANK is set but torch.distributed is not initialized!"));
change it to a normal string literal (remove the leading "f") so the message is
printed without using an f-string and eliminates the F541 lint violation.
In `@src/mattext/models/benchmark.py`:
- Around line 117-119: The loop pairing experiment and test names uses
zip(self.exp_names, self.test_exp_names) which silently truncates mismatched
sequences; update both occurrences (the for i, (exp_name, test_name) in
enumerate(...) loops at the shown spots) to use zip(self.exp_names,
self.test_exp_names, strict=True) so a length mismatch raises immediately; apply
the same change to the second occurrence around lines 151-153 and run tests
(ensure Python version supports zip(..., strict=True)).
- Around line 64-87: The wandb run started by wandb.init in the
state.is_main_process branch isn't closed; wrap the inference sequence that
calls self._get_inference and predict.predict in a try/except/finally (still
preserving the existing except behavior that re-raises ValueError/TypeError) and
call wandb.finish() in the finally block (only when state.is_main_process and
after wandb.init was invoked) so each fold's run is always closed even on
errors; ensure the finally runs before state.wait_for_everyone().
In `@src/mattext/models/llama_sft.py`:
- Around line 209-220: The output directory self.output_dir_ is currently set to
a static path and all folds write to the same location, risking overwrites;
update the path generation to include the fold (e.g., incorporate self.fold or a
fold-scoped subdirectory) before calling trainer.save_model(self.output_dir_)
and tokenizer.save_pretrained(...), and ensure trainer.save_state() and any
other artifact saves use the same fold-scoped directory so each fold writes to a
unique directory under self.cfg.path.finetuned_modelname.
---
Outside diff comments:
In `@src/mattext/models/llama.py`:
- Around line 244-260: Guard the post-training logging and saving so only the
main rank performs them: keep trainer.train() and
trainer.evaluate(eval_dataset=self.tokenized_dataset["test"]) as-is (Trainer
handles DDP), but wrap calls to wandb.log({"Training Arguments": ...}),
wandb.log({"model_summary": ...}), trainer.save_model(...),
wandb.log(eval_result), self.model.save_pretrained(...), and wandb.finish() with
the existing is_main check (the is_main variable defined earlier) so only when
is_main is True these logging/saving/cleanup operations run.
---
Nitpick comments:
In @.github/workflows/docs.yml:
- Around line 21-24: Replace the unpinned uv installer version to a fixed
release to avoid CI drift: in the GitHub Actions step that uses
astral-sh/setup-uv@v3 (the step with name "Install uv" and the input key
version), change version: "latest" to a specific pinned version string (e.g.,
"0.6.0" or another tested semver) and update any related documentation or
workflow comments to note the chosen pinned version.
In @.github/workflows/tests.yml:
- Around line 27-30: The workflow currently pins the setup action to
astral-sh/setup-uv@v3 but uses version: "latest" which harms reproducibility;
change the version field under the Install uv step to a specific, tested
semantic version (e.g., "1.2.3") instead of "latest" so CI runs are
deterministic and match docs.yml's pinned version.
In @.pre-commit-config.yaml:
- Around line 14-19: Update the pinned ruff-pre-commit version by changing the
rev value currently set to v0.3.4 to the latest stable release (e.g., v0.15.4)
in the pre-commit config where the repository URL
"https://github.com/astral-sh/ruff-pre-commit" and hooks "ruff" and
"ruff-format" are declared so the pre-commit hook uses the newer ruff rules and
fixes.
In `@pyproject.toml`:
- Line 32: The xtal2txt dependency is installed from the repository HEAD which
makes builds non-reproducible; update the dependency string in pyproject.toml
(the "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git" entry) to pin
to a specific commit hash or release tag (e.g., append @<commit-hash> or @<tag>
to the git URL) so installations are deterministic, and update any related docs
or changelogs to note the chosen pinned ref.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.github/workflows/docs.yml.github/workflows/pre-commit.yml.github/workflows/tests.yml.pre-commit-config.yamlREADME.mddocs/getting_started.mdpyproject.tomlsrc/mattext/main.pysrc/mattext/models/benchmark.pysrc/mattext/models/finetune.pysrc/mattext/models/llama.pysrc/mattext/models/llama_sft.pysrc/mattext/models/potential.py
| - name: Install system dependencies | ||
| run: | | ||
| sudo apt-get update | ||
| sudo apt-get install -y openbabel libfftw3-dev |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if any test files use LocalEnv or openbabel
rg -i "local.?env|openbabel" tests/Repository: lamalab-org/MatText
Length of output: 45
🏁 Script executed:
#!/bin/bash
# List test files to understand the test structure
find tests/ -name "*.py" -type f | head -20Repository: lamalab-org/MatText
Length of output: 107
🏁 Script executed:
#!/bin/bash
# Read test files to understand their scope
echo "=== test_imports.py ==="
cat tests/test_imports.py
echo -e "\n=== test_xtal2pot.py ==="
cat tests/test_xtal2pot.pyRepository: lamalab-org/MatText
Length of output: 1109
🏁 Script executed:
#!/bin/bash
# Search the entire codebase for LocalEnv usage
echo "=== Searching for LocalEnv usage ==="
rg -i "local.?env" --type py
echo -e "\n=== Searching for openbabel imports ==="
rg -i "openbabel|import open" --type pyRepository: lamalab-org/MatText
Length of output: 1732
🏁 Script executed:
#!/bin/bash
# Check pyproject.toml for dependencies
echo "=== pyproject.toml dependencies ==="
cat pyproject.toml | grep -A 50 "dependencies\|optional"
echo -e "\n=== Check if LocalEnv/local_env is implemented in src ==="
rg -l "local_env|LocalEnv" src/Repository: lamalab-org/MatText
Length of output: 1484
🏁 Script executed:
#!/bin/bash
# Check how local_env is used in the source
echo "=== src/mattext/models/utils.py ==="
cat src/mattext/models/utils.pyRepository: lamalab-org/MatText
Length of output: 8533
🏁 Script executed:
#!/bin/bash
# Check if SmilesTokenizer or local_env implementation uses openbabel
echo "=== Searching for SmilesTokenizer implementation ==="
find src -name "*.py" -type f | xargs grep -l "SmilesTokenizer"
echo -e "\n=== Check tokenizer implementations ==="
rg "class SmilesTokenizer|class.*Tokenizer" src/ -A 5Repository: lamalab-org/MatText
Length of output: 2281
🏁 Script executed:
#!/bin/bash
# Find and check tokenizer implementations
echo "=== tokenizer/__init__.py ==="
cat src/mattext/tokenizer/__init__.py
echo -e "\n=== Check for tokenizer directory files ==="
ls -la src/mattext/tokenizer/Repository: lamalab-org/MatText
Length of output: 666
🏁 Script executed:
#!/bin/bash
# Check README for OpenBabel mentions
echo "=== README.md mentions of OpenBabel ==="
rg -i "openbabel|local.*env" README.md -A 2 -B 2
echo -e "\n=== Check if xtal2txt is in dependencies and what it might need ==="
cat pyproject.toml | grep -i xtal2txt
echo -e "\n=== Search for any OpenBabel-related comments in codebase ==="
rg -i "openbabel" --type py --type ymlRepository: lamalab-org/MatText
Length of output: 577
🏁 Script executed:
#!/bin/bash
# Check if libfftw3-dev is needed for xtal2txt or other dependencies
echo "=== Searching for fftw references ==="
rg -i "fftw" src/ scripts/ tests/ revision-scripts/
echo -e "\n=== Check xtal2txt GitHub or documentation for dependencies ==="
echo "xtal2txt is a git dependency, checking if it's imported in tests"
rg "from xtal2txt|import xtal2txt" tests/Repository: lamalab-org/MatText
Length of output: 230
🏁 Script executed:
#!/bin/bash
# Final check: verify what the tests actually import and execute
echo "=== All imports in test files ==="
rg "^import |^from " tests/
echo -e "\n=== Check if pymatgen structures internally use OpenBabel ==="
echo "Checking if pymatgen or xtal2txt might require OpenBabel transitively"
rg -i "openbabel" . --type py | head -20Repository: lamalab-org/MatText
Length of output: 681
Remove openbabel from system dependencies—it's not needed for the current test suite.
LocalEnv representation (which requires OpenBabel) is not covered by the tests (test_imports.py and test_xtal2pot.py only test basic tokenizers and structures). Since the test suite doesn't exercise this feature, installing OpenBabel adds unnecessary overhead to CI. If needed later for integration testing, add it back with a comment explaining its purpose.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/tests.yml around lines 32 - 35, Remove openbabel from the
system dependencies step in the "Install system dependencies" job by deleting
"openbabel" from the apt-get install command (leave "libfftw3-dev" intact) so CI
stops installing OpenBabel for tests; if OpenBabel is needed later for
integration tests, re-add it with a clarifying comment near the "Install system
dependencies" step.
|
|
||
| The `+` symbol before a configuration key indicates that you are adding a new key-value pair to the configuration. This is useful when you want to specify parameters that are not part of the default configuration. | ||
|
|
||
| To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups. |
There was a problem hiding this comment.
Fix redundant phrase "for e.g."
The phrase "for e.g." is redundant since "e.g." already means "for example". Use either "e.g." or "for example".
Suggested fix
-To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.
+To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups. | |
| To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups. |
🧰 Tools
🪛 LanguageTool
[style] ~125-~125: The phrase ‘for e.g.’ is an tautology (‘e.g.’ means ‘for example’). Consider using just “e.g.” or “for example”.
Context: ...isting default configuration, use ++, for e.g., `++model.pretrain.training_arguments.p...
(FOR_EG_REDUNDANCY)
[uncategorized] ~125-~125: Do not mix variants of the same word (‘pretrain’ and ‘pre-train’) within a single text.
Context: ...ault configuration, use ++, for e.g., `++model.pretrain.training_arguments.per_device_train_bat...
(EN_WORD_COHERENCY)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` at line 125, The README line contains the redundant phrase "for
e.g." — replace it with either "e.g." or "for example" so the sentence reads
e.g. `++model.pretrain.training_arguments.per_device_train_batch_size=32` or for
example `++model.pretrain.training_arguments.per_device_train_batch_size=32`;
update the text surrounding the example in the same sentence accordingly.
| ``` | ||
| @misc{alampara2024mattextlanguagemodelsneed, | ||
| title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?}, | ||
| title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?}, | ||
| author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka}, | ||
| year={2024}, | ||
| eprint={2406.17295}, | ||
| archivePrefix={arXiv}, | ||
| primaryClass={cond-mat.mtrl-sci} | ||
| url={https://arxiv.org/abs/2406.17295}, | ||
| url={https://arxiv.org/abs/2406.17295}, | ||
| } |
There was a problem hiding this comment.
Fix citation block formatting.
The code block should specify a language (e.g., bibtex), and the url field appears to be placed after primaryClass without a trailing comma, which would be invalid BibTeX syntax.
Suggested fix
-```
+```bibtex
`@misc`{alampara2024mattextlanguagemodelsneed,
title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
year={2024},
eprint={2406.17295},
archivePrefix={arXiv},
- primaryClass={cond-mat.mtrl-sci}
- url={https://arxiv.org/abs/2406.17295},
+ primaryClass={cond-mat.mtrl-sci},
+ url={https://arxiv.org/abs/2406.17295}
}🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 150-150: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 150 - 159, Update the citation block for the
`@misc`{alampara2024mattextlanguagemodelsneed} entry: change the code fence to
specify the language (```bibtex), add a missing comma after the primaryClass
field value (primaryClass={cond-mat.mtrl-sci},) and ensure the url field is
punctuated correctly (url={https://arxiv.org/abs/2406.17295} without an extra
trailing comma), then close the fenced code block.
| if state.is_main_process: | ||
| wandb.init( | ||
| config=dict(self.task_cfg.model.inference), | ||
| project=self.task_cfg.model.logging.wandb_project, | ||
| name=test_name, | ||
| ) | ||
|
|
||
| exp_cfg.model.inference.path.test_data = self.test_data | ||
| exp_cfg.model.inference.path.pretrained_checkpoint = ckpt | ||
| exp_cfg.model.inference.path.test_data = self.test_data | ||
| exp_cfg.model.inference.path.pretrained_checkpoint = ckpt | ||
|
|
||
| try: | ||
| predict = self._get_inference(exp_cfg, fold_name) | ||
| predictions, prediction_ids = predict.predict() | ||
| self._record_predictions(task, i, predictions, prediction_ids) | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}" | ||
| ) | ||
| if isinstance(e, (ValueError, TypeError)): | ||
| raise | ||
| logger.error(traceback.format_exc()) | ||
| try: | ||
| predict = self._get_inference(exp_cfg, fold_name) | ||
| predictions, prediction_ids = predict.predict() | ||
| self._record_predictions(task, i, predictions, prediction_ids) | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}" | ||
| ) | ||
| if isinstance(e, (ValueError, TypeError)): | ||
| raise | ||
| logger.error(traceback.format_exc()) | ||
|
|
||
| # All ranks sync after inference before next fold | ||
| state.wait_for_everyone() |
There was a problem hiding this comment.
Always close the inference W&B run (wandb.finish) in this method.
wandb.init at Line 65 is not paired with wandb.finish here, so inference runs can bleed across folds.
Proposed fix
if state.is_main_process:
wandb.init(
config=dict(self.task_cfg.model.inference),
project=self.task_cfg.model.logging.wandb_project,
name=test_name,
)
exp_cfg.model.inference.path.test_data = self.test_data
exp_cfg.model.inference.path.pretrained_checkpoint = ckpt
try:
predict = self._get_inference(exp_cfg, fold_name)
predictions, prediction_ids = predict.predict()
self._record_predictions(task, i, predictions, prediction_ids)
except Exception as e:
logger.error(
f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
)
if isinstance(e, (ValueError, TypeError)):
raise
logger.error(traceback.format_exc())
+ finally:
+ wandb.finish()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if state.is_main_process: | |
| wandb.init( | |
| config=dict(self.task_cfg.model.inference), | |
| project=self.task_cfg.model.logging.wandb_project, | |
| name=test_name, | |
| ) | |
| exp_cfg.model.inference.path.test_data = self.test_data | |
| exp_cfg.model.inference.path.pretrained_checkpoint = ckpt | |
| exp_cfg.model.inference.path.test_data = self.test_data | |
| exp_cfg.model.inference.path.pretrained_checkpoint = ckpt | |
| try: | |
| predict = self._get_inference(exp_cfg, fold_name) | |
| predictions, prediction_ids = predict.predict() | |
| self._record_predictions(task, i, predictions, prediction_ids) | |
| except Exception as e: | |
| logger.error( | |
| f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}" | |
| ) | |
| if isinstance(e, (ValueError, TypeError)): | |
| raise | |
| logger.error(traceback.format_exc()) | |
| try: | |
| predict = self._get_inference(exp_cfg, fold_name) | |
| predictions, prediction_ids = predict.predict() | |
| self._record_predictions(task, i, predictions, prediction_ids) | |
| except Exception as e: | |
| logger.error( | |
| f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}" | |
| ) | |
| if isinstance(e, (ValueError, TypeError)): | |
| raise | |
| logger.error(traceback.format_exc()) | |
| # All ranks sync after inference before next fold | |
| state.wait_for_everyone() | |
| if state.is_main_process: | |
| wandb.init( | |
| config=dict(self.task_cfg.model.inference), | |
| project=self.task_cfg.model.logging.wandb_project, | |
| name=test_name, | |
| ) | |
| exp_cfg.model.inference.path.test_data = self.test_data | |
| exp_cfg.model.inference.path.pretrained_checkpoint = ckpt | |
| try: | |
| predict = self._get_inference(exp_cfg, fold_name) | |
| predictions, prediction_ids = predict.predict() | |
| self._record_predictions(task, i, predictions, prediction_ids) | |
| except Exception as e: | |
| logger.error( | |
| f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}" | |
| ) | |
| if isinstance(e, (ValueError, TypeError)): | |
| raise | |
| logger.error(traceback.format_exc()) | |
| finally: | |
| wandb.finish() | |
| # All ranks sync after inference before next fold | |
| state.wait_for_everyone() |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 80-80: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mattext/models/benchmark.py` around lines 64 - 87, The wandb run started
by wandb.init in the state.is_main_process branch isn't closed; wrap the
inference sequence that calls self._get_inference and predict.predict in a
try/except/finally (still preserving the existing except behavior that re-raises
ValueError/TypeError) and call wandb.finish() in the finally block (only when
state.is_main_process and after wandb.init was invoked) so each fold's run is
always closed even on errors; ensure the finally runs before
state.wait_for_everyone().
| for i, (exp_name, test_name) in enumerate( | ||
| zip(self.exp_names, self.test_exp_names) | ||
| ): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/mattext/models/benchmark.py | head -200Repository: lamalab-org/MatText
Length of output: 7757
🏁 Script executed:
fd -e "pyproject.toml" -o -e "setup.py" -o -e "setup.cfg" -o -e "requirements.txt" | head -10Repository: lamalab-org/MatText
Length of output: 179
🏁 Script executed:
git ls-files | grep -E "(pyproject.toml|setup.py|setup.cfg|requirements)" | head -10Repository: lamalab-org/MatText
Length of output: 77
🏁 Script executed:
cat pyproject.tomlRepository: lamalab-org/MatText
Length of output: 4003
🏁 Script executed:
rg -i "(exp_name|test_exp_names)" src/ --type py -A 2 -B 2 | head -80Repository: lamalab-org/MatText
Length of output: 5612
🏁 Script executed:
rg "exp_name|exp_names" --type yaml --type toml --type json | head -30Repository: lamalab-org/MatText
Length of output: 3410
Use zip(..., strict=True) to catch length mismatches between experiment and test names.
Without strict=True, if self.exp_names and self.test_exp_names differ in length, the loop silently discards unmatched pairs, causing silent data loss in benchmark results.
Proposed fix
for i, (exp_name, test_name) in enumerate(
- zip(self.exp_names, self.test_exp_names)
+ zip(self.exp_names, self.test_exp_names, strict=True)
):Also applies to: 151-153
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 118-118: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mattext/models/benchmark.py` around lines 117 - 119, The loop pairing
experiment and test names uses zip(self.exp_names, self.test_exp_names) which
silently truncates mismatched sequences; update both occurrences (the for i,
(exp_name, test_name) in enumerate(...) loops at the shown spots) to use
zip(self.exp_names, self.test_exp_names, strict=True) so a length mismatch
raises immediately; apply the same change to the second occurrence around lines
151-153 and run tests (ensure Python version supports zip(..., strict=True)).
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (14)
scripts/qmof_prepare_data.py (1)
18-26:⚠️ Potential issue | 🟡 MinorKeep return type docs consistent with the new annotation.
Line 18 returns
list[dict], but Line 25 still documentsList[Dict]. Please update the docstring to match.Suggested fix
- List[Dict]: A list of dictionaries containing the JSON data. + list[dict]: A list of dictionaries containing the JSON data.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/qmof_prepare_data.py` around lines 18 - 26, The docstring for read_json currently uses capitalized typing (List[Dict]) that doesn't match the function annotation list[dict]; update the "Returns:" section to use the lowercase, modern annotation style (e.g., "list[dict]") and ensure the descriptive text matches the return type; locate the read_json function and change the docstring return type line to "list[dict]".scripts/linear_potential.py (1)
16-24:⚠️ Potential issue | 🟡 MinorUpdate docstring return type to match the annotation.
Line 16 uses
list[dict], while Line 23 still saysList[Dict].Suggested fix
- List[Dict]: A list of dictionaries containing the JSON data. + list[dict]: A list of dictionaries containing the JSON data.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/linear_potential.py` around lines 16 - 24, The docstring for function read_json has a mismatched return type description ("List[Dict]") that doesn't match the type annotation (list[dict]); update the docstring return section to exactly match the annotation (use "list[dict]" or the same canonical form you choose) so the return type in the docstring and the function signature (read_json) are consistent.pyproject.toml (1)
124-135:⚠️ Potential issue | 🟡 MinorBumpversion
current_versionformat doesn't match theparseregex.The
current_version = "0.0.2-dev"uses a hyphen before "dev", but theparseregex only matches:
major.minor.patch(e.g.,0.0.2)major.minor.patch.postN.devN(e.g.,0.0.2.post1.dev1)The regex only partially matches "0.0.2-dev" (capturing just "0.0.2"), leaving the "-dev" suffix unmatched. This incomplete parsing will cause bumpversion commands to fail or behave unexpectedly.
Additionally, the project
version = "0.0.2"(line 9) mismatchescurrent_version = "0.0.2-dev"and should be synchronized.Fix by either updating
current_versionto match the regex pattern:-current_version = "0.0.2-dev" +current_version = "0.0.2"Or update the
parseregex to support the hyphenated dev format:-parse = "(?P<major>\\d+)\\.(?P<minor>\\d+)\\.(?P<patch>\\d+)(\\.(?P<dev>post)\\d+\\.dev\\d+)?" +parse = "(?P<major>\\d+)\\.(?P<minor>\\d+)\\.(?P<patch>\\d+)(-(?P<dev>dev))?"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 124 - 135, The bumpversion config is inconsistent: current_version = "0.0.2-dev" doesn't match the parse regex and also disagrees with project version = "0.0.2"; either make the values consistent or expand the regex. Fix option A: set current_version (and project version) to a value that matches parse (e.g., "0.0.2") so both "current_version" and "version" are identical; or Fix option B: update the parse pattern to accept the hyphenated dev suffix (so the parse regex used by [tool.bumpversion] will match "0.0.2-dev") and ensure serialize still produces the intended tags. Ensure you update the unique keys current_version, parse, and version accordingly.src/mattext/models/inference.py (2)
20-35:⚠️ Potential issue | 🔴 CriticalInitialize
self.task_cfgbefore dereferencing it.
self.task_cfgis used on Line 28 onward but never assigned in__init__, which will raise immediately during object creation.Suggested fix
def __init__(self, task_cfg: DictConfig): + self.task_cfg = task_cfg self.representation = self.task_cfg.model.representation self.benchmark = self.task_cfg.model.inference.benchmark_dataset🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/inference.py` around lines 20 - 35, The __init__ method dereferences self.task_cfg before it is assigned, causing an AttributeError; fix by assigning the incoming task_cfg parameter to self.task_cfg at the start of __init__ (e.g., self.task_cfg = task_cfg) before any access, then keep the existing assignments to self.representation, self.benchmark, self.exp_names, self.test_exp_names, self.train_data, self.test_data, self.wndb, and self.benchmark_save_path so they reference the stored configuration object.
91-95:⚠️ Potential issue | 🔴 CriticalUnpack
predict.predict()before recording benchmark results.In this block,
predict.predict()is treated as a single object, but the benchmark flow insrc/mattext/models/benchmark.py(Lines 74-84) expects unpacked values. Passing the raw return value tobenchmark.record(...)can fail and then be silently skipped by the broad exception path.Suggested fix
try: predict = Inference(exp_cfg) - predictions = predict.predict() + predictions, prediction_ids = predict.predict() benchmark.record(i, predictions) except Exception:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/inference.py` around lines 91 - 95, The call to predict.predict() returns multiple values that must be unpacked before passing into benchmark.record; modify the try block around Inference(exp_cfg) so you assign the returned tuple from Inference.predict() to separate variables (e.g., y_pred, timings or similar based on Inference.predict signature) and then call benchmark.record(i, y_pred, timings) instead of benchmark.record(i, predictions). Also avoid silently swallowing errors — catch Exception as e and log or re-raise so failures in Inference.predict() are visible (referencing the Inference class and its predict() method and the benchmark.record(...) call).scripts/filterdataset.py (1)
25-34:⚠️ Potential issue | 🔴 CriticalBug:
wycoff_repmapped toNonewill causeTypeErrorat runtime.The
_TOKENIZER_MAPmapswycoff_reptoNone, but_tokenize_without_truncation(line 60-61) directly callstokenizer_class(...)without a None check. This will raiseTypeError: 'NoneType' object is not callableifwycoff_repis used.This is inconsistent with
src/mattext/models/utils.py(lines 19-32), which correctly mapswycoff_reptoRobocrysTokenizer.Proposed fix - align with utils.py
_TOKENIZER_MAP: dict[str, type] = { "slice": SliceTokenizer, "composition": CompositionTokenizer, "cif_symmetrized": CifTokenizer, "cif_p1": CifTokenizer, "cif_bonding": CifTokenizer, "crystal_llm_rep": CrysllmTokenizer, "robocrys_rep": RobocrysTokenizer, - "wycoff_rep": None, + "wycoff_rep": RobocrysTokenizer, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/filterdataset.py` around lines 25 - 34, The _TOKENIZER_MAP currently maps "wycoff_rep" to None which leads to a TypeError when _tokenize_without_truncation calls tokenizer_class(...); update the _TOKENIZER_MAP entry for "wycoff_rep" to reference RobocrysTokenizer (same as in src/mattext/models/utils.py) so that _tokenize_without_truncation, which instantiates tokenizer_class, receives a valid callable class; verify the map uses the exact symbol RobocrysTokenizer and no None values for supported keys.src/mattext/models/llama_sft.py (1)
90-99:⚠️ Potential issue | 🔴 CriticalBug:
compute_dtypemay be unbound when quantization is disabled.If both
use_4bitanduse_8bitareFalse, the code jumps toelseat line 90 and setsbnb_config = None, butcompute_dtype(line 82) is never assigned. Line 94's checkif compute_dtype == torch.float16will raiseUnboundLocalError.Proposed fix - ensure compute_dtype is always defined
+ compute_dtype = getattr(torch, self.bnb_config.bnb_4bit_compute_dtype) + if self.bnb_config.use_4bit and self.bnb_config.use_8bit: raise ValueError( "You can't load the model in 8 bits and 4 bits at the same time" ) elif self.bnb_config.use_4bit or self.bnb_config.use_8bit: - compute_dtype = getattr(torch, self.bnb_config.bnb_4bit_compute_dtype) bnb_config = BitsAndBytesConfig( load_in_4bit=self.bnb_config.use_4bit, load_in_8bit=self.bnb_config.use_8bit,Or skip the GPU capability check when quantization is disabled:
else: bnb_config = None + compute_dtype = None # Check GPU compatibility with bfloat16 - if compute_dtype == torch.float16: + if compute_dtype is not None and compute_dtype == torch.float16: major, _ = torch.cuda.get_device_capability()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama_sft.py` around lines 90 - 99, The GPU bfloat16 compatibility check uses compute_dtype which can be unassigned when quantization is disabled (use_4bit and use_8bit both False); to fix, ensure compute_dtype is always defined or skip the check when bnb_config is None: either set a default (e.g., compute_dtype = None or torch.float32) before the quantization branches or wrap the check with a guard like "if bnb_config is not None and compute_dtype == torch.float16:" so the call in torch.cuda.get_device_capability() only runs when compute_dtype was set by the quantization branches (referencing compute_dtype, bnb_config, use_4bit, use_8bit, and logger).src/mattext/models/utils.py (1)
256-264:⚠️ Potential issue | 🟠 MajorBug: Duplicate code in
on_evaluatelogs predictions twice.Lines 256-260 and 261-264 are identical - the method calls
super().on_evaluate(...), generates a samples table, and logs it twice per evaluation. This doubles computation and logs duplicatesample_predictionsentries to WandB.Proposed fix - remove duplicate block
def on_evaluate(self, args, state, control, **kwargs): """Log the wandb.Table after calling trainer.evaluate.""" super().on_evaluate(args, state, control, **kwargs) records_table = self.samples_table(self.sample_dataset) self._wandb.log({"sample_predictions": records_table}) - # "Log the wandb.Table after calling trainer.evaluate" - super().on_evaluate(args, state, control, **kwargs) - records_table = self.samples_table(self.sample_dataset) - self._wandb.log({"sample_predictions": records_table})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/utils.py` around lines 256 - 264, The on_evaluate method currently executes the same block twice (calls super().on_evaluate(...), builds records_table via samples_table(self.sample_dataset), and calls self._wandb.log({"sample_predictions": records_table})), causing duplicate computation and duplicate WandB entries; remove the duplicated second block so on_evaluate only calls super().on_evaluate(args, state, control, **kwargs), creates records_table = self.samples_table(self.sample_dataset) once, and logs it once with self._wandb.log({"sample_predictions": records_table}).src/mattext/models/llama.py (2)
248-264:⚠️ Potential issue | 🟠 MajorInconsistent main-process gating for WandB and model saving.
Unlike
finetune.py, thewandb.log(eval_result),trainer.save_model(),model.save_pretrained(), andwandb.finish()calls at lines 256-264 are not gated to the main process. This will cause:
- All ranks attempting to save models (potential file conflicts)
- Duplicate WandB logging from non-main ranks
wandb.finish()called on ranks wherewandb.init()was never called (persrc/mattext/main.py:75-79, only main process callswandb.init())Suggested fix to match finetune.py pattern
# Trainer handles all DDP synchronization automatically trainer.train() - trainer.save_model( - f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune" - ) eval_result = trainer.evaluate(eval_dataset=self.tokenized_dataset["test"]) - wandb.log(eval_result) - self.model.save_pretrained(self.cfg.path.finetuned_modelname) - wandb.finish() + if is_main: + wandb.log(eval_result) + trainer.save_model( + f"{self.cfg.path.finetuned_modelname}/llamav2-7b-lora-fine-tune" + ) + self.model.save_pretrained(self.cfg.path.finetuned_modelname) + wandb.finish() + return self.cfg.path.finetuned_modelname🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama.py` around lines 248 - 264, The calls after training (wandb.log(eval_result), trainer.save_model(...), self.model.save_pretrained(...), and wandb.finish()) are not gated by the main-process check and will run on all ranks; wrap these calls with the same is_main condition used earlier (derived from self.local_rank is None or self.local_rank == 0) so only the main process performs wandb logging, saves the trainer/model, and calls wandb.finish; use the existing is_main variable and mirror the finetune.py pattern to avoid duplicate logs, file conflicts, and wandb.finish() being called on non-initialized ranks.
99-119:⚠️ Potential issue | 🔴 CriticalPotential
NameError:compute_dtypemay be undefined.
compute_dtypeis only assigned inside theelifblock (line 100), but is referenced unconditionally at line 112. If bothuse_4bitanduse_8bitareFalse, the code will raise aNameError.Suggested fix
+ compute_dtype = None if self.bnb_config.use_4bit and self.bnb_config.use_8bit: raise ValueError( "You can't load the model in 8 bits and 4 bits at the same time" ) elif self.bnb_config.use_4bit or self.bnb_config.use_8bit: compute_dtype = getattr(torch, self.bnb_config.bnb_4bit_compute_dtype) bnb_config = BitsAndBytesConfig( load_in_4bit=self.bnb_config.use_4bit, load_in_8bit=self.bnb_config.use_8bit, bnb_4bit_quant_type=self.bnb_config.bnb_4bit_quant_type, bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=self.bnb_config.use_nested_quant, ) else: bnb_config = None # Check GPU compatibility with bfloat16 - if compute_dtype == torch.float16: + if compute_dtype is not None and compute_dtype == torch.float16: major, _ = torch.cuda.get_device_capability()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama.py` around lines 99 - 119, The variable compute_dtype can be referenced before assignment; set compute_dtype = None before the BitsAndBytesConfig branch (or move the bfloat16 check into the same branch) and guard the GPU bfloat16 check so it only runs when compute_dtype is not None and equals torch.float16; update the code around the BitsAndBytesConfig construction (the elif that checks bnb_config.use_4bit or bnb_config.use_8bit and the subsequent bfloat16 GPU check) to avoid a NameError.src/mattext/models/finetune.py (1)
117-126:⚠️ Potential issue | 🟡 MinorSpecify minimum transformers version requirement in pyproject.toml.
The
processing_classparameter (used in the Trainer initialization) was introduced in transformers v4.46.0. Since pyproject.toml currently specifies"transformers"without a version constraint, the code will fail on earlier versions. Addtransformers>=4.46.0to enforce compatibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/finetune.py` around lines 117 - 126, The Trainer instantiation uses the processing_class parameter (processing_class=self._wrapped_tokenizer) which requires transformers>=4.46.0; update pyproject.toml to pin the transformers dependency to at least 4.46.0 (e.g. "transformers>=4.46.0") so the Trainer(..., processing_class=...) call will work reliably across installs.revision-scripts/text_rep.py (1)
140-146:⚠️ Potential issue | 🟡 MinorAlways persist final resume progress at loop end.
If total batches are fewer than
save_interval, no progress file is written. That breaks restart-from-log expectations.Proposed fix
for i, batch_data in enumerate(batch_iterator, start=1): ... if i % save_interval == 0: with open(log_file_path, "w") as log_file: log_file.write(f"Last processed entry index: {last_processed_entry}\n") log_file.write(f"Last processed batch number: {i}\n") + with open(log_file_path, "w") as log_file: + log_file.write(f"Last processed entry index: {last_processed_entry}\n") + log_file.write(f"Last processed batch number: {i if 'i' in locals() else 0}\n") + print(f"Finished !!! logging at {log_file_path}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@revision-scripts/text_rep.py` around lines 140 - 146, The loop only writes progress when i % save_interval == 0, so if total batches < save_interval no progress file is created; after the loop in revision-scripts/text_rep.py add a final unconditional write to log_file_path that persists last_processed_entry and the final batch index (i or last batch number) so restart-from-log works; use the same format as the periodic writes (f"Last processed entry index: {last_processed_entry}\n" and f"Last processed batch number: {i}\n") to overwrite/create the file and ensure resume state is always saved.revision-scripts/prep_rep.py (1)
140-146:⚠️ Potential issue | 🟡 MinorWrite final progress log after processing completes.
With current logic, the log may never be written when batch count is below
save_interval.Proposed fix
for i, batch_data in enumerate(batch_iterator, start=1): ... if i % save_interval == 0: with open(log_file_path, "w") as log_file: log_file.write(f"Last processed entry index: {last_processed_entry}\n") log_file.write(f"Last processed batch number: {i}\n") + with open(log_file_path, "w") as log_file: + log_file.write(f"Last processed entry index: {last_processed_entry}\n") + log_file.write(f"Last processed batch number: {i if 'i' in locals() else 0}\n") + print(f"Finished !!! logging at {log_file_path}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@revision-scripts/prep_rep.py` around lines 140 - 146, The final progress log may not be written if the total batch count is less than save_interval; after the batch-processing loop completes, ensure you write the final progress to log_file_path using the same fields (last_processed_entry and the last batch index i or a final batch counter) so the log is always present; update the code around the loop that updates last_processed_entry and checks save_interval (refer to last_processed_entry, save_interval, log_file_path, and i) to perform one last write to the log after the loop finishes.revision-scripts/prep_json.py (1)
39-42:⚠️ Potential issue | 🟡 MinorUse
os.makedirs(..., exist_ok=True)instead of check-then-create.The pattern at lines 39–40 (
os.path.exists()+os.mkdir()) is race-prone and fails for nested paths.os.makedirs(..., exist_ok=True)is safer and handles parent directory creation automatically.Proposed fix
- if not os.path.exists(save_path): - os.mkdir(save_path) - else: - print(f"Directory '{save_path}' already exists.") + already_exists = os.path.isdir(save_path) + os.makedirs(save_path, exist_ok=True) + if already_exists: + print(f"Directory '{save_path}' already exists.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@revision-scripts/prep_json.py` around lines 39 - 42, Replace the race-prone check-then-create block that uses os.path.exists() and os.mkdir() with a single safe call to os.makedirs(save_path, exist_ok=True); remove the else print branch since exist_ok=True will silently handle existing directories and will also create any needed parent directories. Keep the surrounding logic intact and ensure the os import remains available for prep_json.py.
♻️ Duplicate comments (2)
src/mattext/models/benchmark.py (1)
64-87:⚠️ Potential issue | 🟠 MajorMissing
wandb.finish()for inference W&B run.
wandb.initat line 65 is not paired withwandb.finish(), so inference runs can bleed across folds. The W&B run started in theis_main_processblock should be closed after inference completes.Proposed fix - add wandb.finish in finally block
if state.is_main_process: wandb.init( config=dict(self.task_cfg.model.inference), project=self.task_cfg.model.logging.wandb_project, name=test_name, ) exp_cfg.model.inference.path.test_data = self.test_data exp_cfg.model.inference.path.pretrained_checkpoint = ckpt try: predict = self._get_inference(exp_cfg, fold_name) predictions, prediction_ids = predict.predict() self._record_predictions(task, i, predictions, prediction_ids) except Exception as e: logger.error( f"Error occurred during inference for finetuned checkpoint '{exp_name}': {e!s}" ) if isinstance(e, ValueError | TypeError): raise logger.error(traceback.format_exc()) + finally: + wandb.finish() # All ranks sync after inference before next fold state.wait_for_everyone()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/benchmark.py` around lines 64 - 87, The W&B run opened in the state.is_main_process block via wandb.init is never closed; add a finally that calls wandb.finish() after the inference code so the run is always terminated (success or exception). Specifically, wrap the calls around _get_inference(...), predict.predict(), and _record_predictions(...) in a try/except/finally (or add a nested try/finally) such that wandb.finish() is executed only when state.is_main_process opened the run, and preserve the existing exception handling (re-raise ValueError/TypeError and log others) before finishing the run.src/mattext/models/llama_sft.py (1)
215-217:⚠️ Potential issue | 🟠 MajorOutput directory not fold-scoped, causing checkpoint overwrites.
Line 215-217 writes all folds to the same
output_dir_path, so later folds overwrite earlier checkpoints. Line 227 correctly uses{self.fold}for tokenizer save, buttrainer.save_modeluses the non-scoped path.Proposed fix - scope output_dir_ by fold
- self.output_dir_ = ( - f"{self.cfg.path.finetuned_modelname}/llamav3-8b-lora-fine-tune" - ) + self.output_dir_ = ( + f"{self.cfg.path.finetuned_modelname}_{self.fold}/llamav3-8b-lora-fine-tune" + )Also applies to: 226-228
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama_sft.py` around lines 215 - 217, The output_dir_ currently is not fold-scoped so multiple folds overwrite checkpoints; change the initialization of self.output_dir_ to include the fold (e.g. incorporate {self.fold} into the f-string) so each fold writes to a unique directory, and ensure any calls that save the model (e.g. trainer.save_model) use this fold-scoped self.output_dir_ instead of a shared path—also update the related save calls around the tokenizer/save_pretrained area so both tokenizer and trainer use the same fold-scoped directory.
🧹 Nitpick comments (12)
src/mattext/analysis/xtal2pot.py (1)
32-42: Considerstrict=Trueto catch data integrity issues early.Using
strict=False(which is also the default) silently truncates when the DataFrame columns have mismatched lengths. If the CSVs fetched from Dropbox ever have inconsistent row counts, element mappings would be silently lost.If data integrity is a concern,
strict=Truewould raise aValueErroron length mismatch, making issues visible immediately.Suggested change
_ATOM_PARAMETERS["uff_energy"] = dict( - zip(_df_uff_energy["Element"], _df_uff_energy["energy"], strict=False) + zip(_df_uff_energy["Element"], _df_uff_energy["energy"], strict=True) ) _ATOM_PARAMETERS["uff_xi"] = dict( - zip(_df_uff_xi["Element"], _df_uff_xi["xi"], strict=False) + zip(_df_uff_xi["Element"], _df_uff_xi["xi"], strict=True) ) _ATOM_PARAMETERS["uff_z"] = dict( - zip(_df_uff_z["Element"], _df_uff_z["zi"], strict=False) + zip(_df_uff_z["Element"], _df_uff_z["zi"], strict=True) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/analysis/xtal2pot.py` around lines 32 - 42, The zip calls that build _ATOM_PARAMETERS ("uff_energy", "uff_xi", "uff_z") use strict=False which silently ignores mismatched lengths; change these zip(...) calls to use strict=True so a ValueError is raised if the DataFrame columns (_df_uff_energy["Element"]/["energy"], _df_uff_xi["Element"]/["xi"], _df_uff_z["Element"]/["zi"]) have differing lengths, ensuring any CSV/data integrity issues surface immediately (handle the ValueError where appropriate if you want custom error messaging).revision-scripts/5fold_split.py (1)
25-34: Consider de-duplicating the split-file write logic.Line 25-34 repeats the same pattern for train/test writing. A small helper would reduce duplication and future drift.
♻️ Suggested refactor
def split_dataset(input_json, output_dir, n_splits=5, random_state=42): @@ # Ensure output directory exists os.makedirs(output_dir, exist_ok=True) + def _write_fold_json(split_name, fold, payload): + file_path = os.path.join(output_dir, f"{split_name}_mp_classification_fold_{fold}.json") + with open(file_path, "w") as f: + json.dump(payload, f, indent=2) + return file_path + # Perform the split for fold, (train_index, test_index) in enumerate(kf.split(data), 1): train_data = [data[i] for i in train_index] test_data = [data[i] for i in test_index] - # Save train data - train_file = os.path.join( - output_dir, f"train_mp_classification_fold_{fold}.json" - ) - with open(train_file, "w") as f: - json.dump(train_data, f, indent=2) - - # Save test data - test_file = os.path.join(output_dir, f"test_mp_classification_fold_{fold}.json") - with open(test_file, "w") as f: - json.dump(test_data, f, indent=2) + train_file = _write_fold_json("train", fold, train_data) + test_file = _write_fold_json("test", fold, test_data)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@revision-scripts/5fold_split.py` around lines 25 - 34, The train/test file write logic in the 5fold_split.py block duplicates opening and json.dump calls for train_data/test_data; extract a small helper (e.g., write_split_file or save_fold_split) that accepts parameters (data, split_name, output_dir, fold) and performs os.path.join, open, and json.dump, then replace the duplicate blocks that set train_file/test_file and write to disk with calls to that helper for "train" and "test" using train_data and test_data respectively; ensure you preserve indent=2 and the filename pattern f"{split_name}_mp_classification_fold_{fold}.json".src/mattext/dataprep/matbench_prepare_data.py (1)
33-43: Module-level signal handler registration is a side effect on import.Registering the
SIGALRMhandler at module level (line 43) affects the entire process whenever this module is imported, even if the timeout functionality isn't used. Consider moving the signal registration inside the functions that need it, or using a context manager pattern.Additionally, note that
revision-scripts/text_rep.pycatchesTimeoutErrorfromconcurrent.futures, not this customTimeoutException. If these modules are expected to share exception-handling patterns, this inconsistency could cause timeouts to go unhandled.♻️ Consider lazy registration pattern
-# Change the behavior of SIGALRM -signal.signal(signal.SIGALRM, timeout_handler) +_signal_handler_registered = False + +def _ensure_signal_handler(): + global _signal_handler_registered + if not _signal_handler_registered: + signal.signal(signal.SIGALRM, timeout_handler) + _signal_handler_registered = TrueThen call
_ensure_signal_handler()at the start ofprocess_entry_train_matbenchandprocess_entry_test_matbench.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/dataprep/matbench_prepare_data.py` around lines 33 - 43, The module currently registers the SIGALRM handler at import time (signal.signal(signal.SIGALRM, timeout_handler)), which is a global side effect; move that registration into a lazy helper (e.g. _ensure_signal_handler) or a context manager and call it from the functions that actually use timeouts (process_entry_train_matbench and process_entry_test_matbench) so the handler is only installed when needed; also reconcile the exception type by either translating your custom TimeoutException to concurrent.futures.TimeoutError where those callers expect it (or raise TimeoutError directly) so external code that catches concurrent.futures.TimeoutError will correctly handle timeout events.pyproject.toml (1)
31-32: Consider pinning git dependencies to a specific commit or tag.Both
matbenchandxtal2txtuse git URLs without pinning to a specific ref. This can lead to non-reproducible builds if the main branch changes. Consider pinning:"xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git@v1.0.0" # or commit SHAThis is optional if the dependency is under active development and you want the latest changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 31 - 32, The git dependencies for matbench and xtal2txt are unpinned which breaks reproducibility; update the dependency strings "matbench @ git+https://github.com/materialsproject/matbench.git" and "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git" to include a specific tag or commit SHA (e.g., append `@vX.Y.Z` or @<commit-sha>) so builds are deterministic, or document why leaving them floating is intentional if you prefer tracking latest changes.src/mattext/models/pretrain.py (1)
112-112: Remove stale commented-out code.Line 112 looks like leftover debug/experiment code; better remove it to keep the training path clean.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/pretrain.py` at line 112, Remove the stale commented-out debug line that sets config_model_args["max_position_embeddings"] = self.context_length in src/mattext/models/pretrain.py; delete this commented code (the line referencing config_model_args, max_position_embeddings and self.context_length) to keep the training code clean and avoid leaving legacy debug artifacts.src/mattext/models/llama_sft.py (1)
240-241: Replace deprecatedtorch.cuda.amp.autocast()withtorch.amp.autocast(device_type='cuda').
torch.cuda.amp.autocast()is deprecated in favor of the unified AMP API. Usetorch.amp.autocast(device_type='cuda')instead.Proposed fix
- with torch.cuda.amp.autocast(): + with torch.amp.autocast(device_type='cuda'): pred = pipe(self.formatting_tests_func(self.testdata))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama_sft.py` around lines 240 - 241, Replace the deprecated torch.cuda.amp.autocast() context manager with the unified API: use torch.amp.autocast(device_type='cuda') around the inference call (the block using pipe(self.formatting_tests_func(self.testdata))). Update the single context manager invocation in the method containing that line to torch.amp.autocast(device_type='cuda') so AMP is used via the unified API while keeping the existing call to pipe and formatting_tests_func/testdata unchanged.src/mattext/models/llama.py (2)
115-119: Redundant "=" separators in warning messages.Using
logger.warning()with separator lines is noisy. Consider a single concise warning message.Suggested fix
- logger.warning("=" * 80) - logger.warning( - "Your GPU supports bfloat16: accelerate training with bf16=True" - ) - logger.warning("=" * 80) + logger.warning( + "Your GPU supports bfloat16 - consider enabling bf16=True for faster training" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama.py` around lines 115 - 119, The three consecutive logger.warning calls that print separator lines and a message are noisy; consolidate them into a single concise warning by removing the "=" separator lines and calling logger.warning once with a clear message (e.g., "Your GPU supports bfloat16; consider using bf16=True to accelerate training") where the current logger.warning calls appear in src/mattext/models/llama.py.
216-228: Redundant "WARNING:" prefix in logger.warning() calls.Same issue as in
finetune.py- thelogger.warning()method already indicates WARNING level.Suggested fix
if config_dict.get("load_best_model_at_end", False): logger.warning( - f"[Rank {self.local_rank}] WARNING: Disabling load_best_model_at_end in DDP mode" + f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP mode" ) config_dict["load_best_model_at_end"] = False if config_dict.get("save_on_each_node", False): logger.warning( - f"[Rank {self.local_rank}] WARNING: save_on_each_node should be False in DDP" + f"[Rank {self.local_rank}] save_on_each_node should be False in DDP" ) config_dict["save_on_each_node"] = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama.py` around lines 216 - 228, The logger.warning messages in the DDP handling block include a redundant "WARNING:" prefix; update the two logger.warning calls in the method that manipulates config_dict/local_rank (the block checking "load_best_model_at_end" and "save_on_each_node") to remove the literal "WARNING:" prefix so the messages read e.g. f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP mode" and f"[Rank {self.local_rank}] save_on_each_node should be False in DDP" respectively, leaving the rest of the logic (setting config_dict[...] = False) unchanged.src/mattext/models/finetune.py (2)
86-98: Redundant "WARNING:" prefix in logger.warning() calls.The
logger.warning()method already outputs messages at WARNING level. Including "WARNING:" in the message string is redundant.Suggested fix
if config_dict.get("load_best_model_at_end", False): logger.warning( - f"[Rank {self.local_rank}] WARNING: Disabling load_best_model_at_end in DDP mode" + f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP mode" ) config_dict["load_best_model_at_end"] = False if config_dict.get("save_on_each_node", False): logger.warning( - f"[Rank {self.local_rank}] WARNING: save_on_each_node should be False in DDP" + f"[Rank {self.local_rank}] save_on_each_node should be False in DDP" ) config_dict["save_on_each_node"] = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/finetune.py` around lines 86 - 98, The logger.warning messages in the DDP handling block (inside finetune logic where config_dict is created from config_train_args) redundantly include the literal "WARNING:" prefix; update the two logger.warning(...) calls (the ones guarded by self.local_rank and checking config_dict.get("load_best_model_at_end") and config_dict.get("save_on_each_node")) to remove the "WARNING:" text so they read e.g. f"[Rank {self.local_rank}] Disabling load_best_model_at_end in DDP mode" and f"[Rank {self.local_rank}] save_on_each_node should be False in DDP" while leaving the rest of the logic and assignments (setting the config_dict keys to False) unchanged.
137-144: Consider adding a synchronization barrier before non-main ranks exit.All ranks call
evaluate(), but only the main process saves the model and callswandb.finish(). If non-main ranks exit while the main process is still saving, this could cause issues in some distributed setups. Consider addingtorch.distributed.barrier()after the gated block to ensure all ranks wait for the main process to complete saving before any rank terminates.Suggested fix
if is_main: wandb.log(eval_result) trainer.save_model(self.cfg.path.finetuned_modelname) wandb.finish() + # Ensure all ranks wait for main process to finish saving + if self.local_rank is not None: + torch.distributed.barrier() + return self.cfg.path.finetuned_modelname🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/finetune.py` around lines 137 - 144, The code calls trainer.evaluate(...) on all ranks but only the main rank runs wandb.log, trainer.save_model, and wandb.finish, so add a synchronization point after that gated block to ensure non-main ranks wait for the main rank to finish saving and finalizing wandb; specifically, after the if is_main: ... block call torch.distributed.barrier() when torch.distributed.is_initialized() is True (or skip if not initialized) so all ranks block until the main process completes save/finish operations.revision-scripts/prep_rep.py (1)
26-28: Updateprocess_entry_train_matbenchandprocess_entry_test_matbenchto accept and pass thetransformationparameter.These functions don't currently accept
transformationsas a parameter or pass it toTextRep.from_input(). The equivalent functions insrc/mattext/dataprep/matbench_prepare_data.pyincludetransformations: dict | None = Nonein their signatures and pass it viatransformation=transformationstoTextRep.from_input(). This divergence can cause inconsistent behavior between the two scripts.Update both function signatures to include
transformations: dict | None = Noneand pass it to theTextRep.from_input()calls at lines 26-28 and 58-60.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@revision-scripts/prep_rep.py` around lines 26 - 28, Add a new optional parameter transformations: dict | None = None to both process_entry_train_matbench and process_entry_test_matbench, and pass it into TextRep.from_input by calling TextRep.from_input(entry["structure"], transformation=transformations) so the transformation dict is forwarded; update both function signatures and the corresponding TextRep.from_input invocations (the calls around the existing get_requested_text_reps usages) to include transformation=transformations.revision-scripts/text_rep.py (1)
26-28: Addtransformationparameter for consistency with matbench pipeline.
src/mattext/dataprep/matbench_prepare_data.pypassestransformation=transformationstoTextRep.from_input(), but this script omits it at lines 26-28 and 57-59. This inconsistency could cause behavior differences between the two data processing pipelines. Requires adding the parameter to function signatures and threading it through the call chain.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@revision-scripts/text_rep.py` around lines 26 - 28, The TextRep.from_input call is missing the transformation parameter (used in matbench pipeline) causing inconsistent behavior; update TextRep.from_input invocations in this file (the call at TextRep.from_input(entry["structure"]).get_requested_text_reps(...) and the similar call later) to pass transformation=transformations, and then add a transformation argument to any local function/method signatures that build or forward the call so the parameter is threaded through; specifically locate and update the function that calls TextRep.from_input as well as any helper functions around get_requested_text_reps to accept and forward transformation to ensure parity with src/mattext/dataprep/matbench_prepare_data.py.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a3148b49-3948-4061-a797-680cc7291c3f
📒 Files selected for processing (138)
.pre-commit-config.yamlLICENSEconf/archived_experiments/config-hydra.yamlconf/archived_experiments/config-potential.yamlconf/archived_experiments/config-sft.yamlconf/archived_experiments/config-smiles.yamlconf/archived_experiments/config.yamlconf/archived_experiments/santiago/cifp1.yamlconf/archived_experiments/santiago_100k/cifp1.yamlconf/archived_experiments/testing_perturb_100/cifp1.yamlconf/archived_experiments/testing_perturb_100/cifsymmetrized.yamlconf/archived_experiments/testing_perturb_100/crystal_llm.yamlconf/archived_experiments/testing_perturb_300/cifp1.yamlconf/archived_experiments/testing_perturb_300/cifsymmetrized.yamlconf/archived_experiments/testing_perturb_300/crystal_llm.yamlconf/archived_experiments/testing_translate_100/cifsymmetrized.yamlconf/archived_experiments/testing_translate_100/crystal_llm.yamlconf/archived_experiments/testing_translate_300/cifsymmetrized.yamlconf/archived_experiments/testing_translate_300/crystal_llm.yamlconf/bandgap.yamlconf/benchmark.yamlconf/bg/atoms.yamlconf/bg/atoms_params.yamlconf/bg/cifp1.yamlconf/bg/cifpsym.yamlconf/bg/composition.yamlconf/bg/crystal_llm.yamlconf/bg/local_env.yamlconf/bg/slices.yamlconf/bg/zmatrix.yamlconf/classification.yamlconf/form/atoms.yamlconf/form/atoms_params.yamlconf/form/cifp1.yamlconf/form/cifpsym.yamlconf/form/composition.yamlconf/form/crystal_llm.yamlconf/form/local_env.yamlconf/form/slices.yamlconf/form/zmatrix.yamlconf/form_energy.yamlconf/group-test/composition.yamlconf/group-test/slices.yamlconf/is_metal/atoms.yamlconf/is_metal/atoms_params.yamlconf/is_metal/cifp1.yamlconf/is_metal/cifpsym.yamlconf/is_metal/composition.yamlconf/is_metal/crystal_llm.yamlconf/is_metal/local_env.yamlconf/is_metal/slices.yamlconf/is_metal/zmatrix.yamlconf/llama_8b_bg/atoms.yamlconf/llama_8b_bg/atoms_params.yamlconf/llama_8b_bg/cifp1.yamlconf/llama_8b_bg/composition.yamlconf/llama_8b_bg/crystal_llm.yamlconf/llm_sft.yamlconf/model/archived_base/archived/finetune_template.yamlconf/model/archived_base/archived/finetune_template_gvrh.yamlconf/model/archived_base/archived/finetune_template_gvrh_atoms.yamlconf/model/archived_base/archived/finetune_template_gvrh_permuted copy.yamlconf/model/archived_base/archived/finetune_template_gvrh_permuted.yamlconf/model/archived_base/archived/finetune_template_gvrh_unfiltered.yamlconf/model/archived_base/archived/finetune_template_gvrh_zmatrix.yamlconf/model/archived_base/archived/finetune_template_kvrh_atoms.yamlconf/model/archived_base/archived/finetune_template_kvrh_permuted copy.yamlconf/model/archived_base/archived/finetune_template_kvrh_permuted.yamlconf/model/archived_base/archived/finetune_template_kvrh_unfiltered.yamlconf/model/archived_base/archived/finetune_template_kvrh_zmatrix.yamlconf/model/archived_base/finetune_template_gvrh.yamlconf/model/archived_base/finetune_template_gvrh_perturb.yamlconf/model/archived_base/finetune_template_gvrh_smiles.yamlconf/model/archived_base/finetune_template_gvrh_translate.yamlconf/model/archived_base/finetune_template_kvrh_perturb.yamlconf/model/archived_base/finetune_template_kvrh_translate.yamlconf/model/archived_base/llama_gvrh_sft_nb_fold_2.yamlconf/model/benchmark_example.yamlconf/model/classification_example.yamlconf/model/formation_energy.yamlconf/model/llama_8b.yamlconf/model/llama_example.yamlconf/model/pretrain_example.yamlconf/model/pretrain_other_models.yamlconf/model/pretrain_own_data_example.yamlconf/pretrain.yamldocs/api.mddocs/index.mddocs/representations.mddocs/tokenizers.mdmkdocs.ymlnotebooks/dataprep.ipynbpyproject.tomlrevision-scripts/5fold_split.pyrevision-scripts/mp_classification.pyrevision-scripts/prep_json.pyrevision-scripts/prep_rep.pyrevision-scripts/text_rep.pyscripts/filterdataset.pyscripts/linear_potential.pyscripts/llama-eval-responses/llama_evals_matbench_dielectric_cif_p1.jsonscripts/llama-eval-responses/llama_evals_matbench_dielectric_composition.jsonscripts/llama-eval-responses/llama_evals_matbench_dielectric_crystal_llm_rep.jsonscripts/llama-eval-responses/llama_evals_matbench_dielectric_slice.jsonscripts/llama-eval-responses/llama_evals_matbench_log_gvrh_cif_p1.jsonscripts/llama-eval-responses/llama_evals_matbench_log_gvrh_composition.jsonscripts/llama-eval-responses/llama_evals_matbench_log_gvrh_crystal_llm_rep.jsonscripts/llama-eval-responses/llama_evals_matbench_log_kvrh_cif_p1.jsonscripts/llama-eval-responses/llama_evals_matbench_log_kvrh_composition.jsonscripts/llama-eval-responses/llama_evals_matbench_log_kvrh_composition_old.jsonscripts/llama-eval-responses/llama_evals_matbench_log_kvrh_crystal_llm_rep.jsonscripts/llama-eval-responses/llama_evals_matbench_log_kvrh_slice.jsonscripts/llama-eval-responses/llama_evals_matbench_perovskites_cif_p1.jsonscripts/llama-eval-responses/llama_evals_matbench_perovskites_composition.jsonscripts/llama-eval-responses/llama_evals_matbench_perovskites_crystal_llm_rep.jsonscripts/llama-eval-responses/llama_evals_matbench_perovskites_slice.jsonscripts/llama_sft_evals.pyscripts/modify_mb_json.pyscripts/nomad_postprocess_parallel.pyscripts/qmof_prepare_data.pysrc/mattext/analysis/attention.pysrc/mattext/analysis/xtal2pot.pysrc/mattext/dataprep/matbench_prepare_data.pysrc/mattext/dataprep/nomad_prepare_data.pysrc/mattext/main.pysrc/mattext/models/benchmark.pysrc/mattext/models/finetune.pysrc/mattext/models/helper.pysrc/mattext/models/inference.pysrc/mattext/models/llama.pysrc/mattext/models/llama_sft.pysrc/mattext/models/potential.pysrc/mattext/models/predict.pysrc/mattext/models/pretrain.pysrc/mattext/models/score.pysrc/mattext/models/utils.pytests/test_imports.pytests/test_xtal2pot.py
💤 Files with no reviewable changes (22)
- conf/bg/crystal_llm.yaml
- conf/archived_experiments/testing_perturb_300/cifp1.yaml
- conf/archived_experiments/testing_perturb_100/cifp1.yaml
- conf/group-test/slices.yaml
- conf/is_metal/atoms.yaml
- conf/llama_8b_bg/composition.yaml
- conf/bg/atoms_params.yaml
- conf/bg/composition.yaml
- conf/is_metal/crystal_llm.yaml
- conf/form/composition.yaml
- conf/archived_experiments/santiago/cifp1.yaml
- conf/llama_8b_bg/atoms_params.yaml
- revision-scripts/mp_classification.py
- conf/bg/atoms.yaml
- conf/is_metal/composition.yaml
- conf/llama_8b_bg/cifp1.yaml
- conf/llama_8b_bg/atoms.yaml
- conf/form/crystal_llm.yaml
- conf/form/atoms_params.yaml
- conf/is_metal/atoms_params.yaml
- conf/form/atoms.yaml
- conf/group-test/composition.yaml
✅ Files skipped from review due to trivial changes (62)
- conf/archived_experiments/testing_translate_300/crystal_llm.yaml
- conf/bg/slices.yaml
- conf/archived_experiments/testing_perturb_100/cifsymmetrized.yaml
- conf/archived_experiments/testing_translate_300/cifsymmetrized.yaml
- conf/model/archived_base/llama_gvrh_sft_nb_fold_2.yaml
- conf/is_metal/zmatrix.yaml
- conf/model/archived_base/archived/finetune_template_kvrh_permuted.yaml
- conf/archived_experiments/config-sft.yaml
- conf/model/archived_base/archived/finetune_template_kvrh_permuted copy.yaml
- conf/form/cifp1.yaml
- conf/model/archived_base/archived/finetune_template_gvrh_unfiltered.yaml
- conf/model/archived_base/finetune_template_kvrh_perturb.yaml
- conf/model/llama_example.yaml
- mkdocs.yml
- notebooks/dataprep.ipynb
- conf/archived_experiments/testing_perturb_100/crystal_llm.yaml
- conf/model/formation_energy.yaml
- conf/bg/local_env.yaml
- conf/is_metal/cifp1.yaml
- conf/model/archived_base/finetune_template_gvrh.yaml
- conf/model/classification_example.yaml
- conf/form/cifpsym.yaml
- conf/archived_experiments/testing_translate_100/crystal_llm.yaml
- docs/representations.md
- conf/model/pretrain_own_data_example.yaml
- conf/model/archived_base/archived/finetune_template_gvrh_permuted.yaml
- conf/bg/cifp1.yaml
- conf/bg/zmatrix.yaml
- conf/model/archived_base/finetune_template_kvrh_translate.yaml
- conf/model/archived_base/archived/finetune_template_kvrh_atoms.yaml
- conf/form/slices.yaml
- conf/model/archived_base/archived/finetune_template_kvrh_unfiltered.yaml
- conf/model/archived_base/archived/finetune_template_gvrh_atoms.yaml
- conf/archived_experiments/config-smiles.yaml
- conf/pretrain.yaml
- LICENSE
- conf/model/llama_8b.yaml
- conf/form_energy.yaml
- conf/benchmark.yaml
- conf/model/archived_base/archived/finetune_template_gvrh.yaml
- conf/archived_experiments/config-hydra.yaml
- conf/bg/cifpsym.yaml
- conf/archived_experiments/testing_perturb_300/cifsymmetrized.yaml
- conf/model/archived_base/archived/finetune_template_gvrh_zmatrix.yaml
- conf/is_metal/slices.yaml
- conf/archived_experiments/config.yaml
- conf/model/archived_base/archived/finetune_template_gvrh_permuted copy.yaml
- conf/classification.yaml
- conf/archived_experiments/testing_perturb_300/crystal_llm.yaml
- conf/model/benchmark_example.yaml
- conf/model/pretrain_other_models.yaml
- conf/llama_8b_bg/crystal_llm.yaml
- conf/is_metal/local_env.yaml
- conf/bandgap.yaml
- conf/model/archived_base/archived/finetune_template_kvrh_zmatrix.yaml
- conf/form/zmatrix.yaml
- conf/form/local_env.yaml
- conf/model/archived_base/archived/finetune_template.yaml
- tests/test_imports.py
- conf/is_metal/cifpsym.yaml
- tests/test_xtal2pot.py
- docs/index.md
🚧 Files skipped from review as they are similar to previous changes (1)
- .pre-commit-config.yaml
| ???+ tip "tip" | ||
|
|
||
| You can access the `[CLS]` token using the `cls_token` attribute of the tokenizer. | ||
| You can access the `[CLS]` token using the `cls_token` attribute of the tokenizer. |
There was a problem hiding this comment.
Resolve markdownlint MD046 in this section.
The warning points to this changed block; please convert the affected code block style to fenced format so docs lint passes consistently.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 35-35: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/tokenizers.md` at line 35, Convert the inline/backticked code mention
into a fenced code block to satisfy markdownlint MD046: replace the current
inline sentence "You can access the `[CLS]` token using the `cls_token`
attribute of the tokenizer." with a fenced code block (triple backticks)
containing a short example showing the [CLS] token and the cls_token attribute
(referencing cls_token and [CLS] so reviewers can find the spot), ensuring the
fenced block uses correct language marker (e.g., plaintext or python) and
preserves the explanatory text around it.
|
|
||
|
|
||
|
|
||
| def process_entry_train_matbench(entry: dict, timeout: int) -> dict: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files revision-scripts/ | head -20Repository: lamalab-org/MatText
Length of output: 258
🏁 Script executed:
wc -l revision-scripts/prep_rep.pyRepository: lamalab-org/MatText
Length of output: 95
🏁 Script executed:
cat -n revision-scripts/prep_rep.py | head -100Repository: lamalab-org/MatText
Length of output: 3939
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Function definitions ==="
rg -nP 'def process_entry_(train|test)_matbench\(.*timeout' revision-scripts/prep_rep.py
echo ""
echo "=== executor.map calls ==="
rg -nP 'executor\.map\(' revision-scripts/prep_rep.py
echo ""
echo "=== future.result with timeout ==="
rg -nP 'future\.result\s*\(\s*timeout=' revision-scripts/prep_rep.pyRepository: lamalab-org/MatText
Length of output: 373
🏁 Script executed:
cat -n revision-scripts/prep_rep.py | tail -60Repository: lamalab-org/MatText
Length of output: 2350
🏁 Script executed:
rg -l 'TextRep' revision-scripts/Repository: lamalab-org/MatText
Length of output: 120
Enforce timeout at pool result retrieval—currently, long-running entries can stall batch completion.
The timeout parameter is passed through to entry functions but never enforced during pool execution. The executor.map() call (line 90) lacks a timeout at result retrieval, and the except TimeoutError handlers (lines 47–48, 78–79) in process_entry_train_matbench and process_entry_test_matbench are ineffective since no mechanism triggers a timeout exception within those functions.
Switch to executor.submit() with future.result(timeout=timeout) to enforce timeouts per entry:
Proposed fix
def process_batch(num_workers, batch, timeout, process_entry_func):
- process_entry_with_timeout = partial(process_entry_func, timeout=timeout)
-
- with ProcessPoolExecutor(max_workers=num_workers) as executor:
- results = list(executor.map(process_entry_with_timeout, batch))
-
- return [result for result in results if result is not None]
+ with ProcessPoolExecutor(max_workers=num_workers) as executor:
+ futures = [executor.submit(process_entry_func, entry, timeout) for entry in batch]
+ results = []
+ for future in futures:
+ try:
+ result = future.result(timeout=timeout)
+ except TimeoutError:
+ future.cancel()
+ continue
+ if result is not None:
+ results.append(result)
+ return resultsAlso applies to: 47–48, 55, 78–79, 87–92
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 24-24: Unused function argument: timeout
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@revision-scripts/prep_rep.py` at line 24, The pool result retrieval never
enforces the passed timeout because executor.map is used; update
process_entry_train_matbench and process_entry_test_matbench to submit tasks
with executor.submit(...) instead of executor.map and then call
future.result(timeout=timeout) for each Future so the TimeoutError handlers
actually trigger; replace the iterator over map results with collecting Futures
(from submit), iterate them and call future.result(timeout=timeout) catching
concurrent.futures.TimeoutError to handle per-entry timeouts and cancel
timed-out futures as needed.
| def process_entry_test_matbench(entry: list, timeout: int) -> dict: | ||
| # Ensure the give_slice function and necessary data are picklable |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if file exists and get line count
if [ -f "revision-scripts/prep_rep.py" ]; then
echo "=== File found ==="
wc -l revision-scripts/prep_rep.py
echo ""
# Show lines around 55 to see the function definition
echo "=== Lines 50-65 (function definition and usage context) ==="
sed -n '50,65p' revision-scripts/prep_rep.py | cat -n
echo ""
# Show lines around 76 to see other usage
echo "=== Lines 70-85 (checking line 76 usage) ==="
sed -n '70,85p' revision-scripts/prep_rep.py | cat -n
else
echo "File not found"
exit 1
fiRepository: lamalab-org/MatText
Length of output: 1531
Change entry type annotation from list to dict.
The function signature at line 55 declares entry: list, but the implementation accesses it as a dictionary (lines 59 and 76 use entry["structure"] and entry["mbid"]).
-def process_entry_test_matbench(entry: list, timeout: int) -> dict:
+def process_entry_test_matbench(entry: dict, timeout: int) -> dict:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def process_entry_test_matbench(entry: list, timeout: int) -> dict: | |
| # Ensure the give_slice function and necessary data are picklable | |
| def process_entry_test_matbench(entry: dict, timeout: int) -> dict: | |
| # Ensure the give_slice function and necessary data are picklable |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 55-55: Unused function argument: timeout
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@revision-scripts/prep_rep.py` around lines 55 - 56, The function
process_entry_test_matbench currently types its parameter as entry: list but
treats it like a dict (uses entry["structure"] and entry["mbid"]); update the
type annotation to entry: dict (or a more specific TypedDict/Mapping if
available) in the process_entry_test_matbench signature and adjust any related
type hints or imports so the code and type annotations match the actual usage of
entry["structure"] and entry["mbid"].
|
|
||
|
|
||
|
|
||
| def process_entry_train_matbench(entry: dict, timeout: int) -> dict: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# First, let's read the relevant sections of the file to understand the implementation
echo "=== Reading lines 20-100 of text_rep.py ==="
sed -n '20,100p' revision-scripts/text_rep.py | cat -n
echo ""
echo "=== Checking file size ==="
wc -l revision-scripts/text_rep.pyRepository: lamalab-org/MatText
Length of output: 3403
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Searching for executor.map usage ==="
rg -nP 'executor\.map\(' revision-scripts/text_rep.py -A 2 -B 2
echo ""
echo "=== Searching for future.result with timeout ==="
rg -nP 'future\.result\s*\(\s*timeout=' revision-scripts/text_rep.py -A 1 -B 1
echo ""
echo "=== Searching for TimeoutError except blocks ==="
rg -nP 'except\s+TimeoutError' revision-scripts/text_rep.py -A 2 -B 2
echo ""
echo "=== Searching for process_batch function ==="
rg -nP 'def process_batch' revision-scripts/text_rep.py -A 20Repository: lamalab-org/MatText
Length of output: 373
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full process_batch function ==="
rg -nP 'def process_batch' revision-scripts/text_rep.py -A 10
echo ""
echo "=== Checking imports at top of file ==="
head -20 revision-scripts/text_rep.py
echo ""
echo "=== Full context of process_entry_train_matbench function ==="
rg -nP 'def process_entry_train_matbench' revision-scripts/text_rep.py -A 12Repository: lamalab-org/MatText
Length of output: 1555
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify the actual behavior of executor.map()
# Check Python documentation to confirm timeout enforcement
python3 << 'EOF'
from concurrent.futures import ProcessPoolExecutor
import inspect
# Check if map() has a timeout parameter
sig = inspect.signature(ProcessPoolExecutor.map)
print("ProcessPoolExecutor.map() signature:")
print(sig)
print("\nParameters:")
for param_name, param in sig.parameters.items():
print(f" - {param_name}: {param}")
EOFRepository: lamalab-org/MatText
Length of output: 267
timeout parameter is not enforced at the executor level.
At line 89, executor.map() is called without passing the timeout parameter. The timeout is only forwarded to worker functions via partial(), which does not enforce executor-level timeouts. A stuck worker can block the batch indefinitely. The TimeoutError handlers at lines 27-29 and 58-60 would only catch errors raised within worker code, not executor timeouts.
Apply the proposed fix to pass timeout to executor.map() or use future.result(timeout=timeout) as shown.
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 24-24: Unused function argument: timeout
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@revision-scripts/text_rep.py` at line 24, The function
process_entry_train_matbench currently uses executor.map(...) without enforcing
the timeout, relying only on the partial-wrapped worker timeout; update the call
to enforce executor-level timeouts by either passing timeout=timeout to
executor.map(...) or by switching to executor.submit(...) and calling
future.result(timeout=timeout) for each future (and keep the partial-wrapped
worker as-is). Also ensure the existing TimeoutError handlers catch
concurrent.futures.TimeoutError from the executor, not just worker-raised errors
so stuck workers are properly handled.
| def process_entry_test_matbench(entry: list, timeout: int) -> dict: | ||
| # Ensure the give_slice function and necessary data are picklable |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# First, let's see the function definition and surrounding lines
echo "=== Function definition and context ==="
cat -n revision-scripts/text_rep.py | sed -n '50,80p'
echo -e "\n=== Search for entry annotation ==="
rg -nP 'def process_entry_test_matbench\(entry:\s*list' revision-scripts/text_rep.py
echo -e "\n=== Search for mapping access patterns ==="
rg -nP 'entry\["structure"\]|entry\["mbid"\]' revision-scripts/text_rep.pyRepository: lamalab-org/MatText
Length of output: 1738
Fix entry annotation in test processor (list → dict).
At line 54, the function annotates entry as list, but the code uses mapping access with string keys at lines 58 and 75 (entry["structure"] and entry["mbid"]). List indexing requires integers; this should be annotated as dict.
Proposed fix
-def process_entry_test_matbench(entry: list, timeout: int) -> dict:
+def process_entry_test_matbench(entry: dict, timeout: int) -> dict:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def process_entry_test_matbench(entry: list, timeout: int) -> dict: | |
| # Ensure the give_slice function and necessary data are picklable | |
| def process_entry_test_matbench(entry: dict, timeout: int) -> dict: | |
| # Ensure the give_slice function and necessary data are picklable |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 54-54: Unused function argument: timeout
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@revision-scripts/text_rep.py` around lines 54 - 55, The type annotation for
the parameter in process_entry_test_matbench is incorrect: change the annotation
of entry from list to dict because the function accesses entry["structure"] and
entry["mbid"]; update the signature of process_entry_test_matbench(entry: dict,
timeout: int) -> dict to reflect this and ensure any related callers or imports
expecting a list are adjusted to pass a mapping with those keys.
| """ | ||
| Return a datapoint | ||
| """ | ||
| """Return a datapoint.""" |
There was a problem hiding this comment.
Fix small docstring typo.
Line 38 has a double space ("a datapoint"). Please normalize to "a datapoint" for doc quality consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/nomad_postprocess_parallel.py` at line 38, Fix the docstring typo by
replacing the double space in the string "Return a datapoint." with a single
space so it reads "Return a datapoint."; update the triple-quoted docstring
containing that text (the function/method that returns a datapoint) to the
corrected wording to normalize doc quality.
| @@ -1,5 +1,4 @@ | |||
| """ | |||
| matbench_prepare_data.py | |||
| """matbench_prepare_data.py. | |||
There was a problem hiding this comment.
Fix stale module name in the file header docstring.
Line 1 says matbench_prepare_data.py, but this file is scripts/qmof_prepare_data.py. Please align it to avoid confusion in docs/search.
Suggested fix
-"""matbench_prepare_data.py.
+"""qmof_prepare_data.py.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """matbench_prepare_data.py. | |
| """qmof_prepare_data.py. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/qmof_prepare_data.py` at line 1, The module-level docstring at the
top of the file incorrectly names the file "matbench_prepare_data.py"; update
that header docstring to the correct module name "qmof_prepare_data.py" so the
file header accurately reflects this script (check the top-of-file docstring in
scripts/qmof_prepare_data.py and replace the stale name).
| fig, axs = plt.subplots(len(unique_tokens), 1, figsize=(10, len(unique_tokens) * 5)) | ||
|
|
||
| for ax, token in zip(axs, unique_tokens): | ||
| for ax, token in zip(axs, unique_tokens, strict=False): | ||
| # Extract the weights for the current token |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/mattext/analysis/attention.py | sed -n '250,275p'Repository: lamalab-org/MatText
Length of output: 1236
🏁 Script executed:
head -30 src/mattext/analysis/attention.pyRepository: lamalab-org/MatText
Length of output: 750
🏁 Script executed:
web_search
matplotlib.pyplot.subplots return type for 1x1 gridRepository: lamalab-org/MatText
Length of output: 178
🌐 Web query:
matplotlib pyplot subplots return type single subplot
💡 Result:
matplotlib.pyplot.subplots() returns (fig, ax) where ax is a single matplotlib.axes.Axes object (a scalar), not an array, when you create exactly one subplot (nrows=1, ncols=1) and squeeze=True (the default). [1][2]
If you want a consistent array return type even for 1×1, use:
fig, axs = plt.subplots(1, 1, squeeze=False) # axs is a 2D array with shape (1, 1)
ax = axs[0, 0]With the default squeeze=True, ax becomes:
- scalar
Axesfor 1×1 - 1D array of
Axesfor Nx1 or 1xM - 2D array of
Axesfor NxM (N>1 and M>1) [1]
Sources: [1] Matplotlib pyplot.subplots docs, [2] Matplotlib subplots demo gallery.
Fix iteration over matplotlib subplots when only one token exists.
When len(unique_tokens) == 1, plt.subplots(1, 1, ...) returns a single Axes object (not an array) due to the default squeeze=True behavior. Attempting to iterate over it at line 262 will raise TypeError. This must be handled to ensure the loop works for both single and multiple tokens.
Proposed fix
fig, axs = plt.subplots(len(unique_tokens), 1, figsize=(10, len(unique_tokens) * 5))
+ axs = np.atleast_1d(axs).ravel()
for ax, token in zip(axs, unique_tokens, strict=False):🧰 Tools
🪛 Ruff (0.15.2)
[warning] 260-260: Unpacked variable fig is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mattext/analysis/attention.py` around lines 260 - 263, The loop over
matplotlib axes fails when len(unique_tokens) == 1 because plt.subplots returns
a single Axes, not an iterable; modify the code after fig, axs =
plt.subplots(...) to normalize axs into an iterable: if axs is not a
list/ndarray wrap it in [axs], otherwise flatten it (e.g., axs =
np.atleast_1d(axs).ravel()) so the for ax, token in zip(axs, unique_tokens,
strict=False) loop over unique_tokens works for both single and multiple tokens;
update imports to include numpy if needed and reference fig, axs, unique_tokens
and the for loop in attention.py.
| model.save_pretrained(self.cfg.path.finetuned_modelname) | ||
|
|
||
| # All ranks must call evaluate (distributed collectives) | ||
| eval_result = trainer.evaluate(eval_dataset=self.tokenized_testset) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and the specific lines mentioned
git ls-files src/mattext/models/potential.pyRepository: lamalab-org/MatText
Length of output: 94
🏁 Script executed:
# Read the file to understand the context around line 179
cat -n src/mattext/models/potential.py | head -200Repository: lamalab-org/MatText
Length of output: 8241
🏁 Script executed:
# Look at the _prepare_datasets method to understand what it returns
ast-grep --pattern 'def _prepare_datasets' -A 20Repository: lamalab-org/MatText
Length of output: 17418
🏁 Script executed:
# Check what train_test_split returns and how tokenized_testset is used
rg -n "tokenized_testset" src/mattext/models/potential.pyRepository: lamalab-org/MatText
Length of output: 231
🏁 Script executed:
# Check the imports to understand which trainer is being used
head -20 src/mattext/models/potential.pyRepository: lamalab-org/MatText
Length of output: 522
🏁 Script executed:
# Verify what Trainer.evaluate() expects by checking transformers documentation or usage patterns
# Also check if there are other evaluate() calls in the codebase
rg -n "\.evaluate\(" src/mattext/models/ -A 2Repository: lamalab-org/MatText
Length of output: 695
🏁 Script executed:
# Check if there's any documentation or hints about what evaluate() accepts
rg -n "def evaluate" --type py -A 5 | head -100Repository: lamalab-org/MatText
Length of output: 1350
Use self.tokenized_testset["test"] instead of passing the entire DatasetDict to evaluate().
At line 179, self.tokenized_testset is a DatasetDict containing both "train" and "test" splits (from train_test_split() at line 71). This is inconsistent with line 165 (which correctly uses self.tokenized_dataset["test"]) and with the same pattern used in llama.py:260 and finetune.py:138. The evaluate() method expects a single Dataset, not a DatasetDict.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mattext/models/potential.py` at line 179, The call to trainer.evaluate is
passing the whole DatasetDict (self.tokenized_testset) instead of a single
Dataset split; change the argument to the test split
(self.tokenized_testset["test"]) so trainer.evaluate receives a Dataset; locate
the evaluate call in the method where eval_result = trainer.evaluate(...) and
replace the parameter accordingly to match the pattern used with
self.tokenized_dataset["test"] elsewhere.
| def _prepare_datasets( | ||
| self, subset: str, local_file_path: str | None = None | ||
| ) -> DatasetDict: |
There was a problem hiding this comment.
Fix _prepare_datasets return type annotation.
The function is annotated as returning DatasetDict (Line 47), but it returns a 2-tuple of datasets at Line 77-87. Please align the type hint with actual behavior to avoid typing/tooling drift.
Also applies to: 77-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mattext/models/pretrain.py` around lines 45 - 47, The _prepare_datasets
function is annotated to return DatasetDict but actually returns a 2-tuple
(train_dataset, val_dataset); update its return type annotation to reflect that
(e.g., tuple[Dataset, Dataset] or Tuple[Dataset, Dataset]) and ensure Dataset is
imported from the datasets package (or add typing.Tuple if using the older
form). Change the signature of _prepare_datasets and any related type
hints/usages (and update the docstring if present) so callers and tooling match
the actual return of train_dataset and val_dataset.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
README.md (2)
125-125:⚠️ Potential issue | 🟡 MinorFix redundant phrase "for e.g."
The phrase "for e.g." is redundant since "e.g." already means "for example".
📝 Proposed fix
-To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. +To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` at line 125, The README contains a redundant phrase "for e.g." in the sentence explaining how to override the default config (example: "++model.pretrain.training_arguments.per_device_train_batch_size=32"); remove "for " so it reads "e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`" (or replace "e.g." with "for example") to avoid redundancy while preserving the example and link to the docs.
150-159:⚠️ Potential issue | 🟡 MinorFix citation block formatting.
The code block should specify a language (e.g.,
bibtex), and there's a missing comma afterprimaryClasswhich would be invalid BibTeX syntax.📝 Proposed fix
-``` +```bibtex `@misc`{alampara2024mattextlanguagemodelsneed, title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?}, author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka}, year={2024}, eprint={2406.17295}, archivePrefix={arXiv}, - primaryClass={cond-mat.mtrl-sci} - url={https://arxiv.org/abs/2406.17295}, + primaryClass={cond-mat.mtrl-sci}, + url={https://arxiv.org/abs/2406.17295} }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 150 - 159, The BibTeX citation block is missing a language tag and has a syntax error: add a fenced code block language specifier (e.g., ```bibtex) and fix the missing comma after the primaryClass field so the entry reads primaryClass={cond-mat.mtrl-sci}, and also remove the trailing comma after the final url field if required by your style (ensure the entry ends with a closing brace). Locate the citation starting with the `@misc`{alampara2024mattextlanguagemodelsneed and update the surrounding fence and the primaryClass and url lines accordingly.
🧹 Nitpick comments (2)
docs/representations.md (1)
129-134: Specify language for fenced code block.The code block at line 132 is missing a language specifier, which affects syntax highlighting and accessibility.
📝 Proposed fix
??? success "output" - ```bash + ```text Translated Crystal-text-LLM Representations:Alternatively, if this is meant to be Python output:
- ```bash + ```python🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/representations.md` around lines 129 - 134, The fenced code block containing the sample output "Translated Crystal-text-LLM Representations:" is missing a language specifier; update that triple-backtick fence (the block starting with ``` and the text "Translated Crystal-text-LLM Representations:") to include an appropriate language tag such as text, bash, or python (e.g., change ``` to ```text or ```python) to enable correct syntax highlighting and accessibility.pyproject.toml (1)
32-34: Use a tagged release or version constraint for xtal2txt instead of pinning to a commit hash.The commit c6cf9593596ce99f054f1ebbdc4346d7dbec99fb is a stable point (xtal2txt v0.2.0, GPG-verified from maintainer Nawaf Alampara), but pinning to a bare commit hash is unconventional for production dependencies. If a release tag exists for this commit, reference that instead. If not, consider whether pinning to the version number (
xtal2txt==0.2.0) would suffice, as it's more maintainable and allows for patch releases.The exact pins for
tensorflow==2.14.0andkeras==2.14.0are stricter than xtal2txt's own constraints (keras<3). These exact versions may cause dependency conflicts with users' environments unless they're required for specific MatText features—document this if so.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 32 - 34, Replace the git-commit pinned dependency "xtal2txt @ git+https://github.com/lamalab-org/xtal2txt.git@c6cf9593596ce99f054f1ebbdc4346d7dbec99fb" with a tagged release or explicit version (e.g., use the release tag for v0.2.0 or change to "xtal2txt==0.2.0") to avoid a bare-commit pin; also relax or justify the strict pins for "tensorflow==2.14.0" and "keras==2.14.0" (e.g., use version ranges like "keras<3" or document why exact pins are required for the MatText features) so dependency resolution is more maintainable and less likely to conflict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/representations.md`:
- Around line 24-35: The codebase uses outdated representation keys causing
KeyErrors: replace occurrences of the old names "slice" -> "slices",
"crystal_llm_rep" -> "crystal_text_llm", and "atoms_params" ->
"atom_sequences_plusplus" in the modules that map or look up representations
(e.g., the representation key dicts/lookup logic in attention.py and the
dataprep modules); update the representation-key mappings and any conditionals
in attention.py, matbench_prepare_data.py, and nomad_prepare_data.py to use the
new names from src/mattext/models/utils.py, or add a small alias map that
translates the old keys to the new ones to preserve backward compatibility.
Ensure all lookups reference the unified representation names and run tests to
confirm no KeyError remains.
---
Duplicate comments:
In `@README.md`:
- Line 125: The README contains a redundant phrase "for e.g." in the sentence
explaining how to override the default config (example:
"++model.pretrain.training_arguments.per_device_train_batch_size=32"); remove
"for " so it reads "e.g.,
`++model.pretrain.training_arguments.per_device_train_batch_size=32`" (or
replace "e.g." with "for example") to avoid redundancy while preserving the
example and link to the docs.
- Around line 150-159: The BibTeX citation block is missing a language tag and
has a syntax error: add a fenced code block language specifier (e.g., ```bibtex)
and fix the missing comma after the primaryClass field so the entry reads
primaryClass={cond-mat.mtrl-sci}, and also remove the trailing comma after the
final url field if required by your style (ensure the entry ends with a closing
brace). Locate the citation starting with the
`@misc`{alampara2024mattextlanguagemodelsneed and update the surrounding fence and
the primaryClass and url lines accordingly.
---
Nitpick comments:
In `@docs/representations.md`:
- Around line 129-134: The fenced code block containing the sample output
"Translated Crystal-text-LLM Representations:" is missing a language specifier;
update that triple-backtick fence (the block starting with ``` and the text
"Translated Crystal-text-LLM Representations:") to include an appropriate
language tag such as text, bash, or python (e.g., change ``` to ```text or
```python) to enable correct syntax highlighting and accessibility.
In `@pyproject.toml`:
- Around line 32-34: Replace the git-commit pinned dependency "xtal2txt @
git+https://github.com/lamalab-org/xtal2txt.git@c6cf9593596ce99f054f1ebbdc4346d7dbec99fb"
with a tagged release or explicit version (e.g., use the release tag for v0.2.0
or change to "xtal2txt==0.2.0") to avoid a bare-commit pin; also relax or
justify the strict pins for "tensorflow==2.14.0" and "keras==2.14.0" (e.g., use
version ranges like "keras<3" or document why exact pins are required for the
MatText features) so dependency resolution is more maintainable and less likely
to conflict.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8aa001de-853c-4b5a-8780-5cbb7f7069e2
📒 Files selected for processing (3)
README.mddocs/representations.mdpyproject.toml
| For example, to obtain the `cif_p1`, `slices`, `atom_sequences_plusplus`, `crystal_text_llm`, and `zmatrix` representations, use the following code: | ||
|
|
||
| ```python | ||
| text_rep = TextRep(structure) | ||
|
|
||
| # Define a list of requested representations | ||
| requested_reps = [ | ||
| "cif_p1", | ||
| "slice", | ||
| "atoms_params", | ||
| "crystal_llm_rep", | ||
| "slices", | ||
| "atom_sequences_plusplus", | ||
| "crystal_text_llm", | ||
| "zmatrix" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for old representation names still in use across the codebase
echo "=== Checking for OLD representation names ==="
echo "--- 'slice' (should be 'slices') ---"
rg -n --type py '\bslice\b' -g '!*test*' | head -30
echo ""
echo "--- 'crystal_llm_rep' (should be 'crystal_text_llm') ---"
rg -n --type py 'crystal_llm_rep' | head -30
echo ""
echo "--- 'atoms_params' (should be 'atom_sequences_plusplus') ---"
rg -n --type py 'atoms_params' | head -30
echo ""
echo "=== Checking what names ARE being used in these files ==="
echo "--- In attention.py ---"
rg -n 'representation|_rep|_param|slice' src/mattext/analysis/attention.py 2>/dev/null | head -20
echo ""
echo "--- In matbench_prepare_data.py ---"
rg -n 'representation|_rep|_param' src/mattext/dataprep/matbench_prepare_data.py 2>/dev/null | head -20
echo ""
echo "--- In nomad_prepare_data.py ---"
rg -n 'representation|_rep|_param' src/mattext/dataprep/nomad_prepare_data.py 2>/dev/null | head -20Repository: lamalab-org/MatText
Length of output: 5497
🏁 Script executed:
#!/bin/bash
# Check the documentation file to see what names it advertises
echo "=== Lines 24-35 from docs/representations.md ==="
sed -n '24,35p' docs/representations.md
echo ""
echo "=== Checking for new representation names in the documentation ==="
rg -n 'slices|atom_sequences_plusplus|crystal_text_llm' docs/representations.md
echo ""
echo "=== Checking TextRep class for representation definitions ==="
find . -name '*.py' -type f | xargs grep -l 'class TextRep' | head -1 | xargs head -100
echo ""
echo "=== Checking for representation method names ==="
rg -n 'def get_.*text_rep|def get_.*representation' src/mattext/representations.py | head -20Repository: lamalab-org/MatText
Length of output: 3996
🏁 Script executed:
#!/bin/bash
# Find the representations module
echo "=== Finding representations module ==="
find . -name '*representation*' -type f | grep -E '\.(py|md)$' | head -20
echo ""
echo "=== Searching for where new representation names are defined/used ==="
rg -n 'slices|atom_sequences_plusplus|crystal_text_llm' --type py | head -40
echo ""
echo "=== Checking if TextRep has methods for these representations ==="
find . -name '*.py' -type f | xargs grep -l 'TextRep' | head -3 | xargs grep -A 5 'def get_'Repository: lamalab-org/MatText
Length of output: 604
🏁 Script executed:
#!/bin/bash
# Check src/mattext/models/utils.py to see the full definitions
echo "=== src/mattext/models/utils.py content ==="
head -50 src/mattext/models/utils.py
echo ""
echo "=== Check if attention.py imports from models/utils or defines its own ==="
head -50 src/mattext/analysis/attention.py
echo ""
echo "=== Verify: Does TextRep class have get_slices, get_atom_sequences_plusplus, get_crystal_text_llm methods? ==="
find . -path '*/mattext/*' -name '*.py' -type f | xargs grep -l 'def get_slices\|def get_atom_sequences_plusplus\|def get_crystal_text_llm' | head -5
echo ""
echo "=== Search for where these representation methods are defined ==="
rg -n 'def get_slices|def get_atom_sequences_plusplus|def get_crystal_text_llm|def get_slice|def get_crystal_llm_rep|def get_atoms_params' --type pyRepository: lamalab-org/MatText
Length of output: 2664
Update representation names in src/mattext/analysis/attention.py and data preparation modules to match documentation.
The documentation advertises new representation names (slices, crystal_text_llm, atom_sequences_plusplus) which are defined in src/mattext/models/utils.py, but src/mattext/analysis/attention.py still uses old names (slice, crystal_llm_rep, atoms_params). Additionally, src/mattext/dataprep/matbench_prepare_data.py uses the old crystal_llm_rep name and src/mattext/dataprep/nomad_prepare_data.py uses the old atoms_params name.
Users following the updated documentation will encounter KeyError exceptions when attempting to use these modules with the new representation names, creating a breaking inconsistency between documented and actual supported representation identifiers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/representations.md` around lines 24 - 35, The codebase uses outdated
representation keys causing KeyErrors: replace occurrences of the old names
"slice" -> "slices", "crystal_llm_rep" -> "crystal_text_llm", and "atoms_params"
-> "atom_sequences_plusplus" in the modules that map or look up representations
(e.g., the representation key dicts/lookup logic in attention.py and the
dataprep modules); update the representation-key mappings and any conditionals
in attention.py, matbench_prepare_data.py, and nomad_prepare_data.py to use the
new names from src/mattext/models/utils.py, or add a small alias map that
translates the old keys to the new ones to preserve backward compatibility.
Ensure all lookups reference the unified representation names and run tests to
confirm no KeyError remains.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/mattext/models/llama_sft.py (3)
81-95:⚠️ Potential issue | 🔴 Critical
compute_dtypecan be undefined on non-quantized runs (runtime crash).Line 94 reads
compute_dtypeeven when Line 91 path is taken (bnb_config = None), causingUnboundLocalError.Proposed fix
- if self.bnb_config.use_4bit and self.bnb_config.use_8bit: + compute_dtype = None + if self.bnb_config.use_4bit and self.bnb_config.use_8bit: raise ValueError( "You can't load the model in 8 bits and 4 bits at the same time" ) @@ - if compute_dtype == torch.float16: + if compute_dtype == torch.float16 and torch.cuda.is_available(): major, _ = torch.cuda.get_device_capability() if major >= 8: logger.info( "Your GPU supports bfloat16: accelerate training with bf16=True!" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama_sft.py` around lines 81 - 95, The code reads compute_dtype unconditionally but only sets it inside the quantized branch, causing UnboundLocalError; to fix, initialize compute_dtype (e.g., compute_dtype = None) before the if/elif where bnb_config is created or guard the later check, and change the GPU compatibility check to only run when compute_dtype is not None (and/or bnb_config is not None) before comparing to torch.float16 and calling torch.cuda.get_device_capability(); ensure references to compute_dtype, bnb_config, BitsAndBytesConfig, and the torch.float16 check are updated accordingly.
63-64:⚠️ Potential issue | 🟡 Minor
test_sample_sizehandling has edge-case bugs.Using a truthy check means
0behaves like “unset”, and selectingrange(self.test_sample_size)can exceed dataset length.Proposed fix
- if self.test_sample_size: - dataset = dataset.select(range(self.test_sample_size)) + if self.test_sample_size is not None: + n = min(self.test_sample_size, len(dataset)) + dataset = dataset.select(range(n))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama_sft.py` around lines 63 - 64, The current truthy check misinterprets 0 as unset and may request more samples than exist; change the conditional to check for None (e.g., if self.test_sample_size is not None), coerce to an int and clamp to dataset length (n = max(0, int(self.test_sample_size)); n = min(n, len(dataset))), then call dataset.select(range(n)) so 0 yields an empty selection and values larger than the dataset are safely capped; update the code around test_sample_size and dataset.select accordingly.
111-115:⚠️ Potential issue | 🟠 MajorFix
device_map="auto"for DDP-compatible model loading.In DDP mode (which this code supports via
local_rank),device_map="auto"attempts model sharding across devices, conflicting with DDP's expectation of one full model replica per process. With BitsAndBytesConfig quantization, this causes training failures or undefined behavior. Use rank-scoped placement instead:+ device_map = {"": f"cuda:{self.local_rank}"} if self.local_rank is not None else "auto" model = AutoModelForCausalLM.from_pretrained( self.ckpt, quantization_config=bnb_config, - device_map="auto", + device_map=device_map, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama_sft.py` around lines 111 - 115, The model loading uses AutoModelForCausalLM.from_pretrained with device_map="auto", which conflicts with DDP: when a local rank is present you must place a full model replica on that GPU rather than auto-shard it. Change the call that references self.ckpt and bnb_config so that if a DDP/local rank is set (e.g. self.local_rank or similar), pass a rank-scoped device_map like {"": self.local_rank} (and ensure the process sets its CUDA device accordingly); otherwise keep device_map="auto" for single-process loading.
♻️ Duplicate comments (1)
src/mattext/models/llama_sft.py (1)
215-228:⚠️ Potential issue | 🟠 MajorModel checkpoint path is still not fold-scoped (artifact overwrite risk).
Line 216 uses a shared directory across folds, so later folds can overwrite earlier model artifacts. This is the same concern previously raised.
Proposed fix
- self.output_dir_ = ( - f"{self.cfg.path.finetuned_modelname}/llamav3-8b-lora-fine-tune" - ) + self.output_dir_ = ( + f"{self.cfg.path.finetuned_modelname}_{self.fold}/llamav3-8b-lora-fine-tune" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mattext/models/llama_sft.py` around lines 215 - 228, The output directory and saved artifacts are not fold-scoped and risk being overwritten; update the path construction for self.output_dir_ (and any uses like trainer.save_model(self.output_dir_)) to include the fold identifier (self.fold) and/or fold-specific folder, and similarly change the tokenizer save path used in self.tokenizer.save_pretrained to use the same fold-scoped directory (e.g., include f".../{self.fold}/..." or append _{self.fold}) so each fold writes to a unique directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mattext/models/llama_sft.py`:
- Around line 186-193: The code currently ignores the configurable context
length by hardcoding max_length based on self.representation; update the logic
where max_length is set (before building SFTConfig) to use self.context_length
when provided (e.g., if self.context_length is truthy/positive) and fall back to
the existing representation-based defaults (2048 for "cif_p1", 1024 otherwise);
then pass that computed max_length into SFTConfig so SFTConfig(...,
max_seq_length=max_length, ...) uses the configured value (ensure
self.context_length is validated/coerced to an int if necessary).
- Line 242: The debug log in llama_sft.py uses a printf-style placeholder
("Prediction: %s") which Loguru doesn't expand; update the logger.debug call in
the predict/logging area (the line that currently reads
logger.debug("Prediction: %s", pred)) to use Loguru's brace-style formatting
(e.g., "Prediction: {}") or an f-string so the pred value is interpolated
correctly.
---
Outside diff comments:
In `@src/mattext/models/llama_sft.py`:
- Around line 81-95: The code reads compute_dtype unconditionally but only sets
it inside the quantized branch, causing UnboundLocalError; to fix, initialize
compute_dtype (e.g., compute_dtype = None) before the if/elif where bnb_config
is created or guard the later check, and change the GPU compatibility check to
only run when compute_dtype is not None (and/or bnb_config is not None) before
comparing to torch.float16 and calling torch.cuda.get_device_capability();
ensure references to compute_dtype, bnb_config, BitsAndBytesConfig, and the
torch.float16 check are updated accordingly.
- Around line 63-64: The current truthy check misinterprets 0 as unset and may
request more samples than exist; change the conditional to check for None (e.g.,
if self.test_sample_size is not None), coerce to an int and clamp to dataset
length (n = max(0, int(self.test_sample_size)); n = min(n, len(dataset))), then
call dataset.select(range(n)) so 0 yields an empty selection and values larger
than the dataset are safely capped; update the code around test_sample_size and
dataset.select accordingly.
- Around line 111-115: The model loading uses
AutoModelForCausalLM.from_pretrained with device_map="auto", which conflicts
with DDP: when a local rank is present you must place a full model replica on
that GPU rather than auto-shard it. Change the call that references self.ckpt
and bnb_config so that if a DDP/local rank is set (e.g. self.local_rank or
similar), pass a rank-scoped device_map like {"": self.local_rank} (and ensure
the process sets its CUDA device accordingly); otherwise keep device_map="auto"
for single-process loading.
---
Duplicate comments:
In `@src/mattext/models/llama_sft.py`:
- Around line 215-228: The output directory and saved artifacts are not
fold-scoped and risk being overwritten; update the path construction for
self.output_dir_ (and any uses like trainer.save_model(self.output_dir_)) to
include the fold identifier (self.fold) and/or fold-specific folder, and
similarly change the tokenizer save path used in self.tokenizer.save_pretrained
to use the same fold-scoped directory (e.g., include f".../{self.fold}/..." or
append _{self.fold}) so each fold writes to a unique directory.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb5dde5a-24e4-44e7-875c-9bdfb123c076
📒 Files selected for processing (1)
src/mattext/models/llama_sft.py
src/mattext/models/llama_sft.py
Outdated
| max_length = 2048 if self.representation == "cif_p1" else 1024 | ||
|
|
||
| training_args = SFTConfig( | ||
| **config_dict, | ||
| packing=False, | ||
| max_seq_length=max_length, | ||
| completion_only_loss=True, | ||
| ) |
There was a problem hiding this comment.
Configured context length is ignored.
Line 186 hardcodes sequence length by representation, but self.context_length (Line 49) is never used. This makes context length non-configurable in practice.
Proposed fix
- max_length = 2048 if self.representation == "cif_p1" else 1024
+ max_length = self.context_length🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mattext/models/llama_sft.py` around lines 186 - 193, The code currently
ignores the configurable context length by hardcoding max_length based on
self.representation; update the logic where max_length is set (before building
SFTConfig) to use self.context_length when provided (e.g., if
self.context_length is truthy/positive) and fall back to the existing
representation-based defaults (2048 for "cif_p1", 1024 otherwise); then pass
that computed max_length into SFTConfig so SFTConfig(...,
max_seq_length=max_length, ...) uses the configured value (ensure
self.context_length is validated/coerced to an int if necessary).
|
|
||
| with torch.cuda.amp.autocast(): | ||
| pred = pipe(self.formatting_tests_func(self.testdata)) | ||
| logger.debug("Prediction: %s", pred) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Loguru logger formatting style %s vs {} placeholders
💡 Result:
- Loguru does not support
%-style placeholders (%s,%d, …) in log messages. You must use brace-style{}formatting. Example: replacelogger.debug("x=%s", x)withlogger.debug("x={}", x). Loguru usesmessage.format(*args, **kwargs)internally. [1] - Passing arguments separately (instead of building an f-string) keeps formatting lazy-ish: if the log level is filtered out, the message won’t be formatted. [1]
- If your message contains literal
{or}, escape them as{{and}}, otherwise Loguru will treat them as format fields and may raise formatting errors when args/kwargs are present. [2] - For expensive computations, use lazy callables via
logger.opt(lazy=True)so functions are only evaluated if the message is actually emitted. [3]
Sources:
[1] Loguru docs (migration): “Replacing % style formatting of messages” — Loguru only supports {} and calls .format()
[2] Loguru docs (troubleshooting): brace formatting pitfalls / escaping and errors with braces
[3] Loguru docs (API): logger.opt(lazy=True) behavior
🏁 Script executed:
# First, let's check the actual file and line in question
cat -n src/mattext/models/llama_sft.py | sed -n '235,250p'Repository: lamalab-org/MatText
Length of output: 677
🏁 Script executed:
# Search for all loguru usage in the codebase to see formatting patterns
rg "logger\.(debug|info|warning|error)" --context=1 -t py | head -60Repository: lamalab-org/MatText
Length of output: 3187
🏁 Script executed:
# Check loguru import in the file
head -20 src/mattext/models/llama_sft.py | grep -i "loguru\|logger"Repository: lamalab-org/MatText
Length of output: 88
Fix Loguru formatting placeholder: change %s to {}
Loguru does not support %-style placeholders (%s, %d, etc.). It uses {} brace-style formatting exclusively via .format() internally. The current code will log the literal string "Prediction: %s" without substituting the pred value.
Fix
- logger.debug("Prediction: %s", pred)
+ logger.debug("Prediction: {}", pred)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.debug("Prediction: %s", pred) | |
| logger.debug("Prediction: {}", pred) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/mattext/models/llama_sft.py` at line 242, The debug log in llama_sft.py
uses a printf-style placeholder ("Prediction: %s") which Loguru doesn't expand;
update the logger.debug call in the predict/logging area (the line that
currently reads logger.debug("Prediction: %s", pred)) to use Loguru's
brace-style formatting (e.g., "Prediction: {}") or an f-string so the pred value
is interpolated correctly.
Summary by Sourcery
Add distributed data parallel (DDP)–aware training, inference, and logging across finetuning pipelines, update library versions and dependencies for a new release, and modernize installation, testing, and docs workflows.
New Features:
Bug Fixes:
Enhancements:
Build:
CI:
Documentation:
Summary by CodeRabbit
New Features
Improvements
Documentation
Chores