Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,16 @@ jobs:

- uses: actions/setup-python@v5
with:
python-version: "3.10"
cache: pip
cache-dependency-path: pyproject.toml
python-version: "3.10.19"

- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install wheel setuptools
pip install -e .
pip install mkdocs mkdocs-material "mkdocstrings[python]" mkdocs-autorefs
uv pip install --system -e ".[docs]"

- name: Build
run: mkdocs build
Expand Down
21 changes: 21 additions & 0 deletions .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Pre-commit

on:
pull_request:
push:
branches: [main, development]

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10.19'

- name: Run pre-commit
uses: pre-commit/action@v3.0.1
41 changes: 17 additions & 24 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,39 +11,32 @@ jobs:
tests:
runs-on: ubuntu-latest
strategy:
fail-fast: true
fail-fast: false
matrix:
python-version: ["3.9"]
python-version: ["3.10", "3.11"]
timeout-minutes: 30
defaults:
run:
shell: bash -l {0}
steps:
- name: Check out repository
uses: actions/checkout@v4

# - uses: pdm-project/setup-pdm@v3
# name: Set up PDM
# with:
# python-version: ${{ matrix.python-version }}
# cache: true

- name: Setup Mambaforge
uses: conda-incubator/setup-miniconda@v3
- name: Set up Python
uses: actions/setup-python@v5
with:
miniforge-variant: Mambaforge
miniforge-version: latest
use-mamba: true
python-version: ${{ matrix.python-version }}
conda-channels: anaconda, conda-forge
activate-environment: test

- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y openbabel libfftw3-dev
Comment on lines +32 to +35
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if any test files use LocalEnv or openbabel
rg -i "local.?env|openbabel" tests/

Repository: lamalab-org/MatText

Length of output: 45


🏁 Script executed:

#!/bin/bash
# List test files to understand the test structure
find tests/ -name "*.py" -type f | head -20

Repository: lamalab-org/MatText

Length of output: 107


🏁 Script executed:

#!/bin/bash
# Read test files to understand their scope
echo "=== test_imports.py ==="
cat tests/test_imports.py
echo -e "\n=== test_xtal2pot.py ==="
cat tests/test_xtal2pot.py

Repository: lamalab-org/MatText

Length of output: 1109


🏁 Script executed:

#!/bin/bash
# Search the entire codebase for LocalEnv usage
echo "=== Searching for LocalEnv usage ==="
rg -i "local.?env" --type py
echo -e "\n=== Searching for openbabel imports ==="
rg -i "openbabel|import open" --type py

Repository: lamalab-org/MatText

Length of output: 1732


🏁 Script executed:

#!/bin/bash
# Check pyproject.toml for dependencies
echo "=== pyproject.toml dependencies ==="
cat pyproject.toml | grep -A 50 "dependencies\|optional"
echo -e "\n=== Check if LocalEnv/local_env is implemented in src ==="
rg -l "local_env|LocalEnv" src/

Repository: lamalab-org/MatText

Length of output: 1484


🏁 Script executed:

#!/bin/bash
# Check how local_env is used in the source
echo "=== src/mattext/models/utils.py ==="
cat src/mattext/models/utils.py

Repository: lamalab-org/MatText

Length of output: 8533


🏁 Script executed:

#!/bin/bash
# Check if SmilesTokenizer or local_env implementation uses openbabel
echo "=== Searching for SmilesTokenizer implementation ==="
find src -name "*.py" -type f | xargs grep -l "SmilesTokenizer"
echo -e "\n=== Check tokenizer implementations ==="
rg "class SmilesTokenizer|class.*Tokenizer" src/ -A 5

Repository: lamalab-org/MatText

Length of output: 2281


🏁 Script executed:

#!/bin/bash
# Find and check tokenizer implementations
echo "=== tokenizer/__init__.py ==="
cat src/mattext/tokenizer/__init__.py
echo -e "\n=== Check for tokenizer directory files ==="
ls -la src/mattext/tokenizer/

Repository: lamalab-org/MatText

Length of output: 666


🏁 Script executed:

#!/bin/bash
# Check README for OpenBabel mentions
echo "=== README.md mentions of OpenBabel ==="
rg -i "openbabel|local.*env" README.md -A 2 -B 2
echo -e "\n=== Check if xtal2txt is in dependencies and what it might need ==="
cat pyproject.toml | grep -i xtal2txt
echo -e "\n=== Search for any OpenBabel-related comments in codebase ==="
rg -i "openbabel" --type py --type yml

Repository: lamalab-org/MatText

Length of output: 577


🏁 Script executed:

#!/bin/bash
# Check if libfftw3-dev is needed for xtal2txt or other dependencies
echo "=== Searching for fftw references ==="
rg -i "fftw" src/ scripts/ tests/ revision-scripts/
echo -e "\n=== Check xtal2txt GitHub or documentation for dependencies ==="
echo "xtal2txt is a git dependency, checking if it's imported in tests"
rg "from xtal2txt|import xtal2txt" tests/

Repository: lamalab-org/MatText

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Final check: verify what the tests actually import and execute
echo "=== All imports in test files ==="
rg "^import |^from " tests/
echo -e "\n=== Check if pymatgen structures internally use OpenBabel ==="
echo "Checking if pymatgen or xtal2txt might require OpenBabel transitively"
rg -i "openbabel" . --type py | head -20

Repository: lamalab-org/MatText

Length of output: 681


Remove openbabel from system dependencies—it's not needed for the current test suite.

LocalEnv representation (which requires OpenBabel) is not covered by the tests (test_imports.py and test_xtal2pot.py only test basic tokenizers and structures). Since the test suite doesn't exercise this feature, installing OpenBabel adds unnecessary overhead to CI. If needed later for integration testing, add it back with a comment explaining its purpose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/tests.yml around lines 32 - 35, Remove openbabel from the
system dependencies step in the "Install system dependencies" job by deleting
"openbabel" from the apt-get install command (leave "libfftw3-dev" intact) so CI
stops installing OpenBabel for tests; if OpenBabel is needed later for
integration tests, re-add it with a clarifying comment near the "Install system
dependencies" step.


- name: Install dependencies
run: |
mamba install -c conda-forge openbabel fftw -y
pip install -e ".[dev]"
pip install pyxtal
pip install "numpy<2.0"

uv pip install --system -e ".[dev]"

- name: Test
run: pytest tests
run: pytest tests
26 changes: 26 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
exclude: ^mkdocs\.yml$
- id: check-added-large-files
args: ['--maxkb=1000']
- id: check-merge-conflict
- id: check-toml
- id: debug-statements

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.3.4
hooks:
- id: ruff
args: [--fix]
- id: ruff-format

- repo: https://github.com/PyCQA/docformatter
rev: v1.7.6
hooks:
- id: docformatter
args: [--in-place, --config, ./pyproject.toml]
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
SOFTWARE.
56 changes: 38 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,27 +18,31 @@
</p>


MatText is a framework for text-based materials modeling. It supports
MatText is a framework for text-based materials modeling. It supports

- conversion of crystal structures in to text representations
- conversion of crystal structures in to text representations
Comment thread
sourcery-ai[bot] marked this conversation as resolved.
- transformations of crystal structures for sensitivity analyses
- decoding of text representations to crystal structures
- tokenization of text-representation of crystal structures
- pre-training, finetuning and testing of language models on text-representations of crystal structures
- pre-training, finetuning and testing of language models on text-representations of crystal structures
- analysis of language models trained on text-representations of crystal structures



## Local Installation

We recommend that you create a virtual conda environment on your computer in which you install the dependencies for this package. To do so head over to [Miniconda](https://docs.conda.io/en/latest/miniconda.html) and follow the installation instructions there.
**Requirements:**
- Python 3.10 or 3.11 (tested and supported)
- [uv](https://docs.astral.sh/uv/) package manager (recommended)

We recommend using [uv](https://docs.astral.sh/uv/) for fast and reliable Python package management. To install uv, follow the [installation instructions](https://docs.astral.sh/uv/getting-started/installation/).

<!-- ### Install latest release

### Install latest release

```bash
pip install mattext
``` -->
uv pip install git+https://github.com/lamalab-org/mattext.git
```

### Install development version

Expand All @@ -49,16 +53,32 @@ git clone https://github.com/lamalab-org/mattext.git
cd mattext
```

Create a virtual environment and install:

```bash
uv venv --python 3.10
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"
```

Install pre-commit hooks (optional, for development):

```bash
pip install -e .
pre-commit install
```

If you want to use the Local Env representation, you will also need to install OpenBabel, e.g. using
If you want to use the Local Env representation, you will also need to install OpenBabel. You can install it via conda/mamba:

```bash
```bash
conda install openbabel -c conda-forge
```

or on Ubuntu/Debian:

```bash
sudo apt-get install openbabel
```

## Getting started

### Converting crystals into text
Expand Down Expand Up @@ -94,18 +114,18 @@ requested_text_reps = text_rep.get_requested_text_reps(requested_reps)
python main.py -cn=pretrain model=pretrain_example +model.representation=composition +model.dataset_type=pretrain30k +model.context_length=32
```

### Running a benchmark
### Running a benchmark

```bash
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
```

The `+` symbol before a configuration key indicates that you are adding a new key-value pair to the configuration. This is useful when you want to specify parameters that are not part of the default configuration.

To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Adjust "for e.g." to the standard "e.g.".

"For e.g." is nonstandard; use "e.g.," instead: use ++, e.g., ++model.pretrain...

Suggested change
To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix redundant phrase "for e.g."

The phrase "for e.g." is redundant since "e.g." already means "for example". Use either "e.g." or "for example".

Suggested fix
-To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.
+To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
To override the existing default configuration, use `++`, for e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
To override the existing default configuration, use `++`, e.g., `++model.pretrain.training_arguments.per_device_train_batch_size=32`. Refer to the [docs](https://lamalab-org.github.io/MatText/) for more examples and advanced ways to use the configs with config groups.
🧰 Tools
🪛 LanguageTool

[style] ~125-~125: The phrase ‘for e.g.’ is an tautology (‘e.g.’ means ‘for example’). Consider using just “e.g.” or “for example”.
Context: ...isting default configuration, use ++, for e.g., `++model.pretrain.training_arguments.p...

(FOR_EG_REDUNDANCY)


[uncategorized] ~125-~125: Do not mix variants of the same word (‘pretrain’ and ‘pre-train’) within a single text.
Context: ...ault configuration, use ++, for e.g., `++model.pretrain.training_arguments.per_device_train_bat...

(EN_WORD_COHERENCY)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 125, The README line contains the redundant phrase "for
e.g." — replace it with either "e.g." or "for example" so the sentence reads
e.g. `++model.pretrain.training_arguments.per_device_train_batch_size=32` or for
example `++model.pretrain.training_arguments.per_device_train_batch_size=32`;
update the text surrounding the example in the same sentence accordingly.



### Using data
### Using data

The MatText datasets can be easily obtained from [HuggingFace](https://huggingface.co/datasets/n0w0f/MatText), for example

Expand All @@ -123,19 +143,19 @@ Contributions, whether filing an issue, making a pull request, or forking, are a

## 👋 Attribution

### Citation
### Citation

If you use MatText in your work, please cite
If you use MatText in your work, please cite

```
@misc{alampara2024mattextlanguagemodelsneed,
title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
year={2024},
eprint={2406.17295},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci}
url={https://arxiv.org/abs/2406.17295},
url={https://arxiv.org/abs/2406.17295},
}
Comment on lines 150 to 159
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix citation block formatting.

The code block should specify a language (e.g., bibtex), and the url field appears to be placed after primaryClass without a trailing comma, which would be invalid BibTeX syntax.

Suggested fix
-```
+```bibtex
 `@misc`{alampara2024mattextlanguagemodelsneed,
       title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
       author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
       year={2024},
       eprint={2406.17295},
       archivePrefix={arXiv},
-      primaryClass={cond-mat.mtrl-sci}
-      url={https://arxiv.org/abs/2406.17295},
+      primaryClass={cond-mat.mtrl-sci},
+      url={https://arxiv.org/abs/2406.17295}
 }
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 150-150: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 150 - 159, Update the citation block for the
`@misc`{alampara2024mattextlanguagemodelsneed} entry: change the code fence to
specify the language (```bibtex), add a missing comma after the primaryClass
field value (primaryClass={cond-mat.mtrl-sci},) and ensure the url field is
punctuated correctly (url={https://arxiv.org/abs/2406.17295} without an extra
trailing comma), then close the fenced code block.

```

Expand All @@ -146,4 +166,4 @@ The code in this package is licensed under the MIT License.

### 💰 Funding

This project has been supported by the [Carl Zeiss Foundation](https://www.carl-zeiss-stiftung.de/en/) as well as Intel and Merck.
This project has been supported by the [Carl Zeiss Foundation](https://www.carl-zeiss-stiftung.de/en/) as well as Intel and Merck.
2 changes: 1 addition & 1 deletion conf/archived_experiments/config-hydra.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ hydra:
sweep:
dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}
subdir: ${hydra.job.override_dirname}

launcher:
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
Expand Down
3 changes: 1 addition & 2 deletions conf/archived_experiments/config-potential.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
sweep:
dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}
subdir: ${hydra.job.override_dirname}

launcher:
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
Expand Down Expand Up @@ -51,4 +51,3 @@

- name: potential_run
tasks: [potential]

3 changes: 1 addition & 2 deletions conf/archived_experiments/config-sft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
sweep:
dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}
subdir: ${hydra.job.override_dirname}

launcher:
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
Expand Down Expand Up @@ -51,4 +51,3 @@

# - name: potential_run
# tasks: [potential]

3 changes: 1 addition & 2 deletions conf/archived_experiments/config-smiles.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
sweep:
dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}
subdir: ${hydra.job.override_dirname}

launcher:
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
Expand Down Expand Up @@ -51,4 +51,3 @@

# - name: potential_run
# tasks: [potential]

3 changes: 1 addition & 2 deletions conf/archived_experiments/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
sweep:
dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}
subdir: ${hydra.job.override_dirname}

launcher:
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
Expand Down Expand Up @@ -51,4 +51,3 @@

# - name: potential_run
# tasks: [potential]

1 change: 0 additions & 1 deletion conf/archived_experiments/santiago/cifp1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,3 @@ model:
per_device_train_batch_size: 64
path:
pretrained_checkpoint: /home/so87pot/n0w0f/santiago_ckpt/cif_p1_pt_30k/checkpoint-23000

2 changes: 1 addition & 1 deletion conf/archived_experiments/santiago_100k/cifp1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ model:
training_arguments:
per_device_train_batch_size: 64
path:
pretrained_checkpoint: /home/so87pot/n0w0f/santiago_ckpt/cif_p1_pt_100k/checkpoint-26000
pretrained_checkpoint: /home/so87pot/n0w0f/santiago_ckpt/cif_p1_pt_100k/checkpoint-26000
1 change: 0 additions & 1 deletion conf/archived_experiments/testing_perturb_100/cifp1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,3 @@ model:
per_device_train_batch_size: 64
path:
pretrained_checkpoint: ft_100k_mb_small

Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ model:
logging:
wandb_project: perturb_1


finetune:
model_name: ft_100k_mb_small
context_length: 1024
training_arguments:
per_device_train_batch_size: 64
path:
pretrained_checkpoint: ft_100k_mb_small
pretrained_checkpoint: ft_100k_mb_small
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,11 @@ model:
representation: crystal_llm_rep
logging:
wandb_project: perturb_1

finetune:
model_name: ft_100k_mb_small
context_length: 512
training_arguments:
per_device_train_batch_size: 64
path:
pretrained_checkpoint: ft_100k_mb_small

1 change: 0 additions & 1 deletion conf/archived_experiments/testing_perturb_300/cifp1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,3 @@ model:
per_device_train_batch_size: 64
path:
pretrained_checkpoint: ft_300k_mb_small

Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ model:
logging:
wandb_project: perturb_1


finetune:
model_name: ft_300k_mb_small
context_length: 1024
training_arguments:
per_device_train_batch_size: 64
path:
pretrained_checkpoint: ft_300k_mb_small
pretrained_checkpoint: ft_300k_mb_small
Loading