diff --git a/README.md b/README.md index 8cc5b438..0095564e 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,28 @@ -# Electrolyte Foundation Model -Benchmarking RoBERTa model pre-training on molecular datasets. +# MIST: Molecular Insight SMILES Transformer + +
+ +![GitHub License](https://img.shields.io/github/license/BattModels/mist) +![arXiv:2409.15370](https://img.shields.io/badge/cs.LG-2409.15370-b31b1b?style=flat&logo=arxiv&logoColor=red) +[![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm.svg)](https://huggingface.co/mist-models) + +
+ + +MIST is a family of molecular foundation models for molecular property prediction. +The models were pre-trained on [smirk tokenized](https://github.com/BattModels/smirk) SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks. # Installation The following provides installation instructions for the top-level package (`electrolyte_fm`), optional add-ons for our -various additional analysis and downstream applications (See `opt/`) may require additional configuration. +various additional analysis and downstream applications (See [`./opt`](./opt/) may require additional configuration. + +1. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) and [julia](https://julialang.org/downloads/) (only needed for `/opt` tasks) +2. Instantiate the environment: `uv sync` +3. Use [`submit/submit.py`](./submit/submit.py) to submit a training job or checkout one of our applications in [`./opt`](./opt) + +> You may need to install [rust](https://www.rust-lang.org/tools/install) if pre-built wheels for [smirk](https://github.com/BattModels/smirk) are not available on [PyPi](https://pypi.org/project/smirk/). +> Feel free to [open an issue](https://github.com/BattModels/smirk/issues) to request additional pre-built wheels. ## Polaris @@ -34,10 +52,13 @@ Same as above except: 1. Build the image `bash container/build.sh`, once build relocate the image `mv /tmp/mist.sif ./mist.sif` 2. Run training within the image `apptainer run --nv mist.sif python train.py ...` -> See `submit/dgx.j2` or `submit/delta.j2` for a more complete example of using the container +> See [`submit/dgx.j2`](./submit/dgx.j2) or [`submit/delta.j2`](./submit/delta.j2) for a more complete example of using the container # Submitting Jobs +We use a python script ([`submit/submit.py`](./submit/submit.py)) to template training jobs for submission on HPC systems across multiple sites. +Templates may need to be modified for your particular HPC cluster, but should provide a starting point. + ```shell source ./activate # Activate Environment ./submit/submit.py ./submit/polaris.j2 --data ./submit/pretrain.yaml | qsub @@ -45,6 +66,8 @@ source ./activate # Activate Environment See `submit/submit.py --help` for more info +> Note: [./activate](./activate) is used to activate the python virtual environment *and* set various environment variables. + # Development ## Pre-commit diff --git a/opt/BayesianScaling/Project.toml b/opt/BayesianScaling/Project.toml index 67cfde76..7841c29b 100644 --- a/opt/BayesianScaling/Project.toml +++ b/opt/BayesianScaling/Project.toml @@ -21,6 +21,7 @@ OnlineStats = "a15396b6-48d5-5d58-9928-6d29437db91e" Optimization = "7f7a1694-90dd-40f0-9382-eb1efda571ba" OptimizationOptimJL = "36348300-93cb-4f02-beb5-3c3902f8871e" Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" +ReTestItems = "817f1d60-ba6b-4fd5-9520-3cf149f6a823" Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91" StatsModels = "3eaba693-59b7-5ba5-a881-562e759f1c8d" diff --git a/opt/BayesianScaling/README.md b/opt/BayesianScaling/README.md new file mode 100644 index 00000000..c8d30acc --- /dev/null +++ b/opt/BayesianScaling/README.md @@ -0,0 +1,20 @@ +# BayesianScaling + +A Julia Package for fitting regression models using MCMC, that was used to fit penalized neural scaling laws. +To install: + +- Install Julia: https://julialang.org/downloads/ +- Instantiate the package: `julia --project -e 'using Pkg; Pkg.instantiate()` +- Download the wandb records or chains ([doi:10.5281/zenodo.17527149](https://doi.org/10.5281/zenodo.17527149)) + +## Code Organization + +- `./scripts/` are used for fitting and analyzing the neural scaling laws. +- `./plots/` has plotting code for the paper and various conferences +- `./src` is the MCMC regression and analysis package powering this work + - [ppl.jl](./src/ppl.jl): Define a regression first interface for fitting MCMC models, + plus single-pass algorithms for working with posterior samples + - [scaling.jl](./src/scaling.jl): functional forms for neural scaling laws and derived qualities + - [analysis.jl](./src/analysis.jl.jl): Code for predicting the perform of models using fitted neural scaling laws +- `./test/` has the unit tests for the BayesianScaling.jl package +- `./benchmark/`: benchmark suite for evaluating different AD backends using [PkgJogger.jl](https://github.com/awadell1/PkgJogger.jl) diff --git a/opt/BayesianScaling/src/ppl.jl b/opt/BayesianScaling/src/ppl.jl index 8bfe1e88..34608d79 100644 --- a/opt/BayesianScaling/src/ppl.jl +++ b/opt/BayesianScaling/src/ppl.jl @@ -216,7 +216,7 @@ function transform_samples(t::TransformVariables.AbstractTransform, x::Matrix{T} end function transform!(y::AbstractVector, tt::TransformVariables.TransformTuple, x::AbstractVector) - (; transformations) = tt + transformations = getfield(tt, :inner) @assert TransformVariables.dimension(tt) == length(y) == length(x) index = firstindex(y) for t in transformations @@ -242,7 +242,7 @@ transform!(y::AbstractVector, t::TransformVariables.AbstractTransform, x::Abstra function transfrom_axis(tt::TransformVariables.TransformTuple{<:NamedTuple}) ax_tt = [] index = 1 - for (k, t) in pairs(tt.transformations) + for (k, t) in pairs(getfield(tt, :inner)) ax = transfrom_axis(t) n = TransformVariables.dimension(t) if ax isa Union{ComponentArrays.ShapedAxis,ComponentArrays.Axis} diff --git a/opt/FeatureMiner/README.md b/opt/FeatureMiner/README.md new file mode 100644 index 00000000..7991a850 --- /dev/null +++ b/opt/FeatureMiner/README.md @@ -0,0 +1,13 @@ +# Feature Miner + +Code for evaluating fitted linear probes for their ability to predict various chemically meaningful features. + +# Replication + +1. Install [Julia](https://julialang.org/downloads/) +2. Instantiate the project: `julia --project -e 'using Pkg; Pkg.instantiate()'` +3. Train linear probes using [linear_probe.jsonnet](../../submit/linear_probe.jsonnet) and [submit/submit.py](../../submit/submit.py) on +MIST finetuned models. +4. Run `julia --project explore_probes.jl` to extract fitted probe weights from the checkpoints +5. Instantiate the plotting code: `julia --project=plots -e 'using Pkg; Pkg.instantiate()'` +6. Evaluate fitted probes: `julia --project=plots ./plots/lipinski_probes.jl` diff --git a/opt/MISTStyle/README.md b/opt/MISTStyle/README.md new file mode 100644 index 00000000..06c7ae16 --- /dev/null +++ b/opt/MISTStyle/README.md @@ -0,0 +1,3 @@ +# MISTStyle.jl + +A collection of plotting utilities and themes for [Makie.jl](https://docs.makie.org/stable/) used through the codebase to generate high-quality plots for publication with a consistent visual theme. diff --git a/opt/TokenizerStats/README.md b/opt/TokenizerStats/README.md index 2c05d09d..783fe3ff 100644 --- a/opt/TokenizerStats/README.md +++ b/opt/TokenizerStats/README.md @@ -1,4 +1,13 @@ -# Analysis Code for "Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models" +# Analysis Code for "Tokenization for Molecular Foundation Models" + +
+ +![GitHub License](https://img.shields.io/github/license/BattModels/smirk) +![paper](https://img.shields.io/badge/paper-10.1021%2Facs.jcim.5c01856-blue) +![data](https://img.shields.io/badge/data-10.5281%2Fzenodo.13761262-blue) +![arXiv:2409.15370](https://img.shields.io/badge/cs.LG-2409.15370-b31b1b?style=flat&logo=arxiv&logoColor=red) + +
## Installation diff --git a/opt/design/README.md b/opt/design/README.md new file mode 100644 index 00000000..3ad4c6ff --- /dev/null +++ b/opt/design/README.md @@ -0,0 +1,12 @@ +# Evaluating Chemical Trends with MIST + +Source code for querying the MIST models on hydrocarbon and other templatable organic molecules. + +## Installation + +> All commands run from this directory + +1. Install [Julia](https://julialang.org/downloads/) and [uv](https://docs.astral.sh/uv/getting-started/installation/) +2. Instantiate the project: `uv run julia --project -e 'using Pkg; Pkg.instantiate()'` +3. Download the mist models to `models/` +4. Recreate the plots `uv run julia --project plots.jl` diff --git a/opt/interp_embeddings/README.md b/opt/interp_embeddings/README.md index abd99f88..2e5e6d4e 100644 --- a/opt/interp_embeddings/README.md +++ b/opt/interp_embeddings/README.md @@ -5,6 +5,6 @@ Scripts for exploring MIST's embeddings and generating relevant figures from the ## Reproducing Analysis 1. Install [julia](https://julialang.org/downloads/) and the base project (See [Project README](../../README.md)) -2. Instantiate the environment `julia --project -e 'using Pkg; Pkg.instantiate()'` +2. Instantiate the environment `uv run julia --project -e 'using Pkg; Pkg.instantiate()'` 3. Obtain model files and place at the appropriate path (see `plots.jl`) -4. Run the script: `julia --project plots.jl` +4. Run the script: `uv run julia --project plots.jl` diff --git a/opt/mixtures/README.md b/opt/mixtures/README.md new file mode 100644 index 00000000..f3e79ea0 --- /dev/null +++ b/opt/mixtures/README.md @@ -0,0 +1,17 @@ +# Mixtures + +Code for evaluating the MIST mixture models, exploring mixture space and optimizing mixture composition. + +## Installation + +> All commands run from this directory + +1. Install [Julia](https://julialang.org/downloads/) and [uv](https://docs.astral.sh/uv/getting-started/installation/) +2. Instantiate the project: `uv run julia --project -e 'using Pkg; Pkg.instantiate()'` +3. Obtain the mixtures dataset from [doi:10.5281/zenodo.17527149](https://doi.org/10.5281/zenodo.17527149) + +## Reproducing Plots + +Once installed, most of the scripts in the current directory can be run with: +- python: `uv run