diff --git a/README.md b/README.md
index 8cc5b438..0095564e 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,28 @@
-# Electrolyte Foundation Model
-Benchmarking RoBERTa model pre-training on molecular datasets.
+# MIST: Molecular Insight SMILES Transformer
+
+
+
+
+MIST is a family of molecular foundation models for molecular property prediction.
+The models were pre-trained on [smirk tokenized](https://github.com/BattModels/smirk) SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks.
# Installation
The following provides installation instructions for the top-level package (`electrolyte_fm`), optional add-ons for our
-various additional analysis and downstream applications (See `opt/`) may require additional configuration.
+various additional analysis and downstream applications (See [`./opt`](./opt/) may require additional configuration.
+
+1. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) and [julia](https://julialang.org/downloads/) (only needed for `/opt` tasks)
+2. Instantiate the environment: `uv sync`
+3. Use [`submit/submit.py`](./submit/submit.py) to submit a training job or checkout one of our applications in [`./opt`](./opt)
+
+> You may need to install [rust](https://www.rust-lang.org/tools/install) if pre-built wheels for [smirk](https://github.com/BattModels/smirk) are not available on [PyPi](https://pypi.org/project/smirk/).
+> Feel free to [open an issue](https://github.com/BattModels/smirk/issues) to request additional pre-built wheels.
## Polaris
@@ -34,10 +52,13 @@ Same as above except:
1. Build the image `bash container/build.sh`, once build relocate the image `mv /tmp/mist.sif ./mist.sif`
2. Run training within the image `apptainer run --nv mist.sif python train.py ...`
-> See `submit/dgx.j2` or `submit/delta.j2` for a more complete example of using the container
+> See [`submit/dgx.j2`](./submit/dgx.j2) or [`submit/delta.j2`](./submit/delta.j2) for a more complete example of using the container
# Submitting Jobs
+We use a python script ([`submit/submit.py`](./submit/submit.py)) to template training jobs for submission on HPC systems across multiple sites.
+Templates may need to be modified for your particular HPC cluster, but should provide a starting point.
+
```shell
source ./activate # Activate Environment
./submit/submit.py ./submit/polaris.j2 --data ./submit/pretrain.yaml | qsub
@@ -45,6 +66,8 @@ source ./activate # Activate Environment
See `submit/submit.py --help` for more info
+> Note: [./activate](./activate) is used to activate the python virtual environment *and* set various environment variables.
+
# Development
## Pre-commit
diff --git a/opt/BayesianScaling/Project.toml b/opt/BayesianScaling/Project.toml
index 67cfde76..7841c29b 100644
--- a/opt/BayesianScaling/Project.toml
+++ b/opt/BayesianScaling/Project.toml
@@ -21,6 +21,7 @@ OnlineStats = "a15396b6-48d5-5d58-9928-6d29437db91e"
Optimization = "7f7a1694-90dd-40f0-9382-eb1efda571ba"
OptimizationOptimJL = "36348300-93cb-4f02-beb5-3c3902f8871e"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
+ReTestItems = "817f1d60-ba6b-4fd5-9520-3cf149f6a823"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
StatsModels = "3eaba693-59b7-5ba5-a881-562e759f1c8d"
diff --git a/opt/BayesianScaling/README.md b/opt/BayesianScaling/README.md
new file mode 100644
index 00000000..c8d30acc
--- /dev/null
+++ b/opt/BayesianScaling/README.md
@@ -0,0 +1,20 @@
+# BayesianScaling
+
+A Julia Package for fitting regression models using MCMC, that was used to fit penalized neural scaling laws.
+To install:
+
+- Install Julia: https://julialang.org/downloads/
+- Instantiate the package: `julia --project -e 'using Pkg; Pkg.instantiate()`
+- Download the wandb records or chains ([doi:10.5281/zenodo.17527149](https://doi.org/10.5281/zenodo.17527149))
+
+## Code Organization
+
+- `./scripts/` are used for fitting and analyzing the neural scaling laws.
+- `./plots/` has plotting code for the paper and various conferences
+- `./src` is the MCMC regression and analysis package powering this work
+ - [ppl.jl](./src/ppl.jl): Define a regression first interface for fitting MCMC models,
+ plus single-pass algorithms for working with posterior samples
+ - [scaling.jl](./src/scaling.jl): functional forms for neural scaling laws and derived qualities
+ - [analysis.jl](./src/analysis.jl.jl): Code for predicting the perform of models using fitted neural scaling laws
+- `./test/` has the unit tests for the BayesianScaling.jl package
+- `./benchmark/`: benchmark suite for evaluating different AD backends using [PkgJogger.jl](https://github.com/awadell1/PkgJogger.jl)
diff --git a/opt/BayesianScaling/src/ppl.jl b/opt/BayesianScaling/src/ppl.jl
index 8bfe1e88..34608d79 100644
--- a/opt/BayesianScaling/src/ppl.jl
+++ b/opt/BayesianScaling/src/ppl.jl
@@ -216,7 +216,7 @@ function transform_samples(t::TransformVariables.AbstractTransform, x::Matrix{T}
end
function transform!(y::AbstractVector, tt::TransformVariables.TransformTuple, x::AbstractVector)
- (; transformations) = tt
+ transformations = getfield(tt, :inner)
@assert TransformVariables.dimension(tt) == length(y) == length(x)
index = firstindex(y)
for t in transformations
@@ -242,7 +242,7 @@ transform!(y::AbstractVector, t::TransformVariables.AbstractTransform, x::Abstra
function transfrom_axis(tt::TransformVariables.TransformTuple{<:NamedTuple})
ax_tt = []
index = 1
- for (k, t) in pairs(tt.transformations)
+ for (k, t) in pairs(getfield(tt, :inner))
ax = transfrom_axis(t)
n = TransformVariables.dimension(t)
if ax isa Union{ComponentArrays.ShapedAxis,ComponentArrays.Axis}
diff --git a/opt/FeatureMiner/README.md b/opt/FeatureMiner/README.md
new file mode 100644
index 00000000..7991a850
--- /dev/null
+++ b/opt/FeatureMiner/README.md
@@ -0,0 +1,13 @@
+# Feature Miner
+
+Code for evaluating fitted linear probes for their ability to predict various chemically meaningful features.
+
+# Replication
+
+1. Install [Julia](https://julialang.org/downloads/)
+2. Instantiate the project: `julia --project -e 'using Pkg; Pkg.instantiate()'`
+3. Train linear probes using [linear_probe.jsonnet](../../submit/linear_probe.jsonnet) and [submit/submit.py](../../submit/submit.py) on
+MIST finetuned models.
+4. Run `julia --project explore_probes.jl` to extract fitted probe weights from the checkpoints
+5. Instantiate the plotting code: `julia --project=plots -e 'using Pkg; Pkg.instantiate()'`
+6. Evaluate fitted probes: `julia --project=plots ./plots/lipinski_probes.jl`
diff --git a/opt/MISTStyle/README.md b/opt/MISTStyle/README.md
new file mode 100644
index 00000000..06c7ae16
--- /dev/null
+++ b/opt/MISTStyle/README.md
@@ -0,0 +1,3 @@
+# MISTStyle.jl
+
+A collection of plotting utilities and themes for [Makie.jl](https://docs.makie.org/stable/) used through the codebase to generate high-quality plots for publication with a consistent visual theme.
diff --git a/opt/TokenizerStats/README.md b/opt/TokenizerStats/README.md
index 2c05d09d..783fe3ff 100644
--- a/opt/TokenizerStats/README.md
+++ b/opt/TokenizerStats/README.md
@@ -1,4 +1,13 @@
-# Analysis Code for "Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models"
+# Analysis Code for "Tokenization for Molecular Foundation Models"
+
+
## Installation
diff --git a/opt/design/README.md b/opt/design/README.md
new file mode 100644
index 00000000..3ad4c6ff
--- /dev/null
+++ b/opt/design/README.md
@@ -0,0 +1,12 @@
+# Evaluating Chemical Trends with MIST
+
+Source code for querying the MIST models on hydrocarbon and other templatable organic molecules.
+
+## Installation
+
+> All commands run from this directory
+
+1. Install [Julia](https://julialang.org/downloads/) and [uv](https://docs.astral.sh/uv/getting-started/installation/)
+2. Instantiate the project: `uv run julia --project -e 'using Pkg; Pkg.instantiate()'`
+3. Download the mist models to `models/`
+4. Recreate the plots `uv run julia --project plots.jl`
diff --git a/opt/interp_embeddings/README.md b/opt/interp_embeddings/README.md
index abd99f88..2e5e6d4e 100644
--- a/opt/interp_embeddings/README.md
+++ b/opt/interp_embeddings/README.md
@@ -5,6 +5,6 @@ Scripts for exploring MIST's embeddings and generating relevant figures from the
## Reproducing Analysis
1. Install [julia](https://julialang.org/downloads/) and the base project (See [Project README](../../README.md))
-2. Instantiate the environment `julia --project -e 'using Pkg; Pkg.instantiate()'`
+2. Instantiate the environment `uv run julia --project -e 'using Pkg; Pkg.instantiate()'`
3. Obtain model files and place at the appropriate path (see `plots.jl`)
-4. Run the script: `julia --project plots.jl`
+4. Run the script: `uv run julia --project plots.jl`
diff --git a/opt/mixtures/README.md b/opt/mixtures/README.md
new file mode 100644
index 00000000..f3e79ea0
--- /dev/null
+++ b/opt/mixtures/README.md
@@ -0,0 +1,17 @@
+# Mixtures
+
+Code for evaluating the MIST mixture models, exploring mixture space and optimizing mixture composition.
+
+## Installation
+
+> All commands run from this directory
+
+1. Install [Julia](https://julialang.org/downloads/) and [uv](https://docs.astral.sh/uv/getting-started/installation/)
+2. Instantiate the project: `uv run julia --project -e 'using Pkg; Pkg.instantiate()'`
+3. Obtain the mixtures dataset from [doi:10.5281/zenodo.17527149](https://doi.org/10.5281/zenodo.17527149)
+
+## Reproducing Plots
+
+Once installed, most of the scripts in the current directory can be run with:
+- python: `uv run