From 17e3c9fbb83368cfac5febcdeabfe4c2d3d94374 Mon Sep 17 00:00:00 2001 From: Alexius Wadell Date: Thu, 2 Apr 2026 21:36:20 -0600 Subject: [PATCH 01/13] add opt READMEs --- opt/BayesianScaling/README.md | 22 ++++++++++++++++++++++ opt/FeatureMiner/README.md | 14 ++++++++++++++ opt/MISTStyle/README.md | 5 +++++ opt/TokenizerStats/README.md | 11 ++++++++++- opt/design/README.md | 12 ++++++++++++ opt/interp_embeddings/README.md | 4 ++-- opt/mixtures/README.md | 17 +++++++++++++++++ opt/pubchem-qc/README.md | 3 +++ opt/pubchem-qc/submit.sh | 0 opt/qmist/README.md | 2 +- opt/run_logs/README.md | 4 ++++ opt/screening/README.md | 16 ++++++++++++++++ opt/sterochemistry/README.md | 1 + opt/synth_access/README.md | 10 ++++++++++ 14 files changed, 117 insertions(+), 4 deletions(-) create mode 100644 opt/BayesianScaling/README.md create mode 100644 opt/FeatureMiner/README.md create mode 100644 opt/MISTStyle/README.md create mode 100644 opt/design/README.md create mode 100644 opt/mixtures/README.md delete mode 100644 opt/pubchem-qc/submit.sh create mode 100644 opt/run_logs/README.md create mode 100644 opt/sterochemistry/README.md create mode 100644 opt/synth_access/README.md diff --git a/opt/BayesianScaling/README.md b/opt/BayesianScaling/README.md new file mode 100644 index 00000000..60e5de2a --- /dev/null +++ b/opt/BayesianScaling/README.md @@ -0,0 +1,22 @@ +# BayesianScaling + +A Julia Package for fitting regression models using MCMC, that was used to fit penalized neural scaling laws. +To install: + +- Install Julia: https://julialang.org/downloads/ +- Instantiate the package: `julia --project -e 'using Pkg; Pkg.instantiate()` +- Download the wandb records or chains ([doi:10.5281/zenodo.17527149](https://doi.org/10.5281/zenodo.17527149)) + +## Code Organization + +- `./scripts/` are used for fitting and analyzing the neural scaling laws. +- `./plots/` has plotting code for the paper and various conferences +- `./src` is the MCMC regression and analysis package powering this work + - [ppl.jl](./src/ppl.jl): Define a regression first interface for fitting MCMC models, + plus single-pass algorithms for working with posterior samples + - [scaling.jl](./src/scaling.jl): functional forms for neural scaling laws and derived qualities + - [analysis.jl](./src/analysis.jl.jl): Code for predicting the perform of models using fitted neural scaling laws +- `./test/` has the unit tests for the BayesianScaling.jl package +- `./benchmark/`: benchmark suite for evaluating different AD backends using [PkgJogger.jl](https://github.com/awadell1/PkgJogger.jl) + + diff --git a/opt/FeatureMiner/README.md b/opt/FeatureMiner/README.md new file mode 100644 index 00000000..c2e00059 --- /dev/null +++ b/opt/FeatureMiner/README.md @@ -0,0 +1,14 @@ +# Feature Miner + +Code for evaluating fitted linear probes for their ability to predict various chemically meaningful features. + +# Replication + +1. Install [Julia](https://julialang.org/downloads/) +2. Instantiate the project: `julia --project -e 'using Pkg; Pkg.instantiate()'` +3. Train linear probes using [linear_probe.jsonnet](../../submit/linear_probe.jsonnet) and [submit/submit.py](../../submit/submit.py) on +MIST finetuned models. +4. Run `julia --project explore_probes.jl` to extract fitted probe weights from the checkpoints +5. Instantiate the plotting code: `julia --project=plots -e 'using Pkg; Pkg.instantiate()'` +6. Evaluate fitted probes: `julia --project=plots ./plots/lipinski_probes.jl` + diff --git a/opt/MISTStyle/README.md b/opt/MISTStyle/README.md new file mode 100644 index 00000000..c3976f10 --- /dev/null +++ b/opt/MISTStyle/README.md @@ -0,0 +1,5 @@ +# MISTStyle.jl + +A collection of plotting utilities and themes for [Makie.jl](https://docs.makie.org/stable/) used through the codebase to generate high-quality plots for publication with a consistent visual theme. + + diff --git a/opt/TokenizerStats/README.md b/opt/TokenizerStats/README.md index 2c05d09d..783fe3ff 100644 --- a/opt/TokenizerStats/README.md +++ b/opt/TokenizerStats/README.md @@ -1,4 +1,13 @@ -# Analysis Code for "Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models" +# Analysis Code for "Tokenization for Molecular Foundation Models" + +
+ +![GitHub License](https://img.shields.io/github/license/BattModels/smirk) +![paper](https://img.shields.io/badge/paper-10.1021%2Facs.jcim.5c01856-blue) +![data](https://img.shields.io/badge/data-10.5281%2Fzenodo.13761262-blue) +![arXiv:2409.15370](https://img.shields.io/badge/cs.LG-2409.15370-b31b1b?style=flat&logo=arxiv&logoColor=red) + +
## Installation diff --git a/opt/design/README.md b/opt/design/README.md new file mode 100644 index 00000000..3ad4c6ff --- /dev/null +++ b/opt/design/README.md @@ -0,0 +1,12 @@ +# Evaluating Chemical Trends with MIST + +Source code for querying the MIST models on hydrocarbon and other templatable organic molecules. + +## Installation + +> All commands run from this directory + +1. Install [Julia](https://julialang.org/downloads/) and [uv](https://docs.astral.sh/uv/getting-started/installation/) +2. Instantiate the project: `uv run julia --project -e 'using Pkg; Pkg.instantiate()'` +3. Download the mist models to `models/` +4. Recreate the plots `uv run julia --project plots.jl` diff --git a/opt/interp_embeddings/README.md b/opt/interp_embeddings/README.md index abd99f88..2e5e6d4e 100644 --- a/opt/interp_embeddings/README.md +++ b/opt/interp_embeddings/README.md @@ -5,6 +5,6 @@ Scripts for exploring MIST's embeddings and generating relevant figures from the ## Reproducing Analysis 1. Install [julia](https://julialang.org/downloads/) and the base project (See [Project README](../../README.md)) -2. Instantiate the environment `julia --project -e 'using Pkg; Pkg.instantiate()'` +2. Instantiate the environment `uv run julia --project -e 'using Pkg; Pkg.instantiate()'` 3. Obtain model files and place at the appropriate path (see `plots.jl`) -4. Run the script: `julia --project plots.jl` +4. Run the script: `uv run julia --project plots.jl` diff --git a/opt/mixtures/README.md b/opt/mixtures/README.md new file mode 100644 index 00000000..f3e79ea0 --- /dev/null +++ b/opt/mixtures/README.md @@ -0,0 +1,17 @@ +# Mixtures + +Code for evaluating the MIST mixture models, exploring mixture space and optimizing mixture composition. + +## Installation + +> All commands run from this directory + +1. Install [Julia](https://julialang.org/downloads/) and [uv](https://docs.astral.sh/uv/getting-started/installation/) +2. Instantiate the project: `uv run julia --project -e 'using Pkg; Pkg.instantiate()'` +3. Obtain the mixtures dataset from [doi:10.5281/zenodo.17527149](https://doi.org/10.5281/zenodo.17527149) + +## Reproducing Plots + +Once installed, most of the scripts in the current directory can be run with: +- python: `uv run