This guide explains all the available parameters in the config.yaml file used to configure training in seq2squiggle. The parameters are grouped into different sections based on their purpose, such as logging, preprocessing, model architecture, and training.
-
log_name
The name of the logging directory where logs will be saved.
Example:"Human-R1041-4khz" -
wandb_logger_state
The state of the wandb logger, which is used for experiment tracking and visualization.
Options:disabled: Completely disable wandb logging.online: Enable online logging, which requires logging into wandb viawandb loginor using other authentication methods (API key, environment variable, etc.). More details can be found on the wandb website.offline: Store logs locally without syncing to the wandb cloud. Use this if you prefer not to log in or need offline access to logs.
A chunk consists of a DNA sequence of length max_dna_len and its corresponding mapped signal, determined by the segmentation tool (e.g., f5c or uncalled4). The chunk size is defined by max_dna_len (DNA sequence length) and max_signal_len (maximum signal length), both of which are set in the config file.
💡 Important: The values of max_dna_len and max_signal_len must be the same for both preprocessing and training. Using different values will cause mismatches and likely lead to errors.
-
max_chunks_train
The maximum number of chunks for training data.
Default:210000000
This limit applies during both preprocessing and training. If your dataset contains fewer chunks, only that number will be used. -
max_chunks_valid
The maximum number of chunks for validation data.
Default:100000 -
scaling_max_value
The value for scaling input signals.
Default:165.0
This is used to normalize pA (picoampere) values before feeding them into the model. -
train_valid_split
The fraction of data used for training. The remaining portion is used for validation.
Default:0.9
If you don't provide a separate validation dataset during preprocessing, seq2squiggle will automatically split your data based on this ratio. -
max_dna_len
The maximum length of the DNA sequence in each chunk.
Default:16 -
max_signal_len
The maximum signal length for each DNA sequence.
Default:250- If the corresponding signal is shorter, it will be zero-padded to reach this length.
- If it's longer, it will be excluded from training after preprocessing.
-
allowed_chars
The allowed characters in the DNA sequence.
Default:"_ACGT"- The
_represents an empty symbol used when splitting input sequences into chunks ofmax_dna_len. - Example:
"ACGTACGT__"(when padding is needed).
- The
-
seq_kmer
The k-mer size used in the model.
Default:9- For R10.4.1 data, set this to 9.
- For R9.4.1 data, set this to 6.
Using the wrong value for your dataset may cause unexpected behavior.
seq2squiggle uses a Feed-Forward Transformer (based on FastSpeech) to predict nanopore signals from DNA data.
-
pre_layers
The number of pre-processing layers before the main transformer model.
Default:1- Input DNA k-mers are one-hot encoded and passed through a dense layer.
- Then,
pre_layersspecifies the number of additional dense layers before the feed-forward transformer block.
-
dmodel
The dimension of the model (hidden layer size).
Default:64 -
dff
The size of the feed-forward layer.
Default:256 -
encoder_layers
The number of layers in the encoder.
Default:2 -
encoder_heads
The number of attention heads in the encoder.
Default:8 -
decoder_layers
The number of layers in the decoder.
Default:2 -
decoder_heads
The number of attention heads in the decoder.
Default:8 -
encoder_dropout
The dropout rate for the encoder layers.
Default:0.2 -
decoder_dropout
The dropout rate for the decoder layers.
Default:0.2 -
duration_dropout
The dropout rate for duration sampling (used in signal prediction).
Default:0.2
-
train_batch_size
The batch size for training.
Default:512 -
max_epochs
The maximum number of epochs for training.
Default:25 -
save_model
Whether to save the trained model.
Options:TrueorFalse -
optimizer
The optimizer used for training.
Options:AdamAdamWRAdamAdaFactorRMSProp
Default:"Adam"
-
warmup_ratio
The fraction of training steps used for learning rate warmup.
Default:0.01 -
lr
The learning rate.
Default:0.0005 -
weight_decay
The weight decay for regularization.
Default:0.0 -
lr_schedule
The learning rate schedule. Options:warmup_cosine: Cosine annealing with warmup.warmup_constant: Constant learning rate with warmup.constant: Fixed learning rate.warmup_cosine_restarts: Cosine annealing with periodic restarts.one_cycle: One-cycle learning rate schedule.
-
gradient_clip_val
The gradient clipping value to prevent exploding gradients.
Default:1.0- This is necessary for stabilizing training, especially for the Feed-Forward Transformer model.
- Always ensure your preprocessing settings match your training settings (especially
max_dna_len,max_signal_len, andseq_kmer) - If you encounter unexpected training behavior, check whether your optimizer and learning rate schedule are suitable for your dataset
- For logging and experiment tracking, wandb integration is recommended (but can be disabled if not needed)
- If you are working with different nanopore sequencing chemistries, adjust
seq_kmeraccordingly