rskit - RNA-seq Analysis Toolkit

English | 中文

A Python toolkit for RNA-seq analysis with a CLI and Python API for common workflows including read alignment, Salmon quantification, DESeq2 differential expression, and WGCNA.

Features

Quantification pipeline: STAR alignment + Salmon quantification
Gene-level expression export during quant
Differential expression analysis with DESeq2
Co-expression network analysis with WGCNA
End-to-end workflow with rskit all
Automatic CSV/TSV detection
Shared --coldata metadata format across subcommands
Early tx2gene.tsv and gene-level matrix export before DESeq2
Strict sample ID validation between metadata and count/expression matrices
Input preflight checks with rskit validate / rskit doctor
Example input templates with rskit template

Installation

Install from source

git clone https://github.com/WWz33/rskit.git
cd rskit
pip install -e .

Dependencies

Python >= 3.8
pandas
numpy
pydeseq2
pytximport
PyWGCNA
STAR
Salmon
fastp (optional)

User Scenarios

I need a valid input file to start from

Generate a small coldata template instead of guessing column names.

rskit template coldata -o coldata.csv
rskit template contrast -o contrast.tsv

I want to catch input problems before a long run

Validate metadata columns, read paths, and count or expression matrix sample IDs without running STAR, Salmon, DESeq2, or WGCNA.

rskit validate -S coldata.csv -r -gc counts.csv -d "~batch + condition"
rskit doctor -S coldata.csv -e expression.csv

I have FASTQ files and want the full RNA-seq workflow

Run alignment, Salmon quantification, gene-level export, and DESeq2. quant creates tx2gene.tsv, gene_counts.csv, gene_tpm.csv, and gene_log2_tpm_plus1.csv before DESeq2 needs gene-level counts.

rskit all -S coldata.csv -g genome.fa -gtf annotation.gtf -gf transcripts.fa -o results/
rskit all -S coldata.csv -g genome.fa -gtf annotation.gtf -gf transcripts.fa -o results/ -d "~batch + condition"

I only need quantification

Use single-sample mode for one library or --coldata for a batch.

rskit quant -s sample1 -1 sample1_R1.fq.gz -2 sample1_R2.fq.gz -g genome.fa -gtf annotation.gtf -gf transcripts.fa -o results/
rskit quant -S coldata.csv -g genome.fa -gtf annotation.gtf -gf transcripts.fa -o results/

I need to tune STAR, Salmon, or fastp

Pass advanced tool arguments as a quoted string. If an allowed argument conflicts with an rskit default, the user-provided value replaces the default. rskit rejects arguments that would change managed inputs, outputs, reports, indexes, library type, or thread counts.

rskit quant -S coldata.csv -g genome.fa -gtf annotation.gtf -gf transcripts.fa -o results/ --star-args "--outFilterMultimapNmax 8"
rskit quant -S coldata.csv -g genome.fa -gtf annotation.gtf -gf transcripts.fa -o results/ --salmon-args "--validateMappings --minScoreFraction 0.95"
rskit quant -S coldata.csv -g genome.fa -gtf annotation.gtf -gf transcripts.fa -o results/ -tr --fastp-args "--length_required 30 --cut_front"

I already have gene counts and want DESeq2

Use a matrix directly, or point at a quant output directory and let rskit reuse gene_counts.csv or gene_counts.tsv when present.

rskit deseq2 -gc counts.csv -S coldata.csv -c condition,treatment,control
rskit deseq2 -sd ./03_quant -S coldata.csv -gtf annotation.gtf -d "~batch + condition"

I want co-expression modules

Provide an expression matrix as rows = samples and columns = genes. Add coldata and gene metadata when available.

rskit wgcna -e expression.csv -o ./wgcna_results
rskit wgcna -e expression.csv -S coldata.csv -G gene_info.csv -o ./wgcna_results

Coldata Format

All subcommands use the same --coldata / -S parameter.

sample,id,condition,r1,r2
sample1,ctrl,control,sample1_R1.fq.gz,sample1_R2.fq.gz
sample2,ctrl,control,sample2_R1.fq.gz,sample2_R2.fq.gz
sample3,treat,treatment,sample3_R1.fq.gz,sample3_R2.fq.gz
sample4,treat,treatment,sample4_R1.fq.gz,sample4_R2.fq.gz

Each subcommand reads only the columns it needs:

quant: sample, r1, r2
deseq2: sample plus every metadata column referenced by --design; the default ~condition requires condition
wgcna: sample plus any metadata columns

Relative r1 and r2 paths are resolved relative to the coldata file. Count and expression matrices are expected as rows = samples and columns = genes. If a DESeq2 counts file is genes x samples, it is transposed only when the sample IDs match the coldata columns. After orientation, sample IDs must match coldata exactly; rskit does not silently drop or intersect samples.

Command Reference

`rskit all`

Complete pipeline: quantification + DESeq2 analysis.

Short	Long	Description
`-S`	`--coldata`	Required coldata file with `sample,id,condition,r1,r2`. Relative `r1`/`r2` paths are resolved from this file.
`-g`	`--genome-fasta`	Required genome FASTA used to build or check the STAR index.
`-gtf`	`--gtf-file`	Required GTF/GFF annotation used by STAR and to create `tx2gene.tsv` when `--tx2gene` is not provided.
`-gf`	`--transcript-fasta`	Required transcript FASTA used by Salmon quantification.
`-o`	`--output-dir`	Required workflow output directory; rskit creates `00_index/`, `02_bam/`, `03_quant/`, and `04_deseq2/` under it.
`-idx`	`--index-dir`	Optional existing STAR index directory; defaults to `<output-dir>/00_index`.
`-t2g`	`--tx2gene`	Optional transcript-to-gene mapping file; if omitted, rskit writes `03_quant/tx2gene.tsv` from `--gtf-file`.
`-t`	`--threads`	Threads per sample when running sequentially; also used for index and DESeq2 defaults.
`-p`	`--parallel`	Total cores for multi-sample parallel execution; rskit divides this across samples.
`-tr`	`--trim`	Run fastp before alignment and use trimmed FASTQ files.
`-fi`	`--force-index`	Rebuild the STAR index even when an index directory already exists.
`-se`	`--skip-existing`	Skip sample-level work when expected output files already exist.
n/a	`--star-args`	Advanced STAR arguments. Allowed conflicts replace rskit defaults; protected options include `--runThreadN`, `--genomeDir`, `--readFilesIn`, `--readFilesCommand`, `--outFileNamePrefix`, `--outSAMtype`, `--quantMode`, `--genomeFastaFiles`, and `--sjdbGTFfile`.
n/a	`--salmon-args`	Advanced `salmon quant` arguments. Allowed conflicts replace rskit defaults; protected options include `-t`/`--targets`, `-a`/`--alignments`, `-o`/`--output`, `-p`/`--threads`, and `-l`/`--libType`.
n/a	`--fastp-args`	Advanced fastp arguments used only with `--trim`. Allowed conflicts replace rskit defaults; protected options include `-i`/`--in1`, `-I`/`--in2`, `-o`/`--out1`, `-O`/`--out2`, `-w`/`--thread`, report paths, STDIN/STDOUT, and extra output-file options.
`-d`	`--design`	DESeq2 design formula; every referenced column must exist in coldata. Default: `~condition`.
`-c`	`--contrast`	DESeq2 contrast as `factor,level1,level2`; factor and levels are validated against coldata.
`-a`	`--alpha`	Adjusted p-value threshold used for significance summaries. Default: `0.05`.
`-l`	`--lfc`	Absolute log2 fold-change threshold used for significance summaries. Default: `2.0`.

`rskit quant`

Complete quantification pipeline: index -> align -> quant -> gene-level table export.

Short	Long	Description
`-s`	`--sample`	Sample name for single-sample mode; use with `-1` and `-2`.
`-S`	`--coldata`	Batch sample table with `sample,r1,r2`; replaces `--sample`, `--r1`, and `--r2`.
`-1`	`--r1`	First read file for single-sample mode.
`-2`	`--r2`	Second read file for single-sample mode.
`-g`	`--genome-fasta`	Required genome FASTA used to build or check the STAR index.
`-gtf`	`--gtf-file`	Required annotation used by STAR and `tx2gene.tsv` generation.
`-gf`	`--transcript-fasta`	Required transcript FASTA used by Salmon.
`-o`	`--output-dir`	Required output/work directory.
`-idx`	`--index-dir`	Optional existing STAR index directory; defaults to `<output-dir>/00_index`.
`-t2g`	`--tx2gene`	Optional transcript-to-gene mapping for gene-level export; otherwise generated from `--gtf-file`.
`-t`	`--threads`	Threads per sample. Default: `8`.
`-p`	`--parallel`	Total cores for multi-sample parallel execution.
`-tr`	`--trim`	Run fastp before alignment.
`-fi`	`--force-index`	Rebuild the STAR index even if it exists.
`-se`	`--skip-existing`	Skip sample work when expected output already exists.
n/a	`--star-args`	Advanced STAR arguments. Allowed conflicts replace rskit defaults; protected options include `--runThreadN`, `--genomeDir`, `--readFilesIn`, `--readFilesCommand`, `--outFileNamePrefix`, `--outSAMtype`, `--quantMode`, `--genomeFastaFiles`, and `--sjdbGTFfile`.
n/a	`--salmon-args`	Advanced `salmon quant` arguments. Allowed conflicts replace rskit defaults; protected options include `-t`/`--targets`, `-a`/`--alignments`, `-o`/`--output`, `-p`/`--threads`, and `-l`/`--libType`.
n/a	`--fastp-args`	Advanced fastp arguments used only with `--trim`. Allowed conflicts replace rskit defaults; protected options include `-i`/`--in1`, `-I`/`--in2`, `-o`/`--out1`, `-O`/`--out2`, `-w`/`--thread`, report paths, STDIN/STDOUT, and extra output-file options.

`rskit deseq2`

DESeq2 differential expression analysis.

Short	Long	Description
`-sd`	`--salmon-dir`	Directory containing Salmon `quant.sf` folders or precomputed `gene_counts.csv`/`gene_counts.tsv`. Mutually exclusive with `--gene-counts`.
`-gc`	`--gene-counts`	Gene counts matrix file. Mutually exclusive with `--salmon-dir`.
`-S`	`--coldata`	Required metadata file with `sample` and every column referenced by `--design`.
`-gtf`	`--gtf`	GTF/GFF annotation; required when importing `quant.sf` without `--tx2gene`.
`-t2g`	`--tx2gene`	Optional transcript-to-gene mapping used when importing from Salmon outputs.
`-w`	`--work-dir`	Work directory used to place the default `04_deseq2/` output directory. Default: current directory.
`-o`	`--output-dir`	Custom DESeq2 output directory; overrides `<work-dir>/04_deseq2`.
`-d`	`--design`	DESeq2 design formula. Default: `~condition`.
`-c`	`--contrast`	Contrast as `factor,level1,level2`; validated against coldata before counts are loaded.
`-a`	`--alpha`	Adjusted p-value threshold used for result summaries. Default: `0.05`.
`-l`	`--lfc`	Absolute log2 fold-change threshold used for result summaries. Default: `2.0`.
`-t`	`--threads`	Number of CPUs for PyDESeq2 inference.

When --salmon-dir points at a quant output directory, deseq2 reuses gene_counts.csv or gene_counts.tsv if present. It falls back to importing from quant.sf only when no precomputed gene counts are available. Metadata sample IDs and count matrix sample IDs must match exactly.

--contrast must use factor,level1,level2. The factor must be a coldata column, and both levels must exist in that column; rskit validates this before loading counts or running DESeq2.

`rskit validate` / `rskit doctor`

Validate input files without running analysis tools.

Short	Long	Description
`-S`	`--coldata`	Required coldata file with a `sample` column.
`-d`	`--design`	DESeq2 design formula used to check required metadata columns. Default: `~condition`.
`-r`	`--check-reads`	Require `r1`/`r2` columns and verify that read files exist.
`-gc`	`--gene-counts`	Optional gene counts matrix to validate against coldata sample IDs.
`-e`	`--expression`	Optional expression matrix to validate against coldata sample IDs.

`rskit template`

Write example input template files.

Short	Long	Description
n/a	`template_name`	Positional template type: `coldata` or `contrast`.
`-o`	`--output`	Required output path; `.csv` writes CSV and `.tsv`/`.txt` writes TSV.
`-f`	`--force`	Overwrite the output file if it already exists.

`rskit wgcna`

WGCNA co-expression network analysis.

Short	Long	Description
`-e`	`--expression`	Required expression matrix; rows must be samples and columns genes.
`-o`	`--output-dir`	Required WGCNA output directory.
`-S`	`--coldata`	Optional sample metadata file; sample IDs must match expression rows.
`-G`	`--gene-info`	Optional gene metadata file indexed by gene ID.
`-sp` (`-sep`)	`--sep`	Optional separator override; by default `.csv` uses comma and `.tsv`/`.txt` use tab.
`-n`	`--name`	Analysis name stored in the PyWGCNA object. Default: `WGCNA`.
`-s`	`--species`	Species label used by PyWGCNA enrichment analysis.
`-l`	`--level`	Data level passed to PyWGCNA: `gene` or `transcript`. Default: `gene`.
`-nw` (`-nt`)	`--network-type`	WGCNA network type: `unsigned`, `signed`, or `signed hybrid`.
`-tt` (`-tom`)	`--tom-type`	TOM type passed to PyWGCNA: `unsigned` or `signed`.
`-ms` (`-min`)	`--min-module-size`	Minimum module size for module detection. Default: `50`.
`-p`	`--power`	Soft-thresholding power; omit to let PyWGCNA auto-detect.
`-rs` (`-rsquared`)	`--rsquared-cut`	R-squared cutoff for power selection. Default: `0.9`.
`-mc` (`-mean`)	`--mean-cut`	Mean connectivity cutoff. Default: `100`.
`-md` (`-mediss`)	`--mediss-thresh`	Module eigengene dissimilarity threshold for merging modules. Default: `0.2`.
`-tc` (`-tpm`)	`--tpm-cutoff`	TPM cutoff used by PyWGCNA filtering. Default: `1`.

Python API

Quantification

from rskit import RNAseqPipeline, PipelineConfig

config = PipelineConfig()
pipeline = RNAseqPipeline(config)

samples = {
    "sample1": {
        "fq1": "data/sample1_R1.fq",
        "fq2": "data/sample1_R2.fq",
    }
}

results = pipeline.run(
    samples=samples,
    genome_fasta="genome.fa",
    gtf_file="annotation.gtf",
    transcript_fasta="transcripts.fa",
    index_dir="STAR_index",
    output_dir="results/02_bam",
    quant_output_dir="results/03_quant",
)

DESeq2

from rskit.core.deseq2 import Deseq2Analyzer
from rskit.config import DESeq2Config

config = DESeq2Config(alpha=0.05, lfc_threshold=2.0)
analyzer = Deseq2Analyzer(config)

metadata_df = analyzer.load_metadata("coldata.csv", required_columns=["condition"])
counts_df = analyzer.load_counts_from_file("counts.csv", metadata_df=metadata_df)

results_df = analyzer.analyze(
    counts_df=counts_df,
    metadata_df=metadata_df,
    contrast=["condition", "treatment", "control"],
)

summary = analyzer.get_summary()
print(f"Significant genes: {summary['significant_genes']}")

WGCNA

from rskit.core.wgcna import WGCNAAnalyzer

analyzer = WGCNAAnalyzer(
    output_dir="./wgcna_results",
    name="MyWGCNA",
    network_type="signed hybrid",
    min_module_size=50,
)

analyzer.load_data(
    expression_file="expression.csv",
    coldata="coldata.csv",
    gene_info_file="gene_info.csv",
)

wgcna_obj = analyzer.run_analysis()
analyzer.save_results()

Output Structure

Quant output

03_quant/
├── <sample>/quant.sf
├── gene_counts.csv
├── gene_tpm.csv
├── gene_log2_tpm_plus1.csv
└── tx2gene.tsv

Complete pipeline output

results/
├── 00_index/
├── 01_clean_data/
├── 02_bam/
├── 03_quant/
└── 04_deseq2/

DESeq2 output

04_deseq2/
├── deseq2_results.csv
├── gene_counts.csv
├── manifest.json
├── pca_plot.pdf
├── volcano_plot.pdf
└── ma_plot.pdf

manifest.json records the DESeq2 inputs, resolved counts file, sample IDs, design, contrast, summary, and key output files.

WGCNA output

wgcna_results/
├── figures/
├── WGCNA.p
└── module_info.csv

File Format Notes

CSV/TSV separators are detected from file extension
Gene expression matrices are stored as rows = samples, columns = genes
DESeq2 count matrices may be provided as genes x samples only when coldata sample IDs identify the count columns
DESeq2 metadata must contain sample plus every column referenced by --design
Matrix sample IDs must match coldata sample IDs exactly
Gene metadata should use the first column as the gene identifier

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
data		data
rskit		rskit
tests		tests
.gitignore		.gitignore
README.md		README.md
README.zh.md		README.zh.md
pyproject.toml		pyproject.toml
salmon.184023.out		salmon.184023.out
setup.py		setup.py
test.sh		test.sh

Folders and files

Latest commit

History

Repository files navigation

rskit - RNA-seq Analysis Toolkit

Features

Installation

Install from source

Dependencies

User Scenarios

I need a valid input file to start from

I want to catch input problems before a long run

I have FASTQ files and want the full RNA-seq workflow

I only need quantification

I need to tune STAR, Salmon, or fastp

I already have gene counts and want DESeq2

I want co-expression modules

Coldata Format

Command Reference

rskit all

rskit quant

rskit deseq2

rskit validate / rskit doctor

rskit template

rskit wgcna

Python API

Quantification

DESeq2

WGCNA

Output Structure

Quant output

Complete pipeline output

DESeq2 output

WGCNA output

File Format Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`rskit all`

`rskit quant`

`rskit deseq2`

`rskit validate` / `rskit doctor`

`rskit template`

`rskit wgcna`

Packages