Design and annotate bacterial CRISPRi guide RNA libraries from any genome.
CRISPRi uses a catalytically dead Cas9 to silence genes by blocking transcription. Designing a good library means knowing not just where a guide targets, but how far upstream of the TSS it lands, which replichore it sits on, whether it shares a seed sequence with another guide, and whether it contains a restriction site that would break your cloning. crispio computes all of this in one pass and outputs annotated GFF3 that loads directly into any genome browser.
crispio generate --pam Spy -g genome.fasta -a genome.gff3 > guides.gff
- Quick start
- What you get
- Generating a new library
- Annotating guides from the literature
- Checking for off-targets
- Adding ML features
- Piping commands together
- Python API
- Installation
- PAMs and scaffolds
- Issues and documentation
You need two files, both available for any sequenced bacterium from NCBI:
- FASTA β the genome sequence (
.fasta/.fa) - GFF3 β gene annotations (
.gff/.gff3)
Try crispio on the first 100 guides straight away with --limit:
crispio generate \
--pam Spy \
--genome EcoMG1655-NC_000913.3.fasta \
--annotations EcoMG1655-NC_000913.3.gff3 \
--limit 100 \
> first100.gffConvert to a spreadsheet-friendly table with bioino:
cat first100.gff | bioino gff2table > first100.tsvOpen first100.tsv in Excel. Each row is one guide. The most useful columns at a glance:
| Column | Example | What it means |
|---|---|---|
Name |
thrL-21-modest_saddle |
gene-position-mnemonic |
guide_sequence |
GCTTTTCATTCTGACTGCAA |
The 20 nt spacer to synthesise |
pam_offset |
-166 |
Distance from PAM to gene start. Negative = upstream of TSS β the productive targeting window for CRISPRi |
pam_replichore |
R |
Left or right replichore β matters for efficiency in fast-growing bacteria |
ann_locus_tag |
b0001 |
Systematic gene ID for programmatic filtering |
guide_re_sites |
BbsI |
Restriction sites in the spacer that would break Golden Gate cloning |
Every guide gets a stable, human-readable mnemonic β modest_saddle, bouncy_sabine β that is a deterministic hash of the guide sequence, PAM, and position. The same guide always gets the same mnemonic regardless of when you run crispio or what else is in the library. Use it to refer to guides in lab notebooks and across collaborators without copying 20-character sequences.
The pam_offset is signed: negative means the PAM is upstream of the annotated gene start, which is the productive targeting window for bacterial CRISPRi. Positive values target inside the coding sequence. Filter on it directly:
cat guides.gff | bioino gff2table \
| awk -F'\t' 'NR==1 || ($NF+0 < 0 && $NF+0 > -300)' \
> upstream_guides.tsvOutput is standard GFF3 and loads as an annotation track in IGV and Artemis β useful for visually checking guide distribution across the chromosome before ordering.
crispio generate finds every PAM site in the genome, extracts the adjacent spacer, and annotates everything in one pass.
crispio generate \
--pam Sth1 \
--max_length 20 \
--genome EcoMG1655-NC_000913.3.fasta \
--annotations EcoMG1655-NC_000913.3.gff3 \
--output guides.gffFor multi-chromosome genomes (chromosome + plasmids), pass a FASTA with multiple sequences. Each sequence is processed independently and guides are tagged with the correct chromosome identifier.
Use --limit N for quick exploratory runs or to generate a capped sub-library:
crispio generate --pam Spy -g genome.fasta -a genome.gff3 --limit 500This is one of the most useful things crispio does: take a published guide library and fully re-annotate it against your genome. It doesn't require matching coordinates or assemblies β it searches by sequence, so it works across strains.
If you have a TSV with a sequence column and a guide_name column:
cat published_library.tsv \
| bioino table2fasta --sequence sequence --name guide_name \
| crispio map \
--pam Spy \
--genome EcoMG1655-NC_000913.3.fasta \
--annotations EcoMG1655-NC_000913.3.gff3 \
> annotated_library.gffOr from an existing FASTA of spacers:
crispio map \
published_spacers.fasta \
--pam Spy \
--genome EcoMG1655-NC_000913.3.fasta \
--annotations EcoMG1655-NC_000913.3.gff3 \
> annotated_library.gffGuides not found in the genome are reported to stderr and skipped β they never appear silently with wrong coordinates.
crispio offtarget flags pairs of guides that share a 4 nt PAM-proximal seed sequence and differ by β€ 4 mismatches elsewhere. These are candidates for unintended cross-silencing.
# Check a library against itself
crispio offtarget --gff2 guides.gff < guides.gff > checked.gffFlagged guides get a crosstalk attribute listing the IDs and distances of matches. Check two libraries against each other β for example, confirming that guides from one experiment won't interfere with another:
crispio offtarget --gff2 library_b.gff < library_a.gff > crosstalk.gffcrispio featurize appends sequence-based features for downstream activity prediction, prefixed feat_ in the output.
cat guides.gff | crispio featurize --scaffold Sth1 > guides_featurized.gffAvailable features:
>>> from crispio import get_features
>>> get_features()
['on_nontemplate_strand', 'context_up2', 'context_down2', 'context_up_autocorr',
'pam_n', 'pam_def', 'pam_gc', 'pam_autocorr', 'pam_scaff_corr',
'guide_purine', 'guide_gc', 'seed_seq', 'guide_start3', 'guide_end3',
'guide_autocorr', 'guide_scaff_corr']--scaffold takes a name (Sth1, PerturbSeq) or a raw scaffold sequence. Use the scaffold for the Cas9 you are working with β the correlation-based features depend on it.
All subcommands read from stdin and write to stdout. Informational messages go to stderr only, so they never appear in your data stream. Full pipelines with no intermediate files:
# Generate β featurize β table
crispio generate --pam Spy -g genome.fasta -a genome.gff3 \
| crispio featurize --scaffold Sth1 \
| bioino gff2table \
> full_library.tsv# Map a published library β off-target check β table
cat published_spacers.fasta \
| crispio map --pam Spy -g genome.fasta -a genome.gff3 \
| crispio offtarget -2 <(crispio generate --pam Spy -g genome.fasta -a genome.gff3) \
| bioino gff2table \
> mapped_checked.tsvGenerate guides de novo:
from crispio import GuideLibrary
genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA"
gl = GuideLibrary.from_generating(genome=genome, pam_search="NGG")
for match_collection in gl:
for guide in match_collection:
print(guide)
# ATACCGTTTTTTTAAAAAAA
# TATCCGTTTTTTTAAAAAAAMap known sequences to a genome:
from crispio import GuideLibrary
genome = "CCCCCCCCCCCTTTTTTTTTTAAAAAAAAAATGATCGATCGATCGAGGAAAAAAAAAACCCCCCCCCCC"
gl = GuideLibrary.from_mapping(
guide_seq=["ATGATCGATCGATCG"],
genome=genome,
pam_search="NGG",
)
for collection in gl:
for match in collection:
print(match.guide_seq, match.pam_start, match.reverse)Calculate features:
from crispio import featurize
from crispio.utils import sequences
# gff_line is a bioino.GffLine with guide_sequence, pam_sequence, etc.
scaffold_seq = sequences.scaffolds["Sth1"]
features = featurize(gff_line, scaffold=scaffold_seq)
# {"feat_guide_gc": "0.500", "feat_seed_seq": "GATCG", ...}Pass the scaffold sequence, not the name, to featurize. Use sequences.scaffolds["Sth1"] to retrieve it.
Full API reference: crispio.readthedocs.io
Requires Python β₯ 3.10.
pip install crispioVerify:
crispio --helpFrom source:
git clone https://github.com/scbirlab/crispio.git
cd crispio
pip install -e .Built-in PAM names for --pam:
| Name | IUPAC | Cas9 |
|---|---|---|
Spy |
NGGN |
SpCas9 (S. pyogenes) |
Sth1 |
NNRGVAN |
StCas9-1 (S. thermophilus) |
Sau |
NGRRT |
SaCas9 (S. aureus) |
Nme |
NNNNGAT |
NmeCas9 (N. meningitidis) |
Built-in scaffold names for --scaffold:
| Name | Description |
|---|---|
Sth1 |
StCas9-1 scaffold |
PerturbSeq |
Perturb-seq optimised scaffold |
Any IUPAC sequence can be passed directly to either argument.
- Bugs and feature requests: issue tracker
- Full API reference: crispio.readthedocs.io