Skip to content

Compression flag causes error with --cluster-mode 2 but not --cluster-mode 0 #1073

@chasemc

Description

@chasemc

To reproduce. On a fresh c6id.large instance:

wget https://dev.mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar -xvzf mmseqs-linux-avx2.tar.gz
export PATH=$PWD/mmseqs/bin:$PATH
rm mmseqs-linux-avx2.tar.gz

wget https://dl.secondarymetabolites.org/mibig/mibig_prot_seqs_4.0.fasta

mmseqs createdb  mibig_prot_seqs_4.0.fasta mibig_db

mkdir -p example/

 mmseqs cluster mibig_db "example/example" tmp \
        --compressed 1 \
        --cluster-mode 2 

fails with the error below

also, in brief:

Clustering mode: Greedy
9036 ZSTD_decompressStream Corrupted block detected
Error: Pre-clustering step died
Error: linclust died

but

 mmseqs cluster mibig_db "example/example" tmp \
        --compressed 1 \
        --cluster-mode 0

works.

This happens with both binary and compiled mmseqs

Error output:

Create directory tmp
cluster mibig_db example/example tmp --compressed 1 --cluster-mode 2
MMseqs Version:                         bd01c2229f027d8d8e61947f44d11ef1a7669212
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             4
k-mer length                            0Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max sequence length                     65535
Max results per query                   20
Split database                          0
Split mode                              2Split memory limit                      0Coverage threshold                      0.8
Coverage mode                           0
Compositional bias                      1
Compositional bias scale                1
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1Mask residues probability               0.9Mask lower case residues                0Mask lower letter repeating N times     0Minimum diagonal score                  15
Selected taxa
Include identical seq. id.              false
Spaced k-mers                           1
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Spaced k-mer pattern
Local temporary path
Threads                                 2
Compressed                              1
Verbosity                               3
Add backtrace                           false
Alignment mode                          3
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Max reject                              2147483647
Max accept                              2147483647
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Cluster mode                            2
Max connected component depth           1000
Similarity type                         2
Weight file name
Cluster Weight threshold                0.9
Set mode                                false
Single step clustering                  false
Cascaded clustering steps               3
Cluster reassign                        false
Remove temporary files                  false
Force restart with latest tmp           false
MPI runner
k-mers per sequence                     21
Scale k-mers per sequence               aa:0.000,nucl:0.200
Adjust k-mer length                     false
Shift hash                              67
Include only extendable                 false
Skip repeating k-mers                   false

Set cluster sensitivity to -s 6.000000
Set cluster iterations to 3
linclust mibig_db tmp/12627170530073326854/clu_redundancy tmp/12627170530073326854/linclust --cluster-mode 2 --max-iterations 1000 --similarity-type 2 --threads 2 --compressed 1 -v 3 --cluster-weight-threshold 0.9 --set-mode 0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0--pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher mibig_db tmp/12627170530073326854/linclust/7507599336006465408/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 2 --compressed 1 -v 3 --cluster-weight-threshold 0.9

kmermatcher mibig_db tmp/12627170530073326854/linclust/7507599336006465408/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 2 --compressed 1 -v 3 --cluster-weight-threshold 0.9

Database size: 46987 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split
[=================================================================] 100.00% 46.99K 0s 621ms
Sort kmer 0h 0m 0s 97ms
Sort by rep. sequence 0h 0m 0s 18ms
Time for fill: 0h 0m 0s 11ms
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 779ms
rescorediagonal mibig_db mibig_db tmp/12627170530073326854/linclust/7507599336006465408/pref tmp/12627170530073326854/linclust/7507599336006465408/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 1 -v 3

[=================================================================] 100.00% 46.99K 0s 48ms
Time for merging to pref_rescore1: 0h 0m 0s 9ms
Time for processing: 0h 0m 0s 70ms
clust mibig_db tmp/12627170530073326854/linclust/7507599336006465408/pref_rescore1 tmp/12627170530073326854/linclust/7507599336006465408/pre_clust --cluster-mode 2 --max-iterations 1000 --similarity-type 2 --threads 2 --compressed 1 -v 3 --cluster-weight-threshold 0.9 --set-mode 0

Clustering mode: Greedy
9036 ZSTD_decompressStream Corrupted block detected
Error: Pre-clustering step died
Error: linclust died

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions