Skip to content

Segfault when running result2profile when converting clustering result to profile #1077

@prototaxites

Description

@prototaxites

I have downloaded the BOLD (https://boldsystems.org/) FASTA release, and have filtered it to retain only COI-5P sequences over 300bp, and then clustered it using mmseqs linclust:

mmseqs createdb bold.COI-5P.rmdup.m300.fa.gz BOLD.COI-5P.dedup/db
mmseqs linclust BOLD.COI-5P.dedup/db BOLD.COI-5P.clustered/db tmp --threads 32 --min-seq-id 0.99 -c 0.95

Now I am trying to convert the cluster result to a profile format for searches, but I get a segfault:

mmseqs createsubdb BOLD.COI-5P.clustered/db BOLD.COI-5P.dedup/db BOLD.COI-5P.clustered/repSeqdb
mmseqs createsubdb BOLD.COI-5P.clustered/db BOLD.COI-5P.dedup/db_ BOLD.COI-5P.clustered/repSeqdb_h
mmseqs result2profile BOLD.COI-5P.clustered/repSeqdb BOLD.COI-5P.dedup/db BOLD.COI-5P.clustered/db BOLD.COI-5P.clustered.profile/db --threads 16

Which gives:

result2profile BOLD.COI-5P.clustered/repSeqdb BOLD.COI-5P.dedup/db BOLD.COI-5P.clustered/db BOLD.COI-5P.clustered.profile/db --threads 16

MMseqs Version:           	01683a607f83878e95436632d73e1d7d9ae30955
Substitution matrix       	aa:blosum62.out,nucl:nucleotide.out
E-value threshold         	0.001
Mask profile              	1
Profile E-value threshold 	0.001
Compositional bias        	1
Compositional bias scale  	1
Global sequence weighting 	false
Allow deletions           	false
Filter MSA                	1
Use filter only at N seqs 	0
Maximum seq. id. threshold	0.9
Minimum seq. id.          	0.0
Minimum score per column  	-20
Minimum coverage          	0
Select N most diverse seqs	1000
Pseudo count mode         	0
Pseudo count a            	substitution:1.100,context:1.400
Pseudo count b            	substitution:4.100,context:5.800
Preload mode              	0
Gap open cost             	aa:11,nucl:5
Gap extension cost        	aa:1,nucl:2
Threads                   	16
Compressed                	0
Verbosity                 	3
Profile output mode       	0

Query database size: 6957508 type: Nucleotide
Target database size: 9478899 type: Nucleotide
fish: Job 1, '~/MMseqs2/build/bin/mmseqs resu…' terminated by signal SIGSEGV (Address boundary error)

I have tried both with the latest version on Conda, as well as compiling the latest commit from source, to no avail. Am I doing something wrong, or is this a bug that needs fixing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions