bio-local-blast

gptomics/bio-local-blast

Data & Analytics

178

About

SKILL.md

bio-local-blast

gptomics/bio-local-blast

Data & Analytics

178

About

Run local BLAST searches using BLAST+ command-line tools...

SKILL.md

Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:

CLI: blastn -version then blastn -help to confirm flags
CLI: makeblastdb -help to confirm database build options

If a flag is unrecognized or behavior changes, introspect with -help and adapt the example to match the installed version rather than retrying.

Local BLAST

"Run BLAST locally for speed and control" -> Build or download a BLAST+ database, run the appropriate program with carefully chosen -task, masking, and thread settings, parse tabular output. Local BLAST is the right tool when remote is rate-limited or when the database must be reproducible (frozen).

The biggest mistakes are (a) using nt/nr without realizing they're >250 GB and grow weekly, (b) not building with -parse_seqids and then being unable to extract hit sequences with blastdbcmd, (c) using default blastn for cross-species when dc-megablast is correct, and (d) thinking -num_threads 32 will scale -- past ~16 threads BLAST is I/O bound.

CLI: makeblastdb, blastn/blastp, blastdbcmd, update_blastdb.pl (NCBI BLAST+)
Python: subprocess wrapper (preferred); Bio.Blast.Applications was deprecated and removed -- do not use

Installation

# conda (preferred)
conda install -c bioconda blast

# macOS
brew install blast

# Ubuntu
sudo apt install ncbi-blast+

# Verify
blastn -version    # NCBI BLAST+ 2.15+ expected
update_blastdb.pl --showall pretty | head

Database format: v5 vs v4

NCBI introduced BLAST database v5 in BLAST+ 2.10 (2020). v5 includes taxonomy indexing directly in the database files, enabling -taxids and -taxidlist filtering without a companion file. v4 databases require taxonomy4blast.sqlite3 to be present and discoverable.

Feature	v4	v5
Default for prebuilt NCBI dbs	No (legacy)	Yes (since 2020)
`-taxids`, `-taxidlist` support	No	Yes
`blastdbcmd -taxids`	No	Yes
New `-info` output fields	No	Yes

update_blastdb.pl downloads v5 by default. When building a database manually with makeblastdb, v5 format requires -blastdb_version 5. Always pass -blastdb_version 5 and -parse_seqids when building from scratch.

`makeblastdb` flag taxonomy

Flag	Effect	When
`-dbtype nucl` or `-dbtype prot`	Required	Always
`-parse_seqids`	Indexes accessions so `blastdbcmd -entry <acc>` works	Almost always (downstream extraction)
`-hash_index`	Speeds up extraction by accession	Large dbs
`-blastdb_version 5`	Use v5 format	Always
`-taxid 9606`	Single taxid for all seqs	Single-species DB
`-taxid_map file.tsv`	Per-sequence taxid mapping (seqidtaxid)	Multi-species DB
`-mask_data masking.asnb`	Apply precomputed soft-masking	Production pipelines
`-title "..."`	Free-text label	Cosmetic
`-out path/prefix`	DB file path prefix	Always

makeblastdb -in reference.fasta -dbtype nucl \
            -blastdb_version 5 \
            -parse_seqids \
            -hash_index \
            -title "Custom reference 2026-05" \
            -out custom_db

`-task` taxonomy (the most-misused BLAST setting)

For blastn, the -task flag picks among heuristics with different word sizes and gap parameters.

`-task`	Word	Gapped	Use case	Mistake to avoid
`megablast` (default)	28	linear	>=95% identity, intra-species, primer hits, contamination check	Used for cross-species and misses everything
`dc-megablast`	11 (discontiguous)	yes	Cross-species mRNA homology	Underused -- this is what `blastn` "should" be for cross-species
`blastn`	11	yes	General sensitive DNA	Slower than dc-megablast for same job
`blastn-short`	7	yes	Queries <50 nt (primers, small RNAs)	Default megablast can't seed at length 7
`rmblastn`	11	yes	Repeat masking; bundled with RepeatModeler	Specialized

For blastp:

`-task`	Word	Use case
`blastp` (default)	3	General protein similarity
`blastp-fast`	6	Faster, less sensitive
`blastp-short`	2	Peptides <30 aa, with PAM30 + word_size=2 typical

Soft vs hard masking

Setting	Effect on seed	Effect on extension	Effect on score
`-soft_masking true` (default for several tasks)	Skip masked positions when seeding	Allow extension through masked	Score includes masked positions
`-soft_masking false` + `-dust yes` / `-seg yes`	Skip masked positions when seeding	Skip masked positions in extension	Score excludes masked positions
Hard-mask in input FASTA (N or X)	Hard exclusion everywhere	Hard exclusion	Treated as mismatches

Soft masking is correct for almost all cases. Hard masking creates artificial mismatches at masked boundaries and can split true alignments. The exception: searching against a database of repeats explicitly, where hard masking on the query is the right choice.

Thread scaling

BLAST+ parallelizes per-query (with -num_threads) but is I/O bound past ~16 threads on most hardware. For >100,000 query batches the better answer is splitting the input FASTA into N chunks and running N parallel blastn invocations -- this saturates CPUs better than -num_threads 64.

Threads	Typical speedup vs single	Notes
1-8	Near-linear	Default sweet spot
8-16	Sub-linear (1.5-2x over 8)	Useful on big SMP boxes
16-32	Diminishing returns	I/O bound for most DBs
32+	Often slower	Cache thrash + I/O contention

For massive workflows, prefer DIAMOND (Buchfink et al. 2021 Nat Methods 18:366) or MMseqs2 (Steinegger & Soding 2017 Nat Biotechnol 35:1026) -- 100-10,000x faster than BLASTP at comparable sensitivity. See remote-homology skill.

Output format reference (`-outfmt`)

`-outfmt`	Description	Use
0	Pairwise (default; human-readable)	Debugging, inspection
5	XML	Programmatic parsing (Bio.SearchIO)
6	Tabular (no header)	Most pipelines
7	Tabular with comment headers	Self-documenting
11	ASN.1 binary	Re-parse with later versions

Custom tabular fields:

blastn -query q.fa -db db -outfmt "6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore staxids sscinames stitle"

Field key fields for analysis:

pident = percent identity over the HSP (NOT the query); for query-level, use qcovhsp
qcovs = total query coverage by all HSPs of this subject (the "coverage" most users want)
qcovhsp = query coverage by best HSP alone (use when there's only one HSP per hit)
staxids = taxonomy IDs (v5 only); critical for any "what species" workflow

Prebuilt NCBI databases via `update_blastdb.pl`

# List available
update_blastdb.pl --showall pretty | grep -E 'refseq|swissprot|nt|nr'

# Download (with decompress)
update_blastdb.pl --decompress refseq_select_rna

# Download specific volume of split database
update_blastdb.pl --decompress refseq_protein

# Download with parallelism
update_blastdb.pl --decompress --num_threads 4 refseq_select_rna

Sizes (approximate, 2026):

refseq_select_rna: ~5 GB
refseq_protein: ~30 GB
swissprot: <1 GB
nt: ~250 GB
nr: ~300 GB

For most use cases, refseq_select_* is the right starting point. nt/nr are storage-heavy and reproducibility-hostile.

Code patterns

Build and search a custom protein database

Goal: Build a BLAST+ protein database from a custom FASTA and search against it.

Approach: makeblastdb with v5 + parse_seqids + hash_index; blastp with explicit outfmt.

Reference (NCBI BLAST+ 2.15+):

#!/bin/bash
# Reference: NCBI BLAST+ 2.15+ | Verify API if version differs

REF=reference_proteins.fasta
DB=ref_prot_db
QUERY=query.fasta
OUT=hits.tsv

makeblastdb -in "$REF" -dbtype prot \
            -blastdb_version 5 -parse_seqids -hash_index \
            -title "$REF $(date +%Y-%m-%d)" \
            -out "$DB"

blastp -query "$QUERY" -db "$DB" \
       -evalue 1e-10 \
       -num_threads 8 \
       -max_target_seqs 500 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore stitle" \
       -out "$OUT"

# Top hit per query by bit-score (column 7)
sort -k1,1 -k7,7gr "$OUT" | awk '!seen[$1]++' > top_hit_per_query.tsv

Cross-species DNA with dc-megablast

blastn -query mouse_cdna.fa -db human_refseq_rna \
       -task dc-megablast \
       -word_size 11 \
       -evalue 1e-10 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore" \
       -num_threads 8 \
       -out cross_species.tsv

Short primer search

blastn -query primers.fa -db genome_db \
       -task blastn-short \
       -word_size 7 \
       -evalue 1000 \
       -outfmt 6 \
       -out primer_hits.tsv

Taxonomy-filtered search (BLAST v5 only)

# Restrict to specific taxids
blastp -query query.fa -db nr \
       -taxids 9606,10090,10116 \
       -outfmt "6 qseqid sseqid staxids sscinames evalue bitscore" \
       -out mammalian_hits.tsv

# Or to a taxid subtree (NCBI BLAST+ 2.13+)
echo 9606 > human_only.txt
blastp -query query.fa -db nr -taxidlist human_only.txt -outfmt 6 -out human_hits.tsv

Extract subject sequences for top hits

# Requires database built with -parse_seqids
cut -f2 top_hit_per_query.tsv | sort -u > hit_accessions.txt
blastdbcmd -db ref_prot_db -entry_batch hit_accessions.txt -out hits.fasta

# Pull a range of a sequence
blastdbcmd -db genome_db -entry NC_000001.11 -range 1000000-1001000 -out region.fa

Reciprocal best hit (RBH) for ortholog candidates

See ortholog-inference skill for the principled treatment. Quick version:

blastp -query A.fa -db B_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out A_vs_B.tsv
blastp -query B.fa -db A_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out B_vs_A.tsv

# Best forward + reverse, intersect
awk '!seen[$1]++ {print $1"\t"$2}' A_vs_B.tsv | sort > A_best
awk '!seen[$1]++ {print $1"\t"$2}' B_vs_A.tsv | sort > B_best
awk 'NR==FNR{a[$1]=$2; next} a[$2]==$1' A_best B_best > rbh.tsv

This works but does NOT handle paralog mis-pairs from gene duplication; for that use OrthoFinder or OMA (in ortholog-inference).

Python wrapper with version pinning

import subprocess
import shutil


def require_tool(name, min_version=None):
    if not shutil.which(name):
        raise RuntimeError(f'{name} not on PATH')
    out = subprocess.run([name, '-version'], capture_output=True, text=True)
    print(f'  {out.stdout.strip().splitlines()[0]}')


def run_blast(query, db, out, program='blastp', evalue=1e-10, threads=8, hitlist=500):
    require_tool(program)
    cmd = [program, '-query', query, '-db', db, '-out', out,
           '-evalue', str(evalue),
           '-num_threads', str(threads),
           '-max_target_seqs', str(hitlist),
           '-outfmt', '6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore stitle']
    subprocess.run(cmd, check=True)


def parse_tabular(path):
    cols = ['qseqid', 'sseqid', 'pident', 'length', 'qcovs', 'qcovhsp', 'evalue', 'bitscore', 'stitle']
    rows = []
    with open(path) as f:
        for line in f:
            vals = line.rstrip('\n').split('\t')
            d = dict(zip(cols, vals))
            for k in ('pident', 'qcovs', 'qcovhsp', 'evalue', 'bitscore'):
                d[k] = float(d[k])
            d['length'] = int(d['length'])
            rows.append(d)
    return rows

Failure modes

`nt`/`nr` size shock

Trigger: update_blastdb.pl --decompress nt without realizing the size.
Mechanism: nt is ~250 GB compressed, ~1 TB indexed.
Symptom: Disk fills mid-download; partial DB unusable.
Fix: Use refseq_select for most workflows; only pull nt/nr with intent and >1 TB free.

Missing `-parse_seqids`

Trigger: Built DB without -parse_seqids; later try blastdbcmd -entry.
Mechanism: Without the parsed index, blastdbcmd can't look up by accession.
Symptom: Error: ... not found in database.
Fix: Rebuild with -parse_seqids (cheap if FASTA still on disk).

Wrong `-task` for the question

Trigger: Default blastn for cross-species mRNA (word=11 but ungapped seeding).
Mechanism: Discontiguous seed (dc-megablast) is much more sensitive across species.
Symptom: Far fewer hits than the question warrants.
Fix: Use -task dc-megablast for cross-species; -task megablast only for >=95% identity.

Thread saturation

Trigger: -num_threads 64 on a 32-core box.
Mechanism: I/O bound past ~16 threads; cache thrash hurts past CPU count.
Symptom: No speedup or slowdown.
Fix: Cap at 8-16; split FASTA and run parallel processes instead for very large batches.

v4 database, expecting v5 features

Trigger: Old prebuilt DB; -taxids flag returns "Taxonomy database not available".
Mechanism: v4 needs taxonomy4blast.sqlite3 companion; v5 has taxonomy indexed in DB.
Symptom: Taxonomy filtering silently no-ops or errors.
Fix: Re-download with update_blastdb.pl --decompress (gets v5); or use v5 explicitly when building.

Soft-masking confusion

Trigger: Hard-masking input (replacing repeats with N or X) instead of using -dust/-seg.
Mechanism: Hard-mask creates artificial mismatches at boundaries.
Symptom: True alignments split into multiple short HSPs.
Fix: Pass unmasked FASTA + soft-mask via -soft_masking true + -dust yes/-seg yes.

`max_target_seqs` truncation

Trigger: -max_target_seqs 10 (Shah et al. 2019 Bioinformatics 35:1613).
Mechanism: Early termination, not top-N filter.
Symptom: Different top-10 than -max_target_seqs 500 + post-filter.
Fix: Set -max_target_seqs large (500+); filter top N in awk/Python.

Common errors

Error / symptom	Cause	Solution
`BLAST Database error`	DB path wrong, or alias missing	`blastdbcmd -db <db> -info` to confirm
`Error: entry not found`	Built without `-parse_seqids`	Rebuild
Taxonomy filter no-op	v4 DB	Upgrade to v5
Threads >16 not faster	I/O bound	Split input + parallel invocations
`nt` download fills disk	Database is huge	Use refseq_select
`Sequence too short`	Query < word_size	Use `-task blastn-short` (word=7)
Out of memory	Single large query	Reduce `-num_threads`, split query

References

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402.
Shah N, Nute MG, Warnow T, Pop M. (2019) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35:1613-1614.
Boratyn GM, Camacho C, Cooper PS, et al. (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res 41:W29-W33.

Related Skills

blast-searches - Remote BLAST against NCBI servers
remote-homology - PSI-BLAST, jackhmmer, HHblits, MMseqs2, DIAMOND, Foldseek for distant homology
ortholog-inference - Reciprocal best hit, OrthoFinder, OMA for ortholog calls
sequence-io/read-sequences - Load query/reference FASTA files
batch-downloads - Download large reference FASTA sets before makeblastdb

About

SKILL.md

About

Run local BLAST searches using BLAST+ command-line tools...

SKILL.md

Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:

CLI: blastn -version then blastn -help to confirm flags
CLI: makeblastdb -help to confirm database build options

If a flag is unrecognized or behavior changes, introspect with -help and adapt the example to match the installed version rather than retrying.

Local BLAST

CLI: makeblastdb, blastn/blastp, blastdbcmd, update_blastdb.pl (NCBI BLAST+)
Python: subprocess wrapper (preferred); Bio.Blast.Applications was deprecated and removed -- do not use

Installation

# conda (preferred)
conda install -c bioconda blast

# macOS
brew install blast

# Ubuntu
sudo apt install ncbi-blast+

# Verify
blastn -version    # NCBI BLAST+ 2.15+ expected
update_blastdb.pl --showall pretty | head

Database format: v5 vs v4

Feature	v4	v5
Default for prebuilt NCBI dbs	No (legacy)	Yes (since 2020)
`-taxids`, `-taxidlist` support	No	Yes
`blastdbcmd -taxids`	No	Yes
New `-info` output fields	No	Yes

`makeblastdb` flag taxonomy

Flag	Effect	When
`-dbtype nucl` or `-dbtype prot`	Required	Always
`-parse_seqids`	Indexes accessions so `blastdbcmd -entry <acc>` works	Almost always (downstream extraction)
`-hash_index`	Speeds up extraction by accession	Large dbs
`-blastdb_version 5`	Use v5 format	Always
`-taxid 9606`	Single taxid for all seqs	Single-species DB
`-taxid_map file.tsv`	Per-sequence taxid mapping (seqidtaxid)	Multi-species DB
`-mask_data masking.asnb`	Apply precomputed soft-masking	Production pipelines
`-title "..."`	Free-text label	Cosmetic
`-out path/prefix`	DB file path prefix	Always

makeblastdb -in reference.fasta -dbtype nucl \
            -blastdb_version 5 \
            -parse_seqids \
            -hash_index \
            -title "Custom reference 2026-05" \
            -out custom_db

`-task` taxonomy (the most-misused BLAST setting)

For blastn, the -task flag picks among heuristics with different word sizes and gap parameters.

`-task`	Word	Gapped	Use case	Mistake to avoid
`megablast` (default)	28	linear	>=95% identity, intra-species, primer hits, contamination check	Used for cross-species and misses everything
`dc-megablast`	11 (discontiguous)	yes	Cross-species mRNA homology	Underused -- this is what `blastn` "should" be for cross-species
`blastn`	11	yes	General sensitive DNA	Slower than dc-megablast for same job
`blastn-short`	7	yes	Queries <50 nt (primers, small RNAs)	Default megablast can't seed at length 7
`rmblastn`	11	yes	Repeat masking; bundled with RepeatModeler	Specialized

For blastp:

`-task`	Word	Use case
`blastp` (default)	3	General protein similarity
`blastp-fast`	6	Faster, less sensitive
`blastp-short`	2	Peptides <30 aa, with PAM30 + word_size=2 typical

Soft vs hard masking

Setting	Effect on seed	Effect on extension	Effect on score
`-soft_masking true` (default for several tasks)	Skip masked positions when seeding	Allow extension through masked	Score includes masked positions
`-soft_masking false` + `-dust yes` / `-seg yes`	Skip masked positions when seeding	Skip masked positions in extension	Score excludes masked positions
Hard-mask in input FASTA (N or X)	Hard exclusion everywhere	Hard exclusion	Treated as mismatches

Thread scaling

Threads	Typical speedup vs single	Notes
1-8	Near-linear	Default sweet spot
8-16	Sub-linear (1.5-2x over 8)	Useful on big SMP boxes
16-32	Diminishing returns	I/O bound for most DBs
32+	Often slower	Cache thrash + I/O contention

Output format reference (`-outfmt`)

`-outfmt`	Description	Use
0	Pairwise (default; human-readable)	Debugging, inspection
5	XML	Programmatic parsing (Bio.SearchIO)
6	Tabular (no header)	Most pipelines
7	Tabular with comment headers	Self-documenting
11	ASN.1 binary	Re-parse with later versions

Custom tabular fields:

blastn -query q.fa -db db -outfmt "6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore staxids sscinames stitle"

Field key fields for analysis:

pident = percent identity over the HSP (NOT the query); for query-level, use qcovhsp
qcovs = total query coverage by all HSPs of this subject (the "coverage" most users want)
qcovhsp = query coverage by best HSP alone (use when there's only one HSP per hit)
staxids = taxonomy IDs (v5 only); critical for any "what species" workflow

Prebuilt NCBI databases via `update_blastdb.pl`

# List available
update_blastdb.pl --showall pretty | grep -E 'refseq|swissprot|nt|nr'

# Download (with decompress)
update_blastdb.pl --decompress refseq_select_rna

# Download specific volume of split database
update_blastdb.pl --decompress refseq_protein

# Download with parallelism
update_blastdb.pl --decompress --num_threads 4 refseq_select_rna

Sizes (approximate, 2026):

refseq_select_rna: ~5 GB
refseq_protein: ~30 GB
swissprot: <1 GB
nt: ~250 GB
nr: ~300 GB

For most use cases, refseq_select_* is the right starting point. nt/nr are storage-heavy and reproducibility-hostile.

Code patterns

Build and search a custom protein database

Goal: Build a BLAST+ protein database from a custom FASTA and search against it.

Approach: makeblastdb with v5 + parse_seqids + hash_index; blastp with explicit outfmt.

Reference (NCBI BLAST+ 2.15+):

#!/bin/bash
# Reference: NCBI BLAST+ 2.15+ | Verify API if version differs

REF=reference_proteins.fasta
DB=ref_prot_db
QUERY=query.fasta
OUT=hits.tsv

makeblastdb -in "$REF" -dbtype prot \
            -blastdb_version 5 -parse_seqids -hash_index \
            -title "$REF $(date +%Y-%m-%d)" \
            -out "$DB"

blastp -query "$QUERY" -db "$DB" \
       -evalue 1e-10 \
       -num_threads 8 \
       -max_target_seqs 500 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore stitle" \
       -out "$OUT"

# Top hit per query by bit-score (column 7)
sort -k1,1 -k7,7gr "$OUT" | awk '!seen[$1]++' > top_hit_per_query.tsv

Cross-species DNA with dc-megablast

blastn -query mouse_cdna.fa -db human_refseq_rna \
       -task dc-megablast \
       -word_size 11 \
       -evalue 1e-10 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore" \
       -num_threads 8 \
       -out cross_species.tsv

Short primer search

blastn -query primers.fa -db genome_db \
       -task blastn-short \
       -word_size 7 \
       -evalue 1000 \
       -outfmt 6 \
       -out primer_hits.tsv

Taxonomy-filtered search (BLAST v5 only)

# Restrict to specific taxids
blastp -query query.fa -db nr \
       -taxids 9606,10090,10116 \
       -outfmt "6 qseqid sseqid staxids sscinames evalue bitscore" \
       -out mammalian_hits.tsv

# Or to a taxid subtree (NCBI BLAST+ 2.13+)
echo 9606 > human_only.txt
blastp -query query.fa -db nr -taxidlist human_only.txt -outfmt 6 -out human_hits.tsv

Extract subject sequences for top hits

# Requires database built with -parse_seqids
cut -f2 top_hit_per_query.tsv | sort -u > hit_accessions.txt
blastdbcmd -db ref_prot_db -entry_batch hit_accessions.txt -out hits.fasta

# Pull a range of a sequence
blastdbcmd -db genome_db -entry NC_000001.11 -range 1000000-1001000 -out region.fa

Reciprocal best hit (RBH) for ortholog candidates

See ortholog-inference skill for the principled treatment. Quick version:

blastp -query A.fa -db B_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out A_vs_B.tsv
blastp -query B.fa -db A_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out B_vs_A.tsv

# Best forward + reverse, intersect
awk '!seen[$1]++ {print $1"\t"$2}' A_vs_B.tsv | sort > A_best
awk '!seen[$1]++ {print $1"\t"$2}' B_vs_A.tsv | sort > B_best
awk 'NR==FNR{a[$1]=$2; next} a[$2]==$1' A_best B_best > rbh.tsv

This works but does NOT handle paralog mis-pairs from gene duplication; for that use OrthoFinder or OMA (in ortholog-inference).

Python wrapper with version pinning

import subprocess
import shutil


def require_tool(name, min_version=None):
    if not shutil.which(name):
        raise RuntimeError(f'{name} not on PATH')
    out = subprocess.run([name, '-version'], capture_output=True, text=True)
    print(f'  {out.stdout.strip().splitlines()[0]}')


def run_blast(query, db, out, program='blastp', evalue=1e-10, threads=8, hitlist=500):
    require_tool(program)
    cmd = [program, '-query', query, '-db', db, '-out', out,
           '-evalue', str(evalue),
           '-num_threads', str(threads),
           '-max_target_seqs', str(hitlist),
           '-outfmt', '6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore stitle']
    subprocess.run(cmd, check=True)


def parse_tabular(path):
    cols = ['qseqid', 'sseqid', 'pident', 'length', 'qcovs', 'qcovhsp', 'evalue', 'bitscore', 'stitle']
    rows = []
    with open(path) as f:
        for line in f:
            vals = line.rstrip('\n').split('\t')
            d = dict(zip(cols, vals))
            for k in ('pident', 'qcovs', 'qcovhsp', 'evalue', 'bitscore'):
                d[k] = float(d[k])
            d['length'] = int(d['length'])
            rows.append(d)
    return rows

Failure modes

`nt`/`nr` size shock

Trigger: update_blastdb.pl --decompress nt without realizing the size.
Mechanism: nt is ~250 GB compressed, ~1 TB indexed.
Symptom: Disk fills mid-download; partial DB unusable.
Fix: Use refseq_select for most workflows; only pull nt/nr with intent and >1 TB free.

Missing `-parse_seqids`

Trigger: Built DB without -parse_seqids; later try blastdbcmd -entry.
Mechanism: Without the parsed index, blastdbcmd can't look up by accession.
Symptom: Error: ... not found in database.
Fix: Rebuild with -parse_seqids (cheap if FASTA still on disk).

Wrong `-task` for the question

Trigger: Default blastn for cross-species mRNA (word=11 but ungapped seeding).
Mechanism: Discontiguous seed (dc-megablast) is much more sensitive across species.
Symptom: Far fewer hits than the question warrants.
Fix: Use -task dc-megablast for cross-species; -task megablast only for >=95% identity.

Thread saturation

Trigger: -num_threads 64 on a 32-core box.
Mechanism: I/O bound past ~16 threads; cache thrash hurts past CPU count.
Symptom: No speedup or slowdown.
Fix: Cap at 8-16; split FASTA and run parallel processes instead for very large batches.

v4 database, expecting v5 features

Trigger: Old prebuilt DB; -taxids flag returns "Taxonomy database not available".
Mechanism: v4 needs taxonomy4blast.sqlite3 companion; v5 has taxonomy indexed in DB.
Symptom: Taxonomy filtering silently no-ops or errors.
Fix: Re-download with update_blastdb.pl --decompress (gets v5); or use v5 explicitly when building.

Soft-masking confusion

Trigger: Hard-masking input (replacing repeats with N or X) instead of using -dust/-seg.
Mechanism: Hard-mask creates artificial mismatches at boundaries.
Symptom: True alignments split into multiple short HSPs.
Fix: Pass unmasked FASTA + soft-mask via -soft_masking true + -dust yes/-seg yes.

`max_target_seqs` truncation

Trigger: -max_target_seqs 10 (Shah et al. 2019 Bioinformatics 35:1613).
Mechanism: Early termination, not top-N filter.
Symptom: Different top-10 than -max_target_seqs 500 + post-filter.
Fix: Set -max_target_seqs large (500+); filter top N in awk/Python.

Common errors

Error / symptom	Cause	Solution
`BLAST Database error`	DB path wrong, or alias missing	`blastdbcmd -db <db> -info` to confirm
`Error: entry not found`	Built without `-parse_seqids`	Rebuild
Taxonomy filter no-op	v4 DB	Upgrade to v5
Threads >16 not faster	I/O bound	Split input + parallel invocations
`nt` download fills disk	Database is huge	Use refseq_select
`Sequence too short`	Query < word_size	Use `-task blastn-short` (word=7)
Out of memory	Single large query	Reduce `-num_threads`, split query

References

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402.
Shah N, Nute MG, Warnow T, Pop M. (2019) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35:1613-1614.
Boratyn GM, Camacho C, Cooper PS, et al. (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res 41:W29-W33.

Related Skills

blast-searches - Remote BLAST against NCBI servers
remote-homology - PSI-BLAST, jackhmmer, HHblits, MMseqs2, DIAMOND, Foldseek for distant homology
ortholog-inference - Reciprocal best hit, OrthoFinder, OMA for ortholog calls
sequence-io/read-sequences - Load query/reference FASTA files
batch-downloads - Download large reference FASTA sets before makeblastdb

bio-local-blast

About

SKILL.md

bio-local-blast

About

SKILL.md

Version Compatibility

Local BLAST

Installation

Database format: v5 vs v4

makeblastdb flag taxonomy

-task taxonomy (the most-misused BLAST setting)

Soft vs hard masking

Thread scaling

Output format reference (-outfmt)

Prebuilt NCBI databases via update_blastdb.pl

Code patterns

Build and search a custom protein database

Cross-species DNA with dc-megablast

Short primer search

Taxonomy-filtered search (BLAST v5 only)

Extract subject sequences for top hits

Reciprocal best hit (RBH) for ortholog candidates

Python wrapper with version pinning

Failure modes

nt/nr size shock

Missing -parse_seqids

Wrong -task for the question

Thread saturation

v4 database, expecting v5 features

Soft-masking confusion

max_target_seqs truncation

Common errors

References

Related Skills

About

SKILL.md

About

SKILL.md

Version Compatibility

Local BLAST

Installation

Database format: v5 vs v4

makeblastdb flag taxonomy

-task taxonomy (the most-misused BLAST setting)

Soft vs hard masking

Thread scaling

Output format reference (-outfmt)

Prebuilt NCBI databases via update_blastdb.pl

Code patterns

Build and search a custom protein database

Cross-species DNA with dc-megablast

Short primer search

Taxonomy-filtered search (BLAST v5 only)

Extract subject sequences for top hits

Reciprocal best hit (RBH) for ortholog candidates

Python wrapper with version pinning

Failure modes

nt/nr size shock

Missing -parse_seqids

Wrong -task for the question

Thread saturation

v4 database, expecting v5 features

Soft-masking confusion

max_target_seqs truncation

Common errors

References

Related Skills

`makeblastdb` flag taxonomy

`-task` taxonomy (the most-misused BLAST setting)

Output format reference (`-outfmt`)

Prebuilt NCBI databases via `update_blastdb.pl`

`nt`/`nr` size shock

Missing `-parse_seqids`

Wrong `-task` for the question

`max_target_seqs` truncation

`makeblastdb` flag taxonomy

`-task` taxonomy (the most-misused BLAST setting)

Output format reference (`-outfmt`)

Prebuilt NCBI databases via `update_blastdb.pl`

`nt`/`nr` size shock

Missing `-parse_seqids`

Wrong `-task` for the question

`max_target_seqs` truncation