Run local BLAST searches using BLAST+ command-line tools...
Reference examples tested with: NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
blastn -version then blastn -help to confirm flagsmakeblastdb -help to confirm database build optionsIf a flag is unrecognized or behavior changes, introspect with -help and adapt the example to match the installed version rather than retrying.
"Run BLAST locally for speed and control" -> Build or download a BLAST+ database, run the appropriate program with carefully chosen -task, masking, and thread settings, parse tabular output. Local BLAST is the right tool when remote is rate-limited or when the database must be reproducible (frozen).
The biggest mistakes are (a) using nt/nr without realizing they're >250 GB and grow weekly, (b) not building with -parse_seqids and then being unable to extract hit sequences with blastdbcmd, (c) using default blastn for cross-species when dc-megablast is correct, and (d) thinking -num_threads 32 will scale -- past ~16 threads BLAST is I/O bound.
makeblastdb, blastn/blastp, blastdbcmd, update_blastdb.pl (NCBI BLAST+)subprocess wrapper (preferred); Bio.Blast.Applications was deprecated and removed -- do not use# conda (preferred)
conda install -c bioconda blast
# macOS
brew install blast
# Ubuntu
sudo apt install ncbi-blast+
# Verify
blastn -version # NCBI BLAST+ 2.15+ expected
update_blastdb.pl --showall pretty | head
NCBI introduced BLAST database v5 in BLAST+ 2.10 (2020). v5 includes taxonomy indexing directly in the database files, enabling -taxids and -taxidlist filtering without a companion file. v4 databases require taxonomy4blast.sqlite3 to be present and discoverable.
| Feature | v4 | v5 |
|---|---|---|
| Default for prebuilt NCBI dbs | No (legacy) | Yes (since 2020) |
-taxids, -taxidlist support |
No | Yes |
blastdbcmd -taxids |
No | Yes |
New -info output fields |
No | Yes |
update_blastdb.pl downloads v5 by default. When building a database manually with makeblastdb, v5 format requires -blastdb_version 5. Always pass -blastdb_version 5 and -parse_seqids when building from scratch.
makeblastdb flag taxonomy| Flag | Effect | When |
|---|---|---|
-dbtype nucl or -dbtype prot |
Required | Always |
-parse_seqids |
Indexes accessions so blastdbcmd -entry <acc> works |
Almost always (downstream extraction) |
-hash_index |
Speeds up extraction by accession | Large dbs |
-blastdb_version 5 |
Use v5 format | Always |
-taxid 9606 |
Single taxid for all seqs | Single-species DB |
-taxid_map file.tsv |
Per-sequence taxid mapping (seqid |
Multi-species DB |
-mask_data masking.asnb |
Apply precomputed soft-masking | Production pipelines |
-title "..." |
Free-text label | Cosmetic |
-out path/prefix |
DB file path prefix | Always |
makeblastdb -in reference.fasta -dbtype nucl \
-blastdb_version 5 \
-parse_seqids \
-hash_index \
-title "Custom reference 2026-05" \
-out custom_db
-task taxonomy (the most-misused BLAST setting)For blastn, the -task flag picks among heuristics with different word sizes and gap parameters.
-task |
Word | Gapped | Use case | Mistake to avoid |
|---|---|---|---|---|
megablast (default) |
28 | linear | >=95% identity, intra-species, primer hits, contamination check | Used for cross-species and misses everything |
dc-megablast |
11 (discontiguous) | yes | Cross-species mRNA homology | Underused -- this is what blastn "should" be for cross-species |
blastn |
11 | yes | General sensitive DNA | Slower than dc-megablast for same job |
blastn-short |
7 | yes | Queries <50 nt (primers, small RNAs) | Default megablast can't seed at length 7 |
rmblastn |
11 | yes | Repeat masking; bundled with RepeatModeler | Specialized |
For blastp:
-task |
Word | Use case |
|---|---|---|
blastp (default) |
3 | General protein similarity |
blastp-fast |
6 | Faster, less sensitive |
blastp-short |
2 | Peptides <30 aa, with PAM30 + word_size=2 typical |
| Setting | Effect on seed | Effect on extension | Effect on score |
|---|---|---|---|
-soft_masking true (default for several tasks) |
Skip masked positions when seeding | Allow extension through masked | Score includes masked positions |
-soft_masking false + -dust yes / -seg yes |
Skip masked positions when seeding | Skip masked positions in extension | Score excludes masked positions |
| Hard-mask in input FASTA (N or X) | Hard exclusion everywhere | Hard exclusion | Treated as mismatches |
Soft masking is correct for almost all cases. Hard masking creates artificial mismatches at masked boundaries and can split true alignments. The exception: searching against a database of repeats explicitly, where hard masking on the query is the right choice.
BLAST+ parallelizes per-query (with -num_threads) but is I/O bound past ~16 threads on most hardware. For >100,000 query batches the better answer is splitting the input FASTA into N chunks and running N parallel blastn invocations -- this saturates CPUs better than -num_threads 64.
| Threads | Typical speedup vs single | Notes |
|---|---|---|
| 1-8 | Near-linear | Default sweet spot |
| 8-16 | Sub-linear (1.5-2x over 8) | Useful on big SMP boxes |
| 16-32 | Diminishing returns | I/O bound for most DBs |
| 32+ | Often slower | Cache thrash + I/O contention |
For massive workflows, prefer DIAMOND (Buchfink et al. 2021 Nat Methods 18:366) or MMseqs2 (Steinegger & Soding 2017 Nat Biotechnol 35:1026) -- 100-10,000x faster than BLASTP at comparable sensitivity. See remote-homology skill.
-outfmt)-outfmt |
Description | Use |
|---|---|---|
| 0 | Pairwise (default; human-readable) | Debugging, inspection |
| 5 | XML | Programmatic parsing (Bio.SearchIO) |
| 6 | Tabular (no header) | Most pipelines |
| 7 | Tabular with comment headers | Self-documenting |
| 11 | ASN.1 binary | Re-parse with later versions |
Custom tabular fields:
blastn -query q.fa -db db -outfmt "6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore staxids sscinames stitle"
Field key fields for analysis:
pident = percent identity over the HSP (NOT the query); for query-level, use qcovhspqcovs = total query coverage by all HSPs of this subject (the "coverage" most users want)qcovhsp = query coverage by best HSP alone (use when there's only one HSP per hit)staxids = taxonomy IDs (v5 only); critical for any "what species" workflowupdate_blastdb.pl# List available
update_blastdb.pl --showall pretty | grep -E 'refseq|swissprot|nt|nr'
# Download (with decompress)
update_blastdb.pl --decompress refseq_select_rna
# Download specific volume of split database
update_blastdb.pl --decompress refseq_protein
# Download with parallelism
update_blastdb.pl --decompress --num_threads 4 refseq_select_rna
Sizes (approximate, 2026):
refseq_select_rna: ~5 GBrefseq_protein: ~30 GBswissprot: <1 GBnt: ~250 GBnr: ~300 GBFor most use cases, refseq_select_* is the right starting point. nt/nr are storage-heavy and reproducibility-hostile.
Goal: Build a BLAST+ protein database from a custom FASTA and search against it.
Approach: makeblastdb with v5 + parse_seqids + hash_index; blastp with explicit outfmt.
Reference (NCBI BLAST+ 2.15+):
#!/bin/bash
# Reference: NCBI BLAST+ 2.15+ | Verify API if version differs
REF=reference_proteins.fasta
DB=ref_prot_db
QUERY=query.fasta
OUT=hits.tsv
makeblastdb -in "$REF" -dbtype prot \
-blastdb_version 5 -parse_seqids -hash_index \
-title "$REF $(date +%Y-%m-%d)" \
-out "$DB"
blastp -query "$QUERY" -db "$DB" \
-evalue 1e-10 \
-num_threads 8 \
-max_target_seqs 500 \
-outfmt "6 qseqid sseqid pident length qcovs evalue bitscore stitle" \
-out "$OUT"
# Top hit per query by bit-score (column 7)
sort -k1,1 -k7,7gr "$OUT" | awk '!seen[$1]++' > top_hit_per_query.tsv
blastn -query mouse_cdna.fa -db human_refseq_rna \
-task dc-megablast \
-word_size 11 \
-evalue 1e-10 \
-outfmt "6 qseqid sseqid pident length qcovs evalue bitscore" \
-num_threads 8 \
-out cross_species.tsv
blastn -query primers.fa -db genome_db \
-task blastn-short \
-word_size 7 \
-evalue 1000 \
-outfmt 6 \
-out primer_hits.tsv
# Restrict to specific taxids
blastp -query query.fa -db nr \
-taxids 9606,10090,10116 \
-outfmt "6 qseqid sseqid staxids sscinames evalue bitscore" \
-out mammalian_hits.tsv
# Or to a taxid subtree (NCBI BLAST+ 2.13+)
echo 9606 > human_only.txt
blastp -query query.fa -db nr -taxidlist human_only.txt -outfmt 6 -out human_hits.tsv
# Requires database built with -parse_seqids
cut -f2 top_hit_per_query.tsv | sort -u > hit_accessions.txt
blastdbcmd -db ref_prot_db -entry_batch hit_accessions.txt -out hits.fasta
# Pull a range of a sequence
blastdbcmd -db genome_db -entry NC_000001.11 -range 1000000-1001000 -out region.fa
See ortholog-inference skill for the principled treatment. Quick version:
blastp -query A.fa -db B_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
-max_target_seqs 5 -out A_vs_B.tsv
blastp -query B.fa -db A_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
-max_target_seqs 5 -out B_vs_A.tsv
# Best forward + reverse, intersect
awk '!seen[$1]++ {print $1"\t"$2}' A_vs_B.tsv | sort > A_best
awk '!seen[$1]++ {print $1"\t"$2}' B_vs_A.tsv | sort > B_best
awk 'NR==FNR{a[$1]=$2; next} a[$2]==$1' A_best B_best > rbh.tsv
This works but does NOT handle paralog mis-pairs from gene duplication; for that use OrthoFinder or OMA (in ortholog-inference).
import subprocess
import shutil
def require_tool(name, min_version=None):
if not shutil.which(name):
raise RuntimeError(f'{name} not on PATH')
out = subprocess.run([name, '-version'], capture_output=True, text=True)
print(f' {out.stdout.strip().splitlines()[0]}')
def run_blast(query, db, out, program='blastp', evalue=1e-10, threads=8, hitlist=500):
require_tool(program)
cmd = [program, '-query', query, '-db', db, '-out', out,
'-evalue', str(evalue),
'-num_threads', str(threads),
'-max_target_seqs', str(hitlist),
'-outfmt', '6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore stitle']
subprocess.run(cmd, check=True)
def parse_tabular(path):
cols = ['qseqid', 'sseqid', 'pident', 'length', 'qcovs', 'qcovhsp', 'evalue', 'bitscore', 'stitle']
rows = []
with open(path) as f:
for line in f:
vals = line.rstrip('\n').split('\t')
d = dict(zip(cols, vals))
for k in ('pident', 'qcovs', 'qcovhsp', 'evalue', 'bitscore'):
d[k] = float(d[k])
d['length'] = int(d['length'])
rows.append(d)
return rows
nt/nr size shockupdate_blastdb.pl --decompress nt without realizing the size.nt is ~250 GB compressed, ~1 TB indexed.refseq_select for most workflows; only pull nt/nr with intent and >1 TB free.-parse_seqids-parse_seqids; later try blastdbcmd -entry.blastdbcmd can't look up by accession.Error: ... not found in database.-parse_seqids (cheap if FASTA still on disk).-task for the questionblastn for cross-species mRNA (word=11 but ungapped seeding).dc-megablast) is much more sensitive across species.-task dc-megablast for cross-species; -task megablast only for >=95% identity.-num_threads 64 on a 32-core box.-taxids flag returns "Taxonomy database not available".taxonomy4blast.sqlite3 companion; v5 has taxonomy indexed in DB.update_blastdb.pl --decompress (gets v5); or use v5 explicitly when building.-dust/-seg.-soft_masking true + -dust yes/-seg yes.max_target_seqs truncation-max_target_seqs 10 (Shah et al. 2019 Bioinformatics 35:1613).-max_target_seqs 500 + post-filter.-max_target_seqs large (500+); filter top N in awk/Python.| Error / symptom | Cause | Solution |
|---|---|---|
BLAST Database error |
DB path wrong, or alias missing | blastdbcmd -db <db> -info to confirm |
Error: entry not found |
Built without -parse_seqids |
Rebuild |
| Taxonomy filter no-op | v4 DB | Upgrade to v5 |
| Threads >16 not faster | I/O bound | Split input + parallel invocations |
nt download fills disk |
Database is huge | Use refseq_select |
Sequence too short |
Query < word_size | Use -task blastn-short (word=7) |
| Out of memory | Single large query | Reduce -num_threads, split query |