Assess genome assembly quality using QUAST for contiguity metrics and BUSCO for completeness. Essential for evaluating assembly success and comparing assemblers...
Reference examples tested with: QUAST 5.2+, BUSCO 5.5+ (and 6.x for odb12 lineages), compleasm 0.2.6+, Merqury 1.3+, meryl 1.4+, minimap2 2.26+, Inspector 1.2+, CRAQ 1.0+, merfin 1.0+, GenomeScope2 2.0+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesResults depend on inputs that outlive the binary version - record them:
_odb10 (BUSCO 5) and _odb12 (BUSCO 6 default) gene sets are not comparable across the version boundary; a 99% on the shallow eukaryota_odb10 (~255 genes) is a different claim from 99% on a deep clade set (~5,500+).best_k.sh <genome_size>, not hardcoded) and the read set used for the k-mer DB (use accurate reads; see the circularity warning below).If code throws an error, introspect the installed tool and adapt rather than retrying.
"Is my genome assembly any good?" -> Measure all three orthogonal axes - contiguity, completeness, correctness - with reference-free methods, because no single number (least of all N50) is quality.
quast.py asm.fa --large --eukaryote -o out (contiguity + reference-based structure), busco -i asm.fa -m genome -l <lineage> or compleasm run -a asm.fa -l <lineage> (gene completeness), merqury.sh reads.meryl asm.fa out (reference-free QV + k-mer completeness), inspector.py -c asm.fa -r reads.fq (reference-free structural errors)Assembly quality is three genuinely orthogonal axes - contiguity, completeness, correctness - and a single number on any one is not quality. The axes do not predict each other, and the diagnostic failure modes prove it:
The field's historical sin is reporting contiguity alone because it is cheapest to compute and easiest to game. N50 is the most-gamed metric in genomics: it rises when sequence is thrown away (N50 is computed on what survives), when misjoins are not broken (a misjoined contig is a long contig), and when haplotigs are retained. A bigger N50 is louder, not better. Three load-bearing moves:
| Tool | Citation | Axis / Role | When |
|---|---|---|---|
| QUAST / calN50 | Gurevich 2013 Bioinformatics; auN = Li blog (no journal) | contiguity (auN/NGx, N50/L50) + reference-based structure (NA50, misassemblies) | always for contiguity; structure only vs a same-organism reference |
| BUSCO | Manni 2021 Mol Biol Evol; Simão 2015 Bioinformatics | gene-space completeness (C/S/D/F/M) | universal; conservative on good genomes |
| compleasm | Huang & Li 2023 Bioinformatics | gene-space completeness, miniprot-based | faster + more sensitive; default on HiFi/T2T-era genomes |
| Merqury | Rhie 2020 Genome Biol | reference-free QV + k-mer completeness + spectra-cn + phasing | always; the accuracy gold standard |
| merfin | Formenti 2022 Nat Methods | multiplicity-corrected QV / polishing | refine QV biased by k-mer multiplicity |
| Inspector | Chen 2021 Genome Biol | reference-free structural + base errors (long reads) | novel genomes; can also correct |
| CRAQ | Li 2023 Nat Commun | reference-free structural/regional errors (clipped alignments) | novel genomes; flags misjoins to split |
| asmgene | Li (minimap2, no separate journal) | gene collapse / false duplication | high-quality genomes where BUSCO saturates |
| GenomeScope2 | Ranallo-Benavidez 2020 Nat Commun | genome size / het / repeat % from k-mers | the size estimate NG50/auNG/spectra-cn need (-> genome-profiling) |
| Scenario | Recommended | Why |
|---|---|---|
| Novel genome, no trusted reference | Merqury QV + k-mer completeness + BUSCO/compleasm + auN/NGx + Inspector/CRAQ | reference-free triad; QUAST structure is uninterpretable here |
| Same-species (near-isogenic) reference available | add QUAST --large --eukaryote -r ref.fa (NA50, misassemblies) |
reference-based structure is trustworthy only vs the same organism |
| High-quality HiFi/T2T-era assembly, BUSCO looks low | compleasm | BUSCO under-reports good genomes (its predictor, not the assembly, misses genes) |
| Contiguity claim must resist gaming | auN/auNG via calN50.js -L <size> |
N50 is a single unstable order-statistic and is gameable |
| High BUSCO-Duplicated, size > expected | spectra-cn + asmgene + GenomeScope2 size -> purge_dups | distinguish uncollapsed haplotigs (purge) from real WGD (keep) |
| Phased diploid / trio assembly | Merqury hap-mers: switch/hamming error + blob plot | phasing accuracy is the extra axis |
| Need genome size / het before NG50 | -> genome-profiling (GenomeScope2) | NG/auN/spectra-cn all need a size estimate |
| Reads not yet QC'd | -> read-qc/quality-reports | garbage-in caps assembly quality |
| Bacterial isolate / MAG completeness+contamination | -> contamination-detection (CheckM2/GUNC/MIMAG) | marker-gene completeness/contamination is a different problem |
k8 calN50.js -L <genome_size> asm.fa # N50/L50 + NG50/NGx + auN/auNG; -L sets genome size for NG/auNG (ships with minimap2)
quast.py asm.fa --large --eukaryote -t 16 -o quast_out # N50/L50, GC, # contigs; NG50 only with -r or --est-ref-size; structure only if -r given
--large implies --eukaryote --min-contig 3000 --min-alignment 500 --extensive-mis-size 7000. Report contig AND scaffold N50; a scaffold N50 far above the contig N50 means the contiguity is scaffolding Ns, and every gap is a join hypothesis that could be a misassembly. NA50 (QUAST, contigs broken at misassemblies) far below N50 means the contiguity is partly fictional.
busco -i asm.fa -m genome -l vertebrata_odb10 -o busco_out -c 16 # metaeuk predictor (default)
busco -i asm.fa -m genome --auto-lineage -o busco_out -c 16 # auto-pick if clade unknown
compleasm run -a asm.fa -l vertebrata -o compleasm_out -t 16 # miniprot-based; faster + more sensitive
Reported as C:[S,D],F,M,n. Read C, F, and M together, never C alone: high Fragmented with high Complete signals a contiguity/base-quality problem hidden behind the headline. Use the deepest applicable clade dataset (a 99% on the shallow eukaryota_odb10 ~255-gene set is trivially easy and not comparable to a deep clade set), and record the lineage + OrthoDB generation. On a high-quality assembly, BUSCO reported ~95.7% complete where compleasm reported ~99.6% on the same human genome (Huang & Li 2023) - the missing ~4% was missing from BUSCO's predictor, not the genome - so prefer compleasm on good genomes. Both share gene-space blindness: they say nothing about intergenic/repeat/regulatory sequence. Merqury k-mer completeness scores the whole genome (reliable read k-mers found in the assembly / reliable read k-mers in the reads), catching missing sequence BUSCO cannot see; it is blind to structure (a scrambled-but-present genome scores 100%).
Goal: Get a reference-free per-base accuracy (QV) plus a copy-number/false-duplication picture, then structural errors without a reference.
Approach: Build a meryl k-mer DB at the best_k.sh-derived k from accurate reads, run Merqury for QV + completeness + spectra-cn, and map raw long reads back with Inspector/CRAQ for structural errors. Refine QV with merfin where multiplicity bias matters.
best_k.sh <genome_size> # prints recommended k (NOT hardcoded); ~18-21 for Gbp genomes
meryl count k=21 reads.fastq output reads.meryl # k from best_k.sh; use ACCURATE reads (HiFi/Illumina)
merqury.sh reads.meryl asm.fa out # -> out.qv (per-scaffold + overall), out.completeness.stats, spectra-cn
inspector.py -c asm.fa -r reads.fq -o insp_out --datatype hifi -t 16 # reference-free structural + base errors
craq -g asm.fa -sms long_reads.bam -ngs short_reads.bam -o craq_out # R-AQI/S-AQI; CRE (regional)/CSE (structural)
QV: with E = K_asm-only / K_total, per-base error P = 1 - (1 - E)^(1/k) and QV = -10*log10(P) (the ^(1/k) converts a k-mer error rate to per-base, since one wrong base breaks k overlapping k-mers). QV40 = 1 error/10 kb (the EBP/VGP floor), QV50 strong, ~QV60 = T2T-grade (1/Mb). The spectra-cn plot reads completeness and false duplication in one figure: a black "missing" peak at homozygous depth = real content absent; 2-copy k-mers under the 1-copy peak = uncollapsed haplotigs; error k-mers sit far left.
The EBP minimum is 6.C.Q40: x.y.z where x = log10 of contig NG50 (6 = 1 Mb), y = scaffold level (C = chromosome-scale), z = QV (40 = <1 error/10 kb). It is literally the triad turned into a label, and the bar moves - T2T pushed the achievable frontier to ~Q60/gapless, so bragging about QV40 in 2026 is hitting the floor. Match the bar to the organism (the relaxed "5" tier, >100 kb contig NG50, exists for low-input species). For phased diploid/trio assemblies, Merqury hap-mers (parental k-mers, or Hi-C) give the switch error (local haplotype flips within a block) and hamming error (global mis-assignment fraction) - report both, and read the hap-mer blob plot (cleanly phased contigs sit on one axis).
Trigger: reporting a single N50 as the quality verdict. Mechanism: N50 rises on thrown-away sequence, unbroken misjoins, and retained haplotigs; it is also a single unstable order-statistic. Symptom: big N50, unstated QV/completeness/NG50. Fix: report auN/NGx + BUSCO/compleasm + Merqury QV; treat N50-only claims as untrustworthy.
Trigger: quast.py -r congener.fa on a novel genome. Mechanism: real inversions/SVs and SNPs between organism and reference are scored as "misassemblies"/"mismatches"; the count scales with divergence, not error. Symptom: "hundreds of misassemblies" on a correct assembly. Fix: use reference-free correctness (Inspector/CRAQ/Merqury); reserve QUAST structure for a same-organism reference.
Trigger: QV from the exact reads used to polish. Mechanism: the polisher made the assembly agree with those reads by construction. Symptom: impressively high QV that rose after polishing with the QV reads. Fix: build the k-mer DB from accurate, ideally independent reads; consider merfin for multiplicity-corrected QV.
Trigger: treating high BUSCO-D as "extra coverage / more complete". Mechanism: uncollapsed haplotigs (both alleles kept as separate primary contigs) vs real WGD vs split models. Symptom: D in high single digits to tens, assembly size >> GenomeScope2 estimate. Fix: triangulate size + spectra-cn 2-copy peak + asmgene false-dup; if no WGD -> purge_dups, then re-QC (watch for over-purge: size dropping below the estimate deletes real segmental duplications).
Trigger: QV60 + BUSCO 99% taken as "done". Mechanism: QV is measured on what assembled; BUSCO scores the conserved easy core only. Symptom: high accuracy and gene-completeness while 8-15% of sequence (repeats/centromeres) is absent. Fix: add Merqury k-mer completeness (whole-genome) and inspect read mapping-rate/coverage uniformity.
| Threshold | Source | Rationale |
|---|---|---|
| Merqury QV >= 40 | EBP/VGP minimum (Rhie 2021) | 1 error/10 kb; QV50 strong, ~Q60 T2T-grade; report the actual value |
| QV from polishing reads | circularity trap | always biased high; use independent/accurate reads, prefer merfin |
| BUSCO Complete >= 95%, Fragmented < 5% | field convention | read F+M with C; high F = contiguity/base-quality problem behind a good C% |
| BUSCO Duplicated ~1-3% (clean haploid); >5-8% no WGD | assembly norm | uncollapsed haplotigs -> purge_dups; cross-check size + spectra-cn + asmgene |
| compleasm preferred on good genomes | Huang & Li 2023 | BUSCO under-reports (~95.7% vs ~99.6% on human) due to its predictor |
| k-mer completeness (Merqury) >= 95% | field convention | lower = sequence absent that the BUSCO gene set cannot see |
| Contig NG50 >= 1 Mb (EBP "6") | Rhie 2021 / EBP standards | the 6.C.Q40 contig bar; relaxed "5" (>100 kb) for low-input species |
| Report auN/NGx, not bare N50 | Li (auN blog) | N50 is gameable and a single unstable order-statistic |
| NG/auN need a genome-size estimate | by definition | GenomeScope2/flow cytometry; couples contiguity to completeness |
| Error / symptom | Cause | Solution |
|---|---|---|
| Big N50, no QV reported | leading with the most-gamed metric | add Merqury QV, k-mer completeness, auN/NGx |
| Hundreds of QUAST "misassemblies" on a novel genome | divergent reference; biology scored as error | reference-free (Inspector/CRAQ); QUAST only vs same organism |
| QV suspiciously high, rose after polishing | QV computed on the polishing reads (circular) | independent/accurate-read k-mer DB; merfin |
| Assembly ~1.5-2x expected size, high BUSCO-D | uncollapsed haplotigs (false duplication) | purge_dups; verify with spectra-cn + asmgene |
| Size drops below GenomeScope2 estimate after purging | over-purged real segmental duplications | back off purge stringency; check asmgene collapse direction |
| BUSCO low-90s on a HiFi/T2T assembly | BUSCO predictor misses present genes | re-run compleasm before concluding incompleteness |
| Scaffold N50 >> contig N50 reported as contiguity | contiguity is gap-Ns, not sequence | report contig N50 too; each gap is a join hypothesis |