Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    gptomics

    bio-genome-assembly-assembly-qc

    gptomics/bio-genome-assembly-assembly-qc
    Data & Analytics
    178
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Assess genome assembly quality using QUAST for contiguity metrics and BUSCO for completeness. Essential for evaluating assembly success and comparing assemblers.

    SKILL.md

    Version Compatibility

    Reference examples tested with: BUSCO 5.5+, QUAST 5.2+, SPAdes 3.15+, pandas 2.2+

    Before using code patterns, verify installed versions match. If versions differ:

    • Python: pip show <package> then help(module.function) to check signatures
    • CLI: <tool> --version then <tool> --help to confirm flags

    If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

    Assembly QC

    "Assess my genome assembly quality" → Evaluate assembly contiguity (N50, total length, misassemblies) and gene completeness using conserved single-copy orthologs.

    • CLI: quast assembly.fa -r reference.fa (contiguity), busco -i assembly.fa -l lineage (completeness)

    Key Metrics

    Metric Good Assembly
    N50 High (relative to genome)
    L50 Low
    Contigs Few
    Misassemblies 0 (with reference)
    BUSCO Complete >95%
    BUSCO Duplicated <5% (unless polyploid)

    QUAST

    Installation

    conda install -c bioconda quast
    

    Basic Usage

    quast.py assembly.fasta -o quast_output
    

    With Reference Genome

    quast.py assembly.fasta -r reference.fasta -o quast_output
    

    Compare Multiple Assemblies

    quast.py assembly1.fa assembly2.fa assembly3.fa -o comparison
    

    Key Options

    Option Description
    -o Output directory
    -r Reference genome
    -g Gene annotations (GFF)
    -t Threads
    -m Min contig length (default: 500)
    --large For large genomes (>100Mb)
    --fragmented For highly fragmented assemblies
    --scaffolds Input is scaffolds (includes N-gaps)

    With Gene Annotations

    quast.py assembly.fasta -r reference.fasta -g genes.gff -o quast_output
    

    For Large Genomes

    quast.py --large assembly.fasta -o quast_output -t 16
    

    Output Files

    quast_output/
    ├── report.txt        # Summary statistics
    ├── report.html       # Interactive report
    ├── report.tsv        # Tab-separated stats
    ├── icarus.html       # Contig viewer
    └── aligned_stats/    # If reference provided
    

    Key Output Metrics

    Metric Description
    Total length Sum of contig lengths
    # contigs Number of contigs (>= min length)
    Largest contig Length of largest contig
    N50 50% of assembly in contigs >= this length
    N90 90% of assembly in contigs >= this length
    L50 Number of contigs comprising N50
    GC % GC content
    # misassemblies With reference: structural errors
    Genome fraction With reference: % of reference covered

    BUSCO

    Installation

    conda install -c bioconda busco
    

    Basic Usage

    busco -i assembly.fasta -m genome -l bacteria_odb10 -o busco_output
    

    Key Options

    Option Description
    -i Input assembly
    -m Mode: genome, proteins, transcriptome
    -l Lineage dataset
    -o Output name
    -c CPU threads
    --auto-lineage Auto-detect lineage
    --offline Use downloaded datasets only
    --list-datasets List available lineages

    List Available Lineages

    busco --list-datasets
    

    Common Lineages

    Lineage Use For
    bacteria_odb10 Bacteria
    archaea_odb10 Archaea
    eukaryota_odb10 General eukaryote
    fungi_odb10 Fungi
    metazoa_odb10 Animals
    vertebrata_odb10 Vertebrates
    mammalia_odb10 Mammals
    viridiplantae_odb10 Plants
    saccharomycetes_odb10 Yeasts

    Auto-Lineage Detection

    busco -i assembly.fasta -m genome --auto-lineage -o busco_output
    

    Output Files

    busco_output/
    ├── short_summary.txt           # Quick summary
    ├── full_table.tsv              # All BUSCO results
    ├── missing_busco_list.tsv      # Missing genes
    └── busco_sequences/            # BUSCO gene sequences
    

    Interpret Results

    C:98.5%[S:97.0%,D:1.5%],F:0.5%,M:1.0%,n:4085
    
    C - Complete (total)
    S - Single-copy
    D - Duplicated
    F - Fragmented
    M - Missing
    n - Total BUSCO groups
    

    Quality Thresholds

    Quality Complete Missing
    Excellent >95% <2%
    Good >90% <5%
    Acceptable >80% <10%
    Poor <80% >10%

    Complete QC Workflow

    Goal: Run a comprehensive assembly quality assessment combining contiguity and completeness metrics.

    Approach: Execute QUAST for contiguity statistics and BUSCO for gene completeness, optionally with a reference genome.

    #!/bin/bash
    set -euo pipefail
    
    ASSEMBLY=$1
    REFERENCE=${2:-}
    LINEAGE=${3:-bacteria_odb10}
    OUTDIR=${4:-assembly_qc}
    
    mkdir -p $OUTDIR
    
    echo "=== Assembly QC ==="
    
    # QUAST
    echo "Running QUAST..."
    if [ -n "$REFERENCE" ]; then
        quast.py $ASSEMBLY -r $REFERENCE -o ${OUTDIR}/quast -t 8
    else
        quast.py $ASSEMBLY -o ${OUTDIR}/quast -t 8
    fi
    
    # BUSCO
    echo "Running BUSCO..."
    busco -i $ASSEMBLY -m genome -l $LINEAGE -o busco_run -c 8
    mv busco_run ${OUTDIR}/busco
    
    # Summary
    echo ""
    echo "=== QUAST Summary ==="
    cat ${OUTDIR}/quast/report.txt
    
    echo ""
    echo "=== BUSCO Summary ==="
    cat ${OUTDIR}/busco/short_summary*.txt
    
    echo ""
    echo "Reports saved to $OUTDIR"
    

    Compare Assemblies

    Goal: Evaluate multiple assemblies side-by-side to select the best one.

    Approach: Run QUAST with multiple input assemblies and labeled names, then generate BUSCO comparison plots.

    QUAST Comparison

    quast.py \
        spades_assembly.fa \
        flye_assembly.fa \
        canu_assembly.fa \
        -r reference.fa \
        -l "SPAdes,Flye,Canu" \
        -o assembly_comparison
    

    BUSCO Comparison

    # Run BUSCO on each assembly
    for asm in spades.fa flye.fa canu.fa; do
        name=$(basename $asm .fa)
        busco -i $asm -m genome -l bacteria_odb10 -o busco_${name}
    done
    
    # Generate comparison plot
    generate_plot.py -wd . busco_spades busco_flye busco_canu
    

    Python: Parse QUAST Output

    Goal: Programmatically extract assembly metrics from QUAST reports.

    Approach: Read the tab-separated report.tsv file and transpose it for easy metric access.

    import pandas as pd
    
    def parse_quast(report_tsv):
        '''Parse QUAST report.tsv file.'''
        df = pd.read_csv(report_tsv, sep='\t', index_col=0)
        return df.T
    
    stats = parse_quast('quast_output/report.tsv')
    print(f"N50: {stats['N50'].values[0]}")
    print(f"Total length: {stats['Total length'].values[0]}")
    print(f"# contigs: {stats['# contigs'].values[0]}")
    

    Python: Parse BUSCO Output

    Goal: Programmatically extract BUSCO completeness metrics from summary files.

    Approach: Parse the short_summary.txt file using regex to capture completeness, duplication, fragmentation, and missing percentages.

    import re
    
    def parse_busco_summary(summary_file):
        '''Parse BUSCO short summary.'''
        with open(summary_file) as f:
            text = f.read()
    
        pattern = r'C:(\d+\.\d+)%\[S:(\d+\.\d+)%,D:(\d+\.\d+)%\],F:(\d+\.\d+)%,M:(\d+\.\d+)%,n:(\d+)'
        match = re.search(pattern, text)
    
        if match:
            return {
                'complete': float(match.group(1)),
                'single': float(match.group(2)),
                'duplicated': float(match.group(3)),
                'fragmented': float(match.group(4)),
                'missing': float(match.group(5)),
                'total': int(match.group(6))
            }
        return None
    
    result = parse_busco_summary('busco_output/short_summary.txt')
    print(f"Complete: {result['complete']}%")
    

    MetaQUAST (Metagenomes)

    Goal: Assess metagenome assembly quality accounting for multiple reference genomes.

    Approach: Run MetaQUAST which automatically identifies reference genomes and reports per-genome metrics.

    metaquast.py metagenome_assembly.fa -o metaquast_output -t 16
    

    Troubleshooting

    Low N50

    • Check coverage depth
    • Consider longer reads
    • Try different assembler

    Low BUSCO Completeness

    • Check input read quality
    • Verify correct lineage dataset
    • May indicate real gene loss (compare to relatives)

    High Duplication in BUSCO

    • Normal for polyploids
    • May indicate contamination
    • Check for collapsed haplotypes

    Related Skills

    • short-read-assembly - SPAdes assembly
    • long-read-assembly - Flye/Canu assembly
    • assembly-polishing - Improve accuracy
    • metagenomics - Metagenome analysis
    Repository
    gptomics/bioskills
    Files