Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    brunoasm

    busco-phylogeny

    brunoasm/busco-phylogeny
    Research
    7

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Generate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation

    SKILL.md

    BUSCO-based Phylogenomics Workflow Generator

    This skill provides phylogenomics expertise for generating comprehensive, scheduler-aware workflows for phylogenetic inference from genome assemblies using single-copy orthologs.

    Purpose

    This skill helps users generate phylogenies from genome assemblies by:

    1. Handling mixed input (local files and NCBI accessions)
    2. Creating scheduler-specific scripts (SLURM, PBS, cloud, local)
    3. Setting up complete workflows from raw genomes to final trees
    4. Providing quality control and recommendations
    5. Supporting flexible software management (bioconda, Docker, custom)

    Available Resources

    The skill provides access to these bundled resources:

    Scripts (scripts/)

    • query_ncbi_assemblies.py - Query NCBI for available genome assemblies by taxon name (new!)
    • download_ncbi_genomes.py - Download genomes from NCBI using BioProjects or Assembly accessions
    • rename_genomes.py - Rename genome files with meaningful sample names (important!)
    • generate_qc_report.sh - Generate quality control reports from compleasm results
    • extract_orthologs.sh - Extract and reorganize single-copy orthologs
    • run_aliscore.sh - Wrapper for Aliscore to identify randomly similar sequences (RSS)
    • run_alicut.sh - Wrapper for ALICUT to remove RSS positions from alignments
    • run_aliscore_alicut_batch.sh - Batch process all alignments through Aliscore + ALICUT
    • convert_fasconcat_to_partition.py - Convert FASconCAT output to IQ-TREE partition format
    • predownloaded_aliscore_alicut/ - Pre-tested Aliscore and ALICUT Perl scripts

    Templates (templates/)

    • slurm/ - SLURM job scheduler templates
    • pbs/ - PBS/Torque job scheduler templates
    • local/ - Local machine templates (with GNU parallel)
    • README.md - Complete template documentation

    References (references/)

    • REFERENCE.md - Detailed technical reference including:
      • Sample naming best practices
      • BUSCO lineage datasets (complete list)
      • Resource recommendations (memory, CPUs, walltime)
      • Detailed step-by-step implementation guides
      • Quality control guidelines
      • Aliscore/ALICUT detailed guide
      • Tool citations and download links
      • Software installation guide
      • Common issues and troubleshooting

    Workflow Overview

    The complete phylogenomics pipeline follows this sequence:

    Input Preparation → Ortholog Identification → Quality Control → Ortholog Extraction → Alignment → Trimming → Concatenation → Phylogenetic Inference

    Initial User Questions

    When a user requests phylogeny generation, gather the following information systematically:

    Step 1: Detect Computing Environment

    Before asking questions, attempt to detect the local computing environment:

    # Check for job schedulers
    command -v sbatch >/dev/null 2>&1  # SLURM
    command -v qsub >/dev/null 2>&1    # PBS/Torque
    command -v parallel >/dev/null 2>&1  # GNU parallel
    

    Report findings to the user, then confirm: "I detected [X] on this machine. Will you be running the scripts here or on a different system?"

    Required Information

    Ask these questions to gather essential workflow parameters:

    1. Computing Environment

      • Where will these scripts run? (SLURM cluster, PBS/Torque cluster, Cloud computing, Local machine)
    2. Input Data

      • Local genome files, NCBI accessions, or both?
      • If NCBI: Do you already have Assembly accessions (GCA_*/GCF_*) or BioProject accessions (PRJNA*/PRJEB*/PRJDA*)?
      • If user doesn't have accessions: Offer to help find assemblies using query_ncbi_assemblies.py (see "STEP 0A: Query NCBI for Assemblies" below)
      • If local files: What are the file paths?
    3. Taxonomic Scope & Dataset Details

      • What taxonomic group? (determines BUSCO lineage dataset)
      • How many taxa/genomes will be analyzed?
      • What is the approximate phylogenetic breadth? (species-level, genus-level, family-level, order-level, etc.)
      • See references/REFERENCE.md for complete lineage list
    4. Environment Management

      • Use unified conda environment (default, recommended), or separate environments per tool?
    5. Resource Constraints

      • How many CPU cores/threads to use in total? (Ask user to specify, do not auto-detect)
      • Available memory (RAM) per node/machine?
      • Maximum walltime for jobs?
      • See references/REFERENCE.md for resource recommendations
    6. Parallelization Strategy

      Ask the user how they want to handle parallel processing:

      • For job schedulers (SLURM/PBS):

        • Use array jobs for parallel steps? (Recommended: Yes)
        • Which steps to parallelize? (Steps 2, 5, 6, 8C recommended)
      • For local machines:

        • Use GNU parallel for parallel steps? (requires parallel installed)
        • How many concurrent jobs?
      • For all systems:

        • Optimize for maximum throughput or simplicity?
    7. Scheduler-Specific Configuration (if using SLURM or PBS)

      • Account/Username for compute time charges
      • Partition/Queue to submit jobs to
      • Email notifications? (address and when: START, END, FAIL, ALL)
      • Job dependencies? (Recommended: Yes for linear workflow)
      • Output log directory? (Default: logs/)
    8. Alignment Trimming Preference

      • Aliscore/ALICUT (traditional, thorough), trimAl (fast), BMGE (entropy-based), or ClipKit (modern)?
    9. Substitution Model Selection (for IQ-TREE phylogenetic inference)

      Context needed: Taxonomic breadth, number of taxa, evolutionary rates

      Action: Fetch IQ-TREE model documentation and suggest appropriate amino acid substitution models based on dataset characteristics.

      Use the substitution model recommendation system (see "Substitution Model Recommendation" section below).

    10. Educational Goals

    • Are you learning bioinformatics and would you like comprehensive explanations of each workflow step?
    • If yes: After completing each major workflow stage, offer to explain what the step accomplishes, why certain choices were made, and what best practices are being followed.
    • Store this preference to use throughout the workflow.

    Recommended Directory Structure

    Organize analyses with dedicated folders for each pipeline step:

    project_name/
    ├── logs/                          # All log files
    ├── 00_genomes/                    # Input genome assemblies
    ├── 01_busco_results/              # BUSCO/compleasm outputs
    ├── 02_qc/                         # Quality control reports
    ├── 03_extracted_orthologs/        # Extracted single-copy orthologs
    ├── 04_alignments/                 # Multiple sequence alignments
    ├── 05_trimmed/                    # Trimmed alignments
    ├── 06_concatenation/              # Supermatrix and partition files
    ├── 07_partition_search/           # Partition model selection
    ├── 08_concatenated_tree/          # Concatenated ML tree
    ├── 09_gene_trees/                 # Individual gene trees
    ├── 10_species_tree/               # ASTRAL species tree
    └── scripts/                       # All analysis scripts
    

    Benefits: Easy debugging, clear workflow progression, reproducibility, prevents root directory clutter.


    Template System

    This skill uses a template-based system to reduce token usage and improve maintainability. Script templates are stored in the templates/ directory and organized by computing environment.

    How to Use Templates

    When generating scripts for users:

    1. Read the appropriate template for their computing environment:

      Read("templates/slurm/02_compleasm_first.job")
      
    2. Replace placeholders with user-specific values:

      • TOTAL_THREADS → e.g., 64
      • THREADS_PER_JOB → e.g., 16
      • NUM_GENOMES → e.g., 20
      • NUM_LOCI → e.g., 2795
      • LINEAGE → e.g., insecta_odb10
      • MODEL_SET → e.g., LG,WAG,JTT,Q.pfam
    3. Present the customized script to the user with setup instructions

    Available Templates

    Key templates by workflow step:

    • Step 0 (setup): Environment setup script in references/REFERENCE.md
    • Step 2 (compleasm): 02_compleasm_first, 02_compleasm_parallel
    • Step 8A (partition search): 08a_partition_search
    • Step 8C (gene trees): 08c_gene_trees_array, 08c_gene_trees_parallel, 08c_gene_trees_serial

    See templates/README.md for complete template documentation.


    Substitution Model Recommendation

    When asked about substitution model selection (Question 9), use this systematic approach:

    Step 1: Fetch IQ-TREE Documentation

    Use WebFetch to retrieve current model information:

    WebFetch(url="https://iqtree.github.io/doc/Substitution-Models",
             prompt="Extract all amino acid substitution models with descriptions and usage guidelines")
    

    Step 2: Analyze Dataset Characteristics

    Consider these factors from user responses:

    • Taxonomic Scope: Species/genus (shallow) vs. family/order (moderate) vs. class/phylum+ (deep)
    • Number of Taxa: <20 (small), 20-50 (medium), >50 (large)
    • Evolutionary Rates: Fast-evolving, moderate, or slow-evolving
    • Sequence Type: Nuclear proteins, mitochondrial, or chloroplast

    Step 3: Recommend Models

    Provide 3-5 appropriate models based on dataset characteristics. For detailed model recommendation matrices and taxonomically-targeted models, see references/REFERENCE.md section "Substitution Model Recommendation".

    General recommendations:

    • Nuclear proteins (most common): LG, WAG, JTT, Q.pfam
    • Mitochondrial: mtREV, mtZOA, mtMAM, mtART, mtVer, mtInv
    • Chloroplast: cpREV
    • Taxonomically-targeted: Q.bird, Q.mammal, Q.insect, Q.plant, Q.yeast (when applicable)

    Step 4: Present Recommendations

    Format recommendations with justifications and explain how models will be used in IQ-TREE steps 8A and 8C.

    Step 5: Store Model Set

    Store the final comma-separated model list (e.g., "LG,WAG,JTT,Q.pfam") for use in Step 8 template placeholders.


    Workflow Implementation

    Once required information is gathered, guide the user through these steps. For each step, use templates where available and refer to references/REFERENCE.md for detailed implementation.

    STEP 0: Environment Setup

    ALWAYS start by generating a setup script for the user's environment.

    Use the unified conda environment setup script from references/REFERENCE.md (Section: "Software Installation Guide"). This creates a single conda environment with all necessary tools:

    • compleasm, MAFFT, trimming tools (trimAl, ClipKit, BMGE)
    • IQ-TREE, ASTRAL, Perl with BioPerl, GNU parallel
    • Downloads and installs Aliscore/ALICUT Perl scripts

    Key points:

    • Users choose between mamba (faster) or conda
    • Users choose between predownloaded Aliscore/ALICUT scripts (tested) or latest from GitHub
    • All subsequent steps use conda activate phylo (the unified environment)

    See references/REFERENCE.md for the complete setup script template.


    STEP 0A: Query NCBI for Assemblies (Optional)

    Use this step when: User wants to use NCBI data but doesn't have specific assembly accessions yet.

    This optional preliminary step helps users discover available genome assemblies by taxon name before proceeding with the main workflow.

    IMPORTANT: This step requires the NCBI datasets CLI tool (datasets) to be installed on the local machine where Claude is running. If Claude is running locally but preparing scripts for a remote HPC cluster, the datasets tool must be installed locally to query NCBI and find assemblies. The query functionality cannot be used if preparing scripts for remote execution without local access to the datasets CLI tool.

    When to Offer This Step

    Offer this step when:

    • User wants to analyze genomes from NCBI
    • User doesn't have specific Assembly or BioProject accessions
    • User mentions a taxonomic group (e.g., "I want to build a phylogeny for beetles")
    • The datasets CLI tool is available locally (check with command -v datasets)

    If the datasets tool is not available locally, inform the user they can:

    1. Install it via conda: conda install -c conda-forge ncbi-datasets-cli
    2. Download it manually from https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/
    3. Use the NCBI website to manually search for assemblies

    Workflow

    1. Check for datasets CLI tool availability:

      command -v datasets >/dev/null 2>&1 && echo "datasets CLI found" || echo "datasets CLI not found"
      
    2. Ask for focal taxon: Request the taxonomic group of interest

      • Examples: "Coleoptera", "Drosophila", "Apis mellifera"
      • Can be at any taxonomic level (order, family, genus, species)
    3. Suggest quality filtering criteria: Before querying, ask the user about quality preferences to help select the best assemblies for phylogenomics:

      Assembly Quality Considerations:

      • Assembly Level: Chromosome-level assemblies are preferred over Scaffold over Contig

        • Chromosome: Most complete, best for phylogenomics
        • Scaffold: Medium quality, may have gaps between contigs
        • Contig: Lower quality, more fragmented
      • Contig N50: Higher N50 indicates better assembly contiguity

        • Excellent: N50 > 10 Mbp
        • Good: N50 > 1 Mbp
        • Poor: N50 < 100 kbp
      • Annotation Status: Annotated assemblies have gene predictions and BUSCO scores

        • Useful for assessing completeness
        • BUSCO > 95% is excellent for phylogenomics
      • RefSeq vs GenBank:

        • RefSeq (GCF_*): Curated, higher quality, recommended
        • GenBank (GCA_*): Submitted assemblies, variable quality

      Suggested Questions for Users:

      • "Do you prefer chromosome-level assemblies only, or are scaffold-level acceptable?"
      • "Should I filter for high-quality assemblies (e.g., contig N50 > 1 Mbp)?"
      • "Do you want to see RefSeq assemblies only (higher quality)?"
      • "Would you like to see quality metrics (N50, BUSCO scores) to help you decide?"
    4. Query NCBI using the script: Use scripts/query_ncbi_assemblies.py with appropriate filters based on user preferences:

      # Basic query with quality metrics
      python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" --show-quality
      
      # Filter for chromosome-level assemblies
      python scripts/query_ncbi_assemblies.py --taxon "Drosophila" --assembly-level Chromosome --show-quality
      
      # Filter for high-quality assemblies (N50 > 1 Mbp)
      python scripts/query_ncbi_assemblies.py --taxon "Apis" --min-contig-n50 1000000 --show-quality
      
      # RefSeq annotated assemblies only (highest quality)
      python scripts/query_ncbi_assemblies.py --taxon "Felidae" --refseq-only --annotated --show-quality
      
      # Combined filters: Chromosome-level, high N50, annotated
      python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" \
        --assembly-level Chromosome \
        --min-contig-n50 1000000 \
        --annotated \
        --show-quality \
        --save assembly_accessions.txt
      

      Available Filtering Options:

      • --assembly-level {Chromosome,Scaffold,Contig} - Filter by assembly level
      • --min-contig-n50 N - Minimum contig N50 in base pairs (e.g., 1000000 for 1 Mbp)
      • --annotated - Only return assemblies with gene annotations
      • --refseq-only - Only return RefSeq (GCF_*) assemblies
      • --show-quality - Display quality metrics (N50, BUSCO scores) in output
      • --max-results N - Maximum number of results (default: 20)
      • --save FILE - Save accessions to file for later download
    5. Present results to user: The script displays:

      • Assembly accession (GCA_* or GCF_*)
      • Organism name
      • Assembly level (Chromosome, Scaffold, Contig)
      • Assembly name
      • If --show-quality: Contig N50, BUSCO completeness %, Annotation status
      • Quality summary: Count of assemblies by level
    6. Help user select assemblies: Review results with user and recommend best choices:

      • Prioritize chromosome-level assemblies with high N50
      • Prefer RefSeq (GCF_*) over GenBank (GCA_*) when available
      • Look for BUSCO completeness > 95% when available
      • Aim for consistent quality across all taxa in the phylogeny
      • Consider phylogenetic breadth (species coverage)
      • Consider data quality (RefSeq > GenBank when available)
    7. Collect selected accessions: Compile the list of chosen assembly accessions

    8. Proceed to STEP 1: Use the selected accessions with download_ncbi_genomes.py

    Tips for Assembly Selection

    • Assembly Level: Chromosome-level assemblies are most complete, followed by Scaffold, then Contig
    • RefSeq vs GenBank: RefSeq (GCF_*) assemblies undergo additional curation; GenBank (GCA_*) are submitter-provided
    • Taxonomic Sampling: For phylogenetics, aim for representative sampling across the taxonomic group
    • Quality over Quantity: Better to have 20 high-quality assemblies than 100 poor-quality ones

    STEP 1: Download NCBI Genomes (if applicable)

    If user provided NCBI accessions, use scripts/download_ncbi_genomes.py:

    For BioProjects:

    python scripts/download_ncbi_genomes.py --bioprojects PRJNA12345 -o genomes.zip
    unzip genomes.zip
    

    For Assembly Accessions:

    python scripts/download_ncbi_genomes.py --assemblies GCA_123456789.1 -o genomes.zip
    unzip genomes.zip
    

    IMPORTANT: After download, genomes must be renamed with meaningful sample names (format: [ACCESSION]_[SPECIES_NAME]). Sample names appear in final phylogenetic trees.

    Generate a script that:

    1. Finds all downloaded FASTA files in ncbi_dataset directory structure
    2. Moves/renames files to main genomes directory with meaningful names
    3. Includes any local genome files
    4. Creates final genome_list.txt with ALL genomes (local + downloaded)

    See references/REFERENCE.md section "Sample Naming Best Practices" for detailed guidelines.


    STEP 2: Ortholog Identification with compleasm

    Activate the unified environment and run compleasm on all genomes to identify single-copy orthologs.

    Key considerations:

    • First genome must run alone to download lineage database
    • Remaining genomes can run in parallel
    • Thread allocation: Miniprot scales well up to ~16-32 threads per genome

    Threading guidelines: See references/REFERENCE.md for recommended thread allocation table.

    Generate scripts using templates:

    • SLURM: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
    • PBS: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
    • Local: Read templates 02_compleasm_first.sh and 02_compleasm_parallel.sh

    Replace placeholders: TOTAL_THREADS, THREADS_PER_JOB, NUM_GENOMES, LINEAGE

    For detailed implementation examples, see references/REFERENCE.md section "Ortholog Identification Implementation".


    STEP 3: Quality Control

    After compleasm completes, generate QC report using scripts/generate_qc_report.sh:

    bash scripts/generate_qc_report.sh qc_report.csv
    

    Provide interpretation:

    • >95% complete: Excellent, retain
    • 90-95% complete: Good, retain
    • 85-90% complete: Acceptable, case-by-case
    • 70-85% complete: Questionable, consider excluding
    • <70% complete: Poor, recommend excluding

    See references/REFERENCE.md section "Quality Control Guidelines" for detailed assessment criteria.


    STEP 4: Ortholog Extraction

    Use scripts/extract_orthologs.sh to extract single-copy orthologs:

    bash scripts/extract_orthologs.sh LINEAGE_NAME
    

    This generates per-locus unaligned FASTA files in single_copy_orthologs/unaligned_aa/.


    STEP 5: Alignment with MAFFT

    Activate the unified environment (conda activate phylo) which contains MAFFT.

    Create locus list, then generate alignment scripts:

    cd single_copy_orthologs/unaligned_aa
    ls *.fas > locus_names.txt
    num_loci=$(wc -l < locus_names.txt)
    

    Generate scheduler-specific scripts:

    • SLURM/PBS: Array job with one task per locus
    • Local: Sequential processing or GNU parallel

    For detailed script templates, see references/REFERENCE.md section "Alignment Implementation".


    STEP 6: Alignment Trimming

    Based on user's preference, provide appropriate trimming method. All tools are available in the unified conda environment.

    Options:

    • trimAl: Fast (-automated1), recommended for large datasets
    • ClipKit: Modern, fast (default smart-gap mode)
    • BMGE: Entropy-based (-t AA)
    • Aliscore/ALICUT: Traditional, thorough (recommended for phylogenomics)

    For Aliscore/ALICUT:

    • Perl scripts were installed in STEP 0
    • Use scripts/run_aliscore_alicut_batch.sh for batch processing
    • Or use array jobs with scripts/run_aliscore.sh and scripts/run_alicut.sh
    • Always use -N flag for amino acid sequences

    Generate scripts using scheduler-appropriate templates (array jobs for SLURM/PBS, parallel or serial for local).

    For detailed implementation of each trimming method, see references/REFERENCE.md section "Alignment Trimming Implementation".


    STEP 7: Concatenation and Partition Definition

    Download FASconCAT-G (Perl script) and run concatenation:

    conda activate phylo  # Has Perl installed
    wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl -O FASconCAT-G.pl
    chmod +x FASconCAT-G.pl
    
    cd trimmed_aa
    perl ../FASconCAT-G.pl -s -i
    

    Convert to IQ-TREE format using scripts/convert_fasconcat_to_partition.py:

    python ../scripts/convert_fasconcat_to_partition.py FcC_info.xls partition_def.txt
    

    Outputs: FcC_supermatrix.fas, FcC_info.xls, partition_def.txt


    STEP 8: Phylogenetic Inference

    IQ-TREE is already installed in the unified environment. Activate with conda activate phylo.

    Part 8A: Partition Model Selection

    Use the substitution models selected during initial setup (Question 9).

    Generate script using templates:

    • Read appropriate template: templates/[slurm|pbs|local]/08a_partition_search.[job|sh]
    • Replace MODEL_SET placeholder with user's selected models (e.g., "LG,WAG,JTT,Q.pfam")

    For detailed implementation, see references/REFERENCE.md section "Partition Model Selection Implementation".

    Part 8B: Concatenated ML Tree

    Run IQ-TREE using the best partition scheme from Part 8A:

    iqtree -s FcC_supermatrix.fas -spp partition_search.best_scheme.nex \
      -nt 18 -safe -pre concatenated_ML_tree -bb 1000 -bnni
    

    Output: concatenated_ML_tree.treefile

    Part 8C: Individual Gene Trees

    Estimate gene trees for coalescent-based species tree inference.

    Generate scripts using templates:

    • SLURM/PBS: Read 08c_gene_trees_array.job template
    • Local: Read 08c_gene_trees_parallel.sh or 08c_gene_trees_serial.sh template
    • Replace NUM_LOCI placeholder

    For detailed implementation, see references/REFERENCE.md section "Gene Trees Implementation".

    Part 8D: ASTRAL Species Tree

    ASTRAL is already installed in the unified conda environment.

    conda activate phylo
    
    # Concatenate all gene trees
    cat trimmed_aa/*.treefile > all_gene_trees.tre
    
    # Run ASTRAL
    astral -i all_gene_trees.tre -o astral_species_tree.tre
    

    Output: astral_species_tree.tre


    STEP 9: Generate Methods Paragraph

    ALWAYS generate a methods paragraph to help users write their publication methods section.

    Create METHODS_PARAGRAPH.md file with:

    • Customized text based on tools and parameters used
    • Complete citations for all software
    • Placeholders for user-specific values (genome count, loci count, thresholds)
    • Instructions for adapting to journal requirements

    For the complete methods paragraph template, see references/REFERENCE.md section "Methods Paragraph Template".

    Pre-fill known values when possible:

    • Number of genomes
    • BUSCO lineage
    • Trimming method used
    • Substitution models tested

    Final Outputs Summary

    Provide users with a summary of outputs:

    Phylogenetic Results:

    1. concatenated_ML_tree.treefile - ML tree from concatenated supermatrix
    2. astral_species_tree.tre - Coalescent species tree
    3. *.treefile - Individual gene trees

    Data and Quality Control: 4. qc_report.csv - Genome quality statistics 5. FcC_supermatrix.fas - Concatenated alignment 6. partition_search.best_scheme.nex - Selected partitioning scheme

    Publication Materials: 7. METHODS_PARAGRAPH.md - Ready-to-use methods section with citations

    Visualization tools: FigTree, iTOL, ggtree (R), ete3/toytree (Python)


    Script Validation

    ALWAYS perform validation checks after generating scripts but before presenting them to the user. This ensures script accuracy, consistency, and proper resource allocation.

    Validation Workflow

    For each generated script, perform these validation checks in order:

    1. Program Option Verification

    Purpose: Detect hallucinated or incorrect command-line options that may cause scripts to fail.

    Procedure:

    1. Extract all command invocations from the generated script (e.g., compleasm run, iqtree -s, mafft --auto)
    2. Compare against reference sources:
      • First check: Compare against corresponding template in templates/ directory
      • Second check: Compare against examples in references/REFERENCE.md
      • Third check: If options differ significantly or are uncertain, perform web search for official documentation
    3. Common tools to validate:
      • compleasm run - Check -a, -o, -l, -t options
      • iqtree - Verify -s, -p, -m, -bb, -alrt, -nt, -safe options
      • mafft - Check --auto, --thread, --reorder options
      • astral - Verify -i, -o options
      • Trimming tools (trimal, clipkit, BMGE.jar) - Validate options

    Action on issues:

    • If incorrect options found: Inform user of the issue and ask if they want you to correct it
    • If uncertain: Ask user to verify with tool documentation before proceeding

    2. Pipeline Continuity Verification

    Purpose: Ensure outputs from one step correctly feed into inputs of subsequent steps.

    Procedure:

    1. Map input/output relationships:

      • Step 2 output (01_busco_results/*_compleasm/) → Step 3 input (QC script)
      • Step 3 output (single_copy_orthologs/) → Step 5 input (MAFFT)
      • Step 5 output (04_alignments/*.fas) → Step 6 input (trimming)
      • Step 6 output (05_trimmed/*.fas) → Step 7 input (FASconCAT-G)
      • Step 7 output (FcC_supermatrix.fas, partition file) → Step 8A input (IQ-TREE)
      • Step 8C output (*.treefile) → Step 8D input (ASTRAL)
    2. Check for consistency:

      • File path references match across scripts
      • Directory structure follows recommended layout
      • Glob patterns correctly match expected files
      • Required intermediate files are generated before being used

    Action on issues:

    • If path mismatches found: Inform user and ask if they want you to correct them
    • If directory structure inconsistent: Suggest corrections aligned with recommended structure

    3. Resource Compatibility Check

    Purpose: Ensure allocated computational resources are appropriate for the task.

    Procedure:

    1. Verify resource allocations against recommendations in references/REFERENCE.md:

      • Memory allocation: Check if memory per CPU (typically 6GB for compleasm, 2-4GB for others) is adequate
      • Thread allocation: Verify thread counts are reasonable for the number of genomes/loci
      • Walltime: Ensure walltime is sufficient based on dataset size guidelines
      • Parallelization: Check that threads per job × concurrent jobs ≤ total threads
    2. Common issues to check:

      • Compleasm: First job needs full thread allocation (downloads database)
      • IQ-TREE: -nt should match allocated CPUs
      • Gene trees: Ensure enough threads per tree × concurrent trees ≤ total available
      • Memory: Concatenated tree inference may need 8-16GB per CPU for large datasets
    3. Validate against user-specified constraints:

      • Total CPUs specified by user
      • Available memory per node
      • Maximum walltime limits
      • Scheduler-specific limits (if mentioned)

    Action on issues:

    • If resource allocation issues found: Inform user and suggest corrections with justification
    • If uncertain about adequacy: Ask user about typical job performance in their environment

    Validation Reporting

    After completing all validation checks:

    1. If all checks pass: Inform user briefly: "Scripts validated successfully - options, pipeline flow, and resources verified."

    2. If issues found: Present a structured report:

      **Validation Results**
      
      ⚠️ Issues found during validation:
      
      1. [Issue category]: [Description]
         - Current: [What was generated]
         - Suggested: [Recommended fix]
         - Reason: [Why this is an issue]
      
      Would you like me to apply these corrections?
      
    3. Always ask before correcting: Never silently fix issues - always get user confirmation before applying changes.

    4. Document corrections: If corrections are applied, explain what was changed and why.


    Communication Guidelines

    • Always start with STEP 0: Generate the unified environment setup script
    • Always end with STEP 9: Generate the customized methods paragraph
    • Always validate scripts: Perform validation checks before presenting scripts to users
    • Use unified environment by default: All scripts should use conda activate phylo
    • Always ask about CPU allocation: Never auto-detect cores, always ask user
    • Recommend optimized workflows: For users with adequate resources, recommend optimized parallel approaches over simple serial approaches
    • Be clear and pedagogical: Explain why each step is necessary
    • Provide educational explanations when requested: If user answered yes to educational goals (question 10):
      • After completing each major workflow stage, ask: "Would you like me to explain this step?"
      • If yes, provide moderate-length explanation (1-2 paragraphs) covering:
        • What the step accomplishes biologically and computationally
        • Significant choices made and their rationale
        • Best practices being followed in the workflow
      • Examples of "major workflow stages": STEP 0 (setup), STEP 1 (download), STEP 2 (BUSCO), STEP 3 (QC), STEP 5 (alignment), STEP 6 (trimming), STEP 7 (concatenation), STEP 8 (phylogenetic inference)
    • Provide complete, ready-to-run scripts: Users should copy-paste and run
    • Adapt to user's environment: Always generate scheduler-specific scripts
    • Reference supporting files: Direct users to references/REFERENCE.md for details
    • Use helper scripts: Leverage provided scripts in scripts/ directory
    • Include error checking: Add file existence checks and informative error messages
    • Be encouraging: Phylogenomics is complex; maintain supportive tone

    Important Notes

    Mandatory Steps

    1. STEP 0 is mandatory: Always generate the environment setup script first
    2. STEP 9 is mandatory: Always generate the methods paragraph file at the end

    Template Usage (IMPORTANT!)

    1. Prefer templates over inline code: Use templates/ directory for major scripts
    2. Template workflow:
      • Read: Read("templates/slurm/02_compleasm_first.job")
      • Replace placeholders: TOTAL_THREADS, LINEAGE, NUM_GENOMES, MODEL_SET, etc.
      • Present customized script to user
    3. Available templates: See templates/README.md for complete list
    4. Benefits: Reduces token usage, easier maintenance, consistent structure

    Script Generation

    1. Always adapt scripts to user's scheduler (SLURM/PBS/local)
    2. Replace all placeholders before presenting scripts
    3. Never auto-detect CPU cores: Always ask user to specify
    4. Provide parallelization options: For each parallelizable step, offer array job, parallel, and serial options
    5. Scheduler-specific configuration: For SLURM/PBS, always ask about account, partition, email, etc.

    Parallelization Strategy

    1. Ask about preferences: Let user choose between throughput optimization vs. simplicity
    2. Compleasm optimization: For ≥2 genomes and ≥16 cores, recommend two-phase approach
    3. Use threading guidelines: Refer to references/REFERENCE.md for thread allocation recommendations
    4. Parallelizable steps: Steps 2 (compleasm), 5 (MAFFT), 6 (trimming), 8C (gene trees)

    Substitution Model Selection

    1. Always recommend models: Use the systematic model recommendation process
    2. Fetch current documentation: Use WebFetch to get IQ-TREE model information
    3. Replace MODEL_SET placeholder: In Step 8A templates with comma-separated list
    4. Taxonomically-targeted models: Suggest Q.bird, Q.mammal, Q.insect, Q.plant when applicable

    Reference Material

    1. Direct users to references/REFERENCE.md for:
      • Detailed implementation guides
      • BUSCO lineage datasets (complete list)
      • Resource recommendations (memory, CPUs, walltime tables)
      • Sample naming best practices
      • Quality control assessment criteria
      • Aliscore/ALICUT detailed guide and parameters
      • Tool citations with DOIs
      • Software installation instructions
      • Common issues and troubleshooting

    Attribution

    This skill was created by Bruno de Medeiros (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by Paul Frandsen (Brigham Young University).

    Workflow Entry Point

    When a user requests phylogeny generation:

    1. Gather required information using the "Initial User Questions" section
    2. Generate STEP 0 setup script from references/REFERENCE.MD
    3. If user needs help finding NCBI assemblies, perform STEP 0A using query_ncbi_assemblies.py
    4. Proceed step-by-step through workflow (STEPS 1-8), using templates and referring to references/REFERENCE.md for detailed implementation
    5. All workflow scripts should use the unified conda environment (conda activate phylo)
    6. Validate all generated scripts before presenting to user (see "Script Validation" section)
    7. Generate STEP 9 methods paragraph from template in references/REFERENCE.md
    8. Provide final outputs summary
    Repository
    brunoasm/my_claude_skills
    Files