busco-phylogeny

brunoasm/busco-phylogeny

Research

About

SKILL.md

busco-phylogeny

brunoasm/busco-phylogeny

Research

About

Generate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation

SKILL.md

BUSCO-based Phylogenomics Workflow Generator

This skill provides phylogenomics expertise for generating comprehensive, scheduler-aware workflows for phylogenetic inference from genome assemblies using single-copy orthologs.

Purpose

This skill helps users generate phylogenies from genome assemblies by:

Handling mixed input (local files and NCBI accessions)
Creating scheduler-specific scripts (SLURM, PBS, cloud, local)
Setting up complete workflows from raw genomes to final trees
Providing quality control and recommendations
Supporting flexible software management (bioconda, Docker, custom)

Available Resources

The skill provides access to these bundled resources:

Scripts (`scripts/`)

query_ncbi_assemblies.py - Query NCBI for available genome assemblies by taxon name (new!)
download_ncbi_genomes.py - Download genomes from NCBI using BioProjects or Assembly accessions
rename_genomes.py - Rename genome files with meaningful sample names (important!)
generate_qc_report.sh - Generate quality control reports from compleasm results
extract_orthologs.sh - Extract and reorganize single-copy orthologs
run_aliscore.sh - Wrapper for Aliscore to identify randomly similar sequences (RSS)
run_alicut.sh - Wrapper for ALICUT to remove RSS positions from alignments
run_aliscore_alicut_batch.sh - Batch process all alignments through Aliscore + ALICUT
convert_fasconcat_to_partition.py - Convert FASconCAT output to IQ-TREE partition format
predownloaded_aliscore_alicut/ - Pre-tested Aliscore and ALICUT Perl scripts

Templates (`templates/`)

slurm/ - SLURM job scheduler templates
pbs/ - PBS/Torque job scheduler templates
local/ - Local machine templates (with GNU parallel)
README.md - Complete template documentation

References (`references/`)

REFERENCE.md - Detailed technical reference including:
- Sample naming best practices
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime)
- Detailed step-by-step implementation guides
- Quality control guidelines
- Aliscore/ALICUT detailed guide
- Tool citations and download links
- Software installation guide
- Common issues and troubleshooting

Workflow Overview

The complete phylogenomics pipeline follows this sequence:

Input Preparation → Ortholog Identification → Quality Control → Ortholog Extraction → Alignment → Trimming → Concatenation → Phylogenetic Inference

Initial User Questions

When a user requests phylogeny generation, gather the following information systematically:

Step 1: Detect Computing Environment

Before asking questions, attempt to detect the local computing environment:

# Check for job schedulers
command -v sbatch >/dev/null 2>&1  # SLURM
command -v qsub >/dev/null 2>&1    # PBS/Torque
command -v parallel >/dev/null 2>&1  # GNU parallel

Report findings to the user, then confirm: "I detected [X] on this machine. Will you be running the scripts here or on a different system?"

Required Information

Ask these questions to gather essential workflow parameters:

Computing Environment
- Where will these scripts run? (SLURM cluster, PBS/Torque cluster, Cloud computing, Local machine)
Input Data
- Local genome files, NCBI accessions, or both?
- If NCBI: Do you already have Assembly accessions (GCA_*/GCF_*) or BioProject accessions (PRJNA*/PRJEB*/PRJDA*)?
- If user doesn't have accessions: Offer to help find assemblies using query_ncbi_assemblies.py (see "STEP 0A: Query NCBI for Assemblies" below)
- If local files: What are the file paths?
Taxonomic Scope & Dataset Details
- What taxonomic group? (determines BUSCO lineage dataset)
- How many taxa/genomes will be analyzed?
- What is the approximate phylogenetic breadth? (species-level, genus-level, family-level, order-level, etc.)
- See references/REFERENCE.md for complete lineage list
Environment Management
- Use unified conda environment (default, recommended), or separate environments per tool?
Resource Constraints
- How many CPU cores/threads to use in total? (Ask user to specify, do not auto-detect)
- Available memory (RAM) per node/machine?
- Maximum walltime for jobs?
- See references/REFERENCE.md for resource recommendations
Parallelization Strategy

Ask the user how they want to handle parallel processing:
- For job schedulers (SLURM/PBS):
  - Use array jobs for parallel steps? (Recommended: Yes)
  - Which steps to parallelize? (Steps 2, 5, 6, 8C recommended)
- For local machines:
  - Use GNU parallel for parallel steps? (requires parallel installed)
  - How many concurrent jobs?
- For all systems:
  - Optimize for maximum throughput or simplicity?
Scheduler-Specific Configuration (if using SLURM or PBS)
- Account/Username for compute time charges
- Partition/Queue to submit jobs to
- Email notifications? (address and when: START, END, FAIL, ALL)
- Job dependencies? (Recommended: Yes for linear workflow)
- Output log directory? (Default: logs/)
Alignment Trimming Preference
- Aliscore/ALICUT (traditional, thorough), trimAl (fast), BMGE (entropy-based), or ClipKit (modern)?
Substitution Model Selection (for IQ-TREE phylogenetic inference)

Context needed: Taxonomic breadth, number of taxa, evolutionary rates

Action: Fetch IQ-TREE model documentation and suggest appropriate amino acid substitution models based on dataset characteristics.

Use the substitution model recommendation system (see "Substitution Model Recommendation" section below).
Educational Goals

Are you learning bioinformatics and would you like comprehensive explanations of each workflow step?
If yes: After completing each major workflow stage, offer to explain what the step accomplishes, why certain choices were made, and what best practices are being followed.
Store this preference to use throughout the workflow.

Recommended Directory Structure

Organize analyses with dedicated folders for each pipeline step:

project_name/
├── logs/                          # All log files
├── 00_genomes/                    # Input genome assemblies
├── 01_busco_results/              # BUSCO/compleasm outputs
├── 02_qc/                         # Quality control reports
├── 03_extracted_orthologs/        # Extracted single-copy orthologs
├── 04_alignments/                 # Multiple sequence alignments
├── 05_trimmed/                    # Trimmed alignments
├── 06_concatenation/              # Supermatrix and partition files
├── 07_partition_search/           # Partition model selection
├── 08_concatenated_tree/          # Concatenated ML tree
├── 09_gene_trees/                 # Individual gene trees
├── 10_species_tree/               # ASTRAL species tree
└── scripts/                       # All analysis scripts

Benefits: Easy debugging, clear workflow progression, reproducibility, prevents root directory clutter.

Template System

This skill uses a template-based system to reduce token usage and improve maintainability. Script templates are stored in the templates/ directory and organized by computing environment.

How to Use Templates

When generating scripts for users:

Read the appropriate template for their computing environment:
```
Read("templates/slurm/02_compleasm_first.job")
```
Replace placeholders with user-specific values:
- TOTAL_THREADS → e.g., 64
- THREADS_PER_JOB → e.g., 16
- NUM_GENOMES → e.g., 20
- NUM_LOCI → e.g., 2795
- LINEAGE → e.g., insecta_odb10
- MODEL_SET → e.g., LG,WAG,JTT,Q.pfam
Present the customized script to the user with setup instructions

Available Templates

Key templates by workflow step:

Step 0 (setup): Environment setup script in references/REFERENCE.md
Step 2 (compleasm): 02_compleasm_first, 02_compleasm_parallel
Step 8A (partition search): 08a_partition_search
Step 8C (gene trees): 08c_gene_trees_array, 08c_gene_trees_parallel, 08c_gene_trees_serial

See templates/README.md for complete template documentation.

Substitution Model Recommendation

When asked about substitution model selection (Question 9), use this systematic approach:

Step 1: Fetch IQ-TREE Documentation

Use WebFetch to retrieve current model information:

WebFetch(url="https://iqtree.github.io/doc/Substitution-Models",
         prompt="Extract all amino acid substitution models with descriptions and usage guidelines")

Step 2: Analyze Dataset Characteristics

Consider these factors from user responses:

Taxonomic Scope: Species/genus (shallow) vs. family/order (moderate) vs. class/phylum+ (deep)
Number of Taxa: <20 (small), 20-50 (medium), >50 (large)
Evolutionary Rates: Fast-evolving, moderate, or slow-evolving
Sequence Type: Nuclear proteins, mitochondrial, or chloroplast

Step 3: Recommend Models

Provide 3-5 appropriate models based on dataset characteristics. For detailed model recommendation matrices and taxonomically-targeted models, see references/REFERENCE.md section "Substitution Model Recommendation".

General recommendations:

Nuclear proteins (most common): LG, WAG, JTT, Q.pfam
Mitochondrial: mtREV, mtZOA, mtMAM, mtART, mtVer, mtInv
Chloroplast: cpREV
Taxonomically-targeted: Q.bird, Q.mammal, Q.insect, Q.plant, Q.yeast (when applicable)

Step 4: Present Recommendations

Format recommendations with justifications and explain how models will be used in IQ-TREE steps 8A and 8C.

Step 5: Store Model Set

Store the final comma-separated model list (e.g., "LG,WAG,JTT,Q.pfam") for use in Step 8 template placeholders.

Workflow Implementation

Once required information is gathered, guide the user through these steps. For each step, use templates where available and refer to references/REFERENCE.md for detailed implementation.

STEP 0: Environment Setup

ALWAYS start by generating a setup script for the user's environment.

Use the unified conda environment setup script from references/REFERENCE.md (Section: "Software Installation Guide"). This creates a single conda environment with all necessary tools:

compleasm, MAFFT, trimming tools (trimAl, ClipKit, BMGE)
IQ-TREE, ASTRAL, Perl with BioPerl, GNU parallel
Downloads and installs Aliscore/ALICUT Perl scripts

Key points:

Users choose between mamba (faster) or conda
Users choose between predownloaded Aliscore/ALICUT scripts (tested) or latest from GitHub
All subsequent steps use conda activate phylo (the unified environment)

See references/REFERENCE.md for the complete setup script template.

STEP 0A: Query NCBI for Assemblies (Optional)

Use this step when: User wants to use NCBI data but doesn't have specific assembly accessions yet.

This optional preliminary step helps users discover available genome assemblies by taxon name before proceeding with the main workflow.

IMPORTANT: This step requires the NCBI datasets CLI tool (datasets) to be installed on the local machine where Claude is running. If Claude is running locally but preparing scripts for a remote HPC cluster, the datasets tool must be installed locally to query NCBI and find assemblies. The query functionality cannot be used if preparing scripts for remote execution without local access to the datasets CLI tool.

When to Offer This Step

Offer this step when:

User wants to analyze genomes from NCBI
User doesn't have specific Assembly or BioProject accessions
User mentions a taxonomic group (e.g., "I want to build a phylogeny for beetles")
The datasets CLI tool is available locally (check with command -v datasets)

If the datasets tool is not available locally, inform the user they can:

Install it via conda: conda install -c conda-forge ncbi-datasets-cli
Download it manually from https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/
Use the NCBI website to manually search for assemblies

Workflow

Check for datasets CLI tool availability:

command -v datasets >/dev/null 2>&1 && echo "datasets CLI found" || echo "datasets CLI not found"

Ask for focal taxon: Request the taxonomic group of interest
- Examples: "Coleoptera", "Drosophila", "Apis mellifera"
- Can be at any taxonomic level (order, family, genus, species)
Suggest quality filtering criteria: Before querying, ask the user about quality preferences to help select the best assemblies for phylogenomics:

Assembly Quality Considerations:
- Assembly Level: Chromosome-level assemblies are preferred over Scaffold over Contig
  - Chromosome: Most complete, best for phylogenomics
  - Scaffold: Medium quality, may have gaps between contigs
  - Contig: Lower quality, more fragmented
- Contig N50: Higher N50 indicates better assembly contiguity
  - Excellent: N50 > 10 Mbp
  - Good: N50 > 1 Mbp
  - Poor: N50 < 100 kbp
- Annotation Status: Annotated assemblies have gene predictions and BUSCO scores
  - Useful for assessing completeness
  - BUSCO > 95% is excellent for phylogenomics
- RefSeq vs GenBank:
  - RefSeq (GCF_*): Curated, higher quality, recommended
  - GenBank (GCA_*): Submitted assemblies, variable quality
Suggested Questions for Users:
- "Do you prefer chromosome-level assemblies only, or are scaffold-level acceptable?"
- "Should I filter for high-quality assemblies (e.g., contig N50 > 1 Mbp)?"
- "Do you want to see RefSeq assemblies only (higher quality)?"
- "Would you like to see quality metrics (N50, BUSCO scores) to help you decide?"

Query NCBI using the script: Use scripts/query_ncbi_assemblies.py with appropriate filters based on user preferences:

# Basic query with quality metrics
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" --show-quality

# Filter for chromosome-level assemblies
python scripts/query_ncbi_assemblies.py --taxon "Drosophila" --assembly-level Chromosome --show-quality

# Filter for high-quality assemblies (N50 > 1 Mbp)
python scripts/query_ncbi_assemblies.py --taxon "Apis" --min-contig-n50 1000000 --show-quality

# RefSeq annotated assemblies only (highest quality)
python scripts/query_ncbi_assemblies.py --taxon "Felidae" --refseq-only --annotated --show-quality

# Combined filters: Chromosome-level, high N50, annotated
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" \
  --assembly-level Chromosome \
  --min-contig-n50 1000000 \
  --annotated \
  --show-quality \
  --save assembly_accessions.txt

Available Filtering Options:

--assembly-level {Chromosome,Scaffold,Contig} - Filter by assembly level
--min-contig-n50 N - Minimum contig N50 in base pairs (e.g., 1000000 for 1 Mbp)
--annotated - Only return assemblies with gene annotations
--refseq-only - Only return RefSeq (GCF_*) assemblies
--show-quality - Display quality metrics (N50, BUSCO scores) in output
--max-results N - Maximum number of results (default: 20)
--save FILE - Save accessions to file for later download

Present results to user: The script displays:
- Assembly accession (GCA_* or GCF_*)
- Organism name
- Assembly level (Chromosome, Scaffold, Contig)
- Assembly name
- If --show-quality: Contig N50, BUSCO completeness %, Annotation status
- Quality summary: Count of assemblies by level
Help user select assemblies: Review results with user and recommend best choices:
- Prioritize chromosome-level assemblies with high N50
- Prefer RefSeq (GCF_*) over GenBank (GCA_*) when available
- Look for BUSCO completeness > 95% when available
- Aim for consistent quality across all taxa in the phylogeny
- Consider phylogenetic breadth (species coverage)
- Consider data quality (RefSeq > GenBank when available)
Collect selected accessions: Compile the list of chosen assembly accessions
Proceed to STEP 1: Use the selected accessions with download_ncbi_genomes.py

Tips for Assembly Selection

Assembly Level: Chromosome-level assemblies are most complete, followed by Scaffold, then Contig
RefSeq vs GenBank: RefSeq (GCF_*) assemblies undergo additional curation; GenBank (GCA_*) are submitter-provided
Taxonomic Sampling: For phylogenetics, aim for representative sampling across the taxonomic group
Quality over Quantity: Better to have 20 high-quality assemblies than 100 poor-quality ones

STEP 1: Download NCBI Genomes (if applicable)

If user provided NCBI accessions, use scripts/download_ncbi_genomes.py:

For BioProjects:

python scripts/download_ncbi_genomes.py --bioprojects PRJNA12345 -o genomes.zip
unzip genomes.zip

For Assembly Accessions:

python scripts/download_ncbi_genomes.py --assemblies GCA_123456789.1 -o genomes.zip
unzip genomes.zip

IMPORTANT: After download, genomes must be renamed with meaningful sample names (format: [ACCESSION]_[SPECIES_NAME]). Sample names appear in final phylogenetic trees.

Generate a script that:

Finds all downloaded FASTA files in ncbi_dataset directory structure
Moves/renames files to main genomes directory with meaningful names
Includes any local genome files
Creates final genome_list.txt with ALL genomes (local + downloaded)

See references/REFERENCE.md section "Sample Naming Best Practices" for detailed guidelines.

STEP 2: Ortholog Identification with compleasm

Activate the unified environment and run compleasm on all genomes to identify single-copy orthologs.

Key considerations:

First genome must run alone to download lineage database
Remaining genomes can run in parallel
Thread allocation: Miniprot scales well up to ~16-32 threads per genome

Threading guidelines: See references/REFERENCE.md for recommended thread allocation table.

Generate scripts using templates:

SLURM: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
PBS: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
Local: Read templates 02_compleasm_first.sh and 02_compleasm_parallel.sh

Replace placeholders: TOTAL_THREADS, THREADS_PER_JOB, NUM_GENOMES, LINEAGE

For detailed implementation examples, see references/REFERENCE.md section "Ortholog Identification Implementation".

STEP 3: Quality Control

After compleasm completes, generate QC report using scripts/generate_qc_report.sh:

bash scripts/generate_qc_report.sh qc_report.csv

Provide interpretation:

>95% complete: Excellent, retain
90-95% complete: Good, retain
85-90% complete: Acceptable, case-by-case
70-85% complete: Questionable, consider excluding
<70% complete: Poor, recommend excluding

See references/REFERENCE.md section "Quality Control Guidelines" for detailed assessment criteria.

STEP 4: Ortholog Extraction

Use scripts/extract_orthologs.sh to extract single-copy orthologs:

bash scripts/extract_orthologs.sh LINEAGE_NAME

This generates per-locus unaligned FASTA files in single_copy_orthologs/unaligned_aa/.

STEP 5: Alignment with MAFFT

Activate the unified environment (conda activate phylo) which contains MAFFT.

Create locus list, then generate alignment scripts:

cd single_copy_orthologs/unaligned_aa
ls *.fas > locus_names.txt
num_loci=$(wc -l < locus_names.txt)

Generate scheduler-specific scripts:

SLURM/PBS: Array job with one task per locus
Local: Sequential processing or GNU parallel

For detailed script templates, see references/REFERENCE.md section "Alignment Implementation".

STEP 6: Alignment Trimming

Based on user's preference, provide appropriate trimming method. All tools are available in the unified conda environment.

Options:

trimAl: Fast (-automated1), recommended for large datasets
ClipKit: Modern, fast (default smart-gap mode)
BMGE: Entropy-based (-t AA)
Aliscore/ALICUT: Traditional, thorough (recommended for phylogenomics)

For Aliscore/ALICUT:

Perl scripts were installed in STEP 0
Use scripts/run_aliscore_alicut_batch.sh for batch processing
Or use array jobs with scripts/run_aliscore.sh and scripts/run_alicut.sh
Always use -N flag for amino acid sequences

Generate scripts using scheduler-appropriate templates (array jobs for SLURM/PBS, parallel or serial for local).

For detailed implementation of each trimming method, see references/REFERENCE.md section "Alignment Trimming Implementation".

STEP 7: Concatenation and Partition Definition

Download FASconCAT-G (Perl script) and run concatenation:

conda activate phylo  # Has Perl installed
wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl -O FASconCAT-G.pl
chmod +x FASconCAT-G.pl

cd trimmed_aa
perl ../FASconCAT-G.pl -s -i

Convert to IQ-TREE format using scripts/convert_fasconcat_to_partition.py:

python ../scripts/convert_fasconcat_to_partition.py FcC_info.xls partition_def.txt

Outputs: FcC_supermatrix.fas, FcC_info.xls, partition_def.txt

STEP 8: Phylogenetic Inference

IQ-TREE is already installed in the unified environment. Activate with conda activate phylo.

Part 8A: Partition Model Selection

Use the substitution models selected during initial setup (Question 9).

Generate script using templates:

Read appropriate template: templates/[slurm|pbs|local]/08a_partition_search.[job|sh]
Replace MODEL_SET placeholder with user's selected models (e.g., "LG,WAG,JTT,Q.pfam")

For detailed implementation, see references/REFERENCE.md section "Partition Model Selection Implementation".

Part 8B: Concatenated ML Tree

Run IQ-TREE using the best partition scheme from Part 8A:

iqtree -s FcC_supermatrix.fas -spp partition_search.best_scheme.nex \
  -nt 18 -safe -pre concatenated_ML_tree -bb 1000 -bnni

Output: concatenated_ML_tree.treefile

Part 8C: Individual Gene Trees

Estimate gene trees for coalescent-based species tree inference.

Generate scripts using templates:

SLURM/PBS: Read 08c_gene_trees_array.job template
Local: Read 08c_gene_trees_parallel.sh or 08c_gene_trees_serial.sh template
Replace NUM_LOCI placeholder

For detailed implementation, see references/REFERENCE.md section "Gene Trees Implementation".

Part 8D: ASTRAL Species Tree

ASTRAL is already installed in the unified conda environment.

conda activate phylo

# Concatenate all gene trees
cat trimmed_aa/*.treefile > all_gene_trees.tre

# Run ASTRAL
astral -i all_gene_trees.tre -o astral_species_tree.tre

Output: astral_species_tree.tre

STEP 9: Generate Methods Paragraph

ALWAYS generate a methods paragraph to help users write their publication methods section.

Create METHODS_PARAGRAPH.md file with:

Customized text based on tools and parameters used
Complete citations for all software
Placeholders for user-specific values (genome count, loci count, thresholds)
Instructions for adapting to journal requirements

For the complete methods paragraph template, see references/REFERENCE.md section "Methods Paragraph Template".

Pre-fill known values when possible:

Number of genomes
BUSCO lineage
Trimming method used
Substitution models tested

Final Outputs Summary

Provide users with a summary of outputs:

Phylogenetic Results:

concatenated_ML_tree.treefile - ML tree from concatenated supermatrix
astral_species_tree.tre - Coalescent species tree
*.treefile - Individual gene trees

Data and Quality Control: 4. qc_report.csv - Genome quality statistics 5. FcC_supermatrix.fas - Concatenated alignment 6. partition_search.best_scheme.nex - Selected partitioning scheme

Publication Materials: 7. METHODS_PARAGRAPH.md - Ready-to-use methods section with citations

Visualization tools: FigTree, iTOL, ggtree (R), ete3/toytree (Python)

Script Validation

ALWAYS perform validation checks after generating scripts but before presenting them to the user. This ensures script accuracy, consistency, and proper resource allocation.

Validation Workflow

For each generated script, perform these validation checks in order:

1. Program Option Verification

Purpose: Detect hallucinated or incorrect command-line options that may cause scripts to fail.

Procedure:

Extract all command invocations from the generated script (e.g., compleasm run, iqtree -s, mafft --auto)
Compare against reference sources:
- First check: Compare against corresponding template in templates/ directory
- Second check: Compare against examples in references/REFERENCE.md
- Third check: If options differ significantly or are uncertain, perform web search for official documentation
Common tools to validate:
- compleasm run - Check -a, -o, -l, -t options
- iqtree - Verify -s, -p, -m, -bb, -alrt, -nt, -safe options
- mafft - Check --auto, --thread, --reorder options
- astral - Verify -i, -o options
- Trimming tools (trimal, clipkit, BMGE.jar) - Validate options

Action on issues:

If incorrect options found: Inform user of the issue and ask if they want you to correct it
If uncertain: Ask user to verify with tool documentation before proceeding

2. Pipeline Continuity Verification

Purpose: Ensure outputs from one step correctly feed into inputs of subsequent steps.

Procedure:

Map input/output relationships:
- Step 2 output (01_busco_results/*_compleasm/) → Step 3 input (QC script)
- Step 3 output (single_copy_orthologs/) → Step 5 input (MAFFT)
- Step 5 output (04_alignments/*.fas) → Step 6 input (trimming)
- Step 6 output (05_trimmed/*.fas) → Step 7 input (FASconCAT-G)
- Step 7 output (FcC_supermatrix.fas, partition file) → Step 8A input (IQ-TREE)
- Step 8C output (*.treefile) → Step 8D input (ASTRAL)
Check for consistency:
- File path references match across scripts
- Directory structure follows recommended layout
- Glob patterns correctly match expected files
- Required intermediate files are generated before being used

Action on issues:

If path mismatches found: Inform user and ask if they want you to correct them
If directory structure inconsistent: Suggest corrections aligned with recommended structure

3. Resource Compatibility Check

Purpose: Ensure allocated computational resources are appropriate for the task.

Procedure:

Verify resource allocations against recommendations in references/REFERENCE.md:
- Memory allocation: Check if memory per CPU (typically 6GB for compleasm, 2-4GB for others) is adequate
- Thread allocation: Verify thread counts are reasonable for the number of genomes/loci
- Walltime: Ensure walltime is sufficient based on dataset size guidelines
- Parallelization: Check that threads per job × concurrent jobs ≤ total threads
Common issues to check:
- Compleasm: First job needs full thread allocation (downloads database)
- IQ-TREE: -nt should match allocated CPUs
- Gene trees: Ensure enough threads per tree × concurrent trees ≤ total available
- Memory: Concatenated tree inference may need 8-16GB per CPU for large datasets
Validate against user-specified constraints:
- Total CPUs specified by user
- Available memory per node
- Maximum walltime limits
- Scheduler-specific limits (if mentioned)

Action on issues:

If resource allocation issues found: Inform user and suggest corrections with justification
If uncertain about adequacy: Ask user about typical job performance in their environment

Validation Reporting

After completing all validation checks:

If all checks pass: Inform user briefly: "Scripts validated successfully - options, pipeline flow, and resources verified."

If issues found: Present a structured report:

**Validation Results**

⚠️ Issues found during validation:

1. [Issue category]: [Description]
   - Current: [What was generated]
   - Suggested: [Recommended fix]
   - Reason: [Why this is an issue]

Would you like me to apply these corrections?

Always ask before correcting: Never silently fix issues - always get user confirmation before applying changes.
Document corrections: If corrections are applied, explain what was changed and why.

Communication Guidelines

Always start with STEP 0: Generate the unified environment setup script
Always end with STEP 9: Generate the customized methods paragraph
Always validate scripts: Perform validation checks before presenting scripts to users
Use unified environment by default: All scripts should use conda activate phylo
Always ask about CPU allocation: Never auto-detect cores, always ask user
Recommend optimized workflows: For users with adequate resources, recommend optimized parallel approaches over simple serial approaches
Be clear and pedagogical: Explain why each step is necessary
Provide educational explanations when requested: If user answered yes to educational goals (question 10):
- After completing each major workflow stage, ask: "Would you like me to explain this step?"
- If yes, provide moderate-length explanation (1-2 paragraphs) covering:
  - What the step accomplishes biologically and computationally
  - Significant choices made and their rationale
  - Best practices being followed in the workflow
- Examples of "major workflow stages": STEP 0 (setup), STEP 1 (download), STEP 2 (BUSCO), STEP 3 (QC), STEP 5 (alignment), STEP 6 (trimming), STEP 7 (concatenation), STEP 8 (phylogenetic inference)
Provide complete, ready-to-run scripts: Users should copy-paste and run
Adapt to user's environment: Always generate scheduler-specific scripts
Reference supporting files: Direct users to references/REFERENCE.md for details
Use helper scripts: Leverage provided scripts in scripts/ directory
Include error checking: Add file existence checks and informative error messages
Be encouraging: Phylogenomics is complex; maintain supportive tone

Important Notes

Mandatory Steps

STEP 0 is mandatory: Always generate the environment setup script first
STEP 9 is mandatory: Always generate the methods paragraph file at the end

Template Usage (IMPORTANT!)

Prefer templates over inline code: Use templates/ directory for major scripts
Template workflow:
- Read: Read("templates/slurm/02_compleasm_first.job")
- Replace placeholders: TOTAL_THREADS, LINEAGE, NUM_GENOMES, MODEL_SET, etc.
- Present customized script to user
Available templates: See templates/README.md for complete list
Benefits: Reduces token usage, easier maintenance, consistent structure

Script Generation

Always adapt scripts to user's scheduler (SLURM/PBS/local)
Replace all placeholders before presenting scripts
Never auto-detect CPU cores: Always ask user to specify
Provide parallelization options: For each parallelizable step, offer array job, parallel, and serial options
Scheduler-specific configuration: For SLURM/PBS, always ask about account, partition, email, etc.

Parallelization Strategy

Ask about preferences: Let user choose between throughput optimization vs. simplicity
Compleasm optimization: For ≥2 genomes and ≥16 cores, recommend two-phase approach
Use threading guidelines: Refer to references/REFERENCE.md for thread allocation recommendations
Parallelizable steps: Steps 2 (compleasm), 5 (MAFFT), 6 (trimming), 8C (gene trees)

Substitution Model Selection

Always recommend models: Use the systematic model recommendation process
Fetch current documentation: Use WebFetch to get IQ-TREE model information
Replace MODEL_SET placeholder: In Step 8A templates with comma-separated list
Taxonomically-targeted models: Suggest Q.bird, Q.mammal, Q.insect, Q.plant when applicable

Reference Material

Direct users to references/REFERENCE.md for:
- Detailed implementation guides
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime tables)
- Sample naming best practices
- Quality control assessment criteria
- Aliscore/ALICUT detailed guide and parameters
- Tool citations with DOIs
- Software installation instructions
- Common issues and troubleshooting

Attribution

This skill was created by Bruno de Medeiros (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by Paul Frandsen (Brigham Young University).

Workflow Entry Point

When a user requests phylogeny generation:

Gather required information using the "Initial User Questions" section
Generate STEP 0 setup script from references/REFERENCE.MD
If user needs help finding NCBI assemblies, perform STEP 0A using query_ncbi_assemblies.py
Proceed step-by-step through workflow (STEPS 1-8), using templates and referring to references/REFERENCE.md for detailed implementation
All workflow scripts should use the unified conda environment (conda activate phylo)
Validate all generated scripts before presenting to user (see "Script Validation" section)
Generate STEP 9 methods paragraph from template in references/REFERENCE.md
Provide final outputs summary

About

SKILL.md

About

Generate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation

SKILL.md

BUSCO-based Phylogenomics Workflow Generator

This skill provides phylogenomics expertise for generating comprehensive, scheduler-aware workflows for phylogenetic inference from genome assemblies using single-copy orthologs.

Purpose

This skill helps users generate phylogenies from genome assemblies by:

Handling mixed input (local files and NCBI accessions)
Creating scheduler-specific scripts (SLURM, PBS, cloud, local)
Setting up complete workflows from raw genomes to final trees
Providing quality control and recommendations
Supporting flexible software management (bioconda, Docker, custom)

Available Resources

The skill provides access to these bundled resources:

Scripts (`scripts/`)

query_ncbi_assemblies.py - Query NCBI for available genome assemblies by taxon name (new!)
download_ncbi_genomes.py - Download genomes from NCBI using BioProjects or Assembly accessions
rename_genomes.py - Rename genome files with meaningful sample names (important!)
generate_qc_report.sh - Generate quality control reports from compleasm results
extract_orthologs.sh - Extract and reorganize single-copy orthologs
run_aliscore.sh - Wrapper for Aliscore to identify randomly similar sequences (RSS)
run_alicut.sh - Wrapper for ALICUT to remove RSS positions from alignments
run_aliscore_alicut_batch.sh - Batch process all alignments through Aliscore + ALICUT
convert_fasconcat_to_partition.py - Convert FASconCAT output to IQ-TREE partition format
predownloaded_aliscore_alicut/ - Pre-tested Aliscore and ALICUT Perl scripts

Templates (`templates/`)

slurm/ - SLURM job scheduler templates
pbs/ - PBS/Torque job scheduler templates
local/ - Local machine templates (with GNU parallel)
README.md - Complete template documentation

References (`references/`)

REFERENCE.md - Detailed technical reference including:
- Sample naming best practices
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime)
- Detailed step-by-step implementation guides
- Quality control guidelines
- Aliscore/ALICUT detailed guide
- Tool citations and download links
- Software installation guide
- Common issues and troubleshooting

Workflow Overview

The complete phylogenomics pipeline follows this sequence:

Input Preparation → Ortholog Identification → Quality Control → Ortholog Extraction → Alignment → Trimming → Concatenation → Phylogenetic Inference

Initial User Questions

When a user requests phylogeny generation, gather the following information systematically:

Step 1: Detect Computing Environment

Before asking questions, attempt to detect the local computing environment:

# Check for job schedulers
command -v sbatch >/dev/null 2>&1  # SLURM
command -v qsub >/dev/null 2>&1    # PBS/Torque
command -v parallel >/dev/null 2>&1  # GNU parallel

Report findings to the user, then confirm: "I detected [X] on this machine. Will you be running the scripts here or on a different system?"

Required Information

Ask these questions to gather essential workflow parameters:

Computing Environment
- Where will these scripts run? (SLURM cluster, PBS/Torque cluster, Cloud computing, Local machine)
Input Data
- Local genome files, NCBI accessions, or both?
- If NCBI: Do you already have Assembly accessions (GCA_*/GCF_*) or BioProject accessions (PRJNA*/PRJEB*/PRJDA*)?
- If user doesn't have accessions: Offer to help find assemblies using query_ncbi_assemblies.py (see "STEP 0A: Query NCBI for Assemblies" below)
- If local files: What are the file paths?
Taxonomic Scope & Dataset Details
- What taxonomic group? (determines BUSCO lineage dataset)
- How many taxa/genomes will be analyzed?
- What is the approximate phylogenetic breadth? (species-level, genus-level, family-level, order-level, etc.)
- See references/REFERENCE.md for complete lineage list
Environment Management
- Use unified conda environment (default, recommended), or separate environments per tool?
Resource Constraints
- How many CPU cores/threads to use in total? (Ask user to specify, do not auto-detect)
- Available memory (RAM) per node/machine?
- Maximum walltime for jobs?
- See references/REFERENCE.md for resource recommendations
Parallelization Strategy

Ask the user how they want to handle parallel processing:
- For job schedulers (SLURM/PBS):
  - Use array jobs for parallel steps? (Recommended: Yes)
  - Which steps to parallelize? (Steps 2, 5, 6, 8C recommended)
- For local machines:
  - Use GNU parallel for parallel steps? (requires parallel installed)
  - How many concurrent jobs?
- For all systems:
  - Optimize for maximum throughput or simplicity?
Scheduler-Specific Configuration (if using SLURM or PBS)
- Account/Username for compute time charges
- Partition/Queue to submit jobs to
- Email notifications? (address and when: START, END, FAIL, ALL)
- Job dependencies? (Recommended: Yes for linear workflow)
- Output log directory? (Default: logs/)
Alignment Trimming Preference
- Aliscore/ALICUT (traditional, thorough), trimAl (fast), BMGE (entropy-based), or ClipKit (modern)?
Substitution Model Selection (for IQ-TREE phylogenetic inference)

Context needed: Taxonomic breadth, number of taxa, evolutionary rates

Action: Fetch IQ-TREE model documentation and suggest appropriate amino acid substitution models based on dataset characteristics.

Use the substitution model recommendation system (see "Substitution Model Recommendation" section below).
Educational Goals

Are you learning bioinformatics and would you like comprehensive explanations of each workflow step?
If yes: After completing each major workflow stage, offer to explain what the step accomplishes, why certain choices were made, and what best practices are being followed.
Store this preference to use throughout the workflow.

Recommended Directory Structure

Organize analyses with dedicated folders for each pipeline step:

project_name/
├── logs/                          # All log files
├── 00_genomes/                    # Input genome assemblies
├── 01_busco_results/              # BUSCO/compleasm outputs
├── 02_qc/                         # Quality control reports
├── 03_extracted_orthologs/        # Extracted single-copy orthologs
├── 04_alignments/                 # Multiple sequence alignments
├── 05_trimmed/                    # Trimmed alignments
├── 06_concatenation/              # Supermatrix and partition files
├── 07_partition_search/           # Partition model selection
├── 08_concatenated_tree/          # Concatenated ML tree
├── 09_gene_trees/                 # Individual gene trees
├── 10_species_tree/               # ASTRAL species tree
└── scripts/                       # All analysis scripts

Benefits: Easy debugging, clear workflow progression, reproducibility, prevents root directory clutter.

Template System

This skill uses a template-based system to reduce token usage and improve maintainability. Script templates are stored in the templates/ directory and organized by computing environment.

How to Use Templates

When generating scripts for users:

Read the appropriate template for their computing environment:
```
Read("templates/slurm/02_compleasm_first.job")
```
Replace placeholders with user-specific values:
- TOTAL_THREADS → e.g., 64
- THREADS_PER_JOB → e.g., 16
- NUM_GENOMES → e.g., 20
- NUM_LOCI → e.g., 2795
- LINEAGE → e.g., insecta_odb10
- MODEL_SET → e.g., LG,WAG,JTT,Q.pfam
Present the customized script to the user with setup instructions

Available Templates

Key templates by workflow step:

Step 0 (setup): Environment setup script in references/REFERENCE.md
Step 2 (compleasm): 02_compleasm_first, 02_compleasm_parallel
Step 8A (partition search): 08a_partition_search
Step 8C (gene trees): 08c_gene_trees_array, 08c_gene_trees_parallel, 08c_gene_trees_serial

See templates/README.md for complete template documentation.

Substitution Model Recommendation

When asked about substitution model selection (Question 9), use this systematic approach:

Step 1: Fetch IQ-TREE Documentation

Use WebFetch to retrieve current model information:

WebFetch(url="https://iqtree.github.io/doc/Substitution-Models",
         prompt="Extract all amino acid substitution models with descriptions and usage guidelines")

Step 2: Analyze Dataset Characteristics

Consider these factors from user responses:

Taxonomic Scope: Species/genus (shallow) vs. family/order (moderate) vs. class/phylum+ (deep)
Number of Taxa: <20 (small), 20-50 (medium), >50 (large)
Evolutionary Rates: Fast-evolving, moderate, or slow-evolving
Sequence Type: Nuclear proteins, mitochondrial, or chloroplast

Step 3: Recommend Models

General recommendations:

Nuclear proteins (most common): LG, WAG, JTT, Q.pfam
Mitochondrial: mtREV, mtZOA, mtMAM, mtART, mtVer, mtInv
Chloroplast: cpREV
Taxonomically-targeted: Q.bird, Q.mammal, Q.insect, Q.plant, Q.yeast (when applicable)

Step 4: Present Recommendations

Format recommendations with justifications and explain how models will be used in IQ-TREE steps 8A and 8C.

Step 5: Store Model Set

Store the final comma-separated model list (e.g., "LG,WAG,JTT,Q.pfam") for use in Step 8 template placeholders.

Workflow Implementation

Once required information is gathered, guide the user through these steps. For each step, use templates where available and refer to references/REFERENCE.md for detailed implementation.

STEP 0: Environment Setup

ALWAYS start by generating a setup script for the user's environment.

Use the unified conda environment setup script from references/REFERENCE.md (Section: "Software Installation Guide"). This creates a single conda environment with all necessary tools:

compleasm, MAFFT, trimming tools (trimAl, ClipKit, BMGE)
IQ-TREE, ASTRAL, Perl with BioPerl, GNU parallel
Downloads and installs Aliscore/ALICUT Perl scripts

Key points:

Users choose between mamba (faster) or conda
Users choose between predownloaded Aliscore/ALICUT scripts (tested) or latest from GitHub
All subsequent steps use conda activate phylo (the unified environment)

See references/REFERENCE.md for the complete setup script template.

STEP 0A: Query NCBI for Assemblies (Optional)

Use this step when: User wants to use NCBI data but doesn't have specific assembly accessions yet.

This optional preliminary step helps users discover available genome assemblies by taxon name before proceeding with the main workflow.

When to Offer This Step

Offer this step when:

User wants to analyze genomes from NCBI
User doesn't have specific Assembly or BioProject accessions
User mentions a taxonomic group (e.g., "I want to build a phylogeny for beetles")
The datasets CLI tool is available locally (check with command -v datasets)

If the datasets tool is not available locally, inform the user they can:

Install it via conda: conda install -c conda-forge ncbi-datasets-cli
Download it manually from https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/
Use the NCBI website to manually search for assemblies

Workflow

Check for datasets CLI tool availability:

command -v datasets >/dev/null 2>&1 && echo "datasets CLI found" || echo "datasets CLI not found"

Ask for focal taxon: Request the taxonomic group of interest
- Examples: "Coleoptera", "Drosophila", "Apis mellifera"
- Can be at any taxonomic level (order, family, genus, species)
Suggest quality filtering criteria: Before querying, ask the user about quality preferences to help select the best assemblies for phylogenomics:

Assembly Quality Considerations:
- Assembly Level: Chromosome-level assemblies are preferred over Scaffold over Contig
  - Chromosome: Most complete, best for phylogenomics
  - Scaffold: Medium quality, may have gaps between contigs
  - Contig: Lower quality, more fragmented
- Contig N50: Higher N50 indicates better assembly contiguity
  - Excellent: N50 > 10 Mbp
  - Good: N50 > 1 Mbp
  - Poor: N50 < 100 kbp
- Annotation Status: Annotated assemblies have gene predictions and BUSCO scores
  - Useful for assessing completeness
  - BUSCO > 95% is excellent for phylogenomics
- RefSeq vs GenBank:
  - RefSeq (GCF_*): Curated, higher quality, recommended
  - GenBank (GCA_*): Submitted assemblies, variable quality
Suggested Questions for Users:
- "Do you prefer chromosome-level assemblies only, or are scaffold-level acceptable?"
- "Should I filter for high-quality assemblies (e.g., contig N50 > 1 Mbp)?"
- "Do you want to see RefSeq assemblies only (higher quality)?"
- "Would you like to see quality metrics (N50, BUSCO scores) to help you decide?"

Query NCBI using the script: Use scripts/query_ncbi_assemblies.py with appropriate filters based on user preferences:

# Basic query with quality metrics
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" --show-quality

# Filter for chromosome-level assemblies
python scripts/query_ncbi_assemblies.py --taxon "Drosophila" --assembly-level Chromosome --show-quality

# Filter for high-quality assemblies (N50 > 1 Mbp)
python scripts/query_ncbi_assemblies.py --taxon "Apis" --min-contig-n50 1000000 --show-quality

# RefSeq annotated assemblies only (highest quality)
python scripts/query_ncbi_assemblies.py --taxon "Felidae" --refseq-only --annotated --show-quality

# Combined filters: Chromosome-level, high N50, annotated
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" \
  --assembly-level Chromosome \
  --min-contig-n50 1000000 \
  --annotated \
  --show-quality \
  --save assembly_accessions.txt

Available Filtering Options:

--assembly-level {Chromosome,Scaffold,Contig} - Filter by assembly level
--min-contig-n50 N - Minimum contig N50 in base pairs (e.g., 1000000 for 1 Mbp)
--annotated - Only return assemblies with gene annotations
--refseq-only - Only return RefSeq (GCF_*) assemblies
--show-quality - Display quality metrics (N50, BUSCO scores) in output
--max-results N - Maximum number of results (default: 20)
--save FILE - Save accessions to file for later download

Present results to user: The script displays:
- Assembly accession (GCA_* or GCF_*)
- Organism name
- Assembly level (Chromosome, Scaffold, Contig)
- Assembly name
- If --show-quality: Contig N50, BUSCO completeness %, Annotation status
- Quality summary: Count of assemblies by level
Help user select assemblies: Review results with user and recommend best choices:
- Prioritize chromosome-level assemblies with high N50
- Prefer RefSeq (GCF_*) over GenBank (GCA_*) when available
- Look for BUSCO completeness > 95% when available
- Aim for consistent quality across all taxa in the phylogeny
- Consider phylogenetic breadth (species coverage)
- Consider data quality (RefSeq > GenBank when available)
Collect selected accessions: Compile the list of chosen assembly accessions
Proceed to STEP 1: Use the selected accessions with download_ncbi_genomes.py

Tips for Assembly Selection

Assembly Level: Chromosome-level assemblies are most complete, followed by Scaffold, then Contig
RefSeq vs GenBank: RefSeq (GCF_*) assemblies undergo additional curation; GenBank (GCA_*) are submitter-provided
Taxonomic Sampling: For phylogenetics, aim for representative sampling across the taxonomic group
Quality over Quantity: Better to have 20 high-quality assemblies than 100 poor-quality ones

STEP 1: Download NCBI Genomes (if applicable)

If user provided NCBI accessions, use scripts/download_ncbi_genomes.py:

For BioProjects:

python scripts/download_ncbi_genomes.py --bioprojects PRJNA12345 -o genomes.zip
unzip genomes.zip

For Assembly Accessions:

python scripts/download_ncbi_genomes.py --assemblies GCA_123456789.1 -o genomes.zip
unzip genomes.zip

IMPORTANT: After download, genomes must be renamed with meaningful sample names (format: [ACCESSION]_[SPECIES_NAME]). Sample names appear in final phylogenetic trees.

Generate a script that:

Finds all downloaded FASTA files in ncbi_dataset directory structure
Moves/renames files to main genomes directory with meaningful names
Includes any local genome files
Creates final genome_list.txt with ALL genomes (local + downloaded)

See references/REFERENCE.md section "Sample Naming Best Practices" for detailed guidelines.

STEP 2: Ortholog Identification with compleasm

Activate the unified environment and run compleasm on all genomes to identify single-copy orthologs.

Key considerations:

First genome must run alone to download lineage database
Remaining genomes can run in parallel
Thread allocation: Miniprot scales well up to ~16-32 threads per genome

Threading guidelines: See references/REFERENCE.md for recommended thread allocation table.

Generate scripts using templates:

SLURM: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
PBS: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
Local: Read templates 02_compleasm_first.sh and 02_compleasm_parallel.sh

Replace placeholders: TOTAL_THREADS, THREADS_PER_JOB, NUM_GENOMES, LINEAGE

For detailed implementation examples, see references/REFERENCE.md section "Ortholog Identification Implementation".

STEP 3: Quality Control

After compleasm completes, generate QC report using scripts/generate_qc_report.sh:

bash scripts/generate_qc_report.sh qc_report.csv

Provide interpretation:

>95% complete: Excellent, retain
90-95% complete: Good, retain
85-90% complete: Acceptable, case-by-case
70-85% complete: Questionable, consider excluding
<70% complete: Poor, recommend excluding

See references/REFERENCE.md section "Quality Control Guidelines" for detailed assessment criteria.

STEP 4: Ortholog Extraction

Use scripts/extract_orthologs.sh to extract single-copy orthologs:

bash scripts/extract_orthologs.sh LINEAGE_NAME

This generates per-locus unaligned FASTA files in single_copy_orthologs/unaligned_aa/.

STEP 5: Alignment with MAFFT

Activate the unified environment (conda activate phylo) which contains MAFFT.

Create locus list, then generate alignment scripts:

cd single_copy_orthologs/unaligned_aa
ls *.fas > locus_names.txt
num_loci=$(wc -l < locus_names.txt)

Generate scheduler-specific scripts:

SLURM/PBS: Array job with one task per locus
Local: Sequential processing or GNU parallel

For detailed script templates, see references/REFERENCE.md section "Alignment Implementation".

STEP 6: Alignment Trimming

Based on user's preference, provide appropriate trimming method. All tools are available in the unified conda environment.

Options:

trimAl: Fast (-automated1), recommended for large datasets
ClipKit: Modern, fast (default smart-gap mode)
BMGE: Entropy-based (-t AA)
Aliscore/ALICUT: Traditional, thorough (recommended for phylogenomics)

For Aliscore/ALICUT:

Perl scripts were installed in STEP 0
Use scripts/run_aliscore_alicut_batch.sh for batch processing
Or use array jobs with scripts/run_aliscore.sh and scripts/run_alicut.sh
Always use -N flag for amino acid sequences

Generate scripts using scheduler-appropriate templates (array jobs for SLURM/PBS, parallel or serial for local).

For detailed implementation of each trimming method, see references/REFERENCE.md section "Alignment Trimming Implementation".

STEP 7: Concatenation and Partition Definition

Download FASconCAT-G (Perl script) and run concatenation:

conda activate phylo  # Has Perl installed
wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl -O FASconCAT-G.pl
chmod +x FASconCAT-G.pl

cd trimmed_aa
perl ../FASconCAT-G.pl -s -i

Convert to IQ-TREE format using scripts/convert_fasconcat_to_partition.py:

python ../scripts/convert_fasconcat_to_partition.py FcC_info.xls partition_def.txt

Outputs: FcC_supermatrix.fas, FcC_info.xls, partition_def.txt

STEP 8: Phylogenetic Inference

IQ-TREE is already installed in the unified environment. Activate with conda activate phylo.

Part 8A: Partition Model Selection

Use the substitution models selected during initial setup (Question 9).

Generate script using templates:

Read appropriate template: templates/[slurm|pbs|local]/08a_partition_search.[job|sh]
Replace MODEL_SET placeholder with user's selected models (e.g., "LG,WAG,JTT,Q.pfam")

For detailed implementation, see references/REFERENCE.md section "Partition Model Selection Implementation".

Part 8B: Concatenated ML Tree

Run IQ-TREE using the best partition scheme from Part 8A:

iqtree -s FcC_supermatrix.fas -spp partition_search.best_scheme.nex \
  -nt 18 -safe -pre concatenated_ML_tree -bb 1000 -bnni

Output: concatenated_ML_tree.treefile

Part 8C: Individual Gene Trees

Estimate gene trees for coalescent-based species tree inference.

Generate scripts using templates:

SLURM/PBS: Read 08c_gene_trees_array.job template
Local: Read 08c_gene_trees_parallel.sh or 08c_gene_trees_serial.sh template
Replace NUM_LOCI placeholder

For detailed implementation, see references/REFERENCE.md section "Gene Trees Implementation".

Part 8D: ASTRAL Species Tree

ASTRAL is already installed in the unified conda environment.

conda activate phylo

# Concatenate all gene trees
cat trimmed_aa/*.treefile > all_gene_trees.tre

# Run ASTRAL
astral -i all_gene_trees.tre -o astral_species_tree.tre

Output: astral_species_tree.tre

STEP 9: Generate Methods Paragraph

ALWAYS generate a methods paragraph to help users write their publication methods section.

Create METHODS_PARAGRAPH.md file with:

Customized text based on tools and parameters used
Complete citations for all software
Placeholders for user-specific values (genome count, loci count, thresholds)
Instructions for adapting to journal requirements

For the complete methods paragraph template, see references/REFERENCE.md section "Methods Paragraph Template".

Pre-fill known values when possible:

Number of genomes
BUSCO lineage
Trimming method used
Substitution models tested

Final Outputs Summary

Provide users with a summary of outputs:

Phylogenetic Results:

concatenated_ML_tree.treefile - ML tree from concatenated supermatrix
astral_species_tree.tre - Coalescent species tree
*.treefile - Individual gene trees

Data and Quality Control: 4. qc_report.csv - Genome quality statistics 5. FcC_supermatrix.fas - Concatenated alignment 6. partition_search.best_scheme.nex - Selected partitioning scheme

Publication Materials: 7. METHODS_PARAGRAPH.md - Ready-to-use methods section with citations

Visualization tools: FigTree, iTOL, ggtree (R), ete3/toytree (Python)

Script Validation

ALWAYS perform validation checks after generating scripts but before presenting them to the user. This ensures script accuracy, consistency, and proper resource allocation.

Validation Workflow

For each generated script, perform these validation checks in order:

1. Program Option Verification

Purpose: Detect hallucinated or incorrect command-line options that may cause scripts to fail.

Procedure:

Extract all command invocations from the generated script (e.g., compleasm run, iqtree -s, mafft --auto)
Compare against reference sources:
- First check: Compare against corresponding template in templates/ directory
- Second check: Compare against examples in references/REFERENCE.md
- Third check: If options differ significantly or are uncertain, perform web search for official documentation
Common tools to validate:
- compleasm run - Check -a, -o, -l, -t options
- iqtree - Verify -s, -p, -m, -bb, -alrt, -nt, -safe options
- mafft - Check --auto, --thread, --reorder options
- astral - Verify -i, -o options
- Trimming tools (trimal, clipkit, BMGE.jar) - Validate options

Action on issues:

If incorrect options found: Inform user of the issue and ask if they want you to correct it
If uncertain: Ask user to verify with tool documentation before proceeding

2. Pipeline Continuity Verification

Purpose: Ensure outputs from one step correctly feed into inputs of subsequent steps.

Procedure:

Map input/output relationships:
- Step 2 output (01_busco_results/*_compleasm/) → Step 3 input (QC script)
- Step 3 output (single_copy_orthologs/) → Step 5 input (MAFFT)
- Step 5 output (04_alignments/*.fas) → Step 6 input (trimming)
- Step 6 output (05_trimmed/*.fas) → Step 7 input (FASconCAT-G)
- Step 7 output (FcC_supermatrix.fas, partition file) → Step 8A input (IQ-TREE)
- Step 8C output (*.treefile) → Step 8D input (ASTRAL)
Check for consistency:
- File path references match across scripts
- Directory structure follows recommended layout
- Glob patterns correctly match expected files
- Required intermediate files are generated before being used

Action on issues:

If path mismatches found: Inform user and ask if they want you to correct them
If directory structure inconsistent: Suggest corrections aligned with recommended structure

3. Resource Compatibility Check

Purpose: Ensure allocated computational resources are appropriate for the task.

Procedure:

Verify resource allocations against recommendations in references/REFERENCE.md:
- Memory allocation: Check if memory per CPU (typically 6GB for compleasm, 2-4GB for others) is adequate
- Thread allocation: Verify thread counts are reasonable for the number of genomes/loci
- Walltime: Ensure walltime is sufficient based on dataset size guidelines
- Parallelization: Check that threads per job × concurrent jobs ≤ total threads
Common issues to check:
- Compleasm: First job needs full thread allocation (downloads database)
- IQ-TREE: -nt should match allocated CPUs
- Gene trees: Ensure enough threads per tree × concurrent trees ≤ total available
- Memory: Concatenated tree inference may need 8-16GB per CPU for large datasets
Validate against user-specified constraints:
- Total CPUs specified by user
- Available memory per node
- Maximum walltime limits
- Scheduler-specific limits (if mentioned)

Action on issues:

If resource allocation issues found: Inform user and suggest corrections with justification
If uncertain about adequacy: Ask user about typical job performance in their environment

Validation Reporting

After completing all validation checks:

If all checks pass: Inform user briefly: "Scripts validated successfully - options, pipeline flow, and resources verified."

If issues found: Present a structured report:

**Validation Results**

⚠️ Issues found during validation:

1. [Issue category]: [Description]
   - Current: [What was generated]
   - Suggested: [Recommended fix]
   - Reason: [Why this is an issue]

Would you like me to apply these corrections?

Always ask before correcting: Never silently fix issues - always get user confirmation before applying changes.
Document corrections: If corrections are applied, explain what was changed and why.

Communication Guidelines

Always start with STEP 0: Generate the unified environment setup script
Always end with STEP 9: Generate the customized methods paragraph
Always validate scripts: Perform validation checks before presenting scripts to users
Use unified environment by default: All scripts should use conda activate phylo
Always ask about CPU allocation: Never auto-detect cores, always ask user
Recommend optimized workflows: For users with adequate resources, recommend optimized parallel approaches over simple serial approaches
Be clear and pedagogical: Explain why each step is necessary
Provide educational explanations when requested: If user answered yes to educational goals (question 10):
- After completing each major workflow stage, ask: "Would you like me to explain this step?"
- If yes, provide moderate-length explanation (1-2 paragraphs) covering:
  - What the step accomplishes biologically and computationally
  - Significant choices made and their rationale
  - Best practices being followed in the workflow
- Examples of "major workflow stages": STEP 0 (setup), STEP 1 (download), STEP 2 (BUSCO), STEP 3 (QC), STEP 5 (alignment), STEP 6 (trimming), STEP 7 (concatenation), STEP 8 (phylogenetic inference)
Provide complete, ready-to-run scripts: Users should copy-paste and run
Adapt to user's environment: Always generate scheduler-specific scripts
Reference supporting files: Direct users to references/REFERENCE.md for details
Use helper scripts: Leverage provided scripts in scripts/ directory
Include error checking: Add file existence checks and informative error messages
Be encouraging: Phylogenomics is complex; maintain supportive tone

Important Notes

Mandatory Steps

STEP 0 is mandatory: Always generate the environment setup script first
STEP 9 is mandatory: Always generate the methods paragraph file at the end

Template Usage (IMPORTANT!)

Prefer templates over inline code: Use templates/ directory for major scripts
Template workflow:
- Read: Read("templates/slurm/02_compleasm_first.job")
- Replace placeholders: TOTAL_THREADS, LINEAGE, NUM_GENOMES, MODEL_SET, etc.
- Present customized script to user
Available templates: See templates/README.md for complete list
Benefits: Reduces token usage, easier maintenance, consistent structure

Script Generation

Always adapt scripts to user's scheduler (SLURM/PBS/local)
Replace all placeholders before presenting scripts
Never auto-detect CPU cores: Always ask user to specify
Provide parallelization options: For each parallelizable step, offer array job, parallel, and serial options
Scheduler-specific configuration: For SLURM/PBS, always ask about account, partition, email, etc.

Parallelization Strategy

Ask about preferences: Let user choose between throughput optimization vs. simplicity
Compleasm optimization: For ≥2 genomes and ≥16 cores, recommend two-phase approach
Use threading guidelines: Refer to references/REFERENCE.md for thread allocation recommendations
Parallelizable steps: Steps 2 (compleasm), 5 (MAFFT), 6 (trimming), 8C (gene trees)

Substitution Model Selection

Always recommend models: Use the systematic model recommendation process
Fetch current documentation: Use WebFetch to get IQ-TREE model information
Replace MODEL_SET placeholder: In Step 8A templates with comma-separated list
Taxonomically-targeted models: Suggest Q.bird, Q.mammal, Q.insect, Q.plant when applicable

Reference Material

Direct users to references/REFERENCE.md for:
- Detailed implementation guides
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime tables)
- Sample naming best practices
- Quality control assessment criteria
- Aliscore/ALICUT detailed guide and parameters
- Tool citations with DOIs
- Software installation instructions
- Common issues and troubleshooting

Attribution

This skill was created by Bruno de Medeiros (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by Paul Frandsen (Brigham Young University).

Workflow Entry Point

When a user requests phylogeny generation:

Gather required information using the "Initial User Questions" section
Generate STEP 0 setup script from references/REFERENCE.MD
If user needs help finding NCBI assemblies, perform STEP 0A using query_ncbi_assemblies.py
Proceed step-by-step through workflow (STEPS 1-8), using templates and referring to references/REFERENCE.md for detailed implementation
All workflow scripts should use the unified conda environment (conda activate phylo)
Validate all generated scripts before presenting to user (see "Script Validation" section)
Generate STEP 9 methods paragraph from template in references/REFERENCE.md
Provide final outputs summary

busco-phylogeny

About

SKILL.md

busco-phylogeny

About

SKILL.md

BUSCO-based Phylogenomics Workflow Generator

Purpose

Available Resources

Scripts (scripts/)

Templates (templates/)

References (references/)

Workflow Overview

Initial User Questions

Step 1: Detect Computing Environment

Required Information

Recommended Directory Structure

Template System

How to Use Templates

Available Templates

Substitution Model Recommendation

Step 1: Fetch IQ-TREE Documentation

Step 2: Analyze Dataset Characteristics

Step 3: Recommend Models

Step 4: Present Recommendations

Step 5: Store Model Set

Workflow Implementation

STEP 0: Environment Setup

STEP 0A: Query NCBI for Assemblies (Optional)

When to Offer This Step

Workflow

Tips for Assembly Selection

STEP 1: Download NCBI Genomes (if applicable)

STEP 2: Ortholog Identification with compleasm

STEP 3: Quality Control

STEP 4: Ortholog Extraction

STEP 5: Alignment with MAFFT

STEP 6: Alignment Trimming

STEP 7: Concatenation and Partition Definition

STEP 8: Phylogenetic Inference

Part 8A: Partition Model Selection

Part 8B: Concatenated ML Tree

Part 8C: Individual Gene Trees

Part 8D: ASTRAL Species Tree

STEP 9: Generate Methods Paragraph

Final Outputs Summary

Script Validation

Validation Workflow

1. Program Option Verification

2. Pipeline Continuity Verification

3. Resource Compatibility Check

Validation Reporting

Communication Guidelines

Important Notes

Mandatory Steps

Template Usage (IMPORTANT!)

Script Generation

Parallelization Strategy

Substitution Model Selection

Reference Material

Attribution

Workflow Entry Point

About

SKILL.md

About

SKILL.md

BUSCO-based Phylogenomics Workflow Generator

Purpose

Available Resources

Scripts (scripts/)

Templates (templates/)

References (references/)

Workflow Overview

Initial User Questions

Step 1: Detect Computing Environment

Required Information

Recommended Directory Structure

Template System

How to Use Templates

Available Templates

Scripts (`scripts/`)

Templates (`templates/`)

References (`references/`)

Scripts (`scripts/`)

Templates (`templates/`)

References (`references/`)