Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    lifangda

    scikit-bio

    lifangda/scikit-bio
    Research
    24
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis.

    SKILL.md

    scikit-bio

    Overview

    scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.

    When to Use This Skill

    This skill should be used when the user:

    • Works with biological sequences (DNA, RNA, protein)
    • Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
    • Performs sequence alignments or searches for motifs
    • Constructs or analyzes phylogenetic trees
    • Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
    • Performs ordination analysis (PCoA, CCA, RDA)
    • Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
    • Analyzes microbiome or community ecology data
    • Works with protein embeddings from language models
    • Needs to manipulate biological data tables

    Core Capabilities

    1. Sequence Manipulation

    Work with biological sequences using specialized classes for DNA, RNA, and protein data.

    Key operations:

    • Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
    • Sequence slicing, concatenation, and searching
    • Reverse complement, transcription (DNA→RNA), and translation (RNA→protein)
    • Find motifs and patterns using regex
    • Calculate distances (Hamming, k-mer based)
    • Handle sequence quality scores and metadata

    Common patterns:

    import skbio
    
    # Read sequences from file
    seq = skbio.DNA.read('input.fasta')
    
    # Sequence operations
    rc = seq.reverse_complement()
    rna = seq.transcribe()
    protein = rna.translate()
    
    # Find motifs
    motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
    
    # Check for properties
    has_degens = seq.has_degenerates()
    seq_no_gaps = seq.degap()
    

    Important notes:

    • Use DNA, RNA, Protein classes for grammared sequences with validation
    • Use Sequence class for generic sequences without alphabet restrictions
    • Quality scores automatically loaded from FASTQ files into positional metadata
    • Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)

    2. Sequence Alignment

    Perform pairwise and multiple sequence alignments using dynamic programming algorithms.

    Key capabilities:

    • Global alignment (Needleman-Wunsch with semi-global variant)
    • Local alignment (Smith-Waterman)
    • Configurable scoring schemes (match/mismatch, gap penalties, substitution matrices)
    • CIGAR string conversion
    • Multiple sequence alignment storage and manipulation with TabularMSA

    Common patterns:

    from skbio.alignment import local_pairwise_align_ssw, TabularMSA
    
    # Pairwise alignment
    alignment = local_pairwise_align_ssw(seq1, seq2)
    
    # Access aligned sequences
    msa = alignment.aligned_sequences
    
    # Read multiple alignment from file
    msa = TabularMSA.read('alignment.fasta', constructor=skbio.DNA)
    
    # Calculate consensus
    consensus = msa.consensus()
    

    Important notes:

    • Use local_pairwise_align_ssw for local alignments (faster, SSW-based)
    • Use StripedSmithWaterman for protein alignments
    • Affine gap penalties recommended for biological sequences
    • Can convert between scikit-bio, BioPython, and Biotite alignment formats

    3. Phylogenetic Trees

    Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.

    Key capabilities:

    • Tree construction from distance matrices (UPGMA, WPGMA, Neighbor Joining, GME, BME)
    • Tree manipulation (pruning, rerooting, traversal)
    • Distance calculations (patristic, cophenetic, Robinson-Foulds)
    • ASCII visualization
    • Newick format I/O

    Common patterns:

    from skbio import TreeNode
    from skbio.tree import nj
    
    # Read tree from file
    tree = TreeNode.read('tree.nwk')
    
    # Construct tree from distance matrix
    tree = nj(distance_matrix)
    
    # Tree operations
    subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
    tips = [node for node in tree.tips()]
    lca = tree.lowest_common_ancestor(['taxon1', 'taxon2'])
    
    # Calculate distances
    patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
    cophenetic_matrix = tree.cophenetic_matrix()
    
    # Compare trees
    rf_distance = tree.robinson_foulds(other_tree)
    

    Important notes:

    • Use nj() for neighbor joining (classic phylogenetic method)
    • Use upgma() for UPGMA (assumes molecular clock)
    • GME and BME are highly scalable for large trees
    • Trees can be rooted or unrooted; some metrics require specific rooting

    4. Diversity Analysis

    Calculate alpha and beta diversity metrics for microbial ecology and community analysis.

    Key capabilities:

    • Alpha diversity: richness, Shannon entropy, Simpson index, Faith's PD, Pielou's evenness
    • Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
    • Phylogenetic diversity metrics (require tree input)
    • Rarefaction and subsampling
    • Integration with ordination and statistical tests

    Common patterns:

    from skbio.diversity import alpha_diversity, beta_diversity
    import skbio
    
    # Alpha diversity
    alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
    faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
                              tree=tree, otu_ids=feature_ids)
    
    # Beta diversity
    bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
    unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
                               ids=sample_ids, tree=tree, otu_ids=feature_ids)
    
    # Get available metrics
    from skbio.diversity import get_alpha_diversity_metrics
    print(get_alpha_diversity_metrics())
    

    Important notes:

    • Counts must be integers representing abundances, not relative frequencies
    • Phylogenetic metrics (Faith's PD, UniFrac) require tree and OTU ID mapping
    • Use partial_beta_diversity() for computing specific sample pairs only
    • Alpha diversity returns Series, beta diversity returns DistanceMatrix

    5. Ordination Methods

    Reduce high-dimensional biological data to visualizable lower-dimensional spaces.

    Key capabilities:

    • PCoA (Principal Coordinate Analysis) from distance matrices
    • CA (Correspondence Analysis) for contingency tables
    • CCA (Canonical Correspondence Analysis) with environmental constraints
    • RDA (Redundancy Analysis) for linear relationships
    • Biplot projection for feature interpretation

    Common patterns:

    from skbio.stats.ordination import pcoa, cca
    
    # PCoA from distance matrix
    pcoa_results = pcoa(distance_matrix)
    pc1 = pcoa_results.samples['PC1']
    pc2 = pcoa_results.samples['PC2']
    
    # CCA with environmental variables
    cca_results = cca(species_matrix, environmental_matrix)
    
    # Save/load ordination results
    pcoa_results.write('ordination.txt')
    results = skbio.OrdinationResults.read('ordination.txt')
    

    Important notes:

    • PCoA works with any distance/dissimilarity matrix
    • CCA reveals environmental drivers of community composition
    • Ordination results include eigenvalues, proportion explained, and sample/feature coordinates
    • Results integrate with plotting libraries (matplotlib, seaborn, plotly)

    6. Statistical Testing

    Perform hypothesis tests specific to ecological and biological data.

    Key capabilities:

    • PERMANOVA: test group differences using distance matrices
    • ANOSIM: alternative test for group differences
    • PERMDISP: test homogeneity of group dispersions
    • Mantel test: correlation between distance matrices
    • Bioenv: find environmental variables correlated with distances

    Common patterns:

    from skbio.stats.distance import permanova, anosim, mantel
    
    # Test if groups differ significantly
    permanova_results = permanova(distance_matrix, grouping, permutations=999)
    print(f"p-value: {permanova_results['p-value']}")
    
    # ANOSIM test
    anosim_results = anosim(distance_matrix, grouping, permutations=999)
    
    # Mantel test between two distance matrices
    mantel_results = mantel(dm1, dm2, method='pearson', permutations=999)
    print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}")
    

    Important notes:

    • Permutation tests provide non-parametric significance testing
    • Use 999+ permutations for robust p-values
    • PERMANOVA sensitive to dispersion differences; pair with PERMDISP
    • Mantel tests assess matrix correlation (e.g., geographic vs genetic distance)

    7. File I/O and Format Conversion

    Read and write 19+ biological file formats with automatic format detection.

    Supported formats:

    • Sequences: FASTA, FASTQ, GenBank, EMBL, QSeq
    • Alignments: Clustal, PHYLIP, Stockholm
    • Trees: Newick
    • Tables: BIOM (HDF5 and JSON)
    • Distances: delimited square matrices
    • Analysis: BLAST+6/7, GFF3, Ordination results
    • Metadata: TSV/CSV with validation

    Common patterns:

    import skbio
    
    # Read with automatic format detection
    seq = skbio.DNA.read('file.fasta', format='fasta')
    tree = skbio.TreeNode.read('tree.nwk')
    
    # Write to file
    seq.write('output.fasta', format='fasta')
    
    # Generator for large files (memory efficient)
    for seq in skbio.io.read('large.fasta', format='fasta', constructor=skbio.DNA):
        process(seq)
    
    # Convert formats
    seqs = list(skbio.io.read('input.fastq', format='fastq', constructor=skbio.DNA))
    skbio.io.write(seqs, format='fasta', into='output.fasta')
    

    Important notes:

    • Use generators for large files to avoid memory issues
    • Format can be auto-detected when into parameter specified
    • Some objects can be written to multiple formats
    • Support for stdin/stdout piping with verify=False

    8. Distance Matrices

    Create and manipulate distance/dissimilarity matrices with statistical methods.

    Key capabilities:

    • Store symmetric (DistanceMatrix) or asymmetric (DissimilarityMatrix) data
    • ID-based indexing and slicing
    • Integration with diversity, ordination, and statistical tests
    • Read/write delimited text format

    Common patterns:

    from skbio import DistanceMatrix
    import numpy as np
    
    # Create from array
    data = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]])
    dm = DistanceMatrix(data, ids=['A', 'B', 'C'])
    
    # Access distances
    dist_ab = dm['A', 'B']
    row_a = dm['A']
    
    # Read from file
    dm = DistanceMatrix.read('distances.txt')
    
    # Use in downstream analyses
    pcoa_results = pcoa(dm)
    permanova_results = permanova(dm, grouping)
    

    Important notes:

    • DistanceMatrix enforces symmetry and zero diagonal
    • DissimilarityMatrix allows asymmetric values
    • IDs enable integration with metadata and biological knowledge
    • Compatible with pandas, numpy, and scikit-learn

    9. Biological Tables

    Work with feature tables (OTU/ASV tables) common in microbiome research.

    Key capabilities:

    • BIOM format I/O (HDF5 and JSON)
    • Integration with pandas, polars, AnnData, numpy
    • Data augmentation techniques (phylomix, mixup, compositional methods)
    • Sample/feature filtering and normalization
    • Metadata integration

    Common patterns:

    from skbio import Table
    
    # Read BIOM table
    table = Table.read('table.biom')
    
    # Access data
    sample_ids = table.ids(axis='sample')
    feature_ids = table.ids(axis='observation')
    counts = table.matrix_data
    
    # Filter
    filtered = table.filter(sample_ids_to_keep, axis='sample')
    
    # Convert to/from pandas
    df = table.to_dataframe()
    table = Table.from_dataframe(df)
    

    Important notes:

    • BIOM tables are standard in QIIME 2 workflows
    • Rows typically represent samples, columns represent features (OTUs/ASVs)
    • Supports sparse and dense representations
    • Output format configurable (pandas/polars/numpy)

    10. Protein Embeddings

    Work with protein language model embeddings for downstream analysis.

    Key capabilities:

    • Store embeddings from protein language models (ESM, ProtTrans, etc.)
    • Convert embeddings to distance matrices
    • Generate ordination objects for visualization
    • Export to numpy/pandas for ML workflows

    Common patterns:

    from skbio.embedding import ProteinEmbedding, ProteinVector
    
    # Create embedding from array
    embedding = ProteinEmbedding(embedding_array, sequence_ids)
    
    # Convert to distance matrix for analysis
    dm = embedding.to_distances(metric='euclidean')
    
    # PCoA visualization of embedding space
    pcoa_results = embedding.to_ordination(metric='euclidean', method='pcoa')
    
    # Export for machine learning
    array = embedding.to_array()
    df = embedding.to_dataframe()
    

    Important notes:

    • Embeddings bridge protein language models with traditional bioinformatics
    • Compatible with scikit-bio's distance/ordination/statistics ecosystem
    • SequenceEmbedding and ProteinEmbedding provide specialized functionality
    • Useful for sequence clustering, classification, and visualization

    Best Practices

    Installation

    pip install scikit-bio
    # Or with conda:
    conda install -c conda-forge scikit-bio
    

    Performance Considerations

    • Use generators for large sequence files to minimize memory usage
    • For massive phylogenetic trees, prefer GME or BME over NJ
    • Beta diversity calculations can be parallelized with partial_beta_diversity()
    • BIOM format (HDF5) more efficient than JSON for large tables

    Integration with Ecosystem

    • Sequences interoperate with Biopython via standard formats
    • Tables integrate with pandas, polars, and AnnData
    • Distance matrices compatible with scikit-learn
    • Ordination results visualizable with matplotlib/seaborn/plotly
    • Works seamlessly with QIIME 2 artifacts (BIOM, trees, distance matrices)

    Common Workflows

    1. Microbiome diversity analysis: Read BIOM table → Calculate alpha/beta diversity → Ordination (PCoA) → Statistical testing (PERMANOVA)
    2. Phylogenetic analysis: Read sequences → Align → Build distance matrix → Construct tree → Calculate phylogenetic distances
    3. Sequence processing: Read FASTQ → Quality filter → Trim/clean → Find motifs → Translate → Write FASTA
    4. Comparative genomics: Read sequences → Pairwise alignment → Calculate distances → Build tree → Analyze clades

    Reference Documentation

    For detailed API information, parameter specifications, and advanced usage examples, refer to references/api_reference.md which contains comprehensive documentation on:

    • Complete method signatures and parameters for all capabilities
    • Extended code examples for complex workflows
    • Troubleshooting common issues
    • Performance optimization tips
    • Integration patterns with other libraries

    Additional Resources

    • Official documentation: https://scikit.bio/docs/latest/
    • GitHub repository: https://github.com/scikit-bio/scikit-bio
    • Forum support: https://forum.qiime2.org (scikit-bio is part of QIIME 2 ecosystem)
    Recommended Servers
    PubMed
    PubMed
    Paper Search
    Paper Search
    Bright Data
    Bright Data
    Repository
    lifangda/claude-plugins
    Files