Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    pydeseq2

    davila7/pydeseq2
    Data & Analytics
    19,892

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.

    SKILL.md

    PyDESeq2

    Overview

    PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.

    When to Use This Skill

    This skill should be used when:

    • Analyzing bulk RNA-seq count data for differential expression
    • Comparing gene expression between experimental conditions (e.g., treated vs control)
    • Performing multi-factor designs accounting for batch effects or covariates
    • Converting R-based DESeq2 workflows to Python
    • Integrating differential expression analysis into Python-based pipelines
    • Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"

    Quick Start Workflow

    For users who want to perform a standard differential expression analysis:

    import pandas as pd
    from pydeseq2.dds import DeseqDataSet
    from pydeseq2.ds import DeseqStats
    
    # 1. Load data
    counts_df = pd.read_csv("counts.csv", index_col=0).T  # Transpose to samples × genes
    metadata = pd.read_csv("metadata.csv", index_col=0)
    
    # 2. Filter low-count genes
    genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
    counts_df = counts_df[genes_to_keep]
    
    # 3. Initialize and fit DESeq2
    dds = DeseqDataSet(
        counts=counts_df,
        metadata=metadata,
        design="~condition",
        refit_cooks=True
    )
    dds.deseq2()
    
    # 4. Perform statistical testing
    ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
    ds.summary()
    
    # 5. Access results
    results = ds.results_df
    significant = results[results.padj < 0.05]
    print(f"Found {len(significant)} significant genes")
    

    Core Workflow Steps

    Step 1: Data Preparation

    Input requirements:

    • Count matrix: Samples × genes DataFrame with non-negative integer read counts
    • Metadata: Samples × variables DataFrame with experimental factors

    Common data loading patterns:

    # From CSV (typical format: genes × samples, needs transpose)
    counts_df = pd.read_csv("counts.csv", index_col=0).T
    metadata = pd.read_csv("metadata.csv", index_col=0)
    
    # From TSV
    counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
    
    # From AnnData
    import anndata as ad
    adata = ad.read_h5ad("data.h5ad")
    counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
    metadata = adata.obs
    

    Data filtering:

    # Remove low-count genes
    genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
    counts_df = counts_df[genes_to_keep]
    
    # Remove samples with missing metadata
    samples_to_keep = ~metadata.condition.isna()
    counts_df = counts_df.loc[samples_to_keep]
    metadata = metadata.loc[samples_to_keep]
    

    Step 2: Design Specification

    The design formula specifies how gene expression is modeled.

    Single-factor designs:

    design = "~condition"  # Simple two-group comparison
    

    Multi-factor designs:

    design = "~batch + condition"  # Control for batch effects
    design = "~age + condition"     # Include continuous covariate
    design = "~group + condition + group:condition"  # Interaction effects
    

    Design formula guidelines:

    • Use Wilkinson formula notation (R-style)
    • Put adjustment variables (e.g., batch) before the main variable of interest
    • Ensure variables exist as columns in the metadata DataFrame
    • Use appropriate data types (categorical for discrete variables)

    Step 3: DESeq2 Fitting

    Initialize the DeseqDataSet and run the complete pipeline:

    from pydeseq2.dds import DeseqDataSet
    
    dds = DeseqDataSet(
        counts=counts_df,
        metadata=metadata,
        design="~condition",
        refit_cooks=True,  # Refit after removing outliers
        n_cpus=1           # Parallel processing (adjust as needed)
    )
    
    # Run the complete DESeq2 pipeline
    dds.deseq2()
    

    What deseq2() does:

    1. Computes size factors (normalization)
    2. Fits genewise dispersions
    3. Fits dispersion trend curve
    4. Computes dispersion priors
    5. Fits MAP dispersions (shrinkage)
    6. Fits log fold changes
    7. Calculates Cook's distances (outlier detection)
    8. Refits if outliers detected (optional)

    Step 4: Statistical Testing

    Perform Wald tests to identify differentially expressed genes:

    from pydeseq2.ds import DeseqStats
    
    ds = DeseqStats(
        dds,
        contrast=["condition", "treated", "control"],  # Test treated vs control
        alpha=0.05,                # Significance threshold
        cooks_filter=True,         # Filter outliers
        independent_filter=True    # Filter low-power tests
    )
    
    ds.summary()
    

    Contrast specification:

    • Format: [variable, test_level, reference_level]
    • Example: ["condition", "treated", "control"] tests treated vs control
    • If None, uses the last coefficient in the design

    Result DataFrame columns:

    • baseMean: Mean normalized count across samples
    • log2FoldChange: Log2 fold change between conditions
    • lfcSE: Standard error of LFC
    • stat: Wald test statistic
    • pvalue: Raw p-value
    • padj: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)

    Step 5: Optional LFC Shrinkage

    Apply shrinkage to reduce noise in fold change estimates:

    ds.lfc_shrink()  # Applies apeGLM shrinkage
    

    When to use LFC shrinkage:

    • For visualization (volcano plots, heatmaps)
    • For ranking genes by effect size
    • When prioritizing genes for follow-up experiments

    Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.

    Step 6: Result Export

    Save results and intermediate objects:

    import pickle
    
    # Export results as CSV
    ds.results_df.to_csv("deseq2_results.csv")
    
    # Save significant genes only
    significant = ds.results_df[ds.results_df.padj < 0.05]
    significant.to_csv("significant_genes.csv")
    
    # Save DeseqDataSet for later use
    with open("dds_result.pkl", "wb") as f:
        pickle.dump(dds.to_picklable_anndata(), f)
    

    Common Analysis Patterns

    Two-Group Comparison

    Standard case-control comparison:

    dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
    dds.deseq2()
    
    ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
    ds.summary()
    
    results = ds.results_df
    significant = results[results.padj < 0.05]
    

    Multiple Comparisons

    Testing multiple treatment groups against control:

    dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
    dds.deseq2()
    
    treatments = ["treatment_A", "treatment_B", "treatment_C"]
    all_results = {}
    
    for treatment in treatments:
        ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
        ds.summary()
        all_results[treatment] = ds.results_df
    
        sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
        print(f"{treatment}: {sig_count} significant genes")
    

    Accounting for Batch Effects

    Control for technical variation:

    # Include batch in design
    dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
    dds.deseq2()
    
    # Test condition while controlling for batch
    ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
    ds.summary()
    

    Continuous Covariates

    Include continuous variables like age or dosage:

    # Ensure continuous variable is numeric
    metadata["age"] = pd.to_numeric(metadata["age"])
    
    dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
    dds.deseq2()
    
    ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
    ds.summary()
    

    Using the Analysis Script

    This skill includes a complete command-line script for standard analyses:

    # Basic usage
    python scripts/run_deseq2_analysis.py \
      --counts counts.csv \
      --metadata metadata.csv \
      --design "~condition" \
      --contrast condition treated control \
      --output results/
    
    # With additional options
    python scripts/run_deseq2_analysis.py \
      --counts counts.csv \
      --metadata metadata.csv \
      --design "~batch + condition" \
      --contrast condition treated control \
      --output results/ \
      --min-counts 10 \
      --alpha 0.05 \
      --n-cpus 4 \
      --plots
    

    Script features:

    • Automatic data loading and validation
    • Gene and sample filtering
    • Complete DESeq2 pipeline execution
    • Statistical testing with customizable parameters
    • Result export (CSV, pickle)
    • Optional visualization (volcano and MA plots)

    Refer users to scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.

    Result Interpretation

    Identifying Significant Genes

    # Filter by adjusted p-value
    significant = ds.results_df[ds.results_df.padj < 0.05]
    
    # Filter by both significance and effect size
    sig_and_large = ds.results_df[
        (ds.results_df.padj < 0.05) &
        (abs(ds.results_df.log2FoldChange) > 1)
    ]
    
    # Separate up- and down-regulated
    upregulated = significant[significant.log2FoldChange > 0]
    downregulated = significant[significant.log2FoldChange < 0]
    
    print(f"Upregulated: {len(upregulated)}")
    print(f"Downregulated: {len(downregulated)}")
    

    Ranking and Sorting

    # Sort by adjusted p-value
    top_by_padj = ds.results_df.sort_values("padj").head(20)
    
    # Sort by absolute fold change (use shrunk values)
    ds.lfc_shrink()
    ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
    top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)
    
    # Sort by a combined metric
    ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
    top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
    

    Quality Metrics

    # Check normalization (size factors should be close to 1)
    print("Size factors:", dds.obsm["size_factors"])
    
    # Examine dispersion estimates
    import matplotlib.pyplot as plt
    plt.hist(dds.varm["dispersions"], bins=50)
    plt.xlabel("Dispersion")
    plt.ylabel("Frequency")
    plt.title("Dispersion Distribution")
    plt.show()
    
    # Check p-value distribution (should be mostly flat with peak near 0)
    plt.hist(ds.results_df.pvalue.dropna(), bins=50)
    plt.xlabel("P-value")
    plt.ylabel("Frequency")
    plt.title("P-value Distribution")
    plt.show()
    

    Visualization Guidelines

    Volcano Plot

    Visualize significance vs effect size:

    import matplotlib.pyplot as plt
    import numpy as np
    
    results = ds.results_df.copy()
    results["-log10(padj)"] = -np.log10(results.padj)
    
    plt.figure(figsize=(10, 6))
    significant = results.padj < 0.05
    
    plt.scatter(
        results.loc[~significant, "log2FoldChange"],
        results.loc[~significant, "-log10(padj)"],
        alpha=0.3, s=10, c='gray', label='Not significant'
    )
    plt.scatter(
        results.loc[significant, "log2FoldChange"],
        results.loc[significant, "-log10(padj)"],
        alpha=0.6, s=10, c='red', label='padj < 0.05'
    )
    
    plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
    plt.xlabel("Log2 Fold Change")
    plt.ylabel("-Log10(Adjusted P-value)")
    plt.title("Volcano Plot")
    plt.legend()
    plt.savefig("volcano_plot.png", dpi=300)
    

    MA Plot

    Show fold change vs mean expression:

    plt.figure(figsize=(10, 6))
    
    plt.scatter(
        np.log10(results.loc[~significant, "baseMean"] + 1),
        results.loc[~significant, "log2FoldChange"],
        alpha=0.3, s=10, c='gray'
    )
    plt.scatter(
        np.log10(results.loc[significant, "baseMean"] + 1),
        results.loc[significant, "log2FoldChange"],
        alpha=0.6, s=10, c='red'
    )
    
    plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
    plt.xlabel("Log10(Base Mean + 1)")
    plt.ylabel("Log2 Fold Change")
    plt.title("MA Plot")
    plt.savefig("ma_plot.png", dpi=300)
    

    Troubleshooting Common Issues

    Data Format Problems

    Issue: "Index mismatch between counts and metadata"

    Solution: Ensure sample names match exactly

    print("Counts samples:", counts_df.index.tolist())
    print("Metadata samples:", metadata.index.tolist())
    
    # Take intersection if needed
    common = counts_df.index.intersection(metadata.index)
    counts_df = counts_df.loc[common]
    metadata = metadata.loc[common]
    

    Issue: "All genes have zero counts"

    Solution: Check if data needs transposition

    print(f"Counts shape: {counts_df.shape}")
    # If genes > samples, transpose is needed
    if counts_df.shape[1] < counts_df.shape[0]:
        counts_df = counts_df.T
    

    Design Matrix Issues

    Issue: "Design matrix is not full rank"

    Cause: Confounded variables (e.g., all treated samples in one batch)

    Solution: Remove confounded variable or add interaction term

    # Check confounding
    print(pd.crosstab(metadata.condition, metadata.batch))
    
    # Either simplify design or add interaction
    design = "~condition"  # Remove batch
    # OR
    design = "~condition + batch + condition:batch"  # Model interaction
    

    No Significant Genes

    Diagnostics:

    # Check dispersion distribution
    plt.hist(dds.varm["dispersions"], bins=50)
    plt.show()
    
    # Check size factors
    print(dds.obsm["size_factors"])
    
    # Look at top genes by raw p-value
    print(ds.results_df.nsmallest(20, "pvalue"))
    

    Possible causes:

    • Small effect sizes
    • High biological variability
    • Insufficient sample size
    • Technical issues (batch effects, outliers)

    Reference Documentation

    For comprehensive details beyond this workflow-oriented guide:

    • API Reference (references/api_reference.md): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.

    • Workflow Guide (references/workflow_guide.md): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.

    Load these references into context when users need:

    • Detailed API documentation: Read references/api_reference.md
    • Comprehensive workflow examples: Read references/workflow_guide.md
    • Troubleshooting guidance: Read references/workflow_guide.md (see Troubleshooting section)

    Key Reminders

    1. Data orientation matters: Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with .T if needed.

    2. Sample filtering: Remove samples with missing metadata before analysis to avoid errors.

    3. Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.

    4. Design formula order: Put adjustment variables before the variable of interest (e.g., "~batch + condition" not "~condition + batch").

    5. LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.

    6. Result interpretation: Use padj < 0.05 for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.

    7. Contrast specification: The format is [variable, test_level, reference_level] where test_level is compared against reference_level.

    8. Save intermediate objects: Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.

    Installation and Requirements

    uv pip install pydeseq2
    

    System requirements:

    • Python 3.10-3.11
    • pandas 1.4.3+
    • numpy 1.23.0+
    • scipy 1.11.0+
    • scikit-learn 1.1.1+
    • anndata 0.8.0+

    Optional for visualization:

    • matplotlib
    • seaborn

    Additional Resources

    • Official Documentation: https://pydeseq2.readthedocs.io
    • GitHub Repository: https://github.com/owkin/PyDESeq2
    • Publication: Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
    • Original DESeq2 (R): Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8
    Recommended Servers
    Open Targets
    Open Targets
    Codeinterpreter
    Codeinterpreter
    Vercel Grep
    Vercel Grep
    Repository
    davila7/claude-code-templates
    Files