Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    Starlitnightly

    single-cell-annotation-skills-with-omicverse

    Starlitnightly/single-cell-annotation-skills-with-omicverse
    Data & Analytics
    834
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities.

    SKILL.md

    Single-cell annotation skills with omicverse

    Overview

    Use this skill to reproduce and adapt the single-cell annotation playbook captured in omicverse tutorials: SCSA t_cellanno.ipynb, MetaTiME t_metatime.ipynb, CellVote t_cellvote.md & t_cellvote_pbmc3k.ipynb, CellMatch t_cellmatch.ipynb, GPTAnno t_gptanno.ipynb, and label transfer t_anno_trans.ipynb. Each section below highlights required inputs, training/inference steps, and how to read the outputs.

    Instructions

    1. SCSA automated cluster annotation

      • Data requirements: PBMC3k raw counts from 10x Genomics (pbmc3k_filtered_gene_bc_matrices.tar.gz) or the processed sample/rna.h5ad. Download instructions are embedded in the notebook; unpack to data/filtered_gene_bc_matrices/hg19/. Ensure an SCSA SQLite database is available (e.g. pySCSA_2024_v1_plus.db from the Figshare/Drive links listed in the tutorial) and point model_path to its location.
      • Preprocessing & model fit: Load with ov.io.read_10x_mtx, run QC (ov.pp.qc), normalization and HVG selection (ov.pp.preprocess), scaling (ov.pp.scale), PCA (ov.pp.pca), neighbors, Leiden clustering, and compute rank markers (sc.tl.rank_genes_groups). Instantiate scsa = ov.single.pySCSA(...) choosing target='cellmarker' or 'panglaodb', tissue scope, and thresholds (foldchange, pvalue).
      • Inference & interpretation: Call scsa.cell_anno(clustertype='leiden', result_key='scsa_celltype_cellmarker') or scsa.cell_auto_anno to append predictions to adata.obs. Compare to manual marker-based labels via ov.pl.embedding or sc.pl.dotplot, inspect marker dictionaries (ov.single.get_celltype_marker), and query supported tissues with scsa.get_model_tissue(). Use the ROI/ROE helpers (ov.utils.roe, ov.utils.plot_cellproportion) to validate abundance trends.
    2. MetaTiME tumour microenvironment states

      • Data requirements: Batched TME AnnData with an scVI latent embedding. The tutorial uses TiME_adata_scvi.h5ad from Figshare (https://figshare.com/ndownloader/files/41440050). If starting from counts, run scVI (scvi.model.SCVI) first to populate adata.obsm['X_scVI'].
      • Preprocessing & model fit: Optionally subset to non-malignant cells via adata.obs['isTME']. Rebuild neighbors on the latent representation (sc.pp.neighbors(adata, use_rep="X_scVI")) and embed with umap (adata.obsm['X_umap'] = ov.pp.umap(...)). Initialise TiME_object = ov.single.MetaTiME(adata, mode='table') and, if finer granularity is desired, over-cluster with TiME_object.overcluster(resolution=8, clustercol='overcluster').
      • Inference & interpretation: Run TiME_object.predictTiME(save_obs_name='MetaTiME') to assign minor states and Major_MetaTiME. Visualise using TiME_object.plot or sc.pl.embedding. Interpret the outputs by comparing cluster-level distributions and confirming that MetaTiME and Major_MetaTiME columns align with expected niches.
    3. CellVote consensus labelling

      • Data requirements: A clustered AnnData (e.g. PBMC3k stored as CELLVOTE_PBMC3K env var or data/pbmc3k.h5ad) plus at least two precomputed annotation columns (simulated in the tutorial as scsa_annotation, gpt_celltype, gbi_celltype). Prepare per-cluster marker genes via sc.tl.rank_genes_groups.
      • Preprocessing & model fit: After standard preprocessing (normalize, log1p, HVGs, PCA, neighbors, Leiden) build a marker dictionary marker_dict = top_markers_from_rgg(adata, 'leiden', topn=10) or via ov.single.get_celltype_marker. Instantiate cv = ov.single.CellVote(adata).
      • Inference & interpretation: Call cv.vote(clusters_key='leiden', cluster_markers=marker_dict, celltype_keys=[...], species='human', organization='PBMC', provider='openai', model='gpt-4o-mini'). Offline examples monkey-patch arbitration to avoid API calls; online voting requires valid credentials. Final consensus labels live in adata.obs['CellVote_celltype']. Compare each cluster’s majority vote with the input sources (adata.obs[['leiden', 'scsa_annotation', ...]]) to justify decisions.
    4. CellMatch ontology mapping

      • Data requirements: Annotated AnnData such as pertpy.dt.haber_2017_regions() with adata.obs['cell_label']. Download Cell Ontology JSON (cl.json) via ov.single.download_cl(...) or manual links, and optionally Cell Taxonomy resources (Cell_Taxonomy_resource.txt). Ensure access to a SentenceTransformer model (sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-base-en-v1.5, etc.), downloading to local_model_dir if offline.
      • Preprocessing & model fit: Create the mapper with ov.single.CellOntologyMapper(cl_obo_file='new_ontology/cl.json', model_name='sentence-transformers/all-MiniLM-L6-v2', local_model_dir='./my_models'). Run mapper.map_adata(...) to assign ontology-derived labels/IDs, optionally enabling taxonomy matching (use_taxonomy=True after calling load_cell_taxonomy_resource).
      • Inference & interpretation: Explore mapping summaries (mapper.print_mapping_summary_taxonomy) and inspect embeddings coloured by cell_ontology, cell_ontology_cl_id, or enhanced_cell_ontology. Use helper queries such as mapper.find_similar_cells('T helper cell'), mapper.get_cell_info(...), and category browsing to validate ontology coverage.
    5. GPTAnno LLM-powered annotation

      • Data requirements: The same PBMC3k dataset (raw matrix or .h5ad) and cluster assignments. Access to an LLM endpoint—configure AGI_API_KEY for OpenAI-compatible providers (provider='openai', 'qwen', 'kimi', etc.), or supply a local model path for ov.single.gptcelltype_local.
      • Preprocessing & model fit: Follow the QC, normalization, HVG, scaling, PCA, neighbor, Leiden, and marker discovery steps described above (reusing outputs from the SCSA workflow). Build the marker dictionary automatically with ov.single.get_celltype_marker(adata, clustertype='leiden', rank=True, key='rank_genes_groups', foldchange=2, topgenenumber=5).
      • Inference & interpretation: Invoke ov.single.gptcelltype(...) specifying tissue/species context and desired provider/model. Post-process responses to keep clean labels (result[key].split(': ')[-1]...) and write them to adata.obs['gpt_celltype']. Compare embeddings (ov.pl.embedding(..., color=['leiden','gpt_celltype'])) to verify cluster identities. If operating offline, call ov.single.gptcelltype_local with a downloaded instruction-tuned checkpoint.
    6. Weighted KNN annotation transfer

      • Data requirements: Cross-modal GLUE outputs with aligned embeddings, e.g. data/analysis_lymph/rna-emb.h5ad (annotated RNA) and data/analysis_lymph/atac-emb.h5ad (query ATAC) where both contain obsm['X_glue'].
      • Preprocessing & model fit: Load both modalities, optionally concatenate for QC plots, and compute a shared low-dimensional embedding with ov.utils.mde. Train a neighbour model using ov.utils.weighted_knn_trainer(train_adata=rna, train_adata_emb='X_glue', n_neighbors=15).
      • Inference & interpretation: Transfer labels via labels, uncert = ov.utils.weighted_knn_transfer(query_adata=atac, query_adata_emb='X_glue', label_keys='major_celltype', knn_model=knn_transformer, ref_adata_obs=rna.obs). Store predictions in atac.obs['transf_celltype'] and uncertainties in atac.obs['transf_celltype_unc']; copy to major_celltype if you want consistent naming. Visualise (ov.pl.embedding) and inspect uncertainty to flag ambiguous cells.

    Defensive Validation Patterns

    # Before SCSA: verify rank_genes_groups has been computed
    assert 'rank_genes_groups' in adata.uns, \
        "Marker genes required. Run sc.tl.rank_genes_groups(adata, groupby='leiden') first."
    
    # Before any annotation: verify clustering exists
    assert 'leiden' in adata.obs.columns or 'louvain' in adata.obs.columns, \
        "Clustering required. Run ov.pp.leiden(adata) or sc.tl.leiden(adata) first."
    
    # Before CellVote: verify multiple annotation columns exist
    annotation_keys = ['scsa_annotation', 'gpt_celltype']  # adjust to actual keys
    for key in annotation_keys:
        assert key in adata.obs.columns, f"Annotation column '{key}' not found — run annotators first"
    

    Critical API Reference - EXACT Function Signatures

    pySCSA - IMPORTANT: Parameter is clustertype, NOT cluster

    CORRECT usage:

    # Step 1: Initialize pySCSA
    scsa = ov.single.pySCSA(
        adata,
        foldchange=1.5,
        pvalue=0.01,
        species='Human',
        tissue='All',
        target='cellmarker'  # or 'panglaodb'
    )
    
    # Step 2: Run annotation - NOTE: use clustertype='leiden', NOT cluster='leiden'!
    anno_result = scsa.cell_anno(clustertype='leiden', cluster='all')
    
    # Step 3: Add cell type labels to adata.obs
    scsa.cell_auto_anno(adata, clustertype='leiden', key='scsa_celltype')
    # Results are stored in adata.obs['scsa_celltype']
    

    WRONG - DO NOT USE:

    # WRONG! 'cluster' is NOT a valid parameter for cell_auto_anno!
    # scsa.cell_auto_anno(adata, cluster='leiden')  # ERROR!
    

    COSG Marker Genes - Results stored in adata.uns, NOT adata.obs

    CORRECT usage:

    # Step 1: Run COSG marker gene identification
    ov.single.cosg(adata, groupby='leiden', n_genes_user=50)
    
    # Step 2: Access results from adata.uns (NOT adata.obs!)
    marker_names = adata.uns['rank_genes_groups']['names']  # DataFrame with cluster columns
    marker_scores = adata.uns['rank_genes_groups']['scores']
    
    # Step 3: Get top markers for specific cluster
    cluster_0_markers = adata.uns['rank_genes_groups']['names']['0'][:10].tolist()
    
    # Step 4: To create celltype column, manually map clusters to cell types
    cluster_to_celltype = {
        '0': 'T cells',
        '1': 'B cells',
        '2': 'Monocytes',
    }
    adata.obs['cosg_celltype'] = adata.obs['leiden'].map(cluster_to_celltype)
    

    WRONG - DO NOT USE:

    # WRONG! COSG does NOT create adata.obs columns directly!
    # adata.obs['cosg_celltype']  # This key does NOT exist after running COSG!
    # adata.uns['cosg_celltype']  # This key also does NOT exist!
    

    Common Pitfalls to Avoid

    1. pySCSA parameter confusion:

      • clustertype = which obs column contains cluster labels (e.g., 'leiden')
      • cluster = which specific clusters to annotate ('all' or specific cluster IDs)
      • These are DIFFERENT parameters!
    2. COSG result access:

      • COSG is a marker gene finder, NOT a cell type annotator
      • Results are per-cluster gene rankings stored in adata.uns['rank_genes_groups']
      • To assign cell types, you must manually map clusters to cell types based on markers
    3. Result storage patterns in OmicVerse:

      • Cell type annotations → adata.obs['<key>']
      • Marker gene results → adata.uns['<key>'] (includes 'names', 'scores', 'logfoldchanges')
      • Differential expression → adata.uns['rank_genes_groups']

    Examples

    • "Run SCSA with both CellMarker and PanglaoDB references on PBMC3k, then benchmark against manual marker assignments before feeding the results into CellVote."
    • "Annotate tumour microenvironment states in the MetaTiME Figshare dataset, highlight Major_MetaTiME classes, and export the label distribution per patient."
    • "Download Cell Ontology resources, map haber_2017_regions clusters to ontology terms, and enrich ambiguous clusters using Cell Taxonomy hints."
    • "Propagate RNA-derived major_celltype labels onto GLUE-integrated ATAC cells and report clusters with high transfer uncertainty."

    References

    • Tutorials and notebooks: t_cellanno.ipynb, t_metatime.ipynb, t_cellvote.md, t_cellvote_pbmc3k.ipynb, t_cellmatch.ipynb, t_gptanno.ipynb, t_anno_trans.ipynb.
    • Sample data & assets: PBMC3k matrix from 10x Genomics, MetaTiME TiME_adata_scvi.h5ad (Figshare), SCSA database downloads, GLUE embeddings under data/analysis_lymph/, Cell Ontology cl.json, and Cell Taxonomy resource.
    • Quick copy commands: reference.md.
    Recommended Servers
    Bizfile MCP — Global Company Intelligence
    Bizfile MCP — Global Company Intelligence
    EasyWeek
    EasyWeek
    Apify
    Apify
    Repository
    starlitnightly/omicverse
    Files