Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    gptomics

    bio-clip-seq-binding-site-annotation

    gptomics/bio-clip-seq-binding-site-annotation
    Data & Analytics
    178

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    Annotate CLIP-seq binding sites to genomic features including 3'UTR, 5'UTR, CDS, introns, and ncRNAs. Use when characterizing where an RBP binds in transcripts.

    SKILL.md

    Version Compatibility

    Reference examples tested with: ChIPseeker 1.40+, RCAS 1.30+, GenomicFeatures 1.56+, GenomicRanges 1.56+, rbp-maps (Yeo github), bedtools 2.31+, pybedtools 0.10+, pyranges 0.0.129+.

    Before using code patterns, verify installed versions match. If versions differ:

    • R: packageVersion('<pkg>') then ?function_name to verify parameters
    • Python: pip show <package> then help(module.function) to check signatures
    • CLI: <tool> --version then <tool> --help to confirm flags

    If code throws unexpected errors, introspect the installed package and adapt the example to match the actual API rather than retrying. ChIPseeker 1.40+ changed the priority defaults; verify the priority vector before pipelining.

    Binding Site Annotation

    "Annotate where in transcripts my RBP binds" -> Map CLIP peaks or single-nucleotide crosslink sites to RNA features and report the per-feature distribution. The interpretation is RBP-class-specific: splicing factors (PTBP1, U2AF2, RBFOX) bind intron-exon junctions; mRNA-stability regulators (HuR, PUM2) bind 3' UTRs; translation factors (EIF3J, RPS19) bind 5' UTRs and CDS; and small-ncRNA-binding RBPs (NSUN2 tRNAs, LARP7 7SK, TROVE2 Y-RNAs) bind specific non-coding transcripts. A correct annotation pipeline (a) resolves overlapping features by priority, (b) preserves transcript-isoform context, (c) generates metagene distributions, and (d) flags repeat-element overlap separately.

    • R (peak-level, fast): ChIPseeker::annotatePeak(peaks, TxDb=txdb, level='gene', tssRegion=c(-100,100)) then plotAnnoPie(anno)
    • R (transcript-level with RNA-specific regions): RCAS runReport(queryRegions=peaks, gffData=gencode_gtf, genomeVersion='hg38')
    • CLI (regions only): bedtools intersect -s -wa -wb -a peaks.bed -b features.bed (manual hierarchy)
    • R (splicing-regulatory map for splice factors): RBP-Maps (yeolab/rbp-maps) generates the 1400 nt vectorized cassette-exon map used in Yeo lab ENCODE papers
    • CLI (metagene aggregation): deepTools computeMatrix or RSeQC geneBody_coverage.py for read profiles

    The Yeo lab convention for ENCODE eCLIP: ChIPseeker for global region distribution + RBP-Maps for position-resolved splicing context + custom analysis for repeat-element overlap. ChIPseeker alone over-counts intronic binding because TSS-region default extends too far for RNA features.

    Algorithmic Taxonomy

    Tool Input Annotation level Feature priority Output Strength Fails when
    ChIPseeker (R) Peak BED + TxDb Gene-level or transcript-level Promoter > 5' UTR > 3' UTR > Exon > Intron > Intergenic Per-peak annotation + pie chart + distance-to-TSS Mature, widely used, fast TSS-region default c(-3000, 3000) over-extends; promoter category meaningless for RNA
    RCAS (R) Peak BED + GFF Transcript-level Customizable HTML report + per-region table RNA-specific design; ncRNA-aware Slow; HTML-heavy; less actively maintained
    RBP-Maps (Yeo) eCLIP BAM + alternative splicing event tables Splice junction-level NA (positional metagene) Splicing regulatory map (1400nt vector around cassette exon) The standard for splicing-factor CLIP analysis Splicing-only; not for 3' UTR or ncRNA binders
    ASTERIX (Yeo) Peak BED Region NA Region distribution + region-specific motifs Companion to RBP-Maps for non-splicing regions Less granular than ChIPseeker
    HOMER annotatePeaks.pl Peak BED Gene-level Promoter > UTR > Exon > Intron > Intergenic Annotation TSV Familiar from ChIP-seq Same TSS issue as ChIPseeker; less RNA-aware
    bedtools intersect Peak BED + feature BED User-defined User-defined Per-feature overlap Most flexible Manual hierarchy logic must be written
    pyranges (Python) Peak BED + GTF Customizable User-defined DataFrame Fast, pythonic, scriptable Hand-written priority logic
    pybedtools Peak BED + GTF Customizable User-defined iterators Flexible, scripted Same as pyranges
    Peakhood (Reuter 2022) Peak BED + transcripts Transcript context NA Per-peak transcript context Resolves transcript-isoform ambiguity Specialized; not a global region tool

    Methodology evolves; verify the GTF source (GENCODE preferred over Ensembl for human; both have feature differences) and the priority hierarchy against published RBP literature.

    Critical Choice: Annotation Hierarchy

    RNA features are nested and overlapping. A 3' UTR peak is ALSO in an exon; an intronic peak near a splice site is ALSO in a transcribed-but-spliced region. Without a priority hierarchy the same peak gets counted multiple times.

    Standard CLIP hierarchy (most-specific first):

    1. 3' UTR (specific to mature mRNA biology)
    2. 5' UTR
    3. CDS (coding sequence)
    4. Exon (catch-all for non-UTR exonic)
    5. Splice site (within 50 nt of donor/acceptor; key for splicing factors)
    6. Intron
    7. Promoter (within 1 kb of TSS; usually not relevant for RNA-binding)
    8. 5' flanking / 3' flanking (within 1 kb of gene)
    9. ncRNA (lncRNA, snoRNA, snRNA, miRNA host)
    10. Repeat element (Alu, LINE, LTR, SINE) - separate axis from above
    11. Intergenic

    ChIPseeker's default level='gene' reports the FIRST hit by priority. For CLIP, set level='transcript' and explicitly customize tssRegion=c(-100, 100) (default c(-3000, 3000) is for ChIP).

    Why CLIP needs the tight TSS window: ChIPseeker was designed for ChIP-seq where the "Promoter" category captures transcription-factor binding within 3 kb upstream of a gene. For CLIP-seq, the binding substrate is RNA - the RBP cannot bind upstream of the TSS because there is no RNA upstream of the TSS. A 6 kb TSS window catches 5'-UTR peaks and unrelated upstream sequence, mislabeling 30-50% of CLIP peaks as "Promoter (2-3kb)" - implausible for an RNA-binding study. Tightening to c(-100, 100) constrains the "Promoter" category to the mRNA 5' end immediate vicinity and pushes deeper-5'-UTR peaks into the "5UTR" category where they belong.

    library(ChIPseeker)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
    
    peaks <- readPeakFile('peaks.stringent.bed')
    
    # CLIP-appropriate annotation
    anno <- annotatePeak(
        peaks,
        TxDb = txdb,
        level = 'transcript',     # RNA-isoform-aware
        tssRegion = c(-100, 100), # tight; not a meaningful promoter for RNA binding
        genomicAnnotationPriority = c(
            'Promoter', '5UTR', '3UTR', 'Exon',
            'Intron', 'Downstream', 'Intergenic'
        )
    )
    
    print(anno)
    plotAnnoPie(anno)
    plotAnnoBar(anno)
    plotDistToTSS(anno)  # CLIP-specific: dist to TSS is dist to mRNA 5' end
    

    RBP Class -> Expected Annotation

    RBP class Expected dominant region RBP examples
    Splicing factors Intron (within 500 nt of splice junction) PTBP1, U2AF2, RBFOX1/2, SRSF1-9, HNRNPC, MBNL1
    3' UTR mRNA stability/decay 3' UTR HuR (ELAVL1), PUM2, AUF1, ZFP36, TIA1, NUDT21
    5' UTR translation regulation 5' UTR, CDS start EIF4A3, EIF3J, eIF2 subunits, LARP1, PCBP1/2
    Ribosome-associated CDS RPL/RPS proteins, RACK1
    Nuclear export CDS, 3' UTR NXF1, ALYREF, SRSF3
    m6A reader 3' UTR, stop codon region YTHDF1/2/3, IGF2BP1/2/3
    miRNA effector 3' UTR (seed-matched sites) AGO1/2/3/4, DICER1, TNRC6A/B/C
    snoRNA-targeting RBP snoRNA, rRNA DKC1, FBL, NOP56, NHP2
    ncRNA-binding Specific ncRNA target TROVE2 (Y RNA), LARP7 (7SK), SSB (vault)
    Mitochondrial chrM (mt-mRNA + 12S/16S rRNA) FASTKD2, LRPPRC, TFAM, MTPAP
    Histone mRNA Histone 3' end stem-loop SLBP
    Repeat-binding (TE) Alu, LINE-1, LTR, SINE MATR3, ZFP36, HNRNPK, HNRNPA2B1, FUS (LINE-1)
    LCD/IDR / phase separation Diverse (Alu, intron, 3' UTR) FUS, TDP-43, EWSR1, TAF15

    Mismatch between observed and expected suggests: (a) failed IP (cross-reactive antibody binding wrong RBP), (b) artifact (high-expression contamination), (c) genuinely new biology - investigate.

    RBP-Maps: Splicing Regulatory Maps

    For splicing factors (PTBP1, U2AF2, RBFOX, SRSF1, etc.), the Yeo lab RBP-Maps tool aggregates eCLIP signal in a 350 nt window flanking the upstream, cassette, and downstream exons, producing a 1400 nt vectorized regulatory map. This map shows the position-dependent enrichment of RBP binding relative to alternative splicing events and is THE standard for splicing-CLIP analysis.

    # Yeo lab rbp-maps Snakemake workflow
    git clone https://github.com/YeoLab/rbp-maps
    cd rbp-maps
    
    # Inputs:
    #   1. eCLIP BAM (rep1 + rep2)
    #   2. Cassette exon BED (from RNA-seq KD experiment; see clip-seq/differential-clip)
    #   3. Background "native" exons (constitutively spliced)
    
    python rbpmaps/RBPMaps.py \
        --ip rep1.bam rep2.bam \
        --input sminput.bam \
        --positive cassette_inc.bed \
        --negative cassette_exc.bed \
        --background native_constitutive.bed \
        --output rbpmaps_out/ \
        --normalization input_normalization
    

    Output: 1400 nt vector with eCLIP signal at every base, plotted as a metagene for cassette inclusion vs exclusion vs constitutive. A peak in the upstream-intron region (right of the cassette exon's 5' splice site) indicates intronic regulation; a peak in the cassette exon body indicates exonic regulation.

    Per-Tool Failure Modes

    ChIPseeker -- TSS-region default over-extends

    Trigger: Used ChIPseeker default tssRegion=c(-3000, 3000) for CLIP peaks.

    Mechanism: The 6 kb TSS window catches CLIP peaks deep in the 5' UTR or upstream introns and labels them "Promoter (2-3kb)". For RNA biology, the "promoter" category is meaningless - the RBP cannot bind pre-mRNA upstream of the TSS.

    Symptom: 30-50% of peaks labeled "Promoter (2-3kb)" - implausible for an RNA-binding study.

    Fix: Set tssRegion=c(-100, 100) or c(-50, 50). Remove the "Promoter" category from the priority vector entirely if it gives 0 hits.

    ChIPseeker -- Gene-level loses isoform context

    Trigger: Used level='gene' (default) on peaks that should be transcript-specific.

    Mechanism: Gene-level annotation picks the canonical/longest transcript and assigns the peak to that transcript's features. A peak in an alternatively-spliced exon may be labeled "intron" because the gene-level reference picks an isoform where the exon is skipped.

    Symptom: Splicing factor peaks heavily "intronic" when they should be exonic at alternative-cassette exons.

    Fix: Use level='transcript'. Match isoforms to RBP-Maps cassette-exon coordinates for splicing factors.

    RCAS -- Slow on large peak sets

    Trigger: RCAS report on > 100k peaks.

    Mechanism: RCAS generates an HTML report with many plots; rendering is slow.

    Symptom: Run time > 1 h; out-of-memory; HTML failing to render.

    Fix: Sub-sample peaks to top 10k; or use ChIPseeker for the global view and reserve RCAS for per-region deep dives.

    RBP-Maps -- Requires SE event table

    Trigger: Splicing-factor CLIP with no companion RNA-seq KD or no SE event BED.

    Mechanism: RBP-Maps's regulatory metagene only makes sense relative to cassette exons differentially regulated by the RBP. Without the SE event table, the tool falls back to plotting at all native exons (no differential signal).

    Symptom: Flat metagene; no position-dependent enrichment.

    Fix: Generate the cassette-exon table from companion RNA-seq KD experiments (rMATS / MAJIQ / LeafCutter; see alternative-splicing skills) OR use a published table from ENCODE shRNA RNA-seq.

    bedtools -- Manual hierarchy errors

    Trigger: Hand-written bedtools intersect chain for region annotation.

    Mechanism: Forgetting to enforce priority leads to double-counting: a 3' UTR peak intersects 3UTR.bed AND exon.bed.

    Symptom: Per-region counts sum to > total peak count by 20-50%.

    Fix: Process in priority order; remove already-annotated peaks from the remaining set with -v before next intersect. Or use ChIPseeker / RCAS which encode the hierarchy internally.

    Repeat-element overlap missed

    Trigger: Repeat-binding RBP (MATR3, HNRNPK) annotated without RepeatMasker BED.

    Mechanism: Genomic feature BEDs (UTR/CDS/intron) do not flag repeat instances within them. An MATR3 peak in an intronic LINE-1 is labeled "intron" when "LINE-1 within intron" is the biology.

    Symptom: Repeat-binding RBP looks like an intronic binder; functional analysis (GO terms) returns generic "RNA processing".

    Fix: Pass RepeatMasker BED (UCSC) as a separate axis; annotate each peak with BOTH region AND repeat-class. Cross-check repeat-class fraction against literature: > 15% Alu / LINE / LTR overlap indicates a repeat binder.

    Mitochondrial peaks excluded by default

    Trigger: Standard TxDb annotation drops chrM transcripts; FASTKD2 / LRPPRC peaks all "intergenic".

    Mechanism: Some TxDb objects (older releases) exclude mitochondrial chromosome; ChIPseeker reports those peaks as having no annotation.

    Symptom: chrM peaks labeled "Distal Intergenic" or simply missing from the annotation table.

    Fix: Verify TxDb includes chrM (genes(txdb, filter=list(tx_chrom='chrM'))); add custom mt-mRNA + mt-rRNA + mt-tRNA BED if missing.

    Strand information ignored

    Trigger: bedtools intersect -wa -wb without -s flag.

    Mechanism: CLIP is strand-specific. Without -s, peaks on one strand can be annotated to features on the other strand - especially for overlapping transcripts (sense/antisense pairs).

    Symptom: Antisense transcript peaks contaminate the sense annotation; per-region distribution shifted.

    Fix: Always pass -s to bedtools intersect for CLIP. ChIPseeker handles strand automatically when peaks have BED column 6.

    Metagene Analysis

    Metagene = aggregated profile of CLIP signal at each base of a normalized feature (e.g., 5' UTR, CDS, 3' UTR scaled to common length). Reveals position-dependent binding.

    Goal: Generate a position-resolved CLIP-signal profile aggregated across all 3' UTRs (or other features) to identify position-dependent binding modes (e.g., stop-codon-proximal m6A readers, polyA-proximal CPEB).

    Approach: Convert dedup BAM to strand-stranded bigWig, then use deepTools computeMatrix scale-regions to align signal across feature instances of variable length, and plotProfile / plotHeatmap for visualization.

    # deepTools computeMatrix for metagene
    # Step 1: peaks BED -> bedgraph
    bedtools genomecov -bg -strand + -ibam dedup.bam > clip_plus.bg
    bedtools genomecov -bg -strand - -ibam dedup.bam > clip_minus.bg
    # combine to bigwig
    bedGraphToBigWig clip_plus.bg chrom.sizes clip_plus.bw
    
    # Step 2: metagene over 3' UTRs
    computeMatrix scale-regions \
        -S clip_plus.bw clip_minus.bw \
        -R three_prime_UTRs.bed \
        --regionBodyLength 500 \
        --beforeRegionStartLength 100 \
        --afterRegionStartLength 100 \
        -o matrix.gz
    
    plotProfile -m matrix.gz -o metagene.png --perGroup
    plotHeatmap -m matrix.gz -o metagene_heatmap.png
    
    Metagene shape Biology
    Peak at stop codon region m6A reader (YTHDF1/2/3); IGF2BP1; some 3' UTR factors
    Peak in proximal 3' UTR (within 200 nt of stop) PUM2, IGF2BP, miRNA effectors
    Peak in distal 3' UTR (within 200 nt of poly-A) NUDT21, CPEB1, polyadenylation regulators
    Peak at 5' splice site flanking intron RBFOX, MBNL, HNRNPC
    Peak at 3' splice site / branch point U2AF2, PTBP1
    Peak at 5' UTR (within 50 nt of TSS) EIF3, EIF4A3, LARP1
    Flat across CDS Cytoplasmic ribosome-associated; less position-specific

    Decision Tree by Scenario

    Scenario Tool + parameters Why
    Global region distribution (3' UTR / CDS / intron) ChIPseeker tssRegion=c(-100,100) level='transcript' Mature, fast, ENCODE-comparable
    Splicing factor regulatory map RBP-Maps (Yeo) with RNA-seq KD cassette table The standard for splicing-CLIP
    RNA-feature aware (snoRNA, lncRNA-specific) RCAS or custom GTF + bedtools RCAS handles ncRNA features
    Transcript-isoform context resolved Peakhood + ChIPseeker level='transcript' Resolves which isoform the peak supports
    Metagene 3' UTR / 5' UTR deepTools computeMatrix + plotProfile Standard metagene visualization
    Repeat-element overlap (MATR3, HNRNPK) bedtools intersect with RepeatMasker Repeat axis is independent of region axis
    Mitochondrial RBP (FASTKD2) Custom chrM-aware annotation Standard TxDb may drop chrM
    Bulk RBP overview ChIPseeker plotAnnoPie + plotDistToTSS Quick orienting plots
    Compare two RBPs' region distributions ChIPseeker annotatePeakList + side-by-side bar Direct comparability
    miRNA target site (AGO-CLIP) bedtools intersect with 3' UTR + TargetScan seed scan miRNA biology specific
    m6A reader (YTHDF, IGF2BP) ChIPseeker + custom stop-codon-distance plot m6A biology stops-codon-proximal

    Reconciliation: When Annotations Disagree

    Pattern Likely cause Action
    ChIPseeker "Promoter" >> 5' UTR for CLIP TSS region too wide Tighten tssRegion=c(-100,100)
    ChIPseeker gene-level intronic; RBP-Maps shows exonic at cassettes Gene-level picks canonical isoform; RBP-Maps uses cassettes Trust RBP-Maps for splicing factors
    Many peaks "intergenic" near a known gene TxDb missing the transcript Verify GENCODE vs Ensembl GTF; add lincRNA / unannotated transcripts
    chrM peaks missing TxDb excluded chrM Add custom mt-mRNA BED
    Sum of per-region counts > total peaks Hierarchy not enforced Use ChIPseeker; or bedtools intersect -v chain
    Repeat-binding RBP looks "intronic" RepeatMasker axis not added Annotate peaks with RepMask intersect separately
    Metagene flat for known position-specific RBP Wrong feature BED (e.g., gene-body instead of transcript-isoform) Use isoform-aware feature BED
    RBP-Maps signal flat at cassette exons No RNA-seq KD reference table Generate cassettes from companion KD; use ENCODE shRNA tables

    Operational rule: Report (a) ChIPseeker global region pie, (b) RBP-Maps metagene for splicing factors, (c) repeat-element fraction separately, (d) transcript-isoform context with Peakhood if ambiguous. Three views together prevent misinterpretation.

    Common Errors

    Error / symptom Cause Solution
    "Promoter" >> 30% of peaks in CLIP TSS region default too wide tssRegion=c(-100,100)
    Strand-mixing artifacts Bedtools intersect without -s Always pass -s for CLIP
    3' UTR fraction < 10% for HuR Wrong txdb version (UTR poorly annotated) Use GENCODE v38+ TxDb
    chrM peaks lost TxDb excludes mitochondrial Add mt-transcript BED
    RBP-Maps flat metagene No cassette exon BED Generate from RNA-seq KD; or use ENCODE table
    Repeat fraction over-counted Same repeat instance counted multiple times bedtools merge repeat BED first
    ChIPseeker level='gene' collapses isoforms Default level=gene Set level='transcript'
    Annotation runs slow for large peak set RCAS HTML generation Use ChIPseeker for global; RCAS for top-K
    GO enrichment yields generic terms Repeat-binding RBP annotated as intronic Re-annotate with RepeatMasker axis
    Peak in unannotated lincRNA labeled intergenic Older GENCODE missing the lincRNA Update GENCODE; add manual transcript BED

    References

    • Yu G et al 2015 Bioinformatics 31:2382 (ChIPseeker)
    • Yusuf D et al 2017 BMC Bioinformatics 18:227 (RCAS)
    • Yee BA et al 2019 RNA 25:193 (RBP-Maps splicing regulatory maps)
    • Van Nostrand EL et al 2020 Nature 583:711 (ENCODE 150 RBP eCLIP, region distribution analysis)
    • Quinlan AR & Hall IM 2010 Bioinformatics 26:841 (bedtools)
    • Hentze MW et al 2018 Nat Rev Mol Cell Biol 19:327 (RBP function review)
    • Uhl M et al 2022 NAR Genom Bioinform 4:lqac020 (Peakhood transcript-context)
    • ENCODE eCLIP standards (encodeproject.org/eclip) - region distribution conventions

    Related Skills

    • clip-seq/clip-peak-calling - Upstream peak calls
    • clip-seq/crosslink-site-detection - Single-nt CL sites for fine-grained metagene
    • clip-seq/clip-motif-analysis - Motif discovery within annotated regions
    • clip-seq/differential-clip - Cross-condition annotated peaks for regulatory maps
    • clip-seq/ago-clip-mirna-targets - 3' UTR seed-matched site annotation
    • clip-seq/m6a-clip - Stop-codon-proximal m6A annotation
    • genome-intervals/gtf-gff-handling - GTF preparation for annotation
    • genome-intervals/interval-arithmetic - bedtools intersect patterns
    • alternative-splicing/differential-splicing - Cassette exon tables for RBP-Maps
    • chip-seq/peak-annotation - DNA-protein annotation analogue
    Recommended Servers
    Apify
    Apify
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    Bright Data
    Bright Data
    Repository
    gptomics/bioskills
    Files