bio-chip-seq-motif-analysis

gptomics/bio-chip-seq-motif-analysis

Research

178

1 installs

About

SKILL.md

bio-chip-seq-motif-analysis

gptomics/bio-chip-seq-motif-analysis

Research

178

1 installs

About

De novo motif discovery and known motif enrichment analysis using HOMER and MEME-ChIP. Identify transcription factor binding motifs in ChIP-seq, ATAC-seq, or other genomic peak data...

SKILL.md

Version Compatibility

Reference examples tested with: BioPython 1.83+, bedtools 2.31+, matplotlib 3.8+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Motif Analysis

"Find enriched motifs in my ChIP-seq peaks" → Discover de novo DNA-binding motifs and test for known TF motif enrichment in peak sequences.

CLI: findMotifsGenome.pl peaks.bed hg38 output/ (HOMER), meme-chip -db JASPAR peaks.fa (MEME)

Identify DNA sequence motifs enriched in ChIP-seq or ATAC-seq peaks to discover transcription factor binding sites.

Tool Comparison

Tool	Strengths	Use Case
HOMER	Fast, comprehensive, built-in databases	General motif analysis
MEME-ChIP	Multiple algorithms, web interface	Publication-quality
MEME	De novo discovery only	Simple discovery
FIMO	Known motif scanning	Genome-wide scanning

HOMER

Installation

conda install -c bioconda homer

# Configure genome (required once)
perl /path/to/homer/configureHomer.pl -install hg38
perl /path/to/homer/configureHomer.pl -install mm10

De Novo Motif Discovery

Goal: Discover enriched DNA-binding motifs directly from ChIP-seq peak sequences.

Approach: Run findMotifsGenome.pl on a peak BED file with a specified fragment size, optionally providing background regions and target motif lengths.

# Basic motif finding
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200

# With background regions
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -bg background.bed

# Specify motif lengths to search
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -len 8,10,12

Key Options

Option	Description
`-size <#>`	Fragment size for analysis (default 200)
`-size given`	Use actual peak sizes
`-bg <file>`	Background regions (BED)
`-len <#,#,...>`	Motif lengths to search
`-mask`	Mask repeats
`-p <#>`	Number of CPUs
`-S <#>`	Number of motifs to find (default 25)
`-mis <#>`	Mismatches allowed (default 2)
`-noweight`	Don't adjust for GC content

Output Files

output_dir/
├── homerResults.html      # Main results page
├── knownResults.html      # Known motif enrichment
├── homerMotifs.all.motifs # All discovered motifs
├── knownResults.txt       # Known motif statistics
└── motif1.motif           # Individual motif files

Known Motif Enrichment Only

# Skip de novo, only check known motifs
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -nomotif

Scan for Specific Motifs

# Find instances of motif in peaks
annotatePeaks.pl peaks.bed hg38 -m motif.motif > annotated.txt

# Scan genome for motif occurrences
scanMotifGenomeWide.pl motif.motif hg38 > motif_sites.bed

Motif Comparison

# Compare discovered motifs to known database
compareMotifs.pl motifs.motif output_dir/ -known

Create Custom Motif

# From consensus sequence
seq2profile.pl CACGTG 4 > MYC.motif

# From aligned sequences
cat aligned_seqs.txt | alignAndConvert.pl - > custom.motif

MEME Suite

Installation

conda install -c bioconda meme

Extract Sequences from Peaks

# Get FASTA sequences under peaks
bedtools getfasta -fi genome.fa -bed peaks.bed -fo peaks.fa

# Center peaks and resize
bedtools slop -i peaks.bed -g genome.sizes -b 100 | \
    bedtools getfasta -fi genome.fa -bed - -fo peaks_centered.fa

MEME (De Novo Discovery)

# Basic de novo discovery
meme peaks.fa -dna -oc meme_output -mod zoops -nmotifs 10 -minw 6 -maxw 20

# With Markov background
fasta-get-markov peaks.fa > background.model
meme peaks.fa -dna -oc meme_output -bfile background.model -mod zoops -nmotifs 10

MEME Options

Option	Description
`-mod zoops`	Zero or one per sequence (default for ChIP)
`-mod oops`	Exactly one per sequence
`-mod anr`	Any number of repeats
`-nmotifs <#>`	Number of motifs to find
`-minw <#>`	Minimum motif width
`-maxw <#>`	Maximum motif width
`-revcomp`	Search both strands
`-bfile <file>`	Background model file

MEME-ChIP (Comprehensive Pipeline)

Goal: Run a comprehensive motif analysis pipeline combining de novo discovery, central enrichment testing, and database comparison.

Approach: Provide peak FASTA sequences and a motif database to MEME-ChIP, which runs MEME, DREME, CentriMo, TOMTOM, and FIMO in a single invocation.

# All-in-one ChIP-seq motif analysis
meme-chip -oc meme_chip_output -db motif_database.meme peaks.fa

MEME-ChIP runs:

MEME - De novo discovery (central enrichment)
DREME - Short motif discovery
CentriMo - Central enrichment analysis
TOMTOM - Compare to known motifs
FIMO - Find motif instances

DREME (Short Motifs)

# Find short enriched motifs
dreme -oc dreme_output -p peaks.fa -n background.fa

CentriMo (Central Enrichment)

# Test for central enrichment of known motifs
centrimo -oc centrimo_output peaks.fa motif_database.meme

TOMTOM (Motif Comparison)

# Compare discovered motifs to database
tomtom -oc tomtom_output discovered.meme database.meme

FIMO (Motif Scanning)

# Scan sequences for motif matches
fimo --oc fimo_output motif.meme sequences.fa

# Scan genome
fimo --oc fimo_output --max-stored-scores 1000000 motif.meme genome.fa

Motif Databases

HOMER Built-in

# List available motif sets
ls /path/to/homer/data/knownTFs/

# Vertebrate, known motifs (default)
findMotifsGenome.pl peaks.bed hg38 output/ -mknown vertebrates/known.motifs

JASPAR

# Download JASPAR motifs
wget https://jaspar.genereg.net/download/data/2024/CORE/JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt

# Use with MEME suite
meme-chip -db JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt peaks.fa

HOCOMOCO

# Download HOCOMOCO
wget https://hocomoco11.autosome.org/final_bundle/hocomoco11/core/HUMAN/mono/HOCOMOCOv11_core_HUMAN_mono_meme_format.meme

# Use with MEME suite
tomtom discovered.meme HOCOMOCOv11_core_HUMAN_mono_meme_format.meme

Python: Parse HOMER Results

import pandas as pd

def parse_homer_known(results_file):
    '''Parse HOMER knownResults.txt.'''
    df = pd.read_csv(results_file, sep='\t')
    df.columns = ['Motif', 'Consensus', 'P-value', 'Log P-value',
                  'q-value', 'Targets', 'Target%', 'Background', 'Background%']
    df['P-value'] = df['P-value'].astype(float)
    return df.sort_values('P-value')

known = parse_homer_known('output_dir/knownResults.txt')
print(known[['Motif', 'P-value', 'Target%']].head(20))

Python: Parse MEME Results

from Bio import motifs

def parse_meme_file(meme_file):
    '''Parse MEME output file.'''
    with open(meme_file) as f:
        record = motifs.parse(f, 'meme')
    return record

record = parse_meme_file('meme_output/meme.txt')
for m in record:
    print(f'{m.name}: {m.consensus}')
    print(m.counts)

Complete Workflows

ChIP-seq Motif Analysis

Goal: Run a complete motif analysis workflow combining HOMER and MEME-ChIP on ChIP-seq peaks.

Approach: Run HOMER findMotifsGenome.pl for fast de novo and known motif discovery, then extract centered peak sequences and run MEME-ChIP for a complementary analysis.

#!/bin/bash
set -euo pipefail

PEAKS=$1  # narrowPeak or BED file
GENOME=$2  # hg38, mm10, etc.
OUTDIR=$3

mkdir -p $OUTDIR

# HOMER analysis
echo "Running HOMER..."
findMotifsGenome.pl $PEAKS $GENOME ${OUTDIR}/homer \
    -size 200 -p 8 -mask

# Extract sequences for MEME
echo "Extracting sequences..."
bedtools slop -i $PEAKS -g ${GENOME}.chrom.sizes -b 0 | \
    awk 'BEGIN{OFS="\t"} {center=int(($2+$3)/2); print $1,center-100,center+100}' | \
    bedtools getfasta -fi ${GENOME}.fa -bed - -fo ${OUTDIR}/peaks.fa

# MEME-ChIP analysis
echo "Running MEME-ChIP..."
meme-chip -oc ${OUTDIR}/meme_chip \
    -db /path/to/JASPAR.meme \
    ${OUTDIR}/peaks.fa

echo "Done. Results in ${OUTDIR}/"

ATAC-seq Footprint Motifs

# Analyze motifs in footprint regions
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -mask -p 8

# Compare to accessible regions background
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -bg accessible_peaks.bed -mask -p 8

Visualization

HOMER Logo

# Generate sequence logo
motif2Logo.pl motif.motif > logo.eps

Plot with Python

import logomaker
import pandas as pd
import matplotlib.pyplot as plt

def plot_motif(pwm_file):
    '''Plot sequence logo from HOMER PWM.'''
    pwm = pd.read_csv(pwm_file, sep='\t', skiprows=1, header=None)
    pwm.columns = ['A', 'C', 'G', 'T']
    logo = logomaker.Logo(pwm, shade_below=0.5, fade_below=0.5)
    plt.show()

Quality Metrics

Metric	Good	Concerning
P-value	< 1e-10	> 1e-5
Target %	> 20%	< 5%
Background %	< Target/2	Similar to Target
Bit score	> 10	< 5

Common Issues

No Significant Motifs

Check peak quality (too few peaks?)
Try different peak sizes (-size)
Ensure genome build matches
Check for repeat masking issues

Too Many Motifs

Increase significance threshold
Use -S to limit number of motifs
Filter by target percentage

Wrong Background

Use matched GC content background
Consider using input/control peaks
Try shuffled sequences

Related Skills

peak-calling - Generate input peaks
peak-annotation - Annotate peaks with genes
atac-seq/footprinting - TF footprint analysis
genome-intervals - BED file operations

About

SKILL.md

About

De novo motif discovery and known motif enrichment analysis using HOMER and MEME-ChIP. Identify transcription factor binding motifs in ChIP-seq, ATAC-seq, or other genomic peak data...

SKILL.md

Version Compatibility

Reference examples tested with: BioPython 1.83+, bedtools 2.31+, matplotlib 3.8+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Motif Analysis

"Find enriched motifs in my ChIP-seq peaks" → Discover de novo DNA-binding motifs and test for known TF motif enrichment in peak sequences.

CLI: findMotifsGenome.pl peaks.bed hg38 output/ (HOMER), meme-chip -db JASPAR peaks.fa (MEME)

Identify DNA sequence motifs enriched in ChIP-seq or ATAC-seq peaks to discover transcription factor binding sites.

Tool Comparison

Tool	Strengths	Use Case
HOMER	Fast, comprehensive, built-in databases	General motif analysis
MEME-ChIP	Multiple algorithms, web interface	Publication-quality
MEME	De novo discovery only	Simple discovery
FIMO	Known motif scanning	Genome-wide scanning

HOMER

Installation

conda install -c bioconda homer

# Configure genome (required once)
perl /path/to/homer/configureHomer.pl -install hg38
perl /path/to/homer/configureHomer.pl -install mm10

De Novo Motif Discovery

Goal: Discover enriched DNA-binding motifs directly from ChIP-seq peak sequences.

Approach: Run findMotifsGenome.pl on a peak BED file with a specified fragment size, optionally providing background regions and target motif lengths.

# Basic motif finding
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200

# With background regions
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -bg background.bed

# Specify motif lengths to search
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -len 8,10,12

Key Options

Option	Description
`-size <#>`	Fragment size for analysis (default 200)
`-size given`	Use actual peak sizes
`-bg <file>`	Background regions (BED)
`-len <#,#,...>`	Motif lengths to search
`-mask`	Mask repeats
`-p <#>`	Number of CPUs
`-S <#>`	Number of motifs to find (default 25)
`-mis <#>`	Mismatches allowed (default 2)
`-noweight`	Don't adjust for GC content

Output Files

output_dir/
├── homerResults.html      # Main results page
├── knownResults.html      # Known motif enrichment
├── homerMotifs.all.motifs # All discovered motifs
├── knownResults.txt       # Known motif statistics
└── motif1.motif           # Individual motif files

Known Motif Enrichment Only

# Skip de novo, only check known motifs
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -nomotif

Scan for Specific Motifs

# Find instances of motif in peaks
annotatePeaks.pl peaks.bed hg38 -m motif.motif > annotated.txt

# Scan genome for motif occurrences
scanMotifGenomeWide.pl motif.motif hg38 > motif_sites.bed

Motif Comparison

# Compare discovered motifs to known database
compareMotifs.pl motifs.motif output_dir/ -known

Create Custom Motif

# From consensus sequence
seq2profile.pl CACGTG 4 > MYC.motif

# From aligned sequences
cat aligned_seqs.txt | alignAndConvert.pl - > custom.motif

MEME Suite

Installation

conda install -c bioconda meme

Extract Sequences from Peaks

# Get FASTA sequences under peaks
bedtools getfasta -fi genome.fa -bed peaks.bed -fo peaks.fa

# Center peaks and resize
bedtools slop -i peaks.bed -g genome.sizes -b 100 | \
    bedtools getfasta -fi genome.fa -bed - -fo peaks_centered.fa

MEME (De Novo Discovery)

# Basic de novo discovery
meme peaks.fa -dna -oc meme_output -mod zoops -nmotifs 10 -minw 6 -maxw 20

# With Markov background
fasta-get-markov peaks.fa > background.model
meme peaks.fa -dna -oc meme_output -bfile background.model -mod zoops -nmotifs 10

MEME Options

Option	Description
`-mod zoops`	Zero or one per sequence (default for ChIP)
`-mod oops`	Exactly one per sequence
`-mod anr`	Any number of repeats
`-nmotifs <#>`	Number of motifs to find
`-minw <#>`	Minimum motif width
`-maxw <#>`	Maximum motif width
`-revcomp`	Search both strands
`-bfile <file>`	Background model file

MEME-ChIP (Comprehensive Pipeline)

Goal: Run a comprehensive motif analysis pipeline combining de novo discovery, central enrichment testing, and database comparison.

Approach: Provide peak FASTA sequences and a motif database to MEME-ChIP, which runs MEME, DREME, CentriMo, TOMTOM, and FIMO in a single invocation.

# All-in-one ChIP-seq motif analysis
meme-chip -oc meme_chip_output -db motif_database.meme peaks.fa

MEME-ChIP runs:

MEME - De novo discovery (central enrichment)
DREME - Short motif discovery
CentriMo - Central enrichment analysis
TOMTOM - Compare to known motifs
FIMO - Find motif instances

DREME (Short Motifs)

# Find short enriched motifs
dreme -oc dreme_output -p peaks.fa -n background.fa

CentriMo (Central Enrichment)

# Test for central enrichment of known motifs
centrimo -oc centrimo_output peaks.fa motif_database.meme

TOMTOM (Motif Comparison)

# Compare discovered motifs to database
tomtom -oc tomtom_output discovered.meme database.meme

FIMO (Motif Scanning)

# Scan sequences for motif matches
fimo --oc fimo_output motif.meme sequences.fa

# Scan genome
fimo --oc fimo_output --max-stored-scores 1000000 motif.meme genome.fa

Motif Databases

HOMER Built-in

# List available motif sets
ls /path/to/homer/data/knownTFs/

# Vertebrate, known motifs (default)
findMotifsGenome.pl peaks.bed hg38 output/ -mknown vertebrates/known.motifs

JASPAR

# Download JASPAR motifs
wget https://jaspar.genereg.net/download/data/2024/CORE/JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt

# Use with MEME suite
meme-chip -db JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt peaks.fa

HOCOMOCO

# Download HOCOMOCO
wget https://hocomoco11.autosome.org/final_bundle/hocomoco11/core/HUMAN/mono/HOCOMOCOv11_core_HUMAN_mono_meme_format.meme

# Use with MEME suite
tomtom discovered.meme HOCOMOCOv11_core_HUMAN_mono_meme_format.meme

Python: Parse HOMER Results

import pandas as pd

def parse_homer_known(results_file):
    '''Parse HOMER knownResults.txt.'''
    df = pd.read_csv(results_file, sep='\t')
    df.columns = ['Motif', 'Consensus', 'P-value', 'Log P-value',
                  'q-value', 'Targets', 'Target%', 'Background', 'Background%']
    df['P-value'] = df['P-value'].astype(float)
    return df.sort_values('P-value')

known = parse_homer_known('output_dir/knownResults.txt')
print(known[['Motif', 'P-value', 'Target%']].head(20))

Python: Parse MEME Results

from Bio import motifs

def parse_meme_file(meme_file):
    '''Parse MEME output file.'''
    with open(meme_file) as f:
        record = motifs.parse(f, 'meme')
    return record

record = parse_meme_file('meme_output/meme.txt')
for m in record:
    print(f'{m.name}: {m.consensus}')
    print(m.counts)

Complete Workflows

ChIP-seq Motif Analysis

Goal: Run a complete motif analysis workflow combining HOMER and MEME-ChIP on ChIP-seq peaks.

Approach: Run HOMER findMotifsGenome.pl for fast de novo and known motif discovery, then extract centered peak sequences and run MEME-ChIP for a complementary analysis.

#!/bin/bash
set -euo pipefail

PEAKS=$1  # narrowPeak or BED file
GENOME=$2  # hg38, mm10, etc.
OUTDIR=$3

mkdir -p $OUTDIR

# HOMER analysis
echo "Running HOMER..."
findMotifsGenome.pl $PEAKS $GENOME ${OUTDIR}/homer \
    -size 200 -p 8 -mask

# Extract sequences for MEME
echo "Extracting sequences..."
bedtools slop -i $PEAKS -g ${GENOME}.chrom.sizes -b 0 | \
    awk 'BEGIN{OFS="\t"} {center=int(($2+$3)/2); print $1,center-100,center+100}' | \
    bedtools getfasta -fi ${GENOME}.fa -bed - -fo ${OUTDIR}/peaks.fa

# MEME-ChIP analysis
echo "Running MEME-ChIP..."
meme-chip -oc ${OUTDIR}/meme_chip \
    -db /path/to/JASPAR.meme \
    ${OUTDIR}/peaks.fa

echo "Done. Results in ${OUTDIR}/"

ATAC-seq Footprint Motifs

# Analyze motifs in footprint regions
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -mask -p 8

# Compare to accessible regions background
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -bg accessible_peaks.bed -mask -p 8

Visualization

HOMER Logo

# Generate sequence logo
motif2Logo.pl motif.motif > logo.eps

Plot with Python

import logomaker
import pandas as pd
import matplotlib.pyplot as plt

def plot_motif(pwm_file):
    '''Plot sequence logo from HOMER PWM.'''
    pwm = pd.read_csv(pwm_file, sep='\t', skiprows=1, header=None)
    pwm.columns = ['A', 'C', 'G', 'T']
    logo = logomaker.Logo(pwm, shade_below=0.5, fade_below=0.5)
    plt.show()

Quality Metrics

Metric	Good	Concerning
P-value	< 1e-10	> 1e-5
Target %	> 20%	< 5%
Background %	< Target/2	Similar to Target
Bit score	> 10	< 5

Common Issues

No Significant Motifs

Check peak quality (too few peaks?)
Try different peak sizes (-size)
Ensure genome build matches
Check for repeat masking issues

Too Many Motifs

Increase significance threshold
Use -S to limit number of motifs
Filter by target percentage

Wrong Background

Use matched GC content background
Consider using input/control peaks
Try shuffled sequences

Related Skills

peak-calling - Generate input peaks
peak-annotation - Annotate peaks with genes
atac-seq/footprinting - TF footprint analysis
genome-intervals - BED file operations