NewFlame, an assistant that learns and improves. Available on

bio-pathway-kegg-pathways

gptomics/bio-pathway-kegg-pathways

Research

178

About

SKILL.md

bio-pathway-kegg-pathways

gptomics/bio-pathway-kegg-pathways

Research

178

About

KEGG pathway and module enrichment analysis using clusterProfiler enrichKEGG and enrichMKEGG. Use when identifying metabolic and signaling pathways over-represented in a gene list...

SKILL.md

Version Compatibility

Reference examples tested with: clusterProfiler 4.18+, org.Hs.eg.db 3.18+, gson 0.1+ (snapshot pinning), SPIA 2.50+ and graphite 1.56+ (topology section).

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

KEGG is a LIVE DATABASE, not a package. enrichKEGG/enrichMKEGG/gseKEGG query the KEGG REST API (https://rest.kegg.jp/) at call time, so the same code on the same genes returns DIFFERENT pathways months apart as KEGG updates. For any reported result, pin the release with a gson snapshot (below) and record the access date; use_internal_data=TRUE does NOT pin the current KEGG (it loads the deprecated 2012 KEGG.db).

KEGG Pathway and Topology Enrichment

"Which KEGG pathways are perturbed in my data?" -> Join genes to KEGG's curated pathway/module gene sets (ORA or GSEA), or propagate fold-changes through KEGG's signed wiring (SPIA) - and pin the KEGG release, because the result is a timestamped query against a moving curation, not a fact about the biology.

R: enrichKEGG(gene, organism, keyType) | gseKEGG(geneList, organism) | spia(de, all, organism)

Scope: KEGG-specific enrichment across all three generations - membership ORA (enrichKEGG/enrichMKEGG), ranked GSEA (gseKEGG), and signed-topology perturbation (SPIA/graphite). KEGG ID mapping (organism codes, keyType, bitr_kegg, prokaryotic locus tags, KO routing), reproducibility/pinning, and pathview map overlay live here. The hypergeometric test and the universe problem -> go-enrichment. The GSEA running-sum engine and ranking-metric choice -> gsea. Reactome/WikiPathways gene sets -> reactome-pathways, wikipathways. Generic dot/cnet/emap plots -> enrichment-visualization. The DE list and fold-changes -> differential-expression/de-results.

The Single Most Important Modern Insight -- A KEGG Result Is a Timestamped Join Against a Moving, Partially-Paywalled Curation, Not a Fact About Biology

Two consequences follow, and both are invisible until someone reruns the analysis.

The query is live, so the result is irreproducible unless the release is pinned. enrichKEGG/gseKEGG/SPIA hit the KEGG REST API at call time; KEGG adds maps, re-annotates genes, and revises edges continuously, so identical code returns a different pathway list next quarter. The fix is a gson snapshot: gson_KEGG('hsa') downloads the current KEGG pathway/module sets into a GSON object, write.gson()/read.gson() persist it, and the generic enricher(gene, gson=k) / GSEA(geneList, gson=k) run frozen and offline against it. Record the access date. use_internal_data=TRUE is NOT this fix - it silently reaches for the deprecated 2012 KEGG.db, which is the wrong, stale snapshot.
KEGG is the only mainstream database shipping signed, directed signaling topology (KGML), which is why this skill owns the third generation of pathway analysis. ORA and GSEA treat a pathway as an unordered bag of exchangeable genes; SPIA asks a question they structurally cannot pose - given where each gene sits in the wiring and the sign of every edge, how perturbed is this pathway? That requires the topology only KEGG (and a few others via graphite) provides. The discipline: choose the generation by the question (membership? rank? signed perturbation?), match keyType/organism to the actual IDs (locus tags for bacteria, KO for non-model), set the universe to the genes that could have been called DE, and pin the release before publishing.

Tool Taxonomy (KEGG Across the Three Generations)

Method	Generation	Engine	Uses log2FC?	Uses topology/direction?	Suitable KEGG maps	Citation
enrichKEGG (ORA)	1st (over-representation)	hypergeometric	no (gene list)	no	all	Wu 2021 The Innovation 2:100141; Kanehisa & Goto 2000 Nucleic Acids Res 28:27
enrichMKEGG (ORA on modules)	1st	hypergeometric	no	no	modules (M-numbers)	Wu 2021 The Innovation 2:100141
gseKEGG (GSEA)	2nd (functional class scoring)	fgsea running sum	yes (ranking)	no	all (as sets)	Wu 2021 The Innovation 2:100141; engine -> gsea
SPIA	3rd (pathway topology)	pNDE (ORA) x pPERT (perturbation) -> pG	yes (named log2FC)	YES (signed KGML)	SIGNALING only	Tarca 2009 Bioinformatics 25:75; Draghici 2007 Genome Res 17:1537
graphite + runSPIA	3rd	SPIA over harmonized graphs	yes	YES	signaling (KEGG/Reactome)	Sales 2012 BMC Bioinformatics 13:20

The three-generations framing (ORA -> FCS -> pathway topology) is Khatri 2012 PLoS Comput Biol 8:e1002375; this skill is the KEGG instantiation of all three (the category README compares the generations across databases).

Decision Tree by Scenario

Scenario	Recommended	Why
Pre-selected gene list, "which KEGG pathways"	enrichKEGG (ORA), set the universe	no ranking available; membership test
All genes carry a DE statistic, no clear cutoff	gseKEGG -> gsea	uses the full ranking; no arbitrary cutoff
Want WHERE in a broad pathway the signal sits	enrichMKEGG (modules)	M-numbers are tighter functional units
Have named log2FC + want signed perturbation on a SIGNALING map	SPIA (or graphite + runSPIA)	propagates fold-changes through the wiring; uses direction
Metabolic-pathway question (glycolysis, TCA)	enrichKEGG / gseKEGG	metabolic maps are compound-mediated; SPIA is undefined there
Human / mouse / model eukaryote	bitr -> Entrez, keyType='ncbi-geneid'	KEGG gene ID == Entrez for these organisms
Bacterial / prokaryotic data	locus tags, keyType='kegg', NO OrgDb/bitr	bacterial KEGG IDs ARE locus tags; no org.*.eg.db exists
Non-model organism with no KEGG genome	map to KO, organism='ko'	the universal escape hatch into KEGG pathway space
Result must be reproducible / published	gson_KEGG snapshot + enricher/GSEA, record date	live unpinned queries drift; use_internal_data pins the WRONG 2012 db
Multiple conditions to compare side by side	compareCluster(fun='enrichKEGG')	one model, faceted dotplot; never compare raw p-values
Overlay per-gene data on the KEGG map image	pathview -> render	a KEGG-specific operation; generic plots -> enrichment-visualization
The DE list / fold-changes themselves	-> differential-expression/de-results	upstream, not enrichment

Prepare the Gene IDs (the Join That Decides Everything)

Goal: Get the query genes and the universe into the exact ID type KEGG expects for the organism, because every KEGG failure is a join failure.

Approach: For model eukaryotes convert SYMBOL/ENSEMBL to Entrez (KEGG's gene ID for hsa/mmu/rno) and pass keyType='ncbi-geneid'. For prokaryotes pass locus tags directly with keyType='kegg' and no OrgDb. Convert the universe the same way. Passing ENSEMBL/SYMBOL to enrichKEGG returns zero hits silently.

library(clusterProfiler)
library(org.Hs.eg.db)

de <- read.csv('de_results.csv')   # DE list source -> differential-expression/de-results
sig_symbols <- de$gene[de$padj < 0.05 & abs(de$log2FoldChange) > 1]   # padj is the DESeq2 adjusted-p column
sig_entrez  <- bitr(sig_symbols, fromType='SYMBOL', toType='ENTREZID', OrgDb=org.Hs.eg.db)$ENTREZID

# universe = genes that COULD have been called DE (non-NA test statistic), same ID type
universe <- bitr(de$gene[!is.na(de$pvalue)], fromType='SYMBOL', toType='ENTREZID', OrgDb=org.Hs.eg.db)$ENTREZID

bitr_kegg(geneID, fromType, toType, organism) converts among KEGG's own ID flavors ('kegg', 'ncbi-geneid', 'ncbi-proteinid', 'uniprot') via the REST conv endpoint - use it when starting from UniProt or NCBI protein IDs. Check KEGG coverage of an organism with search_kegg_organism('Pseudomonas aeruginosa', by='scientific_name').

Run KEGG ORA (enrichKEGG / enrichMKEGG)

Goal: Find KEGG pathways (or modules) over-represented among the query genes relative to the measured universe.

Approach: Run enrichKEGG with the correct organism code, keyType, and an explicit universe; enrichKEGG has no readable argument, so translate the geneID column to symbols afterward with setReadable (eukaryotes only).

kk <- enrichKEGG(gene=sig_entrez, organism='hsa', keyType='ncbi-geneid',
                 universe=universe, pvalueCutoff=0.05, pAdjustMethod='BH',
                 minGSSize=10, maxGSSize=500, qvalueCutoff=0.2)
kk <- setReadable(kk, OrgDb=org.Hs.eg.db, keyType='ENTREZID')   # eukaryotes only; no OrgDb -> keep raw IDs
head(as.data.frame(kk))   # ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, qvalue, geneID, Count

mkk <- enrichMKEGG(gene=sig_entrez, organism='hsa', keyType='ncbi-geneid', universe=universe)   # KEGG MODULES (M-numbers)

Report p.adjust/qvalue, not raw pvalue. Fold enrichment = GeneRatio / BgRatio. enrichMKEGG tests smaller, sparser sets: higher resolution (which sub-process is hit) but lower power and many genes belong to no module.

Run KEGG GSEA (gseKEGG)

Goal: Find KEGG sets whose genes shift coordinately across the full ranking, with no cutoff.

Approach: Build a named numeric vector sorted DECREASING by the ranking metric, fix the seed (gseKEGG defaults seed=FALSE), then run gseKEGG. The running-sum engine and the ranking-metric choice are owned by gsea; only the KEGG arguments (organism, keyType) are KEGG-specific.

geneList <- de$log2FoldChange; names(geneList) <- de$entrez   # names = Entrez IDs
geneList <- sort(geneList[!is.na(geneList)], decreasing=TRUE)
set.seed(123)   # gseKEGG seed=FALSE by default; fix it so permutation p-values are reproducible
kk2 <- gseKEGG(geneList=geneList, organism='hsa', keyType='ncbi-geneid', minGSSize=10, maxGSSize=500, pvalueCutoff=0.05)

Run Signed-Topology Perturbation (SPIA) -- the Third Generation

Goal: Score how perturbed each SIGNALING pathway is given both the over-representation of DE genes and the propagation of their fold-changes through the signed wiring.

Approach: SPIA combines pNDE (the classical over-representation evidence) with pPERT (the probability of the observed total accumulated perturbation tA, computed by propagating log2 fold-changes through KGML activation/inhibition edges) into a single global pG, then FDR-corrects it. It needs a NAMED vector of DE fold-changes plus the universe, and is defined only for signaling maps. graphite is the modern route: it harmonizes node IDs, resolves complexes/families, removes compounds, and can run SPIA over Reactome topology too.

library(SPIA)
sig <- de[de$padj < 0.05, ]   # DE genes only
map <- bitr(sig$gene, 'SYMBOL', 'ENTREZID', org.Hs.eg.db)   # bitr drops/many-to-one: MERGE, never assign as names
de_vec <- setNames(sig$log2FoldChange[match(map$SYMBOL, sig$gene)], map$ENTREZID)
de_vec <- de_vec[!duplicated(names(de_vec))]
res <- spia(de=de_vec, all=universe, organism='hsa', nB=2000, plots=FALSE)   # nB=2000 bootstraps for pPERT
# output cols: Name, ID, pSize, NDE, pNDE, tA, pPERT, pG, pGFdr, pGFWER, Status, KEGGLINK
# Status reports inferred Activated / Inhibited from the sign of tA

# graphite route (decouples from KEGG's bundled data; works on Reactome too)
library(graphite)
db <- pathways('hsapiens', 'kegg')
db <- convertIdentifiers(db, 'ENTREZID')
prepareSPIA(db, 'kegg_hsa_spia')              # writes the pathway dataset file
gr <- runSPIA(de=de_vec, all=universe, 'kegg_hsa_spia')

SPIA aborts if more than ~1% of the DE IDs are absent from all, so build the universe from the same ID space. The standalone SPIA package also ships a frozen hsaSPIA data object that is an OLDER snapshot than a live enrichKEGG query - do not mix the two in one comparison.

Pin the KEGG Release for Reproducibility

Goal: Freeze the KEGG data a result depends on so the analysis is reproducible and runs offline.

Approach: Snapshot the current KEGG sets into a GSON object, persist it, and run enrichment against the snapshot with the generic enricher/GSEA (which accept a gson argument); record the access date. Do NOT use use_internal_data=TRUE for this.

library(gson)                                      # GSON class + write.gson/read.gson
k <- gson_KEGG('hsa')                              # gson_KEGG is exported by clusterProfiler; downloads current KEGG sets
k@accessed_date <- as.character(Sys.Date())        # the accessed_date slot survives write/read; a base attr() does not
write.gson(k, file.path(tempdir(), 'kegg_hsa.gson'))
k <- read.gson(file.path(tempdir(), 'kegg_hsa.gson'))

kk_pinned  <- enricher(sig_entrez, gson=k, universe=universe)   # frozen ORA, offline, reproducible
gsea_pinned <- GSEA(geneList, gson=k)                            # frozen GSEA against the snapshot

Compare Multiple Conditions

Goal: See shared and condition-specific KEGG pathways across groups in one faceted figure.

Approach: Pass named gene lists to compareCluster with fun='enrichKEGG'; it fits one model and produces a faceted dotplot. Compare pathway-ID SETS across conditions, never raw p-values (they depend on sample size, DE gene count, and the KEGG release).

clusters <- list(up=up_entrez, down=down_entrez)
ck <- compareCluster(geneClusters=clusters, fun='enrichKEGG', organism='hsa', keyType='ncbi-geneid')
ck <- setReadable(ck, OrgDb=org.Hs.eg.db, keyType='ENTREZID')
# dotplot(ck) -> enrichment-visualization for the plot grammar

Overlay Data on the KEGG Map (pathview)

pathview downloads a KEGG pathway's KGML and image, joins per-gene values to the nodes, and writes a colored map PNG/PDF (a KEGG-specific operation owned here; generic dot/cnet/emap plots route to enrichment-visualization). It writes files to the working directory and queries KEGG live.

library(pathview)
vals <- setNames(de$log2FoldChange, de$entrez)
pathview(gene.data=vals, pathway.id='hsa04110', species='hsa', gene.idtype='entrez')   # writes hsa04110.pathview.png

Per-Method Failure Modes

ENSEMBL/SYMBOL passed to enrichKEGG

Trigger: feeding OrgDb-style ENSEMBL or SYMBOL IDs to enrichKEGG/gseKEGG. Mechanism: KEGG's keyType is 'kegg'/'ncbi-geneid'/'ncbi-proteinid'/'uniprot', not an OrgDb keytype, so no IDs join. Symptom: zero enriched pathways, no error. Fix: bitr to Entrez and set keyType='ncbi-geneid' (eukaryotes), or pass locus tags with keyType='kegg' (prokaryotes).

Live-query result treated as reproducible

Trigger: reporting an enrichKEGG/gseKEGG/SPIA result without pinning the release. Mechanism: the REST query returns the CURRENT KEGG, which changes over time. Symptom: a rerun months later yields a different pathway list. Fix: snapshot with gson_KEGG, run enricher/GSEA against the gson, and record the access date.

use_internal_data=TRUE believed to pin current KEGG

Trigger: setting use_internal_data=TRUE for reproducibility. Mechanism: it loads the deprecated 2012 KEGG.db, not a current pin (and may simply fail). Symptom: stale or absent pathways unlike the live result. Fix: use a gson snapshot instead; treat KEGG.db as legacy-only.

SPIA on metabolic maps

Trigger: running SPIA/graphite topology on glycolysis or other metabolic maps. Mechanism: metabolic maps are compound-mediated and give no clean signed gene->gene graph. Symptom: meaningless perturbation scores. Fix: restrict SPIA to signaling maps; use enrichKEGG/gseKEGG for metabolism.

Whole-database universe in ORA

Trigger: omitting universe. Mechanism: the default background is all KEGG-annotated genes, biased toward well-studied, metabolically central genes. Symptom: inflated significance for pathways enriched in measured/expressed genes (the tissue-specificity artifact). Fix: set universe to the genes that could have been called DE, in the same ID type.

Locus-tag / strain mismatch in prokaryotes

Trigger: locus tags from a re-annotated genome or a different strain than KEGG's reference. Mechanism: the gene-ID join is exact; drifted locus tags do not match KEGG's pae/eco genome. Symptom: many genes silently dropped, weak or empty enrichment. Fix: confirm the organism code and reference genome with search_kegg_organism; align locus tags to KEGG's annotation, or route through KO.

bitr/OrgDb forced onto bacteria

Trigger: running bitr() or setReadable() on a prokaryote. Mechanism: no org.*.eg.db exists for most bacteria and there is no Entrez==KEGG identity. Symptom: bitr fails or empties the gene list; setReadable errors. Fix: pass locus tags directly with keyType='kegg'; keep raw IDs (no setReadable).

Quantitative Thresholds

Threshold	Source	Rationale
pvalueCutoff=0.05	enrichKEGG/gseKEGG default	filters on p.adjust by default; standard FDR gate
qvalueCutoff=0.2	clusterProfiler default	secondary q-value gate on enrichResult
pAdjustMethod='BH'	clusterProfiler default	Benjamini-Hochberg FDR; less conservative than Bonferroni for discovery
minGSSize=10	enrichKEGG default	drop tiny sets that overfit and give unstable p-values
maxGSSize=500	enrichKEGG default	drop very broad sets that always 'enrich'
nB=2000	SPIA default	bootstrap replicates for the pPERT null; raise for stable small p-values
SPIA aborts if >1% of DE IDs absent from `all`	Tarca 2009 Bioinformatics 25:75	the perturbation null requires the DE genes live in the universe
set.seed before gseKEGG/SPIA	reproducibility	gseKEGG seed=FALSE and SPIA bootstrap are stochastic; fix the seed
ID-conversion loss > ~15%	practice heuristic	report the bitr conversion rate; heavy loss makes the result unreliable

Common Errors

Error / symptom	Cause	Solution
enrichKEGG returns 0 pathways	ENSEMBL/SYMBOL passed, or wrong organism code, or KEGG API unreachable	bitr to Entrez + keyType='ncbi-geneid'; verify code with search_kegg_organism; check network
`setReadable` errors	no OrgDb for the organism (prokaryote)	skip setReadable; keep raw KEGG IDs
`gson=` rejected by enrichKEGG	enrichKEGG/gseKEGG have no gson argument	pass the gson to the generic enricher()/GSEA() instead
Different pathways on rerun	live KEGG changed between runs	pin with a gson snapshot and record the access date
SPIA: "more than 1% of de IDs not in all"	DE IDs not a subset of the universe	build de and all from the same ID space
SPIA gives nonsense on glycolysis	topology on a metabolic map	use enrichKEGG/gseKEGG; SPIA is signaling-only
Bacterial list gives 0 hits	Entrez/bitr forced onto a prokaryote	pass locus tags with keyType='kegg', no OrgDb

References

Kanehisa M, Goto S. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28:27-30.
Kanehisa M, Furumichi M, Sato Y, et al. 2023. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 51:D587-D592.
Wu T, Hu E, Xu S, et al. 2021. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation 2:100141.
Tarca AL, Draghici S, Khatri P, et al. 2009. A novel signaling pathway impact analysis (SPIA). Bioinformatics 25:75-82.
Draghici S, Khatri P, Tarca AL, et al. 2007. A systems biology approach for pathway level analysis. Genome Res 17:1537-1545.
Sales G, Calura E, Cavalieri D, Romualdi C. 2012. graphite - a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics 13:20.
Luo W, Brouwer C. 2013. Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics 29:1830-1831.
Khatri P, Sirota M, Butte AJ. 2012. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 8:e1002375.

Related Skills

go-enrichment - Hypergeometric ORA and the background-universe problem
gsea - GSEA running-sum engine and ranking-metric choice (gseKEGG)
reactome-pathways - Reactome curated-pathway enrichment (reproducible local DB)
wikipathways - WikiPathways community-pathway enrichment
enrichment-visualization - Dot/bar/cnet/emap/ridge plots of enrichment results
differential-expression/de-results - Source of the gene list and the fold-changes
workflows/expression-to-pathways - End-to-end DE-to-enrichment pipeline

About

SKILL.md

About

KEGG pathway and module enrichment analysis using clusterProfiler enrichKEGG and enrichMKEGG. Use when identifying metabolic and signaling pathways over-represented in a gene list...

SKILL.md

Version Compatibility

Reference examples tested with: clusterProfiler 4.18+, org.Hs.eg.db 3.18+, gson 0.1+ (snapshot pinning), SPIA 2.50+ and graphite 1.56+ (topology section).

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

KEGG Pathway and Topology Enrichment

R: enrichKEGG(gene, organism, keyType) | gseKEGG(geneList, organism) | spia(de, all, organism)

The Single Most Important Modern Insight -- A KEGG Result Is a Timestamped Join Against a Moving, Partially-Paywalled Curation, Not a Fact About Biology

Two consequences follow, and both are invisible until someone reruns the analysis.

The query is live, so the result is irreproducible unless the release is pinned. enrichKEGG/gseKEGG/SPIA hit the KEGG REST API at call time; KEGG adds maps, re-annotates genes, and revises edges continuously, so identical code returns a different pathway list next quarter. The fix is a gson snapshot: gson_KEGG('hsa') downloads the current KEGG pathway/module sets into a GSON object, write.gson()/read.gson() persist it, and the generic enricher(gene, gson=k) / GSEA(geneList, gson=k) run frozen and offline against it. Record the access date. use_internal_data=TRUE is NOT this fix - it silently reaches for the deprecated 2012 KEGG.db, which is the wrong, stale snapshot.
KEGG is the only mainstream database shipping signed, directed signaling topology (KGML), which is why this skill owns the third generation of pathway analysis. ORA and GSEA treat a pathway as an unordered bag of exchangeable genes; SPIA asks a question they structurally cannot pose - given where each gene sits in the wiring and the sign of every edge, how perturbed is this pathway? That requires the topology only KEGG (and a few others via graphite) provides. The discipline: choose the generation by the question (membership? rank? signed perturbation?), match keyType/organism to the actual IDs (locus tags for bacteria, KO for non-model), set the universe to the genes that could have been called DE, and pin the release before publishing.

Tool Taxonomy (KEGG Across the Three Generations)

Method	Generation	Engine	Uses log2FC?	Uses topology/direction?	Suitable KEGG maps	Citation
enrichKEGG (ORA)	1st (over-representation)	hypergeometric	no (gene list)	no	all	Wu 2021 The Innovation 2:100141; Kanehisa & Goto 2000 Nucleic Acids Res 28:27
enrichMKEGG (ORA on modules)	1st	hypergeometric	no	no	modules (M-numbers)	Wu 2021 The Innovation 2:100141
gseKEGG (GSEA)	2nd (functional class scoring)	fgsea running sum	yes (ranking)	no	all (as sets)	Wu 2021 The Innovation 2:100141; engine -> gsea
SPIA	3rd (pathway topology)	pNDE (ORA) x pPERT (perturbation) -> pG	yes (named log2FC)	YES (signed KGML)	SIGNALING only	Tarca 2009 Bioinformatics 25:75; Draghici 2007 Genome Res 17:1537
graphite + runSPIA	3rd	SPIA over harmonized graphs	yes	YES	signaling (KEGG/Reactome)	Sales 2012 BMC Bioinformatics 13:20

Decision Tree by Scenario

Scenario	Recommended	Why
Pre-selected gene list, "which KEGG pathways"	enrichKEGG (ORA), set the universe	no ranking available; membership test
All genes carry a DE statistic, no clear cutoff	gseKEGG -> gsea	uses the full ranking; no arbitrary cutoff
Want WHERE in a broad pathway the signal sits	enrichMKEGG (modules)	M-numbers are tighter functional units
Have named log2FC + want signed perturbation on a SIGNALING map	SPIA (or graphite + runSPIA)	propagates fold-changes through the wiring; uses direction
Metabolic-pathway question (glycolysis, TCA)	enrichKEGG / gseKEGG	metabolic maps are compound-mediated; SPIA is undefined there
Human / mouse / model eukaryote	bitr -> Entrez, keyType='ncbi-geneid'	KEGG gene ID == Entrez for these organisms
Bacterial / prokaryotic data	locus tags, keyType='kegg', NO OrgDb/bitr	bacterial KEGG IDs ARE locus tags; no org.*.eg.db exists
Non-model organism with no KEGG genome	map to KO, organism='ko'	the universal escape hatch into KEGG pathway space
Result must be reproducible / published	gson_KEGG snapshot + enricher/GSEA, record date	live unpinned queries drift; use_internal_data pins the WRONG 2012 db
Multiple conditions to compare side by side	compareCluster(fun='enrichKEGG')	one model, faceted dotplot; never compare raw p-values
Overlay per-gene data on the KEGG map image	pathview -> render	a KEGG-specific operation; generic plots -> enrichment-visualization
The DE list / fold-changes themselves	-> differential-expression/de-results	upstream, not enrichment

Prepare the Gene IDs (the Join That Decides Everything)

Goal: Get the query genes and the universe into the exact ID type KEGG expects for the organism, because every KEGG failure is a join failure.

library(clusterProfiler)
library(org.Hs.eg.db)

de <- read.csv('de_results.csv')   # DE list source -> differential-expression/de-results
sig_symbols <- de$gene[de$padj < 0.05 & abs(de$log2FoldChange) > 1]   # padj is the DESeq2 adjusted-p column
sig_entrez  <- bitr(sig_symbols, fromType='SYMBOL', toType='ENTREZID', OrgDb=org.Hs.eg.db)$ENTREZID

# universe = genes that COULD have been called DE (non-NA test statistic), same ID type
universe <- bitr(de$gene[!is.na(de$pvalue)], fromType='SYMBOL', toType='ENTREZID', OrgDb=org.Hs.eg.db)$ENTREZID

Run KEGG ORA (enrichKEGG / enrichMKEGG)

Goal: Find KEGG pathways (or modules) over-represented among the query genes relative to the measured universe.

kk <- enrichKEGG(gene=sig_entrez, organism='hsa', keyType='ncbi-geneid',
                 universe=universe, pvalueCutoff=0.05, pAdjustMethod='BH',
                 minGSSize=10, maxGSSize=500, qvalueCutoff=0.2)
kk <- setReadable(kk, OrgDb=org.Hs.eg.db, keyType='ENTREZID')   # eukaryotes only; no OrgDb -> keep raw IDs
head(as.data.frame(kk))   # ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, qvalue, geneID, Count

mkk <- enrichMKEGG(gene=sig_entrez, organism='hsa', keyType='ncbi-geneid', universe=universe)   # KEGG MODULES (M-numbers)

Run KEGG GSEA (gseKEGG)

Goal: Find KEGG sets whose genes shift coordinately across the full ranking, with no cutoff.

geneList <- de$log2FoldChange; names(geneList) <- de$entrez   # names = Entrez IDs
geneList <- sort(geneList[!is.na(geneList)], decreasing=TRUE)
set.seed(123)   # gseKEGG seed=FALSE by default; fix it so permutation p-values are reproducible
kk2 <- gseKEGG(geneList=geneList, organism='hsa', keyType='ncbi-geneid', minGSSize=10, maxGSSize=500, pvalueCutoff=0.05)

Run Signed-Topology Perturbation (SPIA) -- the Third Generation

Goal: Score how perturbed each SIGNALING pathway is given both the over-representation of DE genes and the propagation of their fold-changes through the signed wiring.

library(SPIA)
sig <- de[de$padj < 0.05, ]   # DE genes only
map <- bitr(sig$gene, 'SYMBOL', 'ENTREZID', org.Hs.eg.db)   # bitr drops/many-to-one: MERGE, never assign as names
de_vec <- setNames(sig$log2FoldChange[match(map$SYMBOL, sig$gene)], map$ENTREZID)
de_vec <- de_vec[!duplicated(names(de_vec))]
res <- spia(de=de_vec, all=universe, organism='hsa', nB=2000, plots=FALSE)   # nB=2000 bootstraps for pPERT
# output cols: Name, ID, pSize, NDE, pNDE, tA, pPERT, pG, pGFdr, pGFWER, Status, KEGGLINK
# Status reports inferred Activated / Inhibited from the sign of tA

# graphite route (decouples from KEGG's bundled data; works on Reactome too)
library(graphite)
db <- pathways('hsapiens', 'kegg')
db <- convertIdentifiers(db, 'ENTREZID')
prepareSPIA(db, 'kegg_hsa_spia')              # writes the pathway dataset file
gr <- runSPIA(de=de_vec, all=universe, 'kegg_hsa_spia')

Pin the KEGG Release for Reproducibility

Goal: Freeze the KEGG data a result depends on so the analysis is reproducible and runs offline.

library(gson)                                      # GSON class + write.gson/read.gson
k <- gson_KEGG('hsa')                              # gson_KEGG is exported by clusterProfiler; downloads current KEGG sets
k@accessed_date <- as.character(Sys.Date())        # the accessed_date slot survives write/read; a base attr() does not
write.gson(k, file.path(tempdir(), 'kegg_hsa.gson'))
k <- read.gson(file.path(tempdir(), 'kegg_hsa.gson'))

kk_pinned  <- enricher(sig_entrez, gson=k, universe=universe)   # frozen ORA, offline, reproducible
gsea_pinned <- GSEA(geneList, gson=k)                            # frozen GSEA against the snapshot

Compare Multiple Conditions

Goal: See shared and condition-specific KEGG pathways across groups in one faceted figure.

clusters <- list(up=up_entrez, down=down_entrez)
ck <- compareCluster(geneClusters=clusters, fun='enrichKEGG', organism='hsa', keyType='ncbi-geneid')
ck <- setReadable(ck, OrgDb=org.Hs.eg.db, keyType='ENTREZID')
# dotplot(ck) -> enrichment-visualization for the plot grammar

Overlay Data on the KEGG Map (pathview)

library(pathview)
vals <- setNames(de$log2FoldChange, de$entrez)
pathview(gene.data=vals, pathway.id='hsa04110', species='hsa', gene.idtype='entrez')   # writes hsa04110.pathview.png

Per-Method Failure Modes

ENSEMBL/SYMBOL passed to enrichKEGG

Live-query result treated as reproducible

use_internal_data=TRUE believed to pin current KEGG

SPIA on metabolic maps

Whole-database universe in ORA

Locus-tag / strain mismatch in prokaryotes

bitr/OrgDb forced onto bacteria

Quantitative Thresholds

Threshold	Source	Rationale
pvalueCutoff=0.05	enrichKEGG/gseKEGG default	filters on p.adjust by default; standard FDR gate
qvalueCutoff=0.2	clusterProfiler default	secondary q-value gate on enrichResult
pAdjustMethod='BH'	clusterProfiler default	Benjamini-Hochberg FDR; less conservative than Bonferroni for discovery
minGSSize=10	enrichKEGG default	drop tiny sets that overfit and give unstable p-values
maxGSSize=500	enrichKEGG default	drop very broad sets that always 'enrich'
nB=2000	SPIA default	bootstrap replicates for the pPERT null; raise for stable small p-values
SPIA aborts if >1% of DE IDs absent from `all`	Tarca 2009 Bioinformatics 25:75	the perturbation null requires the DE genes live in the universe
set.seed before gseKEGG/SPIA	reproducibility	gseKEGG seed=FALSE and SPIA bootstrap are stochastic; fix the seed
ID-conversion loss > ~15%	practice heuristic	report the bitr conversion rate; heavy loss makes the result unreliable

Common Errors

Error / symptom	Cause	Solution
enrichKEGG returns 0 pathways	ENSEMBL/SYMBOL passed, or wrong organism code, or KEGG API unreachable	bitr to Entrez + keyType='ncbi-geneid'; verify code with search_kegg_organism; check network
`setReadable` errors	no OrgDb for the organism (prokaryote)	skip setReadable; keep raw KEGG IDs
`gson=` rejected by enrichKEGG	enrichKEGG/gseKEGG have no gson argument	pass the gson to the generic enricher()/GSEA() instead
Different pathways on rerun	live KEGG changed between runs	pin with a gson snapshot and record the access date
SPIA: "more than 1% of de IDs not in all"	DE IDs not a subset of the universe	build de and all from the same ID space
SPIA gives nonsense on glycolysis	topology on a metabolic map	use enrichKEGG/gseKEGG; SPIA is signaling-only
Bacterial list gives 0 hits	Entrez/bitr forced onto a prokaryote	pass locus tags with keyType='kegg', no OrgDb

References

Kanehisa M, Goto S. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28:27-30.
Kanehisa M, Furumichi M, Sato Y, et al. 2023. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 51:D587-D592.
Wu T, Hu E, Xu S, et al. 2021. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation 2:100141.
Tarca AL, Draghici S, Khatri P, et al. 2009. A novel signaling pathway impact analysis (SPIA). Bioinformatics 25:75-82.
Draghici S, Khatri P, Tarca AL, et al. 2007. A systems biology approach for pathway level analysis. Genome Res 17:1537-1545.
Sales G, Calura E, Cavalieri D, Romualdi C. 2012. graphite - a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics 13:20.
Luo W, Brouwer C. 2013. Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics 29:1830-1831.
Khatri P, Sirota M, Butte AJ. 2012. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 8:e1002375.

Related Skills

go-enrichment - Hypergeometric ORA and the background-universe problem
gsea - GSEA running-sum engine and ranking-metric choice (gseKEGG)
reactome-pathways - Reactome curated-pathway enrichment (reproducible local DB)
wikipathways - WikiPathways community-pathway enrichment
enrichment-visualization - Dot/bar/cnet/emap/ridge plots of enrichment results
differential-expression/de-results - Source of the gene list and the fold-changes
workflows/expression-to-pathways - End-to-end DE-to-enrichment pipeline