KEGG pathway and module enrichment analysis using clusterProfiler enrichKEGG and enrichMKEGG. Use when identifying metabolic and signaling pathways over-represented in a gene list...
Reference examples tested with: clusterProfiler 4.18+, org.Hs.eg.db 3.18+, gson 0.1+ (snapshot pinning), SPIA 2.50+ and graphite 1.56+ (topology section).
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
KEGG is a LIVE DATABASE, not a package. enrichKEGG/enrichMKEGG/gseKEGG query the KEGG REST API (https://rest.kegg.jp/) at call time, so the same code on the same genes returns DIFFERENT pathways months apart as KEGG updates. For any reported result, pin the release with a gson snapshot (below) and record the access date; use_internal_data=TRUE does NOT pin the current KEGG (it loads the deprecated 2012 KEGG.db).
"Which KEGG pathways are perturbed in my data?" -> Join genes to KEGG's curated pathway/module gene sets (ORA or GSEA), or propagate fold-changes through KEGG's signed wiring (SPIA) - and pin the KEGG release, because the result is a timestamped query against a moving curation, not a fact about the biology.
enrichKEGG(gene, organism, keyType) | gseKEGG(geneList, organism) | spia(de, all, organism)Scope: KEGG-specific enrichment across all three generations - membership ORA (enrichKEGG/enrichMKEGG), ranked GSEA (gseKEGG), and signed-topology perturbation (SPIA/graphite). KEGG ID mapping (organism codes, keyType, bitr_kegg, prokaryotic locus tags, KO routing), reproducibility/pinning, and pathview map overlay live here. The hypergeometric test and the universe problem -> go-enrichment. The GSEA running-sum engine and ranking-metric choice -> gsea. Reactome/WikiPathways gene sets -> reactome-pathways, wikipathways. Generic dot/cnet/emap plots -> enrichment-visualization. The DE list and fold-changes -> differential-expression/de-results.
Two consequences follow, and both are invisible until someone reruns the analysis.
The query is live, so the result is irreproducible unless the release is pinned. enrichKEGG/gseKEGG/SPIA hit the KEGG REST API at call time; KEGG adds maps, re-annotates genes, and revises edges continuously, so identical code returns a different pathway list next quarter. The fix is a gson snapshot: gson_KEGG('hsa') downloads the current KEGG pathway/module sets into a GSON object, write.gson()/read.gson() persist it, and the generic enricher(gene, gson=k) / GSEA(geneList, gson=k) run frozen and offline against it. Record the access date. use_internal_data=TRUE is NOT this fix - it silently reaches for the deprecated 2012 KEGG.db, which is the wrong, stale snapshot.
KEGG is the only mainstream database shipping signed, directed signaling topology (KGML), which is why this skill owns the third generation of pathway analysis. ORA and GSEA treat a pathway as an unordered bag of exchangeable genes; SPIA asks a question they structurally cannot pose - given where each gene sits in the wiring and the sign of every edge, how perturbed is this pathway? That requires the topology only KEGG (and a few others via graphite) provides. The discipline: choose the generation by the question (membership? rank? signed perturbation?), match keyType/organism to the actual IDs (locus tags for bacteria, KO for non-model), set the universe to the genes that could have been called DE, and pin the release before publishing.
| Method | Generation | Engine | Uses log2FC? | Uses topology/direction? | Suitable KEGG maps | Citation |
|---|---|---|---|---|---|---|
| enrichKEGG (ORA) | 1st (over-representation) | hypergeometric | no (gene list) | no | all | Wu 2021 The Innovation 2:100141; Kanehisa & Goto 2000 Nucleic Acids Res 28:27 |
| enrichMKEGG (ORA on modules) | 1st | hypergeometric | no | no | modules (M-numbers) | Wu 2021 The Innovation 2:100141 |
| gseKEGG (GSEA) | 2nd (functional class scoring) | fgsea running sum | yes (ranking) | no | all (as sets) | Wu 2021 The Innovation 2:100141; engine -> gsea |
| SPIA | 3rd (pathway topology) | pNDE (ORA) x pPERT (perturbation) -> pG | yes (named log2FC) | YES (signed KGML) | SIGNALING only | Tarca 2009 Bioinformatics 25:75; Draghici 2007 Genome Res 17:1537 |
| graphite + runSPIA | 3rd | SPIA over harmonized graphs | yes | YES | signaling (KEGG/Reactome) | Sales 2012 BMC Bioinformatics 13:20 |
The three-generations framing (ORA -> FCS -> pathway topology) is Khatri 2012 PLoS Comput Biol 8:e1002375; this skill is the KEGG instantiation of all three (the category README compares the generations across databases).
| Scenario | Recommended | Why |
|---|---|---|
| Pre-selected gene list, "which KEGG pathways" | enrichKEGG (ORA), set the universe | no ranking available; membership test |
| All genes carry a DE statistic, no clear cutoff | gseKEGG -> gsea | uses the full ranking; no arbitrary cutoff |
| Want WHERE in a broad pathway the signal sits | enrichMKEGG (modules) | M-numbers are tighter functional units |
| Have named log2FC + want signed perturbation on a SIGNALING map | SPIA (or graphite + runSPIA) | propagates fold-changes through the wiring; uses direction |
| Metabolic-pathway question (glycolysis, TCA) | enrichKEGG / gseKEGG | metabolic maps are compound-mediated; SPIA is undefined there |
| Human / mouse / model eukaryote | bitr -> Entrez, keyType='ncbi-geneid' | KEGG gene ID == Entrez for these organisms |
| Bacterial / prokaryotic data | locus tags, keyType='kegg', NO OrgDb/bitr | bacterial KEGG IDs ARE locus tags; no org.*.eg.db exists |
| Non-model organism with no KEGG genome | map to KO, organism='ko' | the universal escape hatch into KEGG pathway space |
| Result must be reproducible / published | gson_KEGG snapshot + enricher/GSEA, record date | live unpinned queries drift; use_internal_data pins the WRONG 2012 db |
| Multiple conditions to compare side by side | compareCluster(fun='enrichKEGG') | one model, faceted dotplot; never compare raw p-values |
| Overlay per-gene data on the KEGG map image | pathview -> render | a KEGG-specific operation; generic plots -> enrichment-visualization |
| The DE list / fold-changes themselves | -> differential-expression/de-results | upstream, not enrichment |
Goal: Get the query genes and the universe into the exact ID type KEGG expects for the organism, because every KEGG failure is a join failure.
Approach: For model eukaryotes convert SYMBOL/ENSEMBL to Entrez (KEGG's gene ID for hsa/mmu/rno) and pass keyType='ncbi-geneid'. For prokaryotes pass locus tags directly with keyType='kegg' and no OrgDb. Convert the universe the same way. Passing ENSEMBL/SYMBOL to enrichKEGG returns zero hits silently.
library(clusterProfiler)
library(org.Hs.eg.db)
de <- read.csv('de_results.csv') # DE list source -> differential-expression/de-results
sig_symbols <- de$gene[de$padj < 0.05 & abs(de$log2FoldChange) > 1] # padj is the DESeq2 adjusted-p column
sig_entrez <- bitr(sig_symbols, fromType='SYMBOL', toType='ENTREZID', OrgDb=org.Hs.eg.db)$ENTREZID
# universe = genes that COULD have been called DE (non-NA test statistic), same ID type
universe <- bitr(de$gene[!is.na(de$pvalue)], fromType='SYMBOL', toType='ENTREZID', OrgDb=org.Hs.eg.db)$ENTREZID
bitr_kegg(geneID, fromType, toType, organism) converts among KEGG's own ID flavors ('kegg', 'ncbi-geneid', 'ncbi-proteinid', 'uniprot') via the REST conv endpoint - use it when starting from UniProt or NCBI protein IDs. Check KEGG coverage of an organism with search_kegg_organism('Pseudomonas aeruginosa', by='scientific_name').
Goal: Find KEGG pathways (or modules) over-represented among the query genes relative to the measured universe.
Approach: Run enrichKEGG with the correct organism code, keyType, and an explicit universe; enrichKEGG has no readable argument, so translate the geneID column to symbols afterward with setReadable (eukaryotes only).
kk <- enrichKEGG(gene=sig_entrez, organism='hsa', keyType='ncbi-geneid',
universe=universe, pvalueCutoff=0.05, pAdjustMethod='BH',
minGSSize=10, maxGSSize=500, qvalueCutoff=0.2)
kk <- setReadable(kk, OrgDb=org.Hs.eg.db, keyType='ENTREZID') # eukaryotes only; no OrgDb -> keep raw IDs
head(as.data.frame(kk)) # ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, qvalue, geneID, Count
mkk <- enrichMKEGG(gene=sig_entrez, organism='hsa', keyType='ncbi-geneid', universe=universe) # KEGG MODULES (M-numbers)
Report p.adjust/qvalue, not raw pvalue. Fold enrichment = GeneRatio / BgRatio. enrichMKEGG tests smaller, sparser sets: higher resolution (which sub-process is hit) but lower power and many genes belong to no module.
Goal: Find KEGG sets whose genes shift coordinately across the full ranking, with no cutoff.
Approach: Build a named numeric vector sorted DECREASING by the ranking metric, fix the seed (gseKEGG defaults seed=FALSE), then run gseKEGG. The running-sum engine and the ranking-metric choice are owned by gsea; only the KEGG arguments (organism, keyType) are KEGG-specific.
geneList <- de$log2FoldChange; names(geneList) <- de$entrez # names = Entrez IDs
geneList <- sort(geneList[!is.na(geneList)], decreasing=TRUE)
set.seed(123) # gseKEGG seed=FALSE by default; fix it so permutation p-values are reproducible
kk2 <- gseKEGG(geneList=geneList, organism='hsa', keyType='ncbi-geneid', minGSSize=10, maxGSSize=500, pvalueCutoff=0.05)
Goal: Score how perturbed each SIGNALING pathway is given both the over-representation of DE genes and the propagation of their fold-changes through the signed wiring.
Approach: SPIA combines pNDE (the classical over-representation evidence) with pPERT (the probability of the observed total accumulated perturbation tA, computed by propagating log2 fold-changes through KGML activation/inhibition edges) into a single global pG, then FDR-corrects it. It needs a NAMED vector of DE fold-changes plus the universe, and is defined only for signaling maps. graphite is the modern route: it harmonizes node IDs, resolves complexes/families, removes compounds, and can run SPIA over Reactome topology too.
library(SPIA)
sig <- de[de$padj < 0.05, ] # DE genes only
map <- bitr(sig$gene, 'SYMBOL', 'ENTREZID', org.Hs.eg.db) # bitr drops/many-to-one: MERGE, never assign as names
de_vec <- setNames(sig$log2FoldChange[match(map$SYMBOL, sig$gene)], map$ENTREZID)
de_vec <- de_vec[!duplicated(names(de_vec))]
res <- spia(de=de_vec, all=universe, organism='hsa', nB=2000, plots=FALSE) # nB=2000 bootstraps for pPERT
# output cols: Name, ID, pSize, NDE, pNDE, tA, pPERT, pG, pGFdr, pGFWER, Status, KEGGLINK
# Status reports inferred Activated / Inhibited from the sign of tA
# graphite route (decouples from KEGG's bundled data; works on Reactome too)
library(graphite)
db <- pathways('hsapiens', 'kegg')
db <- convertIdentifiers(db, 'ENTREZID')
prepareSPIA(db, 'kegg_hsa_spia') # writes the pathway dataset file
gr <- runSPIA(de=de_vec, all=universe, 'kegg_hsa_spia')
SPIA aborts if more than ~1% of the DE IDs are absent from all, so build the universe from the same ID space. The standalone SPIA package also ships a frozen hsaSPIA data object that is an OLDER snapshot than a live enrichKEGG query - do not mix the two in one comparison.
Goal: Freeze the KEGG data a result depends on so the analysis is reproducible and runs offline.
Approach: Snapshot the current KEGG sets into a GSON object, persist it, and run enrichment against the snapshot with the generic enricher/GSEA (which accept a gson argument); record the access date. Do NOT use use_internal_data=TRUE for this.
library(gson) # GSON class + write.gson/read.gson
k <- gson_KEGG('hsa') # gson_KEGG is exported by clusterProfiler; downloads current KEGG sets
k@accessed_date <- as.character(Sys.Date()) # the accessed_date slot survives write/read; a base attr() does not
write.gson(k, file.path(tempdir(), 'kegg_hsa.gson'))
k <- read.gson(file.path(tempdir(), 'kegg_hsa.gson'))
kk_pinned <- enricher(sig_entrez, gson=k, universe=universe) # frozen ORA, offline, reproducible
gsea_pinned <- GSEA(geneList, gson=k) # frozen GSEA against the snapshot
Goal: See shared and condition-specific KEGG pathways across groups in one faceted figure.
Approach: Pass named gene lists to compareCluster with fun='enrichKEGG'; it fits one model and produces a faceted dotplot. Compare pathway-ID SETS across conditions, never raw p-values (they depend on sample size, DE gene count, and the KEGG release).
clusters <- list(up=up_entrez, down=down_entrez)
ck <- compareCluster(geneClusters=clusters, fun='enrichKEGG', organism='hsa', keyType='ncbi-geneid')
ck <- setReadable(ck, OrgDb=org.Hs.eg.db, keyType='ENTREZID')
# dotplot(ck) -> enrichment-visualization for the plot grammar
pathview downloads a KEGG pathway's KGML and image, joins per-gene values to the nodes, and writes a colored map PNG/PDF (a KEGG-specific operation owned here; generic dot/cnet/emap plots route to enrichment-visualization). It writes files to the working directory and queries KEGG live.
library(pathview)
vals <- setNames(de$log2FoldChange, de$entrez)
pathview(gene.data=vals, pathway.id='hsa04110', species='hsa', gene.idtype='entrez') # writes hsa04110.pathview.png
Trigger: feeding OrgDb-style ENSEMBL or SYMBOL IDs to enrichKEGG/gseKEGG. Mechanism: KEGG's keyType is 'kegg'/'ncbi-geneid'/'ncbi-proteinid'/'uniprot', not an OrgDb keytype, so no IDs join. Symptom: zero enriched pathways, no error. Fix: bitr to Entrez and set keyType='ncbi-geneid' (eukaryotes), or pass locus tags with keyType='kegg' (prokaryotes).
Trigger: reporting an enrichKEGG/gseKEGG/SPIA result without pinning the release. Mechanism: the REST query returns the CURRENT KEGG, which changes over time. Symptom: a rerun months later yields a different pathway list. Fix: snapshot with gson_KEGG, run enricher/GSEA against the gson, and record the access date.
Trigger: setting use_internal_data=TRUE for reproducibility. Mechanism: it loads the deprecated 2012 KEGG.db, not a current pin (and may simply fail). Symptom: stale or absent pathways unlike the live result. Fix: use a gson snapshot instead; treat KEGG.db as legacy-only.
Trigger: running SPIA/graphite topology on glycolysis or other metabolic maps. Mechanism: metabolic maps are compound-mediated and give no clean signed gene->gene graph. Symptom: meaningless perturbation scores. Fix: restrict SPIA to signaling maps; use enrichKEGG/gseKEGG for metabolism.
Trigger: omitting universe. Mechanism: the default background is all KEGG-annotated genes, biased toward well-studied, metabolically central genes. Symptom: inflated significance for pathways enriched in measured/expressed genes (the tissue-specificity artifact). Fix: set universe to the genes that could have been called DE, in the same ID type.
Trigger: locus tags from a re-annotated genome or a different strain than KEGG's reference. Mechanism: the gene-ID join is exact; drifted locus tags do not match KEGG's pae/eco genome. Symptom: many genes silently dropped, weak or empty enrichment. Fix: confirm the organism code and reference genome with search_kegg_organism; align locus tags to KEGG's annotation, or route through KO.
Trigger: running bitr() or setReadable() on a prokaryote. Mechanism: no org.*.eg.db exists for most bacteria and there is no Entrez==KEGG identity. Symptom: bitr fails or empties the gene list; setReadable errors. Fix: pass locus tags directly with keyType='kegg'; keep raw IDs (no setReadable).
| Threshold | Source | Rationale |
|---|---|---|
| pvalueCutoff=0.05 | enrichKEGG/gseKEGG default | filters on p.adjust by default; standard FDR gate |
| qvalueCutoff=0.2 | clusterProfiler default | secondary q-value gate on enrichResult |
| pAdjustMethod='BH' | clusterProfiler default | Benjamini-Hochberg FDR; less conservative than Bonferroni for discovery |
| minGSSize=10 | enrichKEGG default | drop tiny sets that overfit and give unstable p-values |
| maxGSSize=500 | enrichKEGG default | drop very broad sets that always 'enrich' |
| nB=2000 | SPIA default | bootstrap replicates for the pPERT null; raise for stable small p-values |
SPIA aborts if >1% of DE IDs absent from all |
Tarca 2009 Bioinformatics 25:75 | the perturbation null requires the DE genes live in the universe |
| set.seed before gseKEGG/SPIA | reproducibility | gseKEGG seed=FALSE and SPIA bootstrap are stochastic; fix the seed |
| ID-conversion loss > ~15% | practice heuristic | report the bitr conversion rate; heavy loss makes the result unreliable |
| Error / symptom | Cause | Solution |
|---|---|---|
| enrichKEGG returns 0 pathways | ENSEMBL/SYMBOL passed, or wrong organism code, or KEGG API unreachable | bitr to Entrez + keyType='ncbi-geneid'; verify code with search_kegg_organism; check network |
setReadable errors |
no OrgDb for the organism (prokaryote) | skip setReadable; keep raw KEGG IDs |
gson= rejected by enrichKEGG |
enrichKEGG/gseKEGG have no gson argument | pass the gson to the generic enricher()/GSEA() instead |
| Different pathways on rerun | live KEGG changed between runs | pin with a gson snapshot and record the access date |
| SPIA: "more than 1% of de IDs not in all" | DE IDs not a subset of the universe | build de and all from the same ID space |
| SPIA gives nonsense on glycolysis | topology on a metabolic map | use enrichKEGG/gseKEGG; SPIA is signaling-only |
| Bacterial list gives 0 hits | Entrez/bitr forced onto a prokaryote | pass locus tags with keyType='kegg', no OrgDb |