MS-DIAL-based metabolomics preprocessing as alternative to XCMS. Covers peak detection, alignment, annotation, and export for downstream analysis...
Reference examples tested with: MS-DIAL 5.x (LC-MS) / MS-DIAL 4.x (GC-MS), pandas 2.2+, R 4.3+
Before using code patterns, verify installed versions match. If versions differ:
MsdialConsoleApp with no arguments to print the current subcommand/flag listpackageVersion('<pkg>') then ?function_name to verify parameterspip show pandas then help(module.function) to check signaturesThe MS-DIAL GUI runs only on Windows; the console (MsdialConsoleApp) is the cross-platform headless entry. Which build supports a task is itself a constraint: MS-DIAL 5-alpha covers DI-MS, IM-MS, LC-MS, LC-IM-MS but NOT GC-MS - GC-EI stays in the MS-DIAL 4 lineage. If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt rather than retrying.
"Process my LC-MS run with MS-DIAL and give me a feature table" -> Pick peaks per file, deconvolve chimeric MS/MS into clean component spectra (MS2Dec), align across samples, gap-fill, then import the alignment result and filter it honestly.
MsdialConsoleApp lcmsdda|lcmsdia|gcms -i <in> -o <out> -m <param.txt>read.csv(..., skip = 4, check.names = FALSE) to parse the alignment exportpandas.read_csv(..., skiprows=4) for the same exportThe same raw files through MS-DIAL versus XCMS yield different feature tables and different marker lists. Li 2018 benchmarked five tools on a 1,100-compound standard and found that while feature detection was broadly similar, quantification and the set of selected discriminating markers differed by tool. A metabolomics "hit" is conditional on (raw data + software + version + every parameter + fill/filter order), not on the raw files alone. MS-DIAL's specific differentiator is MS2Dec deconvolution: it reconstructs clean, library-matchable MS/MS spectra from chimeric DDA/DIA fragment data, which is what makes wide-window DIA (SWATH) tractable at all. Report the full processing specification as part of the result, and treat a finding that survives only one pipeline as a candidate, not a result.
| Axis | MS-DIAL | XCMS |
|---|---|---|
| Interface | Windows GUI + cross-platform console | R package (scriptable everywhere) |
| Core differentiator | MS2Dec MS/MS deconvolution (DDA + DIA) | centWave peak picking, full programmatic control |
| Annotation | Built-in (library + MS-FINDER + LipidBlast) | Separate (CAMERA, downstream tools) |
| Lipidomics | Strong (predicted-CCS / EAD structural elucidation in v5) | Manual |
| Reproducibility unit | Param file + GUI choices | Versioned R script |
| Best when | DIA data, lipidomics, GUI workflow, built-in IDs | Scripted pipelines, custom parameters, cohort scale |
Use MS-DIAL when DIA deconvolution or built-in lipid annotation is the point; use metabolomics/xcms-preprocessing for fully scripted, version-pinned cohort processing. The strongest untargeted claims replicate across both.
| Situation | Do | Why |
|---|---|---|
| LC-MS, top-N MS/MS (DDA) | lcmsdda console / GUI LC-MS DDA |
Cleaner per-precursor MS2, but intensity-biased, stochastic coverage |
| LC-MS, wide-window MS/MS (DIA / SWATH) | lcmsdia (ABF input only) |
Complete MS2 coverage; chimeric spectra REQUIRE MS2Dec to be usable |
| GC-EI run | gcms (MS-DIAL 4 build), or AMDIS/eRah |
EI fragments every co-eluting compound; deconvolution IS detection (see below) |
| Headless / Linux cluster | MsdialConsoleApp with a -m param file |
GUI is Windows-only; console is the reproducible batch path |
| Lipid-focused study | MS-DIAL + LipidBlast | -> metabolomics/lipidomics for lipid annotation mode |
| Already have an alignment CSV | skip processing, parse + filter | See import + honest-filter sections below |
In GC-EI, 70 eV ionization fragments every compound reproducibly, so the trace at any retention time is a superposition of fragments from several co-eluting molecules. Naive peak picking conflates them; deconvolution into component spectra IS the feature-detection step, then each component is matched against EI+RI libraries (NIST, FiehnLib). Cross-run/cross-lab alignment uses retention index (Kovats n-alkanes, or Fiehn FAME markers giving diagnostic m/z 74/87) rather than raw RT, because RT drifts with column aging. MS-DIAL 5-alpha explicitly excludes GC-MS; use the gcms token in a MS-DIAL 4 build, or AMDIS/eRah, for GC-EI work.
Goal: Process a folder of converted spectra into an alignment table without the GUI.
Approach: Pick the analysis-type token, point -i/-o/-m at input dir, output dir, and a method (parameter) file; keep -p only if the project should reopen in the GUI.
# DDA LC-MS: accepts netCDF/mzML/ABF. Output is *.msdial in the output dir.
MsdialConsoleApp lcmsdda -i ./LCMS_DDA/ -o ./LCMS_DDA_out/ -m ./Msdial-lcms-dda-Param.txt
# DIA/SWATH LC-MS: accepts ABF ONLY (convert vendor raw -> ABF first). MS2Dec is the point.
MsdialConsoleApp lcmsdia -i ./LCMS_DIA/ -o ./LCMS_DIA_out/ -m ./Msdial-lcms-dia-Param.txt
# GC-EI (MS-DIAL 4 build): retention-index alignment, quant-mass quantification.
MsdialConsoleApp gcms -i ./GCMS/ -o ./GCMS_out/ -m ./Msdial-GCMS-Param.txt -p
The parameter file is plain text (one Key=Value per line). The Minimum peak height key is the direct analog of an intensity floor and is instrument-dependent: the GUI default is tuned for a TOF and is often far too high (or its baseline assumption wrong) for an Orbitrap. Set the alignment reference to a pooled QC, never to file #1 by default.
Goal: Split the MS-DIAL alignment export into a feature-metadata frame and an intensity matrix.
Approach: The export carries four header rows above the real column header (sample class / file type / injection order / batch), so skip them; metadata columns precede the per-sample Area columns.
# MS-DIAL alignment export: real column header is on row 5, so skip the first 4 rows.
msdial <- read.csv('AlignResult.txt', sep = '\t', skip = 4, check.names = FALSE)
# Metadata columns appear before the per-sample intensity columns. Common ones:
# 'Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name', 'Adduct type',
# 'Fill %', 'MS/MS assigned', 'Reference RT', 'Formula', 'Ontology', 'INCHIKEY',
# 'SMILES', 'Annotation tag (VS1.0)'. Sample columns are everything after these.
meta_cols <- c('Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name',
'Adduct type', 'Fill %', 'MS/MS assigned', 'Annotation tag (VS1.0)')
meta_cols <- intersect(meta_cols, colnames(msdial))
sample_cols <- setdiff(colnames(msdial), colnames(msdial)[seq_len(max(match(meta_cols, colnames(msdial))))])
feature_info <- msdial[, meta_cols]
intensity <- as.matrix(msdial[, sample_cols])
rownames(intensity) <- msdial[['Alignment ID']]
Goal: Same split, in pandas.
Approach: skiprows=4 to land on the real header; slice metadata vs sample columns by position after the last known metadata column.
import pandas as pd
msdial = pd.read_csv('AlignResult.txt', sep='\t', skiprows=4)
meta_cols = ['Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name', 'Adduct type', 'Fill %', 'MS/MS assigned', 'Annotation tag (VS1.0)']
meta_cols = [c for c in meta_cols if c in msdial.columns]
last_meta = max(msdial.columns.get_loc(c) for c in meta_cols)
sample_cols = msdial.columns[last_meta + 1:]
feature_info = msdial[meta_cols].copy()
intensity = msdial[sample_cols].set_axis(msdial['Alignment ID']) if False else msdial[sample_cols].copy()
intensity.index = msdial['Alignment ID']
Goal: Keep features supported by real signal and known confidence, without overtrusting annotation tags.
Approach: Filter on Fill% (cross-sample presence), require MS/MS support for any feature called identified, and tie the annotation tag to a real MSI confidence level rather than treating a name as proof.
# Fill% is the fraction of samples with a DETECTED (not gap-filled) peak. Low Fill% means
# the feature exists mostly as gap-filled noise-floor integrals, which fabricate intensity
# (an honest 'below detection' becomes a positive number). 70% is a common floor.
keep_fill <- feature_info[['Fill %']] >= 70
# An annotated name without MS/MS is at best an MSI Level 2/3 putative ID (accurate mass
# only). Require 'MS/MS assigned == TRUE' before trusting any identity downstream.
has_msms <- feature_info[['MS/MS assigned']] == 'TRUE'
# Annotation tag confidence (do NOT treat a name as an identification). The exact tag
# vocabulary is MS-DIAL-version-dependent, so inspect unique(feature_info[['Annotation tag (VS1.0)']])
# and map the strings the build actually emits rather than hard-coding them:
# Metabolite / Lipid with MS/MS -> MSI Level 2 (spectral library match)
# Suggested* mass-only -> MSI Level 3 (putative, no MS/MS)
# Unknown -> unannotated feature
feature_info$msi_level <- ifelse(feature_info[['Annotation tag (VS1.0)']] %in% c('Metabolite', 'Lipid') & has_msms, 2,
ifelse(grepl('^Suggested', feature_info[['Annotation tag (VS1.0)']]), 3, NA))
filtered <- intensity[keep_fill, ]
Confidence-level honesty and orthogonal-evidence identification belong to metabolomics/metabolite-annotation; this skill only routes the tag to the right level. Fill% / blank / drift filtering interacts with normalization-qc - process blanks and pooled QCs through the SAME run, then filter the aligned table.
lcmsdda.lcmsdda does not deconvolve wide-isolation chimeric MS/MS, so fragments from co-isolated precursors stay mixed.lcmsdia (ABF input only); MS2Dec deconvolution is the entire reason to run DIA in MS-DIAL.Annotation tag != Unknown and calling the survivors "identified."Suggested* tag is an accurate-mass guess with no MS/MS; a named hit without MS/MS is MSI Level 3.MS/MS assigned == TRUE for any identity claim; map tags to MSI levels (see filtering section) and defer to metabolomics/metabolite-annotation.gcms token in a MS-DIAL 4 build (or AMDIS/eRah); align on Kovats/FAME retention index.| Threshold | Source | Rationale |
|---|---|---|
| Fill% >= 70% | Common untargeted practice | Below this, the feature is mostly gap-filled noise-floor integrals, not measurements |
| QC CV (RSD) < 20-30% | Broadhurst 2018 | Technical reproducibility floor; drop features noisier than this in pooled QCs |
| D-ratio (sd_QC/sd_sample) < 0.5 | Broadhurst 2018 | Keeps features whose technical variance is well below biological variance |
| Blank filter: sample mean > 3-5x blank mean | Broadhurst 2018 | Removes background/contaminant features present in process blanks |
| ~10x more features than compounds | Mahieu 2017 | One metabolite makes adducts/isotopes/fragments; counting features over-counts hypotheses |
| Error / symptom | Cause | Solution |
|---|---|---|
| All columns land in one field on import | Header offset wrong; tab-separated export read as CSV | skip=4 (R) / skiprows=4 (Python), set sep='\t' |
lcmsdia rejects mzML input |
DIA mode accepts ABF only | Convert vendor raw to ABF (Reifycs ABF converter) before lcmsdia |
Annotation tag column not found |
Header changes across versions (e.g. Annotation tag (VS1.0)) |
Match by prefix / inspect colnames(); do not hard-code the suffix |
| No GC-MS option in MS-DIAL 5 | 5-alpha excludes GC-MS | Use a MS-DIAL 4 build's gcms token, or AMDIS/eRah |
| Console command not found on Linux | Expecting the GUI executable | The GUI is Windows-only; run MsdialConsoleApp (cross-platform) |
| Few features detected | Minimum peak height default too high for the instrument |
Lower it toward the real baseline; defaults are TOF-tuned |