Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    gptomics

    bio-metabolomics-msdial-preprocessing

    gptomics/bio-metabolomics-msdial-preprocessing
    Data & Analytics
    178

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    MS-DIAL-based metabolomics preprocessing as alternative to XCMS. Covers peak detection, alignment, annotation, and export for downstream analysis...

    SKILL.md

    Version Compatibility

    Reference examples tested with: MS-DIAL 5.x (LC-MS) / MS-DIAL 4.x (GC-MS), pandas 2.2+, R 4.3+

    Before using code patterns, verify installed versions match. If versions differ:

    • CLI: run MsdialConsoleApp with no arguments to print the current subcommand/flag list
    • R: packageVersion('<pkg>') then ?function_name to verify parameters
    • Python: pip show pandas then help(module.function) to check signatures

    The MS-DIAL GUI runs only on Windows; the console (MsdialConsoleApp) is the cross-platform headless entry. Which build supports a task is itself a constraint: MS-DIAL 5-alpha covers DI-MS, IM-MS, LC-MS, LC-IM-MS but NOT GC-MS - GC-EI stays in the MS-DIAL 4 lineage. If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt rather than retrying.

    MS-DIAL Preprocessing

    "Process my LC-MS run with MS-DIAL and give me a feature table" -> Pick peaks per file, deconvolve chimeric MS/MS into clean component spectra (MS2Dec), align across samples, gap-fill, then import the alignment result and filter it honestly.

    • CLI: MsdialConsoleApp lcmsdda|lcmsdia|gcms -i <in> -o <out> -m <param.txt>
    • R: read.csv(..., skip = 4, check.names = FALSE) to parse the alignment export
    • Python: pandas.read_csv(..., skiprows=4) for the same export

    The Single Most Important Insight -- Preprocessing Software Is Not Neutral

    The same raw files through MS-DIAL versus XCMS yield different feature tables and different marker lists. Li 2018 benchmarked five tools on a 1,100-compound standard and found that while feature detection was broadly similar, quantification and the set of selected discriminating markers differed by tool. A metabolomics "hit" is conditional on (raw data + software + version + every parameter + fill/filter order), not on the raw files alone. MS-DIAL's specific differentiator is MS2Dec deconvolution: it reconstructs clean, library-matchable MS/MS spectra from chimeric DDA/DIA fragment data, which is what makes wide-window DIA (SWATH) tractable at all. Report the full processing specification as part of the result, and treat a finding that survives only one pipeline as a candidate, not a result.

    MS-DIAL vs XCMS

    Axis MS-DIAL XCMS
    Interface Windows GUI + cross-platform console R package (scriptable everywhere)
    Core differentiator MS2Dec MS/MS deconvolution (DDA + DIA) centWave peak picking, full programmatic control
    Annotation Built-in (library + MS-FINDER + LipidBlast) Separate (CAMERA, downstream tools)
    Lipidomics Strong (predicted-CCS / EAD structural elucidation in v5) Manual
    Reproducibility unit Param file + GUI choices Versioned R script
    Best when DIA data, lipidomics, GUI workflow, built-in IDs Scripted pipelines, custom parameters, cohort scale

    Use MS-DIAL when DIA deconvolution or built-in lipid annotation is the point; use metabolomics/xcms-preprocessing for fully scripted, version-pinned cohort processing. The strongest untargeted claims replicate across both.

    Decision Tree by Scenario

    Situation Do Why
    LC-MS, top-N MS/MS (DDA) lcmsdda console / GUI LC-MS DDA Cleaner per-precursor MS2, but intensity-biased, stochastic coverage
    LC-MS, wide-window MS/MS (DIA / SWATH) lcmsdia (ABF input only) Complete MS2 coverage; chimeric spectra REQUIRE MS2Dec to be usable
    GC-EI run gcms (MS-DIAL 4 build), or AMDIS/eRah EI fragments every co-eluting compound; deconvolution IS detection (see below)
    Headless / Linux cluster MsdialConsoleApp with a -m param file GUI is Windows-only; console is the reproducible batch path
    Lipid-focused study MS-DIAL + LipidBlast -> metabolomics/lipidomics for lipid annotation mode
    Already have an alignment CSV skip processing, parse + filter See import + honest-filter sections below

    Why GC-EI Is Different (and stays in MS-DIAL 4)

    In GC-EI, 70 eV ionization fragments every compound reproducibly, so the trace at any retention time is a superposition of fragments from several co-eluting molecules. Naive peak picking conflates them; deconvolution into component spectra IS the feature-detection step, then each component is matched against EI+RI libraries (NIST, FiehnLib). Cross-run/cross-lab alignment uses retention index (Kovats n-alkanes, or Fiehn FAME markers giving diagnostic m/z 74/87) rather than raw RT, because RT drifts with column aging. MS-DIAL 5-alpha explicitly excludes GC-MS; use the gcms token in a MS-DIAL 4 build, or AMDIS/eRah, for GC-EI work.

    Run MS-DIAL Headless (console)

    Goal: Process a folder of converted spectra into an alignment table without the GUI.

    Approach: Pick the analysis-type token, point -i/-o/-m at input dir, output dir, and a method (parameter) file; keep -p only if the project should reopen in the GUI.

    # DDA LC-MS: accepts netCDF/mzML/ABF. Output is *.msdial in the output dir.
    MsdialConsoleApp lcmsdda -i ./LCMS_DDA/ -o ./LCMS_DDA_out/ -m ./Msdial-lcms-dda-Param.txt
    
    # DIA/SWATH LC-MS: accepts ABF ONLY (convert vendor raw -> ABF first). MS2Dec is the point.
    MsdialConsoleApp lcmsdia -i ./LCMS_DIA/ -o ./LCMS_DIA_out/ -m ./Msdial-lcms-dia-Param.txt
    
    # GC-EI (MS-DIAL 4 build): retention-index alignment, quant-mass quantification.
    MsdialConsoleApp gcms -i ./GCMS/ -o ./GCMS_out/ -m ./Msdial-GCMS-Param.txt -p
    

    The parameter file is plain text (one Key=Value per line). The Minimum peak height key is the direct analog of an intensity floor and is instrument-dependent: the GUI default is tuned for a TOF and is often far too high (or its baseline assumption wrong) for an Orbitrap. Set the alignment reference to a pooled QC, never to file #1 by default.

    Import the Alignment Result into R

    Goal: Split the MS-DIAL alignment export into a feature-metadata frame and an intensity matrix.

    Approach: The export carries four header rows above the real column header (sample class / file type / injection order / batch), so skip them; metadata columns precede the per-sample Area columns.

    # MS-DIAL alignment export: real column header is on row 5, so skip the first 4 rows.
    msdial <- read.csv('AlignResult.txt', sep = '\t', skip = 4, check.names = FALSE)
    
    # Metadata columns appear before the per-sample intensity columns. Common ones:
    # 'Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name', 'Adduct type',
    # 'Fill %', 'MS/MS assigned', 'Reference RT', 'Formula', 'Ontology', 'INCHIKEY',
    # 'SMILES', 'Annotation tag (VS1.0)'. Sample columns are everything after these.
    meta_cols <- c('Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name',
                   'Adduct type', 'Fill %', 'MS/MS assigned', 'Annotation tag (VS1.0)')
    meta_cols <- intersect(meta_cols, colnames(msdial))
    sample_cols <- setdiff(colnames(msdial), colnames(msdial)[seq_len(max(match(meta_cols, colnames(msdial))))])
    
    feature_info <- msdial[, meta_cols]
    intensity <- as.matrix(msdial[, sample_cols])
    rownames(intensity) <- msdial[['Alignment ID']]
    

    Import the Alignment Result into Python

    Goal: Same split, in pandas.

    Approach: skiprows=4 to land on the real header; slice metadata vs sample columns by position after the last known metadata column.

    import pandas as pd
    
    msdial = pd.read_csv('AlignResult.txt', sep='\t', skiprows=4)
    meta_cols = ['Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name', 'Adduct type', 'Fill %', 'MS/MS assigned', 'Annotation tag (VS1.0)']
    meta_cols = [c for c in meta_cols if c in msdial.columns]
    last_meta = max(msdial.columns.get_loc(c) for c in meta_cols)
    sample_cols = msdial.columns[last_meta + 1:]
    
    feature_info = msdial[meta_cols].copy()
    intensity = msdial[sample_cols].set_axis(msdial['Alignment ID']) if False else msdial[sample_cols].copy()
    intensity.index = msdial['Alignment ID']
    

    Filter the Table Honestly

    Goal: Keep features supported by real signal and known confidence, without overtrusting annotation tags.

    Approach: Filter on Fill% (cross-sample presence), require MS/MS support for any feature called identified, and tie the annotation tag to a real MSI confidence level rather than treating a name as proof.

    # Fill% is the fraction of samples with a DETECTED (not gap-filled) peak. Low Fill% means
    # the feature exists mostly as gap-filled noise-floor integrals, which fabricate intensity
    # (an honest 'below detection' becomes a positive number). 70% is a common floor.
    keep_fill <- feature_info[['Fill %']] >= 70
    
    # An annotated name without MS/MS is at best an MSI Level 2/3 putative ID (accurate mass
    # only). Require 'MS/MS assigned == TRUE' before trusting any identity downstream.
    has_msms <- feature_info[['MS/MS assigned']] == 'TRUE'
    
    # Annotation tag confidence (do NOT treat a name as an identification). The exact tag
    # vocabulary is MS-DIAL-version-dependent, so inspect unique(feature_info[['Annotation tag (VS1.0)']])
    # and map the strings the build actually emits rather than hard-coding them:
    #   Metabolite / Lipid  with MS/MS  -> MSI Level 2 (spectral library match)
    #   Suggested*          mass-only   -> MSI Level 3 (putative, no MS/MS)
    #   Unknown                         -> unannotated feature
    feature_info$msi_level <- ifelse(feature_info[['Annotation tag (VS1.0)']] %in% c('Metabolite', 'Lipid') & has_msms, 2,
                              ifelse(grepl('^Suggested', feature_info[['Annotation tag (VS1.0)']]), 3, NA))
    
    filtered <- intensity[keep_fill, ]
    

    Confidence-level honesty and orthogonal-evidence identification belong to metabolomics/metabolite-annotation; this skill only routes the tag to the right level. Fill% / blank / drift filtering interacts with normalization-qc - process blanks and pooled QCs through the SAME run, then filter the aligned table.

    Per-Method Failure Modes

    DIA processed as DDA (wrong console token)

    • Trigger: Running SWATH/DIA data through lcmsdda.
    • Mechanism: lcmsdda does not deconvolve wide-isolation chimeric MS/MS, so fragments from co-isolated precursors stay mixed.
    • Symptom: Library matches to the wrong compound; "clean-looking" spectra that fail orthogonal confirmation.
    • Fix: Use lcmsdia (ABF input only); MS2Dec deconvolution is the entire reason to run DIA in MS-DIAL.

    Over-trusting the annotation tag

    • Trigger: Filtering on Annotation tag != Unknown and calling the survivors "identified."
    • Mechanism: A Suggested* tag is an accurate-mass guess with no MS/MS; a named hit without MS/MS is MSI Level 3.
    • Symptom: A marker list full of confident-sounding names that do not validate against standards.
    • Fix: Require MS/MS assigned == TRUE for any identity claim; map tags to MSI levels (see filtering section) and defer to metabolomics/metabolite-annotation.

    Gap-fill masquerading as measurement

    • Trigger: Treating low-Fill% features as quantitative.
    • Mechanism: Gap-filling integrates whatever signal sits in the m/z-RT box even when no peak exists, turning a true below-detection (MNAR, left-censored) value into a positive number.
    • Symptom: "Significant" features that are mostly gap-filled in one group; shrunken fold-changes for on/off markers.
    • Fix: Report per-feature filled fraction; gate on Fill%; for inferential stats prefer MNAR-aware imputation over naive fill (see metabolomics/normalization-qc).

    GC-EI run through an LC pipeline / MS-DIAL 5

    • Trigger: Sending GC-EI data to MS-DIAL 5-alpha or treating it like LC peak-pick-then-group.
    • Mechanism: MS-DIAL 5-alpha excludes GC-MS; EI needs component deconvolution, not adduct-style peak picking, and RI (not RT) alignment.
    • Symptom: No GC mode available; or conflated co-eluting compounds and cross-lab RT misalignment.
    • Fix: Use the gcms token in a MS-DIAL 4 build (or AMDIS/eRah); align on Kovats/FAME retention index.

    Quantitative Thresholds

    Threshold Source Rationale
    Fill% >= 70% Common untargeted practice Below this, the feature is mostly gap-filled noise-floor integrals, not measurements
    QC CV (RSD) < 20-30% Broadhurst 2018 Technical reproducibility floor; drop features noisier than this in pooled QCs
    D-ratio (sd_QC/sd_sample) < 0.5 Broadhurst 2018 Keeps features whose technical variance is well below biological variance
    Blank filter: sample mean > 3-5x blank mean Broadhurst 2018 Removes background/contaminant features present in process blanks
    ~10x more features than compounds Mahieu 2017 One metabolite makes adducts/isotopes/fragments; counting features over-counts hypotheses

    Common Errors

    Error / symptom Cause Solution
    All columns land in one field on import Header offset wrong; tab-separated export read as CSV skip=4 (R) / skiprows=4 (Python), set sep='\t'
    lcmsdia rejects mzML input DIA mode accepts ABF only Convert vendor raw to ABF (Reifycs ABF converter) before lcmsdia
    Annotation tag column not found Header changes across versions (e.g. Annotation tag (VS1.0)) Match by prefix / inspect colnames(); do not hard-code the suffix
    No GC-MS option in MS-DIAL 5 5-alpha excludes GC-MS Use a MS-DIAL 4 build's gcms token, or AMDIS/eRah
    Console command not found on Linux Expecting the GUI executable The GUI is Windows-only; run MsdialConsoleApp (cross-platform)
    Few features detected Minimum peak height default too high for the instrument Lower it toward the real baseline; defaults are TOF-tuned

    References

    • Tsugawa H, Cajka T, Kind T, Ma Y, Higgins B, Ikeda K, Kanazawa M, VanderGheynst J, Fiehn O, Arita M. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat Methods. 2015; 12(6):523-526.
    • Tsugawa H, Ikeda K, Takahashi M, et al. A lipidome atlas in MS-DIAL 4. Nat Biotechnol. 2020; 38(10):1159-1163.
    • Takeda H, Takahashi M, Ikeda K, et al. MS-DIAL 5 multimodal mass spectrometry data mining unveils lipidome complexities. Nat Commun. 2024; 15:9903.
    • Li Z, Lu Y, Guo Y, Cao H, Wang Q, Shui W. Comprehensive evaluation of untargeted metabolomics data processing software in feature detection, quantification and discriminating marker selection. Anal Chim Acta. 2018; 1029:50-57.
    • Mahieu NG, Patti GJ. Systems-level annotation of a metabolomics data set reduces 25,000 features to fewer than 1,000 unique metabolites. Anal Chem. 2017; 89(19):10397-10406.
    • Broadhurst D, Goodacre R, Reinke SN, Kuligowski J, Wilson ID, Lewis MR, Dunn WB. Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics. 2018; 14(6):72.
    • Stein SE. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data (AMDIS). J Am Soc Mass Spectrom. 1999; 10(8):770-781.

    Related Skills

    • metabolomics/xcms-preprocessing - Programmatic R preprocessing and the feature-table-as-artifact framing
    • metabolomics/lipidomics - Lipid annotation mode and LipidBlast workflows
    • metabolomics/metabolite-annotation - MSI confidence levels and orthogonal-evidence identification
    • metabolomics/normalization-qc - Drift correction, QC/CV/D-ratio filtering, MNAR-aware imputation
    Recommended Servers
    abm.dev
    abm.dev
    EasyWeek
    EasyWeek
    ThinAir Data
    ThinAir Data
    Repository
    gptomics/bioskills
    Files