Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    K-Dense-AI

    datamol

    K-Dense-AI/datamol
    Research
    8,232

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Pythonic wrapper around RDKit with simplified interface and sensible defaults...

    SKILL.md

    Datamol Cheminformatics Skill

    Overview

    Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native rdkit.Chem.Mol instances, ensuring full compatibility with the RDKit ecosystem.

    Key capabilities:

    • Molecular format conversion (SMILES, SELFIES, InChI)
    • Structure standardization and sanitization
    • Molecular descriptors and fingerprints
    • 3D conformer generation and analysis
    • Clustering and diversity selection
    • Scaffold and fragment analysis
    • Chemical reaction application
    • Visualization and alignment
    • Batch processing with parallelization
    • Cloud storage support via fsspec

    Installation and Setup

    Guide users to install datamol:

    uv pip install datamol
    

    Import convention:

    import datamol as dm
    

    Core Workflows

    1. Basic Molecule Handling

    Creating molecules from SMILES:

    import datamol as dm
    
    # Single molecule
    mol = dm.to_mol("CCO")  # Ethanol
    
    # From list of SMILES
    smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
    mols = [dm.to_mol(smi) for smi in smiles_list]
    
    # Error handling
    mol = dm.to_mol("invalid_smiles")  # Returns None
    if mol is None:
        print("Failed to parse SMILES")
    

    Converting molecules to SMILES:

    # Canonical SMILES
    smiles = dm.to_smiles(mol)
    
    # Isomeric SMILES (includes stereochemistry)
    smiles = dm.to_smiles(mol, isomeric=True)
    
    # Other formats
    inchi = dm.to_inchi(mol)
    inchikey = dm.to_inchikey(mol)
    selfies = dm.to_selfies(mol)
    

    Standardization and sanitization (always recommend for user-provided molecules):

    # Sanitize molecule
    mol = dm.sanitize_mol(mol)
    
    # Full standardization (recommended for datasets)
    mol = dm.standardize_mol(
        mol,
        disconnect_metals=True,
        normalize=True,
        reionize=True
    )
    
    # For SMILES strings directly
    clean_smiles = dm.standardize_smiles(smiles)
    

    2. Reading and Writing Molecular Files

    Refer to references/io_module.md for comprehensive I/O documentation.

    Reading files:

    # SDF files (most common in chemistry)
    df = dm.read_sdf("compounds.sdf", mol_column='mol')
    
    # SMILES files
    df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
    
    # CSV with SMILES column
    df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
    
    # Excel files
    df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
    
    # Universal reader (auto-detects format)
    df = dm.open_df("file.sdf")  # Works with .sdf, .csv, .xlsx, .parquet, .json
    

    Writing files:

    # Save as SDF
    dm.to_sdf(mols, "output.sdf")
    # Or from DataFrame
    dm.to_sdf(df, "output.sdf", mol_column="mol")
    
    # Save as SMILES file
    dm.to_smi(mols, "output.smi")
    
    # Excel with rendered molecule images
    dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
    

    Remote file support (S3, GCS, HTTP):

    # Read from cloud storage
    df = dm.read_sdf("s3://bucket/compounds.sdf")
    df = dm.read_csv("https://example.com/data.csv")
    
    # Write to cloud storage
    dm.to_sdf(mols, "s3://bucket/output.sdf")
    

    3. Molecular Descriptors and Properties

    Refer to references/descriptors_viz.md for detailed descriptor documentation.

    Computing descriptors for a single molecule:

    # Get standard descriptor set
    descriptors = dm.descriptors.compute_many_descriptors(mol)
    # Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
    #           'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
    

    Batch descriptor computation (recommended for datasets):

    # Compute for all molecules in parallel
    desc_df = dm.descriptors.batch_compute_many_descriptors(
        mols,
        n_jobs=-1,      # Use all CPU cores
        progress=True   # Show progress bar
    )
    

    Specific descriptors:

    # Aromaticity
    n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
    aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
    
    # Stereochemistry
    n_stereo = dm.descriptors.n_stereo_centers(mol)
    n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
    
    # Flexibility
    n_rigid = dm.descriptors.n_rigid_bonds(mol)
    

    Drug-likeness filtering (Lipinski's Rule of Five):

    # Filter compounds
    def is_druglike(mol):
        desc = dm.descriptors.compute_many_descriptors(mol)
        return (
            desc['mw'] <= 500 and
            desc['logp'] <= 5 and
            desc['hbd'] <= 5 and
            desc['hba'] <= 10
        )
    
    druglike_mols = [mol for mol in mols if is_druglike(mol)]
    

    4. Molecular Fingerprints and Similarity

    Generating fingerprints:

    # ECFP (Extended Connectivity Fingerprint, default)
    fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
    
    # Other fingerprint types
    fp_maccs = dm.to_fp(mol, fp_type='maccs')
    fp_topological = dm.to_fp(mol, fp_type='topological')
    fp_atompair = dm.to_fp(mol, fp_type='atompair')
    

    Similarity calculations:

    # Pairwise distances within a set
    distance_matrix = dm.pdist(mols, n_jobs=-1)
    
    # Distances between two sets
    distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
    
    # Find most similar molecules
    from scipy.spatial.distance import squareform
    dist_matrix = squareform(dm.pdist(mols))
    # Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)
    

    5. Clustering and Diversity Selection

    Refer to references/core_api.md for clustering details.

    Butina clustering:

    # Cluster molecules by structural similarity
    clusters = dm.cluster_mols(
        mols,
        cutoff=0.2,    # Tanimoto distance threshold (0=identical, 1=completely different)
        n_jobs=-1      # Parallel processing
    )
    
    # Each cluster is a list of molecule indices
    for i, cluster in enumerate(clusters):
        print(f"Cluster {i}: {len(cluster)} molecules")
        cluster_mols = [mols[idx] for idx in cluster]
    

    Important: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.

    Diversity selection:

    # Pick diverse subset
    diverse_mols = dm.pick_diverse(
        mols,
        npick=100  # Select 100 diverse molecules
    )
    
    # Pick cluster centroids
    centroids = dm.pick_centroids(
        mols,
        npick=50   # Select 50 representative molecules
    )
    

    6. Scaffold Analysis

    Refer to references/fragments_scaffolds.md for complete scaffold documentation.

    Extracting Murcko scaffolds:

    # Get Bemis-Murcko scaffold (core structure)
    scaffold = dm.to_scaffold_murcko(mol)
    scaffold_smiles = dm.to_smiles(scaffold)
    

    Scaffold-based analysis:

    # Group compounds by scaffold
    from collections import Counter
    
    scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
    scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
    
    # Count scaffold frequency
    scaffold_counts = Counter(scaffold_smiles)
    most_common = scaffold_counts.most_common(10)
    
    # Create scaffold-to-molecules mapping
    scaffold_groups = {}
    for mol, scaf_smi in zip(mols, scaffold_smiles):
        if scaf_smi not in scaffold_groups:
            scaffold_groups[scaf_smi] = []
        scaffold_groups[scaf_smi].append(mol)
    

    Scaffold-based train/test splitting (for ML):

    # Ensure train and test sets have different scaffolds
    scaffold_to_mols = {}
    for mol, scaf in zip(mols, scaffold_smiles):
        if scaf not in scaffold_to_mols:
            scaffold_to_mols[scaf] = []
        scaffold_to_mols[scaf].append(mol)
    
    # Split scaffolds into train/test
    import random
    scaffolds = list(scaffold_to_mols.keys())
    random.shuffle(scaffolds)
    split_idx = int(0.8 * len(scaffolds))
    train_scaffolds = scaffolds[:split_idx]
    test_scaffolds = scaffolds[split_idx:]
    
    # Get molecules for each split
    train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
    test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
    

    7. Molecular Fragmentation

    Refer to references/fragments_scaffolds.md for fragmentation details.

    BRICS fragmentation (16 bond types):

    # Fragment molecule
    fragments = dm.fragment.brics(mol)
    # Returns: set of fragment SMILES with attachment points like '[1*]CCN'
    

    RECAP fragmentation (11 bond types):

    fragments = dm.fragment.recap(mol)
    

    Fragment analysis:

    # Find common fragments across compound library
    from collections import Counter
    
    all_fragments = []
    for mol in mols:
        frags = dm.fragment.brics(mol)
        all_fragments.extend(frags)
    
    fragment_counts = Counter(all_fragments)
    common_frags = fragment_counts.most_common(20)
    
    # Fragment-based scoring
    def fragment_score(mol, reference_fragments):
        mol_frags = dm.fragment.brics(mol)
        overlap = mol_frags.intersection(reference_fragments)
        return len(overlap) / len(mol_frags) if mol_frags else 0
    

    8. 3D Conformer Generation

    Refer to references/conformers_module.md for detailed conformer documentation.

    Generating conformers:

    # Generate 3D conformers
    mol_3d = dm.conformers.generate(
        mol,
        n_confs=50,           # Number to generate (auto if None)
        rms_cutoff=0.5,       # Filter similar conformers (Ångströms)
        minimize_energy=True,  # Minimize with UFF force field
        method='ETKDGv3'      # Embedding method (recommended)
    )
    
    # Access conformers
    n_conformers = mol_3d.GetNumConformers()
    conf = mol_3d.GetConformer(0)  # Get first conformer
    positions = conf.GetPositions()  # Nx3 array of atom coordinates
    

    Conformer clustering:

    # Cluster conformers by RMSD
    clusters = dm.conformers.cluster(
        mol_3d,
        rms_cutoff=1.0,
        centroids=False
    )
    
    # Get representative conformers
    centroids = dm.conformers.return_centroids(mol_3d, clusters)
    

    SASA calculation:

    # Calculate solvent accessible surface area
    sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
    
    # Access SASA from conformer properties
    conf = mol_3d.GetConformer(0)
    sasa = conf.GetDoubleProp('rdkit_free_sasa')
    

    9. Visualization

    Refer to references/descriptors_viz.md for visualization documentation.

    Basic molecule grid:

    # Visualize molecules
    dm.viz.to_image(
        mols[:20],
        legends=[dm.to_smiles(m) for m in mols[:20]],
        n_cols=5,
        mol_size=(300, 300)
    )
    
    # Save to file
    dm.viz.to_image(mols, outfile="molecules.png")
    
    # SVG for publications
    dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
    

    Aligned visualization (for SAR analysis):

    # Align molecules by common substructure
    dm.viz.to_image(
        similar_mols,
        align=True,  # Enable MCS alignment
        legends=activity_labels,
        n_cols=4
    )
    

    Highlighting substructures:

    # Highlight specific atoms and bonds
    dm.viz.to_image(
        mol,
        highlight_atom=[0, 1, 2, 3],  # Atom indices
        highlight_bond=[0, 1, 2]      # Bond indices
    )
    

    Conformer visualization:

    # Display multiple conformers
    dm.viz.conformers(
        mol_3d,
        n_confs=10,
        align_conf=True,
        n_cols=3
    )
    

    10. Chemical Reactions

    Refer to references/reactions_data.md for reactions documentation.

    Applying reactions:

    from rdkit.Chem import rdChemReactions
    
    # Define reaction from SMARTS
    rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'
    rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
    
    # Apply to molecule
    reactant = dm.to_mol("CC(=O)O")  # Acetic acid
    product = dm.reactions.apply_reaction(
        rxn,
        (reactant,),
        sanitize=True
    )
    
    # Convert to SMILES
    product_smiles = dm.to_smiles(product)
    

    Batch reaction application:

    # Apply reaction to library
    products = []
    for mol in reactant_mols:
        try:
            prod = dm.reactions.apply_reaction(rxn, (mol,))
            if prod is not None:
                products.append(prod)
        except Exception as e:
            print(f"Reaction failed: {e}")
    

    Parallelization

    Datamol includes built-in parallelization for many operations. Use n_jobs parameter:

    • n_jobs=1: Sequential (no parallelization)
    • n_jobs=-1: Use all available CPU cores
    • n_jobs=4: Use 4 cores

    Functions supporting parallelization:

    • dm.read_sdf(..., n_jobs=-1)
    • dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)
    • dm.cluster_mols(..., n_jobs=-1)
    • dm.pdist(..., n_jobs=-1)
    • dm.conformers.sasa(..., n_jobs=-1)

    Progress bars: Many batch operations support progress=True parameter.

    Common Workflows and Patterns

    Complete Pipeline: Data Loading → Filtering → Analysis

    import datamol as dm
    import pandas as pd
    
    # 1. Load molecules
    df = dm.read_sdf("compounds.sdf")
    
    # 2. Standardize
    df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
    df = df[df['mol'].notna()]  # Remove failed molecules
    
    # 3. Compute descriptors
    desc_df = dm.descriptors.batch_compute_many_descriptors(
        df['mol'].tolist(),
        n_jobs=-1,
        progress=True
    )
    
    # 4. Filter by drug-likeness
    druglike = (
        (desc_df['mw'] <= 500) &
        (desc_df['logp'] <= 5) &
        (desc_df['hbd'] <= 5) &
        (desc_df['hba'] <= 10)
    )
    filtered_df = df[druglike]
    
    # 5. Cluster and select diverse subset
    diverse_mols = dm.pick_diverse(
        filtered_df['mol'].tolist(),
        npick=100
    )
    
    # 6. Visualize results
    dm.viz.to_image(
        diverse_mols,
        legends=[dm.to_smiles(m) for m in diverse_mols],
        outfile="diverse_compounds.png",
        n_cols=10
    )
    

    Structure-Activity Relationship (SAR) Analysis

    # Group by scaffold
    scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
    scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
    
    # Create DataFrame with activities
    sar_df = pd.DataFrame({
        'mol': mols,
        'scaffold': scaffold_smiles,
        'activity': activities  # User-provided activity data
    })
    
    # Analyze each scaffold series
    for scaffold, group in sar_df.groupby('scaffold'):
        if len(group) >= 3:  # Need multiple examples
            print(f"\nScaffold: {scaffold}")
            print(f"Count: {len(group)}")
            print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
    
            # Visualize with activities as legends
            dm.viz.to_image(
                group['mol'].tolist(),
                legends=[f"Activity: {act:.2f}" for act in group['activity']],
                align=True  # Align by common substructure
            )
    

    Virtual Screening Pipeline

    # 1. Generate fingerprints for query and library
    query_fps = [dm.to_fp(mol) for mol in query_actives]
    library_fps = [dm.to_fp(mol) for mol in library_mols]
    
    # 2. Calculate similarities
    from scipy.spatial.distance import cdist
    import numpy as np
    
    distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
    
    # 3. Find closest matches (min distance to any query)
    min_distances = distances.min(axis=0)
    similarities = 1 - min_distances  # Convert distance to similarity
    
    # 4. Rank and select top hits
    top_indices = np.argsort(similarities)[::-1][:100]  # Top 100
    top_hits = [library_mols[i] for i in top_indices]
    top_scores = [similarities[i] for i in top_indices]
    
    # 5. Visualize hits
    dm.viz.to_image(
        top_hits[:20],
        legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
        outfile="screening_hits.png"
    )
    

    Reference Documentation

    For detailed API documentation, consult these reference files:

    • references/core_api.md: Core namespace functions (conversions, standardization, fingerprints, clustering)
    • references/io_module.md: File I/O operations (read/write SDF, CSV, Excel, remote files)
    • references/conformers_module.md: 3D conformer generation, clustering, SASA calculations
    • references/descriptors_viz.md: Molecular descriptors and visualization functions
    • references/fragments_scaffolds.md: Scaffold extraction, BRICS/RECAP fragmentation
    • references/reactions_data.md: Chemical reactions and toy datasets

    Best Practices

    1. Always standardize molecules from external sources:

      mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
      
    2. Check for None values after molecule parsing:

      mol = dm.to_mol(smiles)
      if mol is None:
          # Handle invalid SMILES
      
    3. Use parallel processing for large datasets:

      result = dm.operation(..., n_jobs=-1, progress=True)
      
    4. Leverage fsspec for cloud storage:

      df = dm.read_sdf("s3://bucket/compounds.sdf")
      
    5. Use appropriate fingerprints for similarity:

      • ECFP (Morgan): General purpose, structural similarity
      • MACCS: Fast, smaller feature space
      • Atom pairs: Considers atom pairs and distances
    6. Consider scale limitations:

      • Butina clustering: ~1,000 molecules (full distance matrix)
      • For larger datasets: Use diversity selection or hierarchical methods
    7. Scaffold splitting for ML: Ensure proper train/test separation by scaffold

    8. Align molecules when visualizing SAR series

    Error Handling

    # Safe molecule creation
    def safe_to_mol(smiles):
        try:
            mol = dm.to_mol(smiles)
            if mol is not None:
                mol = dm.standardize_mol(mol)
            return mol
        except Exception as e:
            print(f"Failed to process {smiles}: {e}")
            return None
    
    # Safe batch processing
    valid_mols = []
    for smiles in smiles_list:
        mol = safe_to_mol(smiles)
        if mol is not None:
            valid_mols.append(mol)
    

    Integration with Machine Learning

    # Feature generation
    X = np.array([dm.to_fp(mol) for mol in mols])
    
    # Or descriptors
    desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
    X = desc_df.values
    
    # Train model
    from sklearn.ensemble import RandomForestRegressor
    model = RandomForestRegressor()
    model.fit(X, y_target)
    
    # Predict
    predictions = model.predict(X_test)
    

    Troubleshooting

    Issue: Molecule parsing fails

    • Solution: Use dm.standardize_smiles() first or try dm.fix_mol()

    Issue: Memory errors with clustering

    • Solution: Use dm.pick_diverse() instead of full clustering for large sets

    Issue: Slow conformer generation

    • Solution: Reduce n_confs or increase rms_cutoff to generate fewer conformers

    Issue: Remote file access fails

    • Solution: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)

    Additional Resources

    • Datamol Documentation: https://docs.datamol.io/
    • RDKit Documentation: https://www.rdkit.org/docs/
    • GitHub Repository: https://github.com/datamol-io/datamol
    Recommended Servers
    Codeinterpreter
    Codeinterpreter
    Svelte
    Svelte
    Cloudflare
    Cloudflare
    Repository
    k-dense-ai/claude-scientific-skills
    Files