Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    kthorn

    checking-chembl-for-structured-sar-data

    kthorn/checking-chembl-for-structured-sar-data
    Research
    6
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Check if medicinal chemistry papers are in ChEMBL database to access curated bioactivity data

    SKILL.md

    Checking ChEMBL for Structured SAR Data

    Overview

    ChEMBL is a manually curated database of ~99,000 medicinal chemistry papers with extracted, standardized bioactivity data. If a paper is in ChEMBL, you can access structured data without parsing PDFs.

    Core principle: Check ChEMBL first for medicinal chemistry papers. Curated data is more reliable than table parsing.

    When to Use

    Use this skill when:

    • Paper describes medicinal chemistry / drug discovery
    • Abstract mentions compound series, SAR, or activity data
    • Paper has IC50, MIC, Ki, EC50, or other bioactivity measurements
    • Before attempting to extract data from tables/figures
    • Paper scored ≥ 7 in relevance evaluation

    When NOT to use:

    • Non-medicinal chemistry papers (cell biology, genomics, etc.)
    • Papers without activity measurements
    • Reviews without primary data
    • Very recent papers (< 6 months, likely not curated yet)

    ChEMBL API Basics

    Base URL: https://www.ebi.ac.uk/chembl/api/data/

    No authentication required

    CRITICAL: ChEMBL can ONLY be queried by DOI, NOT by PMID

    • The API returns PMID in results, but does not accept it as a query parameter
    • Always use DOI for lookups: ?doi=10.1234/example
    • PMID queries will return 0 results even if paper exists in ChEMBL

    Two-step process:

    1. Check if paper (by DOI) is in ChEMBL
    2. If yes, retrieve bioactivity data

    Step 1: Check if Paper in ChEMBL

    Query by DOI (ONLY method that works):

    curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI"
    

    ⚠️ IMPORTANT: Must use DOI, not PMID

    # ✅ CORRECT - Use DOI
    doi="10.1021/jm401507s"
    curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi"
    
    # ❌ WRONG - PMID won't work (will return 0 results)
    pmid="24446688"
    curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?pubmed_id=$pmid"  # Does NOT work!
    

    If you only have PMID: Fetch DOI from PubMed first, then query ChEMBL with the DOI.

    Response structure:

    {
      "documents": [
        {
          "document_chembl_id": "CHEMBL3120156",
          "doi": "10.1021/jm401507s",
          "title": "Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor.",
          "abstract": "Hepatitis C virus is a blood-borne infection...",
          "pubmed_id": 24446688,
          "journal": "J Med Chem",
          "year": 2014,
          "doc_type": "PUBLICATION"
        }
      ],
      "page_meta": {
        "total_count": 1
      }
    }
    

    Key fields:

    • document_chembl_id - Use this to retrieve activity data
    • doc_type - "PUBLICATION" (from literature) or "DATASET" (deposited)
    • pubmed_id - PMID is in the response, but cannot be used to query ChEMBL
    • If total_count = 0, paper not in ChEMBL

    Parse response:

    response=$(curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi")
    
    if [ $(echo "$response" | jq -r '.page_meta.total_count') -gt 0 ]; then
      chembl_id=$(echo "$response" | jq -r '.documents[0].document_chembl_id')
      echo "✓ Found in ChEMBL: $chembl_id"
    else
      echo "✗ Not in ChEMBL"
    fi
    

    Step 2: Get Activity Data Count

    Query activity endpoint:

    curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&limit=1"
    

    Extract total count:

    activity_url="https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=$chembl_id&limit=1"
    activity_count=$(curl -s "$activity_url" | jq -r '.page_meta.total_count')
    
    echo "→ $activity_count bioactivity data points"
    

    Step 3: Report to User and Update Summary

    Report immediately:

    📄 [15/127] Screening: "Discovery and development of simeprevir"
       Abstract score: 9 → Fetching full text...
       ✓ ChEMBL: CHEMBL3120156 (101 activity data points)
       → IC50 data for HCV NS3 protease inhibitors available
    

    Add to SUMMARY.md:

    ### [Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor](https://doi.org/10.1021/jm401507s) (Score: 9)
    
    **DOI:** [10.1021/jm401507s](https://doi.org/10.1021/jm401507s)
    **PMID:** [24446688](https://pubmed.ncbi.nlm.nih.gov/24446688/)
    **ChEMBL:** [CHEMBL3120156](https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL3120156/) (101 data points)
    
    **Key Findings:**
    - IC50 data for HCV NS3/4A protease inhibitors (from ChEMBL)
    - Lead compound simeprevir (TMC435) approved for HCV treatment
    - Structures and full activity data: [ChEMBL API](https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156)
    
    **ChEMBL Activity Summary:**
    - IC50 values for HCV NS3/4A protease
    - PK parameters (AUC, Cmax, clearance)
    - DMPK assays (metabolic stability, permeability)
    

    Always include ChEMBL status:

    • If found: Add ChEMBL ID with link and data point count
    • If not found: Note "Not in ChEMBL" (still valuable information)

    Step 4: Update Tracking Files

    Add to papers-reviewed.json:

    {
      "10.1021/jm401507s": {
        "pmid": "24446688",
        "status": "relevant",
        "score": 9,
        "chembl_id": "CHEMBL3120156",
        "chembl_activities": 101,
        "has_structured_data": true
      }
    }
    

    Optional: Extract Structured Data

    For papers with rich ChEMBL data (>20 activities), consider extracting:

    # Get all IC50 data
    curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&standard_type=IC50&limit=100" > chembl_data.json
    
    # Summary statistics
    jq '[.activities[] | .standard_value | tonumber] | "Min: \(min), Max: \(max), Count: \(length)"' chembl_data.json
    

    Report to user:

    📊 ChEMBL data extracted:
       - IC50 values for HCV NS3/4A protease
       - All structures downloaded
       - Data saved to: chembl_CHEMBL3120156_ic50.json
    

    Integration with Other Skills

    During evaluating-paper-relevance workflow:

    1. After abstract screening (score ≥7)
    2. Before deep dive into full text
    3. Check ChEMBL using this skill
    4. If found:
      • Note ChEMBL ID in SUMMARY.md
      • Extract activity data (faster than PDF parsing)
      • Still fetch full text for methods, discussion, context
    5. If not found:
      • Proceed with normal PDF evaluation
      • Parse tables manually if needed

    Workflow integration point:

    Stage 2: Deep Dive
    ├─ 1. Fetch Full Text (PMC → DOI → Unpaywall)
    ├─ 1.5. Check ChEMBL ← ADD THIS STEP
    │   ├─ Query by DOI
    │   ├─ If found: note ChEMBL ID + activity count
    │   └─ Report to user
    ├─ 2. Scan for Relevant Content
    └─ 3. Extract Findings
    

    Common Activity Types in ChEMBL

    Type Description Units
    IC50 Half-maximal inhibitory concentration nM, µM
    MIC Minimum inhibitory concentration µg/mL, nM
    Ki Inhibition constant nM, µM
    EC50 Half-maximal effective concentration nM, µM
    Kd Dissociation constant nM, µM
    Potency General potency measurement Various

    Filter by activity type:

    curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&standard_type=MIC"
    

    ChEMBL Coverage

    ~99,000 documents (as of 2025)

    Well represented:

    • Medicinal chemistry papers
    • SAR studies with compound series
    • Lead optimization campaigns
    • Papers in major journals (J Med Chem, Bioorg Med Chem, Eur J Med Chem, etc.)

    Poorly represented:

    • Very recent papers (6-12 month curation lag)
    • Papers without extractable structures/activities
    • Non-drug-discovery research
    • Purely mechanistic studies

    Typical hit rate:

    • ~30-40% of medicinal chemistry papers
    • Higher for SAR-focused journals

    Advantages of ChEMBL Data

    vs. PDF table parsing:

    • ✓ Structures already extracted (SMILES format)
    • ✓ Units standardized (all IC50s in nM)
    • ✓ Values validated and curated
    • ✓ Machine-readable JSON
    • ✓ No OCR errors
    • ✓ Linked to assay protocols
    • ✓ Queryable (filter by activity range, target, etc.)

    When to still use PDF:

    • Full experimental procedures
    • Synthesis routes
    • Papers not in ChEMBL
    • Very recent papers
    • Context and interpretation

    Progress Reporting

    CRITICAL: Report ChEMBL check for every relevant paper

    Example workflow report:

    📄 [15/50] Screening: "Novel MmpL3 inhibitors..."
       Abstract score: 8 → Checking ChEMBL...
       ✓ ChEMBL: CHEMBL3456789 (34 data points)
       → Fetching full text...
       → Added to SUMMARY.md with ChEMBL link
    

    For papers not in ChEMBL:

    📄 [16/50] Screening: "Another paper..."
       Abstract score: 9 → Checking ChEMBL...
       ✗ Not in ChEMBL (likely too recent or review paper)
       → Fetching full text via Unpaywall...
    

    Helper Script Pattern

    For research sessions with many medicinal chemistry papers:

    Create check_chembl.py:

    #!/usr/bin/env python3
    import requests
    import json
    import sys
    
    def check_chembl(doi):
        """Check if DOI is in ChEMBL and return summary
    
        IMPORTANT: Must use DOI, not PMID. ChEMBL API does not accept PMID queries.
        """
    
        # Query document (ONLY works with DOI)
        doc_url = f"https://www.ebi.ac.uk/chembl/api/data/document.json?doi={doi}"
        try:
            doc_response = requests.get(doc_url, timeout=10).json()
        except:
            return None
    
        # Check if found
        if doc_response.get('page_meta', {}).get('total_count', 0) == 0:
            return {'in_chembl': False}
    
        doc = doc_response['documents'][0]
        chembl_id = doc['document_chembl_id']
    
        # Get activity count
        act_url = f"https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id={chembl_id}&limit=1"
        try:
            act_response = requests.get(act_url, timeout=10).json()
            activity_count = act_response.get('page_meta', {}).get('total_count', 0)
        except:
            activity_count = 0
    
        return {
            'in_chembl': True,
            'chembl_id': chembl_id,
            'activity_count': activity_count,
            'doc_type': doc.get('doc_type'),
            'title': doc.get('title')
        }
    
    if __name__ == "__main__":
        doi = sys.argv[1]
        result = check_chembl(doi)
    
        if result and result['in_chembl']:
            print(f"✓ {result['chembl_id']} ({result['activity_count']} activities)")
        else:
            print("✗ Not in ChEMBL")
    

    Usage:

    python3 check_chembl.py "10.1021/jm401507s"
    # Output: ✓ CHEMBL3120156 (101 activities)
    

    Common Mistakes

    Querying by PMID: Using PMID instead of DOI → Always returns 0 results, ChEMBL only accepts DOI queries Skipping ChEMBL check: Not checking medicinal chemistry papers → Missing structured data that's already extracted Checking non-medchem papers: Checking genomics/cell biology papers → Wasting time, won't be in ChEMBL Not reporting status: Silent ChEMBL checks → User can't see what's happening Not adding to SUMMARY.md: Forgetting to include ChEMBL ID → Harder for user to access data later Only using ChEMBL: Not fetching full text when paper in ChEMBL → Missing context, methods, discussion Parsing PDFs when in ChEMBL: Manually extracting tables when structured data available → Wasting time and introducing errors

    Quick Reference

    Task Command
    Check if DOI in ChEMBL curl "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI"
    Get activity count curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1"
    Get all activities curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1000"
    Filter by activity type curl "...activity.json?document_chembl_id=ID&standard_type=MIC"
    ChEMBL paper page https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL_ID/

    Permissions

    Add to .claude/settings.local.json.template:

    "Bash(curl*https://www.ebi.ac.uk/chembl/api/data/*)",
    "WebFetch(domain:www.ebi.ac.uk)"
    

    Success Criteria

    ChEMBL check successful when:

    • Every medicinal chemistry paper (score ≥7) checked
    • ChEMBL status reported to user immediately
    • ChEMBL ID added to SUMMARY.md (if found)
    • Activity count noted in summary
    • papers-reviewed.json updated with ChEMBL status

    Next Steps

    After checking ChEMBL:

    • If found: Consider extracting structured data for highly relevant papers (≥9)
    • Continue with full text evaluation for context
    • For papers not in ChEMBL: Proceed with normal PDF/table parsing
    • Update SUMMARY.md with all findings

    Resources

    • Full Documentation: See docs/CHEMBL_INTEGRATION.md
    • ChEMBL API Docs: https://chembl.gitbook.io/chembl-interface-documentation/
    • ChEMBL Interface: https://www.ebi.ac.uk/chembl/
    Repository
    kthorn/research-superpower
    Files