Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    JSBtechnologies

    article-extractor

    JSBtechnologies/article-extractor
    Writing

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text.

    SKILL.md

    Article Extractor

    This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

    When to Use This Skill

    Activate when the user:

    • Provides an article/blog URL and wants the text content
    • Asks to "download this article"
    • Wants to "extract the content from [URL]"
    • Asks to "save this blog post as text"
    • Needs clean article text without distractions

    How It Works

    Priority Order:

    1. Check if tools are installed (reader or trafilatura)
    2. Download and extract article using best available tool
    3. Clean up the content (remove extra whitespace, format properly)
    4. Save to file with article title as filename
    5. Confirm location and show preview

    Installation Check

    Check for article extraction tools in this order:

    Option 1: reader (Recommended - Mozilla's Readability)

    command -v reader
    

    If not installed:

    npm install -g @mozilla/readability-cli
    # or
    npm install -g reader-cli
    

    Option 2: trafilatura (Python-based, very good)

    command -v trafilatura
    

    If not installed:

    pip3 install trafilatura
    

    Option 3: Fallback (curl + simple parsing)

    If no tools available, use basic curl + text extraction (less reliable but works)

    Extraction Methods

    Method 1: Using reader (Best for most articles)

    # Extract article
    reader "URL" > article.txt
    

    Pros:

    • Based on Mozilla's Readability algorithm
    • Excellent at removing clutter
    • Preserves article structure

    Method 2: Using trafilatura (Best for blogs/news)

    # Extract article
    trafilatura --URL "URL" --output-format txt > article.txt
    
    # Or with more options
    trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
    

    Pros:

    • Very accurate extraction
    • Good with various site structures
    • Handles multiple languages

    Options:

    • --no-comments: Skip comment sections
    • --no-tables: Skip data tables
    • --precision: Favor precision over recall
    • --recall: Extract more content (may include some noise)

    Method 3: Fallback (curl + basic parsing)

    # Download and extract basic content
    curl -s "URL" | python3 -c "
    from html.parser import HTMLParser
    import sys
    
    class ArticleExtractor(HTMLParser):
        def __init__(self):
            super().__init__()
            self.in_content = False
            self.content = []
            self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
            self.current_tag = None
    
        def handle_starttag(self, tag, attrs):
            if tag not in self.skip_tags:
                if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
                    self.in_content = True
            self.current_tag = tag
    
        def handle_data(self, data):
            if self.in_content and data.strip():
                self.content.append(data.strip())
    
        def get_content(self):
            return '\n\n'.join(self.content)
    
    parser = ArticleExtractor()
    parser.feed(sys.stdin.read())
    print(parser.get_content())
    " > article.txt
    

    Note: This is less reliable but works without dependencies.

    Getting Article Title

    Extract title for filename:

    Using reader:

    # reader outputs markdown with title at top
    TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
    

    Using trafilatura:

    # Get metadata including title
    TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
    

    Using curl (fallback):

    TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
    

    Filename Creation

    Clean title for filesystem:

    # Get title
    TITLE="Article Title from Website"
    
    # Clean for filesystem (remove special chars, limit length)
    FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
    
    # Add extension
    FILENAME="${FILENAME}.txt"
    

    Complete Workflow

    ARTICLE_URL="https://example.com/article"
    
    # Check for tools
    if command -v reader &> /dev/null; then
        TOOL="reader"
        echo "Using reader (Mozilla Readability)"
    elif command -v trafilatura &> /dev/null; then
        TOOL="trafilatura"
        echo "Using trafilatura"
    else
        TOOL="fallback"
        echo "Using fallback method (may be less accurate)"
    fi
    
    # Extract article
    case $TOOL in
        reader)
            # Get content
            reader "$ARTICLE_URL" > temp_article.txt
    
            # Get title (first line after # in markdown)
            TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
            ;;
    
        trafilatura)
            # Get title from metadata
            METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
            TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
    
            # Get clean content
            trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
            ;;
    
        fallback)
            # Get title
            TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
            TITLE=${TITLE%% - *}  # Remove site name
            TITLE=${TITLE%% | *}  # Remove site name (alternate)
    
            # Get content (basic extraction)
            curl -s "$ARTICLE_URL" | python3 -c "
    from html.parser import HTMLParser
    import sys
    
    class ArticleExtractor(HTMLParser):
        def __init__(self):
            super().__init__()
            self.in_content = False
            self.content = []
            self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
    
        def handle_starttag(self, tag, attrs):
            if tag not in self.skip_tags:
                if tag in {'p', 'article', 'main'}:
                    self.in_content = True
            if tag in {'h1', 'h2', 'h3'}:
                self.content.append('\n')
    
        def handle_data(self, data):
            if self.in_content and data.strip():
                self.content.append(data.strip())
    
        def get_content(self):
            return '\n\n'.join(self.content)
    
    parser = ArticleExtractor()
    parser.feed(sys.stdin.read())
    print(parser.get_content())
    " > temp_article.txt
            ;;
    esac
    
    # Clean filename
    FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
    FILENAME="${FILENAME}.txt"
    
    # Move to final filename
    mv temp_article.txt "$FILENAME"
    
    # Show result
    echo "✓ Extracted article: $TITLE"
    echo "✓ Saved to: $FILENAME"
    echo ""
    echo "Preview (first 10 lines):"
    head -n 10 "$FILENAME"
    

    Error Handling

    Common Issues

    1. Tool not installed

    • Try alternate tool (reader → trafilatura → fallback)
    • Offer to install: "Install reader with: npm install -g reader-cli"

    2. Paywall or login required

    • Extraction tools may fail
    • Inform user: "This article requires authentication. Cannot extract."

    3. Invalid URL

    • Check URL format
    • Try with and without redirects

    4. No content extracted

    • Site may use heavy JavaScript
    • Try fallback method
    • Inform user if extraction fails

    5. Special characters in title

    • Clean title for filesystem
    • Remove: /, :, ?, ", <, >, |
    • Replace with - or remove

    Output Format

    Saved File Contains:

    • Article title (if available)
    • Author (if available from tool)
    • Main article text
    • Section headings
    • No navigation, ads, or clutter

    What Gets Removed:

    • Navigation menus
    • Ads and promotional content
    • Newsletter signup forms
    • Related articles sidebars
    • Comment sections (optional)
    • Social media buttons
    • Cookie notices

    Tips for Best Results

    1. Use reader for most articles

    • Best all-around tool
    • Based on Firefox Reader View
    • Works on most news sites and blogs

    2. Use trafilatura for:

    • Academic articles
    • News sites
    • Blogs with complex layouts
    • Non-English content

    3. Fallback method limitations:

    • May include some noise
    • Less accurate paragraph detection
    • Better than nothing for simple sites

    4. Check extraction quality:

    • Always show preview to user
    • Ask if it looks correct
    • Offer to try different tool if needed

    Example Usage

    Simple extraction:

    # User: "Extract https://example.com/article"
    reader "https://example.com/article" > temp.txt
    TITLE=$(head -n 1 temp.txt | sed 's/^# //')
    FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
    mv temp.txt "$FILENAME"
    echo "✓ Saved to: $FILENAME"
    

    With error handling:

    if ! reader "$URL" > temp.txt 2>/dev/null; then
        if command -v trafilatura &> /dev/null; then
            trafilatura --URL "$URL" --output-format txt > temp.txt
        else
            echo "Error: Could not extract article. Install reader or trafilatura."
            exit 1
        fi
    fi
    

    Best Practices

    • ✅ Always show preview after extraction (first 10 lines)
    • ✅ Verify extraction succeeded before saving
    • ✅ Clean filename for filesystem compatibility
    • ✅ Try fallback method if primary fails
    • ✅ Inform user which tool was used
    • ✅ Keep filename length reasonable (< 100 chars)

    After Extraction

    Display to user:

    1. "✓ Extracted: [Article Title]"
    2. "✓ Saved to: [filename]"
    3. Show preview (first 10-15 lines)
    4. File size and location

    Ask if needed:

    • "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
    • "Should I extract another article?"
    Repository
    jsbtechnologies/claude-skills
    Files