Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text.
This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
Activate when the user:
Check for article extraction tools in this order:
command -v reader
If not installed:
npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli
command -v trafilatura
If not installed:
pip3 install trafilatura
If no tools available, use basic curl + text extraction (less reliable but works)
# Extract article
reader "URL" > article.txt
Pros:
# Extract article
trafilatura --URL "URL" --output-format txt > article.txt
# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
Pros:
Options:
--no-comments: Skip comment sections--no-tables: Skip data tables--precision: Favor precision over recall--recall: Extract more content (may include some noise)# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\n\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
Note: This is less reliable but works without dependencies.
Extract title for filename:
# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
Clean title for filesystem:
# Get title
TITLE="Article Title from Website"
# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
# Add extension
FILENAME="${FILENAME}.txt"
ARTICLE_URL="https://example.com/article"
# Check for tools
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
# Extract article
case $TOOL in
reader)
# Get content
reader "$ARTICLE_URL" > temp_article.txt
# Get title (first line after # in markdown)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# Get title from metadata
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# Get clean content
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# Get title
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # Remove site name
TITLE=${TITLE%% | *} # Remove site name (alternate)
# Get content (basic extraction)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\n\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
# Move to final filename
mv temp_article.txt "$FILENAME"
# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
1. Tool not installed
2. Paywall or login required
3. Invalid URL
4. No content extracted
5. Special characters in title
/, :, ?, ", <, >, |- or remove1. Use reader for most articles
2. Use trafilatura for:
3. Fallback method limitations:
4. Check extraction quality:
Simple extraction:
# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
With error handling:
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fi
Display to user:
Ask if needed: