This skill should be used when users need to scrape content from websites, extract text from web pages, crawl and follow links, or download documentation from online sources...
Recursively scrape web pages with concurrent processing, extracting clean text content while following links. The scraper automatically handles URL deduplication, creates proper directory hierarchies based on URL structure, filters out unwanted content, and respects domain boundaries.
Use this skill when users request:
Install required dependencies:
pip install aiohttp beautifulsoup4 lxml aiofiles
These libraries provide:
aiohttp - Async HTTP client for concurrent requestsbeautifulsoup4 - HTML parsing and content extractionlxml - Fast HTML/XML parseraiofiles - Async file I/OScrape a single page without following links:
python scripts/scrape.py <URL> <output-directory> --depth 0
Example:
python scripts/scrape.py https://example.com/article output/
This downloads only the specified page, extracts clean text content, and saves it to output/example.com/article.txt.
Scrape a page and follow links up to a specified depth:
python scripts/scrape.py <URL> <output-directory> --depth <N>
Example:
python scripts/scrape.py https://docs.example.com output/ --depth 2
Depth levels:
--depth 0 - Only the start URL(s)--depth 1 - Start URLs + all links on those pages--depth 2 - Start URLs + links + links found on those linked pages--depth 3+ - Continue following links to the specified depthPrevent excessive scraping by setting a maximum page limit:
python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100
Example:
python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50
Useful for:
Control the number of simultaneous requests for faster scraping:
python scripts/scrape.py <URL> <output-directory> --concurrent <N>
Example:
python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20
Default is 10 concurrent requests. Increase for faster scraping, decrease for more conservative resource usage.
Guidelines:
--concurrent 5--concurrent 10 (default)--concurrent 20-30By default, the scraper only follows links on the same domain as the start URL. This can be controlled:
Same domain only (default):
python scripts/scrape.py https://example.com output/ --depth 2
Follow external links:
python scripts/scrape.py https://example.com output/ --depth 2 --follow-external
Specify allowed domains:
python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.com
Use --allowed-domains when:
Scrape from multiple starting points simultaneously:
python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>
Example:
python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2
All start URLs are processed with the same configuration (depth, domain restrictions, etc.).
Customize HTTP request behavior:
python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60
Options:
--user-agent - Custom User-Agent header (default: "Mozilla/5.0 (compatible; WebScraper/1.0)")--timeout - Request timeout in seconds (default: 30)Example:
python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 45
Enable detailed logging to monitor scraping progress:
python scripts/scrape.py <URL> <output-directory> --verbose
Verbose mode shows:
The scraper creates a directory hierarchy that mirrors the URL structure:
output/
├── example.com/
│ ├── index.txt # https://example.com/
│ ├── about.txt # https://example.com/about
│ ├── docs/
│ │ ├── index.txt # https://example.com/docs/
│ │ ├── getting-started.txt
│ │ └── api/
│ │ └── reference.txt
│ └── blog/
│ ├── post-1.txt
│ └── post-2.txt
├── docs.example.com/
│ └── guide.txt
└── _metadata.json
Each scraped page is saved as a text file with the following structure:
URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00
================================================================================
[Clean extracted text content]
_metadata.json contains scraping session information:
{
"start_time": "2025-10-21T14:30:00",
"end_time": "2025-10-21T14:35:30",
"pages_scraped": 42,
"total_visited": 45,
"errors": {
"https://example.com/broken": "HTTP 404",
"https://example.com/slow": "Timeout"
}
}
The scraper extracts clean text content by:
Focusing on main content - Prioritizes <main>, <article>, or <body> tags
Removing unwanted elements - Strips out:
Filtering common patterns - Removes:
Preserving structure - Maintains line breaks between content blocks
Common unwanted patterns automatically removed:
Scrape an entire documentation site with reasonable limits:
python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15
Download all blog posts from a blog (following pagination):
python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500
Gather text content from multiple related sources:
python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20
Test configuration on a small sample before full scrape:
python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verbose
Then run full scrape after confirming results:
python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15
Scrape across multiple authorized domains:
python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300
When users request web scraping:
Identify the scope:
Configure the scraper:
Run with monitoring:
Verify output:
_metadata.json for statisticsProcess the content:
Command structure:
python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]
Essential options:
-d, --depth N - Maximum link depth (default: 2)-m, --max-pages N - Maximum pages to scrape-c, --concurrent N - Concurrent requests (default: 10)-f, --follow-external - Follow external links-a, --allowed-domains - Specify allowed domains-v, --verbose - Detailed output-u, --user-agent - Custom User-Agent-t, --timeout - Request timeout in secondsGet full help:
python scripts/scrape.py --help
--depth 1 --max-pages 10 before large scrapes--max-pages for initial runs_metadata.json for errors and statisticsCommon issues:
pip install aiohttp beautifulsoup4 lxml aiofiles--timeout or reduce --concurrent--concurrent (e.g., 20-30)--concurrent or use --max-pages to chunk the work--depth or enable same-domain-only (default)_metadata.json errors section for specific issuesLimitations:
--concurrent to control request rate)After scraping, use the Read tool to load content into context:
# Read a specific scraped page
Read file_path: output/docs.example.com/guide.txt
# Search across all scraped content
Grep pattern: "API endpoint" path: output/ -r
The scraper tracks visited URLs in memory during a session but doesn't persist this between runs. To avoid re-downloading:
Chain the scraper with other processing:
# Scrape then process with custom script
python scripts/scrape.py https://example.com output/ --depth 2
python your_analysis_script.py output/
The main web scraping tool implementing concurrent crawling, content extraction, and intelligent filtering. Key features:
asyncio and aiohttp for high-performance concurrent requestsvisited_urls and queued_urls sets to prevent re-downloading_metadata.json with scraping statistics and errorsThe script can be executed directly and includes comprehensive command-line help via --help.