Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    MattB543

    event-scraper

    MattB543/event-scraper
    Data & Analytics
    5

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Create new event scraping scripts for websites. Use when adding a new event source to the Asheville Event Feed. ALWAYS start by detecting the CMS/platform and trying known API endpoints first...

    SKILL.md

    Event Scraper Skill

    Create new event scrapers that integrate with the Asheville Event Feed codebase. This skill provides patterns and guidance for the full lifecycle: exploration, development, testing, and production integration.


    ⚠️ CRITICAL: API-First Approach

    Scrapers run automatically on Vercel which does NOT support browser automation.

    You MUST find the site's API before considering any other approach. Modern websites almost always fetch event data from a backend API - your job is to find and use that same API.

    Priority Order (STRICTLY follow this order):

    1. 🥇 Known CMS API - Check the Quick API Lookup table below FIRST
    2. 🥈 Internal JSON API - Site's own API endpoints (found via page analysis)
    3. 🥉 Public API - Official documented API (Ticketmaster, Eventbrite, etc.)
    4. 🏅 HTML with JSON-LD - Structured data embedded in HTML pages
    5. ❌ Browser scraping - NOT SUPPORTED on Vercel!

    🚀 Quick API Endpoint Lookup (TRY THESE FIRST!)

    Before doing any exploration, check if the site uses a known CMS/platform and try these endpoints directly:

    CMS/Plugin Detection Signs API Endpoint Key Parameters
    WordPress + Tribe Events /wp-content/, "The Events Calendar" /wp-json/tribe/events/v1/events start_date, per_page, page
    WordPress + All Events "All-in-One Event Calendar" /wp-json/osec/v1/events start, end
    WordPress REST /wp-content/, /wp-admin/ /wp-json/wp/v2/posts?type=event per_page, page
    Squarespace squarespace.com, static1.squarespace.com {any-page}?format=json Append to URL
    Next.js /_next/, __NEXT_DATA__ /_next/data/{buildId}/{page}.json Check page source
    Eventbrite eventbrite.com Internal API (see eventbrite.ts) Complex - see example
    Ticketmaster Venues Venue ticket sales Discovery API venueId, apikey

    Example: Detecting and Using Tribe Events API

    If you detect WordPress + Tribe Events, immediately try:

    GET https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50&page=1
    

    This often returns rich JSON with all event data, proper timezone handling, and pagination.


    Required Output Format

    Every scraper MUST return ScrapedEvent[]:

    interface ScrapedEvent {
      sourceId: string;      // Unique ID from source platform (prefix with source, e.g., "mx-123")
      source: EventSource;   // Add to types.ts if new source
      title: string;
      description?: string;
      startDate: Date;       // UTC Date object - see Timezone Decision Tree
      location?: string;     // Format: "Venue, Address, City, State"
      zip?: string;          // Zip code (from API or fallback utilities)
      organizer?: string;
      price?: string;        // "Free", "$20", "$15 - $30", "Unknown"
      url: string;           // Unique event URL (used for deduplication)
      imageUrl?: string;
      interestedCount?: number;
      goingCount?: number;
      timeUnknown?: boolean; // True if source only provided date, no time
    }
    

    PHASE 1: EXPLORATION

    Step 1.1: Detect CMS/Platform

    Use WebFetch to analyze the target site:

    WebFetch URL: https://example.com/events/
    Prompt: "Analyze this page:
    1. What CMS/platform is it? (WordPress, Squarespace, Next.js, custom)
    2. Look for: wp-content, wp-json, squarespace, _next, __NEXT_DATA__
    3. Is there JSON-LD structured data in script tags?
    4. What event plugin is used? (Tribe Events, All Events Calendar, etc.)
    5. Any hints about API endpoints in the HTML?"
    

    Step 1.2: Try Known API Endpoints

    Based on CMS detection, immediately try the known API endpoints from the Quick Lookup table:

    WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?per_page=5
    Prompt: "Analyze this API response:
    1. Is it returning JSON event data?
    2. What fields are available? (title, start_date, venue, cost, etc.)
    3. Is there timezone information?
    4. What pagination mechanism is used?
    5. List all available fields for each event"
    

    Step 1.3: Test API Parameters

    Once you find a working API, test common parameters:

    Parameter Common Names Purpose
    Future filter start_date, after, from, startDate Only get future events
    Page size per_page, limit, count, pageSize Control results per page
    Pagination page, offset, cursor, skip Navigate pages
    Sort orderby, sort, sortValue Order results
    WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50
    Prompt: "Does this API support:
    1. start_date parameter for filtering future events?
    2. per_page parameter for controlling page size?
    3. What's the maximum per_page allowed?
    4. How does pagination work (page number, next_url, etc.)?"
    

    Step 1.4: Document Field Mapping

    Create a mental map of API fields to ScrapedEvent fields:

    API Field ScrapedEvent Field Transform Needed
    id sourceId Prefix: "mx-${id}"
    title title decodeHtmlEntities()
    utc_start_date startDate new Date(utc + 'Z')
    cost price Use directly or "Unknown"
    venue.venue location Build string, decode entities
    venue.zip zip Use directly or fallback
    url url Use directly

    ⏰ Timezone Decision Tree (CRITICAL!)

    Getting timezone right is crucial. Follow this decision tree:

    Does the API provide a UTC field (utc_start_date, utc_time, etc.)?
    ├─ YES → Use directly: new Date(utcField.replace(' ', 'T') + 'Z')
    │        This is the SIMPLEST and most reliable approach.
    │
    └─ NO → Does the API provide ISO 8601 with offset? (e.g., "2025-12-16T19:00:00-05:00")
            ├─ YES → Use directly: new Date(isoString)
            │
            └─ NO → Does the API provide local time + timezone name? (e.g., "America/New_York")
                    ├─ YES → Use parseAsEastern(dateStr, timeStr)
                    │
                    └─ NO → DANGER! Ambiguous local time.
                            - Assume Eastern for NC events
                            - Use parseAsEastern(dateStr, timeStr)
                            - Verify with test insertion!
    

    Timezone Verification

    ALWAYS verify timezone handling by comparing:

    1. API's local time field (e.g., start_date: "2025-12-16 19:00:00")
    2. API's UTC field (e.g., utc_start_date: "2025-12-17 00:00:00")
    3. Your parsed Date displayed in Eastern (should match #1)

    Example verification:

    API local:  19:00:00 (7 PM Eastern)
    API UTC:    00:00:00 next day (midnight UTC = 7 PM EST, correct!)
    Our parsed: 7:00:00 PM Eastern ✓
    

    📍 Location String Best Practices

    Location strings often have issues. Follow these rules:

    1. Always Decode HTML Entities

    const venueName = decodeHtmlEntities(venue.venue); // "Rock & Roll" not "Rock & Roll"
    const address = decodeHtmlEntities(venue.address);
    

    2. Avoid Duplicate City Names

    APIs often include city in both venue name and city field:

    // BAD: "Turgua Brewing, Fairview, Fairview, NC"
    // GOOD: "Turgua Brewing, 123 Main St, Fairview, NC"
    
    if (venue.city && !venue.address?.includes(venue.city)) {
      parts.push(venue.city);
    }
    

    3. Standard Format

    // Format: "Venue, Address, City, State"
    const parts = [venueName];
    if (venue.address) parts.push(decodeHtmlEntities(venue.address));
    if (venue.city && !venue.address?.includes(venue.city)) {
      parts.push(venue.city);
    }
    if (venue.state) parts.push(venue.state);
    location = parts.join(', ');
    

    4. Zip Code Fallbacks

    let zip = venue?.zip || undefined;
    if (!zip && venue?.geo_lat && venue?.geo_lng) {
      zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
    }
    if (!zip && venue?.city) {
      zip = getZipFromCity(venue.city);
    }
    

    PHASE 2: DEVELOPMENT

    Step 2.1: Add Source Type

    Add to lib/scrapers/types.ts:

    export type EventSource = 'AVL_TODAY' | ... | 'YOUR_SOURCE';
    

    Step 2.2: Create Scraper

    Create lib/scrapers/yoursource.ts:

    import { ScrapedEvent } from './types';
    import { fetchWithRetry } from '@/lib/utils/retry';
    import { isNonNCEvent } from '@/lib/utils/geo';
    import { decodeHtmlEntities } from '@/lib/utils/parsers';
    import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';
    import { getTodayStringEastern } from '@/lib/utils/timezone';
    
    const API_BASE = 'https://example.com/wp-json/tribe/events/v1/events';
    const PER_PAGE = 50;
    const MAX_PAGES = 40;
    const DELAY_MS = 200;
    
    const API_HEADERS = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept': 'application/json',
    };
    
    export async function scrapeYourSource(): Promise<ScrapedEvent[]> {
      console.log('[YourSource] Starting scrape...');
    
      const allEvents: ScrapedEvent[] = [];
      const today = getTodayStringEastern();
      let page = 1;
      let hasMore = true;
    
      while (hasMore && page <= MAX_PAGES) {
        try {
          const url = new URL(API_BASE);
          url.searchParams.set('start_date', today);
          url.searchParams.set('per_page', PER_PAGE.toString());
          url.searchParams.set('page', page.toString());
    
          console.log(`[YourSource] Fetching page ${page}...`);
    
          const response = await fetchWithRetry(
            url.toString(),
            { headers: API_HEADERS, cache: 'no-store' },
            { maxRetries: 3, baseDelay: 1000 }
          );
    
          const data = await response.json();
          const events = data.events || [];
    
          console.log(`[YourSource] Page ${page}: ${events.length} events`);
    
          for (const event of events) {
            const formatted = formatEvent(event);
            if (formatted) allEvents.push(formatted);
          }
    
          hasMore = !!data.next_rest_url && page < data.total_pages;
          page++;
    
          if (hasMore) await new Promise(r => setTimeout(r, DELAY_MS));
        } catch (error) {
          console.error(`[YourSource] Error on page ${page}:`, error);
          break;
        }
      }
    
      // Filter non-NC events
      const ncEvents = allEvents.filter(ev => !isNonNCEvent(ev.title, ev.location));
      console.log(`[YourSource] Found ${ncEvents.length} NC events`);
    
      return ncEvents;
    }
    
    function formatEvent(event: ApiEvent): ScrapedEvent | null {
      // Parse UTC date (see Timezone Decision Tree)
      const startDate = new Date(event.utc_start_date.replace(' ', 'T') + 'Z');
    
      if (isNaN(startDate.getTime()) || startDate < new Date()) {
        return null;
      }
    
      // Build location (see Location Best Practices)
      const venue = event.venue;
      let location: string | undefined;
      if (venue?.venue) {
        const parts = [decodeHtmlEntities(venue.venue)];
        if (venue.address) parts.push(decodeHtmlEntities(venue.address));
        if (venue.city && !venue.address?.includes(venue.city)) parts.push(venue.city);
        if (venue.state) parts.push(venue.state);
        location = parts.join(', ');
      }
    
      // Zip with fallbacks
      let zip = venue?.zip || undefined;
      if (!zip && venue?.geo_lat && venue?.geo_lng) {
        zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
      }
    
      return {
        sourceId: `ys-${event.id}`,
        source: 'YOUR_SOURCE',
        title: decodeHtmlEntities(event.title),
        description: event.description ? decodeHtmlEntities(event.description) : undefined,
        startDate,
        location,
        zip,
        organizer: event.organizer?.[0]?.organizer,
        price: event.cost || 'Unknown',
        url: event.url,
        imageUrl: event.image?.url,
        timeUnknown: event.all_day || false,
      };
    }
    

    Step 2.3: Create Test Script

    Create scripts/scrapers/test-yoursource.ts:

    import 'dotenv/config';
    import * as fs from 'fs';
    import * as path from 'path';
    
    const DEBUG_DIR = path.join(process.cwd(), 'debug-scraper-yoursource');
    if (!fs.existsSync(DEBUG_DIR)) {
      fs.mkdirSync(DEBUG_DIR, { recursive: true });
    }
    
    async function main() {
      console.log('='.repeat(60));
      console.log('SCRAPER TEST - YourSource');
      console.log('='.repeat(60));
    
      // Import scraper
      const { scrapeYourSource } = await import('../lib/scrapers/yoursource');
    
      // Run scraper
      const startTime = Date.now();
      const events = await scrapeYourSource();
      const duration = Date.now() - startTime;
    
      // Save results
      fs.writeFileSync(
        path.join(DEBUG_DIR, 'events.json'),
        JSON.stringify(events, null, 2)
      );
    
      // Display summary
      console.log(`\nCompleted in ${(duration / 1000).toFixed(1)}s`);
      console.log(`Found ${events.length} events`);
    
      // Field completeness
      const withImages = events.filter(e => e.imageUrl).length;
      const withPrices = events.filter(e => e.price && e.price !== 'Unknown').length;
      const withZips = events.filter(e => e.zip).length;
    
      console.log(`\nField Completeness:`);
      console.log(`  Images: ${withImages}/${events.length} (${Math.round(withImages/events.length*100)}%)`);
      console.log(`  Prices: ${withPrices}/${events.length} (${Math.round(withPrices/events.length*100)}%)`);
      console.log(`  Zips: ${withZips}/${events.length} (${Math.round(withZips/events.length*100)}%)`);
    
      // Sample events with timezone verification
      console.log(`\nSample Events (verify timezone!):`);
      for (const e of events.slice(0, 5)) {
        console.log(`\n${e.title}`);
        console.log(`  UTC:     ${e.startDate.toISOString()}`);
        console.log(`  Eastern: ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
        console.log(`  Location: ${e.location || 'N/A'}`);
        console.log(`  Price: ${e.price}`);
      }
    
      console.log(`\nDebug files saved to: ${DEBUG_DIR}`);
    }
    
    main().catch(console.error);
    

    Step 2.4: Add to package.json

    "test:yoursource": "npx tsx scripts/scrapers/test-yoursource.ts"
    

    PHASE 3: VALIDATION

    Run the test script and verify output:

    npm run test:yoursource
    

    Validation Checklist

    • Timezone correct: Eastern times match expected (7 PM event shows as 7 PM ET)
    • No HTML entities: Titles/locations decoded (& not &amp;)
    • No duplicate cities: Location format is clean
    • Prices reasonable: Mix of Free, $X, Unknown
    • Zip codes populated: Most events have zips
    • URLs unique: No duplicates
    • Future events only: No past dates

    PHASE 4: DATABASE TESTING

    ⚠️ MANDATORY: You MUST Complete This Phase

    DO NOT declare production-ready until you have inserted test events into the real database and verified they display correctly.

    Scraper output validation alone is NOT sufficient. Database insertion can reveal:

    • Timezone conversion issues
    • Field truncation
    • Constraint violations
    • Display problems

    Step 4.1: Insert Test Events

    // scripts/scrapers/test-yoursource-db.ts
    import 'dotenv/config';
    import { db } from '../lib/db';
    import { events } from '../lib/db/schema';
    import { eq } from 'drizzle-orm';
    import { scrapeYourSource } from '../lib/scrapers/yoursource';
    
    async function main() {
      // Check existing
      const existing = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));
      console.log(`Existing YOUR_SOURCE events: ${existing.length}`);
    
      // Scrape a few events
      const scraped = await scrapeYourSource();
      const testEvents = scraped.slice(0, 5);
    
      // Insert
      for (const event of testEvents) {
        await db.insert(events).values({
          ...event,
          tags: [],
          lastSeenAt: new Date(),
        }).onConflictDoUpdate({
          target: events.url,
          set: { lastSeenAt: new Date() },
        });
        console.log(`Inserted: ${event.title}`);
      }
    
      // Verify - THIS IS THE CRITICAL CHECK
      console.log('\n=== VERIFICATION ===\n');
      const inserted = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));
    
      for (const e of inserted) {
        console.log(`${e.title}`);
        console.log(`  DB Date:  ${e.startDate}`);
        console.log(`  Eastern:  ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
        console.log(`  Location: ${e.location}`);
        console.log(`  Zip:      ${e.zip}`);
        console.log(`  Price:    ${e.price}`);
        console.log('');
      }
    
      console.log('To cleanup: DELETE FROM events WHERE source = \'YOUR_SOURCE\';');
    }
    
    main().catch(console.error);
    

    Step 4.2: Verify Checklist

    • Events inserted without errors
    • Dates display correctly in Eastern time
    • All fields populated as expected
    • No HTML entities in text
    • Zip codes present

    Step 4.3: Cleanup Test Data

    npx tsx -e "
    import 'dotenv/config';
    import { db } from './lib/db';
    import { events } from './lib/db/schema';
    import { eq } from 'drizzle-orm';
    db.delete(events).where(eq(events.source, 'YOUR_SOURCE')).then(() => console.log('Cleaned up'));
    "
    

    PHASE 5: PRODUCTION INTEGRATION

    Step 5.1: Update Cron Route

    Edit app/api/cron/scrape/route.ts:

    // Add import
    import { scrapeYourSource } from '@/lib/scrapers/yoursource';
    
    // Add to Promise.allSettled array
    const [..., yourSourceResult] = await Promise.allSettled([
      ...,
      scrapeYourSource(),
    ]);
    
    // Extract results
    const yourSourceEvents = yourSourceResult.status === 'fulfilled' ? yourSourceResult.value : [];
    
    // Log failures
    if (yourSourceResult.status === 'rejected')
      console.error('[Scrape] YourSource failed:', yourSourceResult.reason);
    
    // Add to stats
    stats.scraping.total = ... + yourSourceEvents.length;
    
    // Add to allEvents
    const allEvents = [..., ...yourSourceEvents];
    
    // Update log message
    console.log(`... YourSource: ${yourSourceEvents.length} ...`);
    

    Step 5.2: Verify TypeScript Compiles

    npx tsc --noEmit
    

    PHASE 6: CLEANUP

    # Remove debug folder
    rm -rf debug-scraper-yoursource
    
    # Remove test DB script if created
    rm scripts/scrapers/test-yoursource-db.ts
    

    Integration Checklist

    • Exploration

      • Detected CMS/platform
      • Tried known API endpoints
      • Tested API parameters (start_date, per_page, page)
      • Documented field mapping
      • Identified timezone handling approach
    • Development

      • Added source to types.ts
      • Created scraper file
      • Created test script
      • Added npm script
    • Validation

      • Timezone verified (Eastern times correct)
      • HTML entities decoded
      • Location strings clean (no duplicates)
      • Field completeness acceptable
    • Database Testing (MANDATORY)

      • Inserted test events
      • Verified dates in database
      • Confirmed all fields correct
      • Cleaned up test data
    • Production

      • Added to cron route
      • TypeScript compiles
      • Ready for deployment

    Common Utilities Reference

    Timezone

    import { getTodayStringEastern, parseAsEastern } from '@/lib/utils/timezone';
    
    // Get today's date in Eastern (for API start_date param)
    const today = getTodayStringEastern(); // "2025-12-16"
    
    // Parse ambiguous local time as Eastern
    const date = parseAsEastern('2025-12-25', '19:00:00');
    

    Price Formatting

    import { formatPrice } from '@/lib/utils/parsers';
    
    formatPrice(0);        // "Free"
    formatPrice(25.50);    // "$26"
    formatPrice(null);     // "Unknown"
    

    HTML Entities

    import { decodeHtmlEntities } from '@/lib/utils/parsers';
    
    decodeHtmlEntities('Rock &amp; Roll &#8211; Live');
    // "Rock & Roll – Live"
    

    Location Filtering

    import { isNonNCEvent } from '@/lib/utils/geo';
    
    // Returns true if event should be EXCLUDED (not in NC)
    if (isNonNCEvent(event.title, event.location)) continue;
    

    Zip Code Fallbacks

    import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';
    
    let zip = venue.zip || getZipFromCoords(lat, lng) || getZipFromCity(city);
    

    Troubleshooting

    API Returns 403/429

    • Add realistic headers (User-Agent, Accept, Referer)
    • Increase delays between requests (200-500ms)
    • Some APIs require Referer header matching the site

    Dates Off by Hours

    • Check Timezone Decision Tree above
    • Verify API returns UTC vs local time
    • Compare API local time with your parsed Eastern time

    Duplicate Events

    • Ensure url is unique per event
    • For recurring events, append date to URL: ${url}#${date}

    Missing Events

    • Check pagination (off-by-one errors)
    • Verify start_date parameter format
    • API may have max page limit

    HTML in Titles/Locations

    • Apply decodeHtmlEntities() to ALL text fields
    • Check for <br>, <p> tags that need stripping

    Duplicate City in Location

    • Check if city already in address before appending
    • Common with APIs that include full address + separate city field
    Repository
    mattb543/asheville-event-feed
    Files