Combines YouTube lecture transcripts with PDF slides to create an interactive HTML page. Matches each slide to corresponding transcript segments, organized by key concepts.
Combines YouTube lecture transcripts (txt files) with corresponding PDF slides to create an interactive HTML page with synchronized content organized by key concepts.
This skill processes lecture materials and generates an HTML page with:
The matching process involves these steps:
Run the conversion script to standardize the transcript timestamp format:
python scripts/convert_transcript.py <transcript_input.txt> <transcript_output.pdf>
This script:
Run the analysis script to understand the lecture materials:
python scripts/analyze_content.py <transcript.txt> <slides.pdf> [output_analysis.json]
This script:
content_analysis.json with all informationWhat to do:
Create a mapping.json file that connects concepts to slides and transcript segments.
Option A: Let Claude create the mapping
After running the analysis script, ask Claude to create the mapping by providing:
content_analysis.json fileClaude will analyze the content and create a comprehensive mapping.
Option B: Manual creation
Use the template in content_analysis.json as a starting point. See references/mapping_schema.md for complete documentation.
[
{
"title": "Key concept or insight",
"slide_indices": [0, 1, 2],
"transcript_segments": [
{
"start_time": "MM:SS or HH:MM:SS",
"end_time": "MM:SS or HH:MM:SS",
"text": "Full transcript text from this time range"
}
]
}
]
Key points:
[HH:MM:SS] or [MM:SS]See references/mapping_schema.md for detailed schema documentation and examples.
Run the generation script to create the final HTML page:
python scripts/match_lecture_content.py <transcript.txt> <slides.pdf> <mapping.json> [output.html]
The script:
Output: lecture_output.html (or specified filename)
The transcript must use timestamp markers:
[00:15] Welcome to today's lecture on machine learning.
[00:45] We'll start by discussing supervised learning...
[02:30] Now let's look at an example with house prices...
Supported timestamp formats:
[HH:MM:SS] - Hours, minutes, seconds[MM:SS] - Minutes, seconds[H:MM:SS] - Single-digit hoursThe script automatically:
Good concept granularity:
Too broad:
Too narrow:
Concept spans non-contiguous slides:
{
"title": "Example: Housing Price Prediction",
"slide_indices": [5, 8, 12],
"transcript_segments": [...]
}
Multiple transcript segments per concept:
{
"title": "Backpropagation",
"slide_indices": [15],
"transcript_segments": [
{"start_time": "20:00", "end_time": "22:30", "text": "..."},
{"start_time": "23:00", "end_time": "25:45", "text": "..."}
]
}
No slides for a concept (discussion only):
{
"title": "Q&A: Common Misconceptions",
"slide_indices": [],
"transcript_segments": [...]
}
The scripts require PyMuPDF for PDF processing:
pip install pymupdf --break-system-packages
Claude handles installation automatically when needed.
Complete workflow example:
# Step 1: Analyze
python scripts/analyze_content.py lecture.txt slides.pdf analysis.json
# Step 2: Create mapping (manually or with Claude's help)
# Edit analysis.json or create new mapping.json
# Step 3: Generate HTML
python scripts/match_lecture_content.py lecture.txt slides.pdf mapping.json output.html
references/mapping_schema.md - Complete JSON schema documentation with examplesreferences/example_mapping.json - Sample mapping for a machine learning lecture"PyMuPDF not installed"
Run: pip install pymupdf --break-system-packages
Timestamps don't match Ensure timestamps in mapping.json exactly match those in the transcript file.
Slides not displaying Verify slide_indices are 0-based (first slide = 0, not 1).
Text looks messy The cleaning is automatic. If issues persist, check for unusual formatting in the transcript.
Missing concepts Review the analysis output to ensure all relevant transcript segments and slides are covered.