Recover deleted GitHub content using the Wayback Machine and Archive.org APIs. Use when repositories, files, issues, PRs, or wiki pages have been deleted from GitHub but may persist in web archives...
Purpose: Recover deleted GitHub content (README files, issues, PRs, wiki pages, repository metadata) from the Internet Archive's Wayback Machine when content is no longer available on GitHub.
Complementary Skills:
Wayback Machine Archives Web Pages, Not Git Repositories:
git clone from archived contentWhat CAN Be Recovered:
What CANNOT Be Recovered:
Check if a repository page was archived:
curl -s "https://archive.org/wayback/available?url=github.com/owner/repo" | jq
Search for all archived URLs under a repository:
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/*&output=json&collapse=urlkey" | head -50
Access an archived snapshot:
https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo
Understanding GitHub's URL structure is essential for constructing archive queries.
| Content Type | URL Pattern |
|---|---|
| Homepage | github.com/{owner}/{repo} |
| Commits list | github.com/{owner}/{repo}/commits/{branch} |
| Individual commit | github.com/{owner}/{repo}/commit/{full-sha} |
| Fork network | github.com/{owner}/{repo}/network/members |
| Content Type | URL Pattern |
|---|---|
| File view | github.com/{owner}/{repo}/blob/{branch}/{path/to/file} |
| Directory view | github.com/{owner}/{repo}/tree/{branch}/{directory} |
| File history | github.com/{owner}/{repo}/commits/{branch}/{path/to/file} |
| Raw file | raw.githubusercontent.com/{owner}/{repo}/{branch}/{path} |
Note: blob = files, tree = directories. Raw URLs are rarely archived compared to rendered views.
| Content Type | URL Pattern |
|---|---|
| Pull request | github.com/{owner}/{repo}/pull/{number} |
| PR files | github.com/{owner}/{repo}/pull/{number}/files |
| PR commits | github.com/{owner}/{repo}/pull/{number}/commits |
| Issue | github.com/{owner}/{repo}/issues/{number} |
| Wiki page | github.com/{owner}/{repo}/wiki/{page-name} |
| Release | github.com/{owner}/{repo}/releases/tag/{tag-name} |
| All PRs | github.com/{owner}/{repo}/pulls?state=all |
| All issues | github.com/{owner}/{repo}/issues?state=all |
The Capture Index (CDX) API provides structured search across all archived URLs.
https://web.archive.org/cdx/search/cdx?url={URL}&output=json
| Parameter | Effect | Example |
|---|---|---|
matchType=exact |
Exact URL only (default) | Single page |
matchType=prefix |
All URLs starting with path | All repo content |
url=.../* |
Wildcard (same as prefix) | github.com/owner/repo/* |
from=YYYY |
Start date filter | from=2023 |
to=YYYY |
End date filter | to=2024 |
filter=statuscode:200 |
Only successful captures | Skip redirects/errors |
collapse=timestamp:8 |
One capture per day | Reduce duplicates |
collapse=urlkey |
Unique URLs only | List all archived pages |
limit=N |
Limit results | limit=100 |
output=json |
JSON format | Machine-readable |
Find all archived pages under a repository:
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/facebook/react/*&matchType=prefix&output=json&collapse=urlkey"
Find archived issues for a specific repository:
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/issues/*&output=json&collapse=urlkey&filter=statuscode:200"
Find archived snapshots of a specific file:
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/blob/*/path/to/file&output=json"
Check for archived snapshots near a specific date:
curl -s "https://archive.org/wayback/available?url=github.com/owner/repo×tamp=20230615"
[
["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"],
["com,github)/owner/repo", "20230615142311", "https://github.com/owner/repo", "text/html", "200", "ABC123...", "12345"]
]
Scenario: Repository or file has been deleted, need to recover file contents.
Step 1: Search for blob URLs
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/blob/*/README.md&output=json"
Step 2: Construct archive URL from timestamp
https://web.archive.org/web/20230615142311/https://github.com/owner/repo/blob/main/README.md
Step 3: Extract content manually or use waybackpack
pip install waybackpack
waybackpack "https://github.com/owner/repo/blob/main/README.md" -d output_dir
Forensic Value: Recover documentation, configuration files, or evidence that existed at specific points in time.
Scenario: Issue or PR was deleted and you need the original content.
Step 1: Query for issue page snapshots
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/issues/123*&output=json"
Step 2: Access archived page
https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/issues/123
Step 3: If issue number unknown, search PR/issue listing
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/issues?state=all&output=json"
Note: Archive Team actively crawls GitHub issues and PRs since 2020. Issue content has higher recovery success than file contents.
Scenario: Repository is deleted, but forks may contain the full git history.
Step 1: Search for archived fork network page
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/network/members&output=json"
Step 2: Access archived network page
https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/network/members
Step 3: Extract fork usernames from archived page, check if forks still exist
# Check if fork exists
curl -s -o /dev/null -w "%{http_code}" https://github.com/forker/repo
Forensic Value: Active forks contain complete git history including all commits. This often yields better results than trying to recover individual files.
Scenario: Repository wiki has been deleted or made private.
Step 1: Search for wiki pages
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/wiki*&output=json&collapse=urlkey"
Step 2: Access wiki home or specific pages
https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/wiki
https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/wiki/Page-Name
import requests
import json
from typing import Optional, List, Dict
from time import sleep
class WaybackGitHubRecovery:
CDX_API = "https://web.archive.org/cdx/search/cdx"
AVAILABILITY_API = "https://archive.org/wayback/available"
ARCHIVE_URL = "https://web.archive.org/web"
def check_availability(self, url: str, timestamp: Optional[str] = None) -> Optional[Dict]:
"""Check if URL has any archived snapshots."""
params = {"url": url}
if timestamp:
params["timestamp"] = timestamp
resp = requests.get(self.AVAILABILITY_API, params=params)
data = resp.json()
if data.get("archived_snapshots", {}).get("closest"):
return data["archived_snapshots"]["closest"]
return None
def search_cdx(self, url: str, match_type: str = "prefix",
collapse: str = "urlkey", limit: int = 1000) -> List[Dict]:
"""Search CDX API for archived URLs."""
params = {
"url": url,
"output": "json",
"matchType": match_type,
"collapse": collapse,
"filter": "statuscode:200",
"limit": limit
}
resp = requests.get(self.CDX_API, params=params)
data = resp.json()
if len(data) <= 1: # Only header row
return []
headers = data[0]
results = []
for row in data[1:]:
results.append(dict(zip(headers, row)))
return results
def find_repository_content(self, owner: str, repo: str) -> Dict[str, List]:
"""Find all archived content for a repository."""
base_url = f"github.com/{owner}/{repo}"
results = {
"homepage": self.search_cdx(base_url, match_type="exact"),
"issues": self.search_cdx(f"{base_url}/issues/*"),
"pulls": self.search_cdx(f"{base_url}/pull/*"),
"wiki": self.search_cdx(f"{base_url}/wiki*"),
"files": self.search_cdx(f"{base_url}/blob/*"),
"network": self.search_cdx(f"{base_url}/network/members", match_type="exact"),
}
return results
def get_archived_page(self, url: str, timestamp: str) -> Optional[str]:
"""Retrieve archived page content."""
archive_url = f"{self.ARCHIVE_URL}/{timestamp}/{url}"
resp = requests.get(archive_url)
if resp.status_code == 200:
return resp.text
return None
def find_forks(self, owner: str, repo: str) -> List[str]:
"""Find potential forks from archived network page."""
network_results = self.search_cdx(
f"github.com/{owner}/{repo}/network/members",
match_type="exact"
)
forks = []
if network_results:
# Get most recent snapshot
latest = network_results[-1]
content = self.get_archived_page(
f"https://github.com/{owner}/{repo}/network/members",
latest["timestamp"]
)
if content:
# Extract fork usernames (simplified - would need HTML parsing)
# Look for patterns like href="/username/repo"
import re
pattern = rf'href="/([^/]+)/{repo}"'
matches = re.findall(pattern, content)
forks = list(set(matches) - {owner})
return forks
# Usage Example
recovery = WaybackGitHubRecovery()
# Check if repository homepage was archived
snapshot = recovery.check_availability("https://github.com/deleted-user/deleted-repo")
if snapshot:
print(f"Archived at: {snapshot['url']}")
print(f"Timestamp: {snapshot['timestamp']}")
# Find all archived content
content = recovery.find_repository_content("deleted-user", "deleted-repo")
print(f"Found {len(content['issues'])} archived issue pages")
print(f"Found {len(content['files'])} archived file pages")
# Find potential forks
forks = recovery.find_forks("deleted-user", "deleted-repo")
for fork in forks:
print(f"Potential fork: github.com/{fork}/deleted-repo")
raw.githubusercontent.com URLs are rarely archivedArchive.org has undocumented rate limits:
collapse parameters to reduce result countNo archived snapshots found:
github.com/owner/repo/*Archived page shows broken layout:
CDX API returns empty results:
matchType=prefix instead of exactfilter=statuscode:200 to see all capturesRate limited by Archive.org:
collapse=timestamp:8 to reduce duplicates