Parse PDFs into clean Markdown using MinerU's VLM engine...
Parse PDF, Office (Word/PPT/Excel), and image files into clean Markdown — with LaTeX formulas, tables, images, and OCR. One zero-dependency script, two backends, automatic routing.
# Parse a local file or URL — the Agent API needs no login
python3 scripts/mineru.py paper.pdf
# Pipe the Markdown straight back to an agent
python3 scripts/mineru.py paper.pdf --stdout
# Machine-readable status for tool pipelines
python3 scripts/mineru.py paper.pdf --json
No pip install, no API key. The free Agent API handles files ≤ 10 MB / ≤ 20 pages.
scripts/mineru.py carries PEP 723 inline
metadata, so uv runs it directly — no venv, no
pip install, with a uv-managed interpreter:
uv run scripts/mineru.py paper.pdf --stdout # zero-install run
uv run --no-project --with pytest pytest -q # dev suite via uv
export MINERU_TOKEN="..." # https://mineru.net/apiManage/token
# Parallel batch a directory, resume on re-run
python3 scripts/mineru.py ./pdfs/ --output ./out/ --workers 8 --resume
# Export DOCX/HTML/LaTeX alongside Markdown (auto-routes to the Standard API)
python3 scripts/mineru.py report.pdf --format docx --format latex
When a token is set, the tool auto-routes: small single files still use the free Agent API; anything large (> 10 MB / > 20 pages), batched, or needing extra export formats uses the Standard API (≤ 200 MB / ≤ 200 pages). If the Agent API hits a size/page limit, it auto-escalates to the Standard API.
| Modality | Extensions | OCR |
|---|---|---|
.pdf |
--ocr |
|
| Image | .png .jpg .jpeg .jp2 .webp .gif .bmp |
built-in |
| Word | .doc .docx |
— |
| Slides | .ppt .pptx |
— |
| Sheet | .xls .xlsx |
— |
| HTML | .html (Standard API, MinerU-HTML model) |
— |
INPUT... One or more files, a directory, or a URL
--output, -o Output directory (default: ./output)
--api auto | agent | standard (default: auto)
--model pipeline | vlm | MinerU-HTML (default: vlm)
--format docx | html | latex (repeatable; forces Standard API)
--lang OCR/document language (default: ch)
--ocr Enable OCR for scanned documents
--pages Page range, e.g. "1-10" or "2,4-6"
--workers, -w Concurrent submit/upload/download slots (default: 8)
--resume Skip inputs already parsed
--stdout Print Markdown to stdout
--json Print machine-readable status to stdout
--to SINK Deliver into a content tool (repeatable); --list-sinks to enumerate
--obsidian PATH Shortcut for --to obsidian with this vault
--engine cloud | local | auto (local/auto parse born-digital PDFs offline)
--split Split oversized PDFs past the page caps, parse parts, merge (needs pypdf)
--chunk Emit heading-aware RAG chunks (.chunks.json + --json)
--doctor Environment self-check and exit
Expose MinerU over MCP (zero-dependency stdio JSON-RPC) so an MCP host can call it:
python3 scripts/mineru_mcp.py
Tools: mineru_parse, mineru_parse_to (parse + deliver to sinks), mineru_list_sinks.
--to)Parse once and push the Markdown into content tools via each one's official path:
python3 scripts/mineru.py paper.pdf --to obsidian --to notion --to feishu
Targets: obsidian logseq siyuan notion linear yuque coda slack
feishu confluence onenote ticktick dingtalk airtable wecom (all
zero-dependency), plus roam and wps via optional extras. Each reads its config
from env vars (run --list-sinks). Per-target auth, fidelity, and image notes:
references/integrations.md.
output/
└── document-name/
├── document-name.md # clean Markdown
└── images/ # extracted figures (Standard API)
End-to-end latency for the official demo PDF via the free Agent API:
cold ≈ 14 s · warm ≈ 13 s (submit → poll → download). Batches scale with
--workers. Numbers come from the no-mock live benchmark in tests/test_live.py.
python3 -m pytest # fast unit suite (offline)
MINERU_LIVE=1 python3 -m pytest -m live -s # real API + benchmark (no mocks)
See references/api_reference.md. Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token