xtract.bot
POST /api/document-ocr

Run OCR on a document or photo and return the recognised text + word/line bounding boxes. Works on PNG, JPEG, WebP, AVIF.

Extract text from images with Tesseract's LSTM engine (English language model bundled). Handles document scans, screenshots, and reasonably-clean photos. Inputs: - `image`: PNG, JPEG, WebP, or AVIF bytes (the front Worker auto-converts non-PNG inputs via the codecs before feeding them to ). - `granularity`: how much detail to return. `text` (default, just the concatenated string), `word` (text + per-word boxes + confidence), `line` (text + per-line boxes + confidence), or `word+line` (everything). Response modes (negotiated via Accept header): - `application/json` (default): `{text, durationMs, width, height}` plus `words[]` and/or `lines[]` when granularity requests them. Each box is `{rect: {left, top, right, bottom}, text, confidence}`. - `text/plain`: just the recognised text as a UTF-8 body — same content as `text` in the JSON envelope. Boxes are discarded in this mode (they only fit in JSON). For best results pre-process noisy scans with `document-prepare-for-ocr` first (deskew + binarize). Phase B of `docs/EXPANSION_PLAN.md`. Other languages can be added later by lazy-loading additional `.traineddata` files from R2 — currently only English is bundled.

Inputs

NameTypeDefaultDescription
image*fileInput document image (PNG/JPEG/WebP/AVIF).
granularityenum (text | word | line | word+line)"text"Level of detail to return (boxes are only emitted in JSON mode).

Response

Modes: json, text. Cache: yes (24h TTL).