Run OCR on a document or photo and return the recognised text + word/line bounding boxes. Works on PNG, JPEG, WebP, AVIF.
Extract text from images with Tesseract's LSTM engine (English
language model bundled). Handles document scans, screenshots,
and reasonably-clean photos.
Inputs:
- `image`: PNG, JPEG, WebP, or AVIF bytes (the front Worker
auto-converts non-PNG inputs via the codecs before
feeding them to ).
- `granularity`: how much detail to return. `text` (default,
just the concatenated string), `word` (text + per-word boxes
+ confidence), `line` (text + per-line boxes + confidence),
or `word+line` (everything).
Response modes (negotiated via Accept header):
- `application/json` (default): `{text, durationMs, width,
height}` plus `words[]` and/or `lines[]` when granularity
requests them. Each box is `{rect: {left, top, right,
bottom}, text, confidence}`.
- `text/plain`: just the recognised text as a UTF-8 body —
same content as `text` in the JSON envelope. Boxes are
discarded in this mode (they only fit in JSON).
For best results pre-process noisy scans with
`document-prepare-for-ocr` first (deskew + binarize).
Phase B of `docs/EXPANSION_PLAN.md`. Other languages can be
added later by lazy-loading additional `.traineddata` files from
R2 — currently only English is bundled.
Inputs
Name
Type
Default
Description
image*
file
—
Input document image (PNG/JPEG/WebP/AVIF).
granularity
enum (text | word | line | word+line)
"text"
Level of detail to return (boxes are only emitted in JSON mode).