xtract.bot
POST /api/pdf-extract-text

Extract plain text from a PDF over HTTP. Multi-page documents, encrypted-but-readable PDFs, and embedded fonts are all handled. Image-only / scanned PDFs return empty (use OCR for those).

Returns the readable text of a PDF — across multiple pages, multi-column layouts, and embedded fonts. Encrypted PDFs that allow text extraction (the common case for documents flagged but not actually password-protected) work without extra configuration. Image-only / scanned PDFs (where the "text" is actually rendered into the page as bitmap pixels) return empty output — for those, run the file through `document-ocr` first. Options: - `keepLayout` (default false): preserve the original visual layout in the output. Useful for tables; can produce a lot of whitespace for prose. - `quiet` (default false): suppress non-fatal warnings.

Inputs

NameTypeDefaultDescription
pdf*filePDF document bytes.
keepLayoutbooleanfalsePass `-layout` to to preserve visual layout.
quietbooleantruePass `-q` to suppress non-fatal stderr from.

Response

Modes: text, json. Cache: yes (24h TTL).

Code samples

Built from the hello example.

# Download or substitute the example input:
#   curl -O https://xtract.bot/examples/pdf-extract-text/hello.pdf
PDF=$(base64 -w0 < hello.pdf)

curl -X POST https://api.xtract.bot/api/pdf-extract-text \
  -H "Content-Type: application/json" \
  -H "Accept: text/plain" \
  -H "X-Account-Id: $XTRACT_ACCOUNT_ID" \
  -H "X-Api-Key: $XTRACT_API_KEY" \
  -d '{
  "keepLayout": false,
  "quiet": true,
  "pdf": "'"$PDF"'"
}'