JSON Schema for Real-World Documents

Why schema-at-generation beats post-hoc cleanup for invoices, leases, and regulated forms.

10 min read

Definition: schema-at-generation

JSON Schema for document extraction means defining the exact shape of output records—field names, types, required properties, and enums—and validating against that schema while the model produces JSON, not in a separate cleanup step afterward. PaperIQ.ai lets tenants describe fields in plain English, generate a JSON Schema, and enforce validation during extraction where configured. The result is automation-ready JSON suitable for databases, spreadsheets, CRMs, and accounting systems—not a conversational summary that still needs manual reshaping.

Why summaries fail downstream systems

Generic AI outputs read well to humans but break automation. A lease “summary” might mention rent and term in prose; your portfolio database needs `monthly_rent` as a number, `lease_start` as an ISO date, and `tenant_legal_name` as a string with consistent spelling. Post-hoc cleanup—asking a model to “convert this paragraph to JSON”—introduces a second failure mode: the converter may invent fields, drop optional clauses, or round currency incorrectly. Operations teams then audit row-by-row, which defeats the purpose of automation. Schema-at-generation aligns the model’s task with the destination system from the start: produce records that already match your table columns.

A practical schema example (invoice)

An accounts-payable team might define: • `vendor_name` (string, required) • `invoice_number` (string, required) • `invoice_date` (string, format date) • `due_date` (string, format date) • `currency` (string, enum USD|EUR|GBP) • `subtotal`, `tax`, `total` (numbers) • `line_items` (array of objects with `description`, `quantity`, `unit_price`, `amount`) When extraction runs, validation errors surface missing totals or malformed dates before export—not after a CSV lands in SAP. That shift from “fix in Excel” to “reject and re-run” is how teams scale document volume without scaling headcount linearly.

Leases and contracts: nested structure matters

Real-world documents need nested JSON, not flat key-value guesses. Commercial leases often require: • Parties block (landlord, tenant, guarantor) • Premises address object • Rent schedule array (base rent, escalations, abatements) • Option clauses with dates and notice periods JSON Schema expresses those relationships explicitly. Required arrays cannot silently empty; date fields reject “TBD” strings when your policy forbids them. For regulated workflows, you can encode business rules at the schema layer—what must be present before a record is accepted.

Plain-English schema authoring

Engineering teams should not be the only ones who can define fields. PaperIQ supports describing what you need—“Extract landlord and tenant legal names, monthly base rent, lease commencement date, and renewal options with notice windows”—and turning that into schema scaffolding. Ops and domain experts iterate on field names that match how they already talk about the document. Engineering reviews types and enums. That collaboration reduces the classic IDP failure where IT ships a schema operators never adopted.

Validation during generation vs after

Validating after generation treats the model as a black box you mop up. Validating during generation (where configured) gives faster feedback loops: the pipeline knows a record failed schema checks and can surface errors in the job UI. PaperIQ does not claim perfect extraction on every document type out of the box. It claims a disciplined contract: your schema is the acceptance test. Measure success by schema pass rate on your representative PDF set, not by a vendor’s generic benchmark on someone else’s forms.

When JSON Schema is the wrong tool

Schema-first extraction is not ideal when: • You only need a one-off narrative summary for a human reader • Field definitions change on every document with no stable target system • You have no representative sample set to test schema pass rates In those cases, a lightweight parse or chat interface may suffice. PaperIQ targets teams with repeatable document classes and systems of record waiting for rows.

Next steps

Start with ten representative PDFs and the JSON you wish you already had in your database. Define schema, run extraction, and track validation errors by field—not just “did it look right in a paragraph.” Related reading: MCP for Business Data Automation (push validated records into CRMs and ERPs) and our invoice extraction use case.


FAQ

No. Schema validation is configurable. Teams that need automation-ready output typically enable schema-at-generation; exploratory workflows may start without strict validation.

Yes. PaperIQ supports export paths aligned with row/column workflows and downstream systems once records pass validation where configured.

OCR plus cleanup produces unstructured or loosely structured text first, then hopes a second model pass formats it correctly. Schema-at-generation makes the target record shape the primary task from the start.


Related guides