A few technical details about how this works.
Stack: - Next.js - Tailwind - KaTeX for rendering - Supabase storage - deployed on Vercel
The pipeline is roughly:
image → vision model → Markdown + LaTeX → custom renderer
The tricky part isn’t OCR itself — it's preserving structure.
Examples:
• consecutive equations with aligned `=` signs need to become a single `align` block • handwritten tables must be reconstructed from vertical alignment patterns • numbered problems must stay separate instead of merging
The system prompt ended up being ~300 lines mostly consisting of *negative constraints* like:
- don't simplify math - don't merge derivation steps - don't reorder columns
Without those rules the model constantly tries to "improve" the notes.
One surprising lesson: prompt engineering for OCR is very different from chat prompts — you want the model to be extremely literal.
Still working on better handling for diagrams and messy annotations.
Curious if anyone here has worked on *math layout detection or document AI*.