1 pointby mrajatnath5 hours ago1 comment
  • mrajatnath5 hours ago
    Rajat here.

    A few technical details about how this works.

    Stack: - Next.js - Tailwind - KaTeX for rendering - Supabase storage - deployed on Vercel

    The pipeline is roughly:

    image → vision model → Markdown + LaTeX → custom renderer

    The tricky part isn’t OCR itself — it's preserving structure.

    Examples:

    • consecutive equations with aligned `=` signs need to become a single `align` block • handwritten tables must be reconstructed from vertical alignment patterns • numbered problems must stay separate instead of merging

    The system prompt ended up being ~300 lines mostly consisting of *negative constraints* like:

    - don't simplify math - don't merge derivation steps - don't reorder columns

    Without those rules the model constantly tries to "improve" the notes.

    One surprising lesson: prompt engineering for OCR is very different from chat prompts — you want the model to be extremely literal.

    Still working on better handling for diagrams and messy annotations.

    Curious if anyone here has worked on *math layout detection or document AI*.