What's That? – Photo to personalized audio narrative in under 10 seconds(apps.apple.com)

2 pointsby whatsthatapp12 hours ago1 comment

whatsthatapp12 hours ago
I've been working on a pipeline that chains vision analysis, parameterized narrative generation, and TTS into a single flow that completes in under 10 seconds (~2s vision, ~5s generation streamed, ~3s TTS). Shipped it as an iOS app.
The pipeline:
1. *Vision analysis* identifies the subject (landmark, artwork, food, signage, museum panel, etc.) and extracts contextual details from the image.
2. *IPOP parameterization* selects the narrative angle. IPOP (Ideas, People, Objects, Physical) is adapted from the Smithsonian's visitor engagement research, which found museum visitors cluster into four types by what draws their attention. Users set their IPOP dimension weights in the app. A user weighted toward "People" gets the story of the craftsman who built a building. A user weighted toward "Ideas" gets the political history of the same building from the same photo.
3. *Narrative generation* produces ~90 seconds of spoken text via LLM. The system uses SSE streaming so text renders on the client while generation is still running. The IPOP weights get injected into the system prompt alongside the vision output.
4. *Non-repetition via session context.* This was the hardest part. If someone photographs three churches in a row, the system needs to find a genuinely different angle each time. The approach: maintain a sliding summary of prior outputs in the session. Before generating, the system checks which IPOP dimensions and narrative angles have already been used, then rotates to an unexplored dimension. So church #1 might get the political history, church #2 gets the story of the stonemason, church #3 gets the acoustic design. Without this, you get "built in 1342, baroque style" on repeat.
5. *TTS* converts the narrative to audio with selectable voice models. Audio generates in the background while the user reads the streamed text.
There's no pre-built content database. The system generates from the image and user profile at request time, which means it handles subjects it's never encountered before, though quality degrades for very obscure or poorly lit subjects.
The IPOP dimension system is the part I'm least sure about. Four dimensions felt right based on the Smithsonian research, but I'm curious whether finer granularity (splitting "Ideas" into "historical" vs "conceptual," for example) would produce meaningfully different outputs or just add noise.
iOS. https://whats-that.app