- Qwen3-VL 8b creates a verbose description + keywords
- Simpler CLIP encoder builds another set of tags
- Description is placed into an image RAG
- Image has keywords placed using "underscores" into the file name itself
- Description/Tags/Keywords are all embedded in EXIF data on the image
I've got close to around 30k worth of images so doing this gives me a more manifold means of searching using natural language, keywords, etc. to quickly retrieve images.