My example was an image segmentation model. I managed to create an dataset of 100,000+ images and was training UNets and other advanced models on it, always reached a good validation loss but my data was simply not diverse enough and I faced a lot of issues in actual deployment, where the data distribution kept changing on a day to day basis. Then, I tried DINO v2 from Meta, finetuned on 4 images and it solved the problem, handled all the variations in lighting etc with far higher accuracy than I ever achieved. It makes sense, DINO was train on 100M + images, I would never be able to compete with that.
In this case, the company still needed my expertise, because Meta just released the weights and so someone had to setup the fine-tuning pipeline. But I can imagine a fine tuning API like OpenAI’s requiring no expertise outside of simple coding. If AI results depend on scale, it naturally follows that only a few well funded companies, will build AI that actually works, and everyone else will just use their models. The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.
We would still need the 100 M+ images with accurate labels. That work can be performed collectively and open sourced but it must be maintained etc. I don't think it will be easy.
How did Chinese companies do it, is it a fabricated claim? https://slashdot.org/story/24/12/27/0420235/chinese-firm-tra...
This change of not needing ML engineers is not so much about the models, as it is about easy API access for how to finetune a model, it seems to me?
Of course it's great that the models have advanced and become better, and more robust though.
Simply yeeting every "object of interest" into DINOv2 and running any cheap classifier on that was a game changer.
Not using it to create segmentations (there are YOLO models that do that, so if you need a segmentation you can get it in one pass), no, just to get a single vector representing each crop.
Our goal was not only to know "this is a traffic sign", but also do multilabel classification like "has graffiti", "has deformations", "shows decoloration" etc. If you store those it becomes pretty trivial (and hella fast) to pass these off to a bunch of data scientists so they can let loose all the classifiers in sklearn on that. See [1] for a substantially similar example.
[1] https://blog.roboflow.com/how-to-classify-images-with-dinov2
I was able to turn around a segmentation and classifier demo in almost no time because they gave me fast and quick segmentation from a text description and then I trained a YOLO model on the results.
Or does it likely just work on real world photos and cartoons and stuff?
Many LLM use cases could be solved by a much smaller, specialized model and/or a bunch of if statements or regexes, but training the specialized model and coming up with the if statements requires programmer time, an ML engineer, human labelers, an eval pipeline, ml ops expertise to set up the GPUs etc.
With an LLM, you spend 10 minutes to integrate with the OpenAI API, and that's something any programmer can do, and get results that are "good enough".
If you're extremely cash-poor, time-rich and have the right expertise, making your own model makes sense. Otherwise, human time is more valuable than computer time.
> LLMs and the platforms powering them are quickly becoming one-stop shops for any ML-related tasks. From my perspective, the real revolution is not the chat ability or the knowledge embedded in these models, but rather the versatility they bring in a single system.
Why use another piece of software if LLM is good enough?
Also privacy. Do museum visitors know their camera data is being sent to the United States? Is that even legal (without consent) where the museum is located? Yes, visitors are supposed to be pointing their phone at a wall, but I suspect there will often be other people in view.
If I have a project with a low enough lifetime inputs I'm not wasting my time labelling data and training a model. That time could be better spent working on something else. As long as the evaluation is thorough, it doesn't matter. But I still like doing some labelling manually to get a feel for the problem space.
Con of this approach would be that it’s requires maintenance if they ever decide to change the illustration positions.
Embedding a QR code or simply a barcode somewhere and you're done. Maybe hide it like a watermark so it does not show to the naked eye and doing some Fourier transform in the app won't require a network connection nor lot of processing power.
We ran a benchmark of our system against an LLM call and the LLM performed much better for so much cheaper, in terms of dev time, complexity, and compute. Incredible time to be in working in the space seeing traditional problems eaten away by new paradigms
With the limited training data they have I'm surprised they don't mention any attempts at synthetic training data. Make (or buy) a couple museum scenes in blender, hang one of the images there, take images from a lot of angles, repeat for more scenes, lighting conditions and all 350 images. Should be easy to script. Then train YOLO on those images, or if that still fails use their embedding approach with those training images.
> “ To address this limitation, we turned to data augmentation, artificially creating new versions of each image by modifying colors, adding noise, applying distortion, or rotating images. By the end, we had generated 600 augmented images per car.”
What I am talking about is that they want to recognize scenes containing the images, but only have the images as training data. They have a good idea what those scenes will look like. Going there to take actual training pictures was evidently not viable, but generating approximations of them might have been.
> ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace
One question I had was, knowing how difficult it was to train the model with the base images, and given that the client didn’t have time to photograph them, did you consider flying someone out to the museum for a couple of days to photograph each illustration from several angles with the actual lighting throughout the day? Or potentially hiring a photographer near the museum to do that? It seems like a round trip ticket plus a couple nights in a hotel could have saved a lot of headache, providing more images to turn into synthetic training data. Even if you still had to resort to using 4o as a tiebreaker, it could be that you only present two candidates as the third might have a much lower similarity score to the second candidate. Good write up either way.
Finding new geoglyphs from known examples.
And which would those be?
Anyway, great work, and thank you for taking the time to share it!
1. Measure the distance from the wall (standard image processing)
2. Use the rotations of the gyro sensors on the phone to conclude which car is being looked at
I wonder if this could be as accurate though
You'd somehow have to generate an embedding for each image, I presume.
For choosing a model, the article mentions the AWS Titan multimodal model, but you’d have to pay for API access to create the embeddings. Alternatively, self-hosting the CLIP model [0] to create embeddings would avoid API costs.
Follow-up question: Would the embeddings from the llama3.2-vision models be of higher quality (contain more information) than the original CLIP model?
The llama vision models use CLIP under the hood, but they add a projection head to align with the text model and the CLIP weights are mutated during alignment training, so I assume the llama vision embeddings would be of higher quality, but I don’t know for sure. Does anybody know?
(I would love to test this quality myself but Ollama does not yet support creating image embeddings from the llama vision models - a feature request with several upvotes has been opened [1].)
We have a good open-source repo here with a ColPali implementation: https://github.com/tjmlabs/ColiVara
I see the ColiVara-Eval repo in your link. If I understand correctly, ColQwen2 is the current leader followed closely by ColPali when applying those models for RAG with documents.
But how do those models compare to each other and to the llama3.2-vision embeddings when applied to, for example, sentiment analysis for photos? Do benchmarks like that exist?
The ColPali paper(1) does a good job explaining why you don’t really want to directly use vision embeddings; and how you are much better off optimizing for RAG with a ColPali like setup. Basically, it is not optimized for textual understanding, it works if you are searching for the word bird; and images of birds. But doesn’t work well to pull a document where it’s a paper about birds.
Congrats on shipping.
when "embeddings" are used to perform closeness test, you are using a pretrained computer vision model behind the scenes. it is doing the far majority of tasks of filtering out hundreds of images down to a handful.
visual llm works on textual descriptions that seem far too close for similar images. regardless, more power to the team for finding something that works for them.
SOTA V-LLMs do not work on textual descriptions.
And someone that's not openai buying into this naming convention is just unpaid propaganda
> give the appearance of agi
Can you point out where specifically they're doing this? Best I can tell, they give a decent summary of the effectiveness of multi-modal LLM's with support for vision, and then talk about using it to solve an incredibly narrow task. The only diction I could see that hints at "agi" is when they describe the versatility of this approach; but how could you possibly argue against that? It's objectively more versatile (if not wasteful and more expensive).
My understanding was that there was a traditional cv library that was effectively producing an image to text before passing it to the llm. But the more I think about it, even that method would involve training for image detection to a point where objects are recognized by images not by tokens.
So the gpt product is no longer an llm or text based.
Can't say much for sure at this point with closed source, we will probably see competition catch up eventually and have more info then. At which point openai will eventually release the text2img separately and dispense with the mysticism and agi pretention.
My guess is that this is a separate image to text model ( or image+text model) and it is slapped on to the main llm code.
I don't think that text is just another modality, it probably will always be the core.
I don't have a source on something as strategic and subjective, I just have an finger on the pulse: their robot demo that does laundry, their consistent talk about AGI, their mention of power-seeking in docs, their attempt to raise trillions for chip factories, transition to for profit. They have a huge pressure to be THE monopoly and their risk is for GPT to be a text based local maximum and for intelligence not to be a sappir wolphian phenomenon.
P.s: early docs from 2023 refer to the img2txt submodel as gpt4v, that's what we should call the submodule in my opinion. (If it in fact is the same piece of tech)
you are given a reference and three candidates, which one of the candidates do you think is a match to the reference? Only output its identifier or a code when none is found
Not exactly that but something along those lines.
Then one "user" message per car (reference + candidates) with image + text indicating the type (reference or candidate) and an identifier (can be as simple as the index for the candidates).
However the inaccuracy threshold seems fine for a museum, but in enterprise operations inaccuracy can mean lost revenue or worse lost trust and future business flow.
I’m struggling with some more advanced AI use cases in my collaborative work platform. I use AI (LLMs) for things like summarizations, communication, finding information using embedding. However, sometimes it is completely wrong.
To test this I spent a few days (doing something unrelated) building up a recipes database and then trying to query it for things like “I want to make a quick and easy drink”. I ran the data through classification and other steps to get as good data as I could. The results would still include fries or some other food result when I’m asking for drinks.
So I have to ask what the heck am I doing wrong? Again, for things like sending messages and reminders or coming up with descriptions, and finding old messages that match some input - no problem.
But if I have data that I’m augmenting with additional information (trying to attach more information that maybe missing but possible to deduce from what’s available) to try and enable richer workflows I’m always being bit in the butt. I feel like if I can figure this out I can provide way more value.
Not sure if what I said makes sense.
Not sure either. But here is the lesson from this and other sources. To improve the output use multistep approach. Get the first answer, one or more, and pass it through the second verification step(s). Like 'for this * is this *' relevant? Or is it correct, does it solve the problem, etc.. Then select the answer with the best scores on the filters. You see, it's very similar to that in the original post. Get first candidates, filter.
I have a multi-step workflow already, but it is getting slow now going through all those steps. And sometimes, if the result is wildly incorrect, it feels really bad.
Again, I'm not an expert in this field but I am trying to learn and improve my product.