Basically doc2vec and cosine similarity. Totally nonsensical matching outputs to the point matching on title tag vectors or precis was better so now I’m curious if we just did something wrong…
So if for you the resulting doc-to-doc similarities seemed nonsensical, there was likely some process error in model training or application.
But ever since learning about word2vec, I've been thinking that there must be a better way. "Push" a section in with the formal vector a bit. Add a pinch of "brief", dial up the "humour" vector. I think it could create a very controllable and efficient writing tool.
[0] https://www.anthropic.com/research/persona-vectors [1] https://arxiv.org/abs/2507.21509
Not quite a usable commercial writing tool like i want, but it shows that extracting and applying a vector of a concept to the embedding is very useful.
Its also a potentially a very effective AI alignment tool like anthropic mentioned. Steering or restricting the model embedding loop instead of convincing it with a convoluted system prompt.
--
There are a few easy applications.
* When surfacing relevant documents, you can keep a list of the previous documents visited and boost in the "direction" that the customer is headed (could be an average of the previous N docs or weight towards frequency). But then you're just building a worse recsys for something where latency probably isn't that critical.
* If you know for every feature you release, you need an API doc, an FAQ, usage samples for different workflows or verticals you're targetting, you can represent each of these as f(doc) + f(topic) and find the existing doc set. But then, you can have much more deterministic workflows from just applying structure.
It's nice that you have a super flexible tool in the toolbox, but I think a lot of text based embedding applications (especially on out of domain data like long, unchunked technical docs) are just better off being something else if you have the time.
This one sounds promising to me, thanks for the suggestion. We technical writers often build out "docs completeness" spreadsheets where we track how completely each product feature is covered, exactly as you described. E.g. the rows are features, column B is "Reference", column C is "Tutorial" etc. So cell B1 would contain the deeplink to the reference for some particular feature. When we inherit a huge, messy docs set (which is fairly common) it can take a very long time to build out a docs completeness dashboard. I think the embeddings workflow you're suggesting could speed up the initial population of these dashboards a lot.
voyage-3-large: 0.54
voyage-code-3: 0.62
qwen3-embedding:4b: 0.71
embeddinggemma: 0.84
voyage-3.5-lite: 0.94
text-embedding-3-small: 0.97
voyage-3.5: 1.01
text-embedding-3-large: 1.13
Shocked by the apparently bad performance of OpenAI's SOTA model. Also always had a gut feeling that `voyage-3-large` secretly may be the best embedding model out there. Have I been vindicated? Make of it what you will ...Also `qwen3-embedding:4b` is my current favorite for local RAG for good reason...
That it ever worked was simply that, among the universe of candidate answers, the right answer was closer to the arithmetic-result-point than other candidates – not necessarily close on any absolute scale. Especially in higher dimensions, everything gets very angularly far from everything else - the "curse of dimensionality".
But the relative differences may still be just as useful/effective. So the real evaluation of effectiveness can't be done with the raw value diff(king-man+woman, queen) alone. It needs to check if that value is less than that for every other alternative to 'queen'.
(Also: canonically these exercises were done as cosine-similarities, not Euclidean/L2 distance. Rank orders will be roughly the same if all vectors normalized to the unit sphere before arithmetic & comparisons, but if you didn't do that, it would also make these raw 'distance' values less meaningful for evaluating this particular effect. The L2 distance could be arbitrarily high for two vectors with 0.0 cosine-difference!)
There you go: Closest 3 words (by L2) to the output vector for the following models, out of the most common 2265 spoken English words among which is also "queen":
voyage-3-large: king (0.46), woman (0.47), young (0.52), ... queen (0.56)
ollama-qwen3-embedding:4b: king (0.68), queen (0.71), woman (0.81)
text-embedding-3-large: king (0.93), woman (1.08), queen (1.13)
All embeddings are normalized to unit length, therefore L2 dists are normalized.> The widely known example only works because the implementation of the algorithm will exclude the original vector from the possible results!
I saw this issue in the "same topic, different domain" experiment when using EmbeddingGemma with the default task types. But when using custom task types, the vector arithmetic worked as expected. I didn't have to remove the original vector from the results or control for that in any way. So while the criticism is valid for word2vec I'm skeptical that modern embedding models still have this issue.
Very curious to learn whether modern models are still better at some analogies (e.g. male/female) and worse at others, though. Is there any more recent research on that topic? The linked article is from 2019.