Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.
It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.
i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.
[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...
https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...
which is microsoft for "we removed two dead links". AI innovation knows no limits!
[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...
moving on....
I really disappointed with this model to say the least.
I feel like a recurring pattern with Microsoft is to create something quickly, market it aggressively and push for everyone to use it immediately, and only once it is installed everywhere do people suddenly realize how terrible it is, but it's too late to change.
Vista on the other hand...
"Open-source" can be "anything you can go out and grab a copy of and use" but doesn't give you much legal certainty about any of it, and reserve "free software" for the other, better thing.
AGPLv3 partially solves the issue by blocking people like Google from using it to build proprietary cloud services that take away their users' freedom. (It still doesn't solve the problem where providers use network effects to achieve the same end game.)
What in the world do you mean?
On the spectrum of "things that take away user freedom", withholding the source code is bad. Withholding the source code, the binaries and physical access to the computer is obviously much worse! This latter business model is heavily subsidized by GPLv3.
The first sentence of the GNU manifesto says this, and a few sections later in the document elaborate on the point:
https://www.gnu.org/gnu/manifesto.html
Note, in particular, footnote [1] which explains that its OK for distributors to ask for payment, but that it's never OK for users to have to ask for permission to use the software, and the section "Why I Must Write GNU".
Since then, software service monopolies became common, and all of the most end-user-hostile systems on earth rely heavily on the GNU system. At this point, we're paying for permission to use those services with our money, our data, our democracy, etc.
I certainly cannot give you permission to use any of the GPLed services that I have used, or that I've been paid to extend. Therefore, I say the free software movement has lost its way.
I care that I know what I can DO with the project when I see it described as "open source".
Yes, the first of which is that you should be able to build it from source. Which requires the source code, and in this case data.
The problem with requiring "build from scratch" for open source models is that the number of interesting models with training data that can be openly licensed is close to zero.
If you trained your model on an unlicensed scrape of the web you can't release the data under an open source license!
The Open Source Initiative have a bunch of their thinking around this in their FAQ for the "Open Source AI definition": https://opensource.org/ai/faq#isn-t-training-data-required-t...
By this definition almost any binary can be "open source" since hex editors exist. (Or more usefully, you can use ghidra et al. to do more interesting changes.) I know GPL has a very specific view of things, but I'd like to quote an excerpt that I think is generally applicable from https://www.gnu.org/licenses/gpl-3.0.html -
> The “source code” for a work means the preferred form of the work for making modifications to it. “Object code” means any non-source form of a work.
Which is why I'm fine with "open weights", because that's saying the object code is under an open license.
> The problem with requiring "build from scratch" for open source models is that the number of interesting models with training data that can be openly licensed is close to zero.
So? If the number of open source models is zero, then the number of open source models is zero.
https://huggingface.co/allenai/OLMo-2-0325-32B
Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.
Maybe we should have a little cue card for models: vendor/name, size, open weights, open source, permissive license.
It’s simple enough an idea.
That said, I entirely agree that MS is misrepresenting their openness here, which isn’t in the least surprising.
[1]: https://opensource.org/licenses [2]: https://opensource.org/osd
This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.
I think it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source."
Ironically, the roots of the Open Source movement are a direct reponse to the Free Software movement largely because it was considered too ideological and unfriendly to corporate interests (i.e. monetization.)
Neither did the inventors of AI. A third party published a document after corporations went with open weights = open source and a spoiler block in FOSS wanted all training data published.
> it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source
I think it’s counterproductive. Most people only see a squabble, which makes any ensuing points from the open-source community seem silly. Those who care can continue using the more-precise language they choose to.
Put another way, there is a difference between using terms like cracker and fully spelling out cryptocurrency, and telling people who use hacker and crypto more loosely that they’re wrong. They aren’t wrong and that isn’t meaningful feedback. At the same time, the person using the precise language isn’t wrong either.
> think it’s counterproductive. Most people only see a squabble, which makes any ensuing points from the open-source community seem silly.
Only to people that truly don't care whether something's open source. In which case, Microsoft using the term (correctly or incorrectly) won't change their perception.
But the people who do care won't like to be mislead by Microsoft. There's a reason the term is right in the headline: people respond to it.
I wish I had time to come up with a better example, but it's like if a AAA game company says they've released "native Linux build," but really they're just packaging the Windows build with Wine.
99% of people won't care, neither about the news nor the deception. But for that last 1%, any goodwill garnered with the headline would be gone, and the game company are the ones who look foolish, not the people calling them out.
Because the other assumption I could have gone with is the less charitable take that they know GIS with a soft G doesn't sound like jizz, but they were just looking for a crude way to mock the soft G.
Some men just want to watch the world burn. At least it's mostly harmless fun anyway. It's even funnier when they bring up how my name is pronounced in defense of "jiff" and I tell them, so you're calling me the expert in "Gi" pronunciation then? :)
Way early on (spring 2023) people tried to stop it, but no luck.
A delusion is a false mental belief.
Basically hallucinations are false external things, and delusions false internal things. You hallucinate a pink elephant, you delude yourself into thinking trump won 2020.
Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.
I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.
Maybe open inference?
But we often also get source code for fine tunning the model.
So maybe it's closer to open source than to anything else?
Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?
As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.
> “I think we see a path now where the world gets much more abundant and much better every year.” – Sam Altman
When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls
People will also post their own interpretations in response to comments, and quickly find out they missed something.
… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.
[on topic]
(OK I’m done making excuses, time to read the article… thanks for the encouragement!)
I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:
“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”
> We need to get beyond the arguments of slop vs sophistication and develop a new equilibrium in terms of our “theory of the mind” that accounts for humans being equipped with these new cognitive amplifier tools as we relate to each other
I think everyone else is relating to
https://futurism.com/artificial-intelligence/microsoft-bans-...
https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
- Cohere Transcribe (self hosted)
- Grok Speech To Text (they provide an API, only $0.10/hr!)
They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck