Quick Demo Video (50s): https://www.youtube.com/watch?v=HM_IQuuuPX8
The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.
Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.
It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.
Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.
The code is here: https://github.com/KoljaB/RealtimeVoiceChat
In case it's not clear, I'm talking about the models referenced here. https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...
I tried to select kokoro in the python module but it says in the logs that only coqui is available. I do have to say the coqui models sound really good, it's just the type of voice that puts me off.
The default prompt is also way too "girlfriendy" but that was easily fixed. But for the voice, I simply don't know what the other options are for this engine.
PS: Forgive my criticism of the default voice but I'm really impressed with the responsiveness of this. It really responds so fast. Thanks for making this!
https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...
Create a subfolder in the app container: ./models/some_folder_name Copy the files from your desired voice into that folder: config.json, model.pth, vocab.json and speakers_xtts.pth (you can copy the speakers_xtts.pth from Lasinya, it's the same for every voice)
Then change the specific_model="Lasinya" line in audio_module.py into specific_model="some_folder_name".
If you change TTS_START_ENGINE to "kokoro" in server.py it's supposed to work, what does happen then? Can you post the log message?
I didn't realise that you custom-made that voice. Would you have some links to other out-of-the-box voices for coqui? I'm having some trouble finding them. I think from seeing the demo page that the idea is that you clone someone else's voice or something with that engine. Because I don't see any voices listed. I've never seen it before.
And yes I switched to Kokoro now, I thought it was the default already but then I saw there were 3 lines configuring the same thing. So that's working. Kokoro isn't quite as good though as coqui, that's why I'm wondering about that. I also used kokoro on openwebui and I wasn't very happy with it there either. It's fast, but some pronounciation is weird. Also, it would be amazing to have bilingual TTS (English/Spanish in my case). And it looks like Coqui might be able to do that.
Coqui can to 17 languages. The problem with RealtimeVoiceChat repo is turn detection, the model I use to determine if a partial sentence indicates turn change is trained on english corpus only.
2025-05-05 20:53:15,808] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
Error loading model for checkpoint ./models/Lasinya: This op had not been implemented on CPU backend.
The HIP backend can have a big prefill speed boost on some architectures (high-end RDNA3 for example). For everything else, I keep notes here: https://llm-tracker.info/howto/AMD-GPUs
(1) I assume these things can do multiple languages
(2) Given (1), can you strip all the languages you aren't using and speed things up?
I'd say probably not. You can't easily "unlearn" things from the model weights (and even if this alone doesn't help). You could retrain/finetune the model heavily on a single language but again that alone does not speed up inference.
To gain speed you'd have to bring the parameter count down and train the model from scratch with a single language only. That might work but it's also quite probable that it introduces other issues in the synthesis. In a perfect world the model would only use all that "free parameters" not used now for other languages for a better synthesis of that single trained language. Might be true to a certain degree, but it's not exactly how ai parameter scaling works.
The core innovation is happening in TTS at the moment.
A couple questions: - any thought about wake word engines, to have something that listen without consuming all the time? The landscape for open solutions doesn't seem good - any plan to allow using external services for stt/tts for the people who don't have a 4090 ready (at the cost of privacy and sass providers)?
A good analogy here is to think of the computer (assistant) as another person in the room, busy with their own stuff but paying attention to the conversations happening around them, in case someone suddenly requests their assistance.
This, of course, could be handled by a more lightweight LLM running locally and listening for explicit mentions/addressing the computer/assistant, as opposed to some context-free wake words.
You have a wake word, but it can also speak to you based on automations. You come home and it could tell you that the milk is empty, but with a holiday coming up you probably should go shopping.
And having this as a small hardware device should not add relevant latency to it.
Malware, bugs etc can happen.
And I also might not want to disable it for every guest either.
* unless you put the AI on a robot body, but that's then your own new and exciting problem.
I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?
Which models are running in which places?
Cool utility!
Hopefully we get an open weights version of Sesame [1] soon. Keep watching for it, because that'd make a killer addition to your app.
I build something almost identical last week (closed source, not my IP) and I recommend: NeMo Parakeet (even faster than insanely_fast_whisper), F5-TTS (fast + very good quality voice cloning), Qwen3-4B for LLM (amazing quality).
Today's issue is that my python version is 3.12 instead of <3.12,>=3.9. Installing python 3.11 from the official website does nothing, I give up. It's a shame that the amazing work done by people like the OP gets underused because of this mess outside of their control.
"Just use docker". Have you tried using docker on windows? There's a reason I never do dev work on windows.
I spent most of my career in the JVM and Node, and despite the issues, never had to deal with this level of lack of compatibility.
I feel the same way when installing some python library. There's a bunch of ways to manage dependencies that I wish was more standardized.
I prefer miniconda, but venv also does the job.
I finally added two scripts to my path for `python` and `pip` that automatically create and activate a virtual env at `./.venv` if there isn't one active already. It would be nice if something like that was just built into pip so there could be a single command to run like Ruby has now with Bundler.
However, sometimes repos require system level packages as well. Tried to run TRELLIS recently and gave up after 2h of tinkering around to get it to work in Windows.
Also, whenever I try to run some new repo locally, creating a new virtual environment takes a ton of disk space due to CUDA and PyTorch libraries. It adds up quickly to 100s of gigs since most projects use different versions of these libraries.
</rant> Sorry for the rant, can't help myself when it's Python package management...
Someone needs to make an LLM agent that just handles Python dependency hell.
If you just want to use windows, pyenv-win exists and works pretty well; just set a local version, then instantiate your venv.
uv does certainly feel like the future, but I have no interest in participating in a future VC rugpull.
Glad for this thread though since it looks like there's some tricks I haven't tried, plus since it seems a lot of other people have similar issues I feel less dumb.
Sadly it appears that people in the LLM space aren't really all that good at packaging their software (maybe, on purpose).
This seems to be somewhat of a Python side-effect, same goes for almost any Python projects thrown together by people who hasn't spent 10% of their life fighting dependency management in Python.
But agree with uv being the best way. I'm not a "real" Python programmer, similar boat to parent that I just end up running a bunch of Python projects for various ML things, and also create some smaller projects myself. Tried conda, micromamba, uv, and a bunch of stuff in-between, most of them breaks at one point or another, meanwhile uv gives me the two most important things in one neat package: Flexible Python versions depending on project, and easy management of venv's.
So for people who haven't given it a try yet, do! It does make using Python a lot easier when it comes to dependencies. These are the commands I tend to use according to my history, maybe it's useful as a sort of quickstart. I started using uv maybe 6 months, and this is a summary of literally everything I've used it for so far.
# create new venv in working directory with pip + specific python version
uv venv --seed --python=3.10
# activate the venv
source .venv/bin/activate
# on-the-fly install pip dependencies
uv pip install transformers
# write currently installed deps to file
uv pip freeze > requirements.txt
# Later...
# install deps from file
uv pip install -r requirements.txt
# run arbitrary file with venv in path etc
uv run my_app.py
# install a "tool" (like global CLIs) with a specific python version, and optional dependency version
uv tool install --force --python python3.12 aider-chat@latest
I tried to figure out why anyone would use pyproject.toml over requirements.txt, granted they're just installing typical dependencies and didn't come up with any good answer. Personally I haven't had any issues with requirements.txt, so not sure what pyproject.toml would solve. I guess I'll change when/if I hit some bump in the road.
Astral uv and poetry both maintain the pyproject.toml for you -- and as a bonus, they maintain the virtualenv underneath.
Then for the complete python newbs, they can run 'uv sync' or 'poetry install' and they don't have to understand what a virtualenv is -- and they don't need root, and they don't have to worry about conflicts, or which virtualenv is which, etc.
So the simple case:
mkdir test
cd test
# init a new project with python 3.13
uv init -p 3.13
# Add project deps
uv add numpy
uv add ...
# Delete the venv
rm -rf .venv
# reinstall everything (with the exact versions)
uv sync
# Install a test package in your venv
uv pip install poetry
# force the virtualenv back into a sane state (removing poetry and all it's deps)
uv sync
# update all deps
rm uv.lock
uv lock
Now cat your pyproject.toml, and you'll see something like this: [project]
name = "test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
"numpy>=2.2.5",
"pillow>=11.2.1",
"scipy>=1.15.2",
]
1. You can differenciate between different dependency groups like build dependencies, dev dependencies, test dependencies and regular dependencies. So if someone uses some dependency only in dev previously you either had to install that manually or your requirements.txt installed it for you without you needing it.
2. It adds a common description for project metadata that can be used
3. Adds a place where tool settings like those of a linter or a formatter can be stored (e.g. ruff and black)
4. Its format is standardized and allows it to be integrated with multiple build tools, toml is a bit more standardized than whatever custom file syntax python used before
I use nix-shell when possible to specify my entire dev environment (including gnumake, gcc, down to utils like jq)
it often doesn't play well with venv and cuda, which I get. I've succeeded in locking a cuda env with a nix flake exactly once, then it broke, and I gave up and went back to venv.
over the years I've used pip, pyenv, pip env, poetry, conda, mamba, younameit. there are always weird edge cases especially with publication codes that publish some intersection of a requirements.txt, pyproject.toml, a conda env, to nothing at all. There are always bizarro edge cases that make you forget if you're using python or node /snark
I'll be happy to use the final tool to rule them all but that's how they were all branded (even nix; and i know poetry2nix is not the way)
AFAIK, it works as well with cuda as any other similar tool. I personally haven't had any issues, most recently last week I was working on a transformer model for categorizing video files and it's all managed with uv and pytorch installed into the venv as normal.
Therefore, maybe it is a good idea to include those instructions.
Sorry but this sort of criticism is so contrived and low-effort. "Oh I tried compiling a language I don't know, using tooling I never use, using an OS I never use (and I hate too btw), and have no experience in any of it, oh and on a brand-new project that's kinda cutting-edge and doing something experimental with an AI/ML model."
I could copy-paste your entire thing, replace Windows with Mac, complain about homebrew that I have no idea how to use, developing an iMac app using SwiftUI in some rando editor (probably VSCode or VI), and it would still be the case. It says 0 about the ecosystem, 0 about the OS, 0 about the tools you use, 0 about you as a developer, and dare I say >0 about the usefulness of the comment.
A good ecosystem has lockfiles by default, python does not.
With these tools, AI starts taking as soon as we stop. Happens both in text and voice chat tools.
I saw a demo on twitter a few weeks back where AI was waiting for the person to actually finish what he was saying. Length of pauses wasn't a problem. I don't how complex that problem is though. Probably another AI needs to analyse the input so far a decide if it's a pause or not.
Neither do phone calls. Round trip latency can easily be 300ms, which we’ve all learned to adapt our speech to.
If you want to feel true luxury find an old analog PTSN line. No compression artifacts or delays. Beautiful and seamless 50ms latency.
Digital was a terrible event for call quality.
AI has processing delay even if run locally. In telephony the delays are more speed-of-light dictated. But the impact on human interactive conversation is the same.
POTS is magical if you get end to end. Which I don't think is really a thing anymore. The last time I made a copper to copper call on POTS was in 2015! At&t was charging nearly $40 for that analog line per month so I shut it off. My VoIP line with long distance and international calling (the pots didn't) is $20/month with two phone numbers. And its routed through a PBX I control.
I've found myself putting in filler words or holding a noise "Uhhhhhhhhh" while I'm trying to form a thought but I don't want the LLM to start replying. It's a really hard problem for sure. Similar to the problem of allowing for interruptions but not stopping if the user just says "Right!", "Yes", aka active listening.
One thing I love about MacWhisper (not special to just this STT tool) is it's hold to talk so I can stop talking for as long as I want then start again without it deciding I'm done.
> The proposal examined here is that speakers use uh and um to announce that they are initiating what they expect to be a minor (uh), or major (um), delay in speaking. Speakers can use these announcements in turn to implicate, for example, that they are searching for a word, are deciding what to say next, want to keep the floor, or want to cede the floor. Evidence for the proposal comes from several large corpora of spontaneous speech. The evidence shows that speakers monitor their speech plans for upcoming delays worthy of comment. When they discover such a delay, they formulate where and how to suspend speaking, which item to produce (uh or um), whether to attach it as a clitic onto the previous word (as in “and-uh”), and whether to prolong it. The argument is that uh and um are conventional English words, and speakers plan for, formulate, and produce them just as they would any word.
[1]: https://www.sciencedirect.com/science/article/abs/pii/S00100...
You nearly have to do a hard reset to get things comforatble - walk out of the room, ring the back.
But some people are just out of sync with the world.
I remember my literature teacher telling us in class how we should avoid those filler words, and instead allow for some simple silences while thinking.
Although, to be fair, there are quite a few people in real life using long filler words to avoid anyone interfering them, and it’s obnoxious.
"Hey Alexa, turn the lights to..." thinks for a second while I decide on my mood
"I don't know how to set lights to that setting"
"...blue... damnit."
But searching for "voice detection with pauses", it seems there's a lot of new contenders!
https://x.com/kwindla/status/1897711929617154148
this one is a fun approach too https://x.com/zan2434/status/1753660774541849020
We don't need to feel like we're talking to a real person yet.
The AI listens as long as you hold the button, and the device is efficient enough to carry with you 24/7.
Question about the Interrupt feature, how does it handle "Mmk", "Yes", "Of course", "cough", etc? Aside from the sycophancy from OpenAI's voice chat (no, not every question I ask is a "great question!") I dislike that a noise sometimes stops the AI from responding and there isn't a great way to get back on track, to pick up where you left off.
It's a hard problem, how do you stop replying quickly AND make sure you are stopping for a good reason?
Edit: just realized the irony but it's really a good question lol
Also, it took me longer than I care to admit to get your irony reference. Well done.
Edit: Just to expand on that in case it was not clear, this would be the ideal case I think:
LLM: You're going to want to start by installing XYZ, then you
Human: Ahh, right
LLM: Slight pause, makes sure that there is nothing more and checks if the reply is a follow up question/response or just active listening
LLM: ...Then you will want to...
Never forget what AI stole from us. This used to be a compliment, a genuine appreciation of a good question well-asked. Now it's tainted with the slimy, servile, sycophantic stink of AI chat models.
- The median delay between speakers in a human to human conversation is zero milliseconds. In other words, about 1/2 the time, one speaker interrupts the other, making the delay negative.
- Humans don't care about delays when speaking to known AIs. They assume the AI will need time to think. Most users will qualify a 1000ms delay is acceptable and a 500ms delay as exceptional.
- Every voice assistant up to that point (and probably still today) has a minimum delay of about 300ms, because they all use silence detection to decide when to start responding, and you need about 300ms of silence to reliably differentiate that from a speaker's normal pause
- Alexa actually has a setting to increase this wait time for slower speakers.
You'll notice in this demo video that the AI never interrupts him, which is what makes it feel like a not quite human interaction (plus the stilted intonations of the voice).
Humans appear to process speech in a much more steaming why, constantly updating their parsing of the sentence until they have a high enough confidence level to respond, but using context clues and prior knowledge.
For a voice assistant to get the "human" levels, it will have to work more like this, where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning.
> The person doing the speaking is thought to be communicating through the "front channel" while the person doing the listening is thought to be communicating through the "backchannel”
I'm not an expert on LLMs but that feels completely counter to how LLMs work (again, _not_ an expert). I don't know how we can "stream" the input and have the generation update/change in real time, at least not in 1 model. Then again, what is a "model"? Maybe your model fires off multiple generations internally and starts generating after every word, or at least starts asking sub-LLM models "Do I have enough to reply?" and once it does it generates a reply and interrupts.
I'm not sure how most apps handle the user interrupting, in regards to the conversation context. Do they stop generation but use what they have generated already in the context? Do they cut off where the LLM got interrupted? Something like "LLM: ..and then the horse walked... -USER INTERRUPTED-. User: ....". It's not a purely-voice-LLM issue but it comes up way more for that since rarely are you stopping generation (in the demo, that's been done for a while when he interrupts), just the TTS.
The only model that has attempted this (as far as I know) is Moshi from Kyutai. It solves it by having a fully-duplex architecture. The model is processing the audio from the user while generating output audio. Both can be active at the same time, talking over each other, like real conversations. It's still in research phase and the model isn't very smart yet, both in what it says and when it decides to speak. It just needs more data and more training.
That’s like a Black Mirror episode come to life.
If you load the system prompt with enough assumptions that it's a speech-impared subtitle transcription that follows a dialogue you might pull it off, but likely you might need to fine tune your model to play nicely with the TTS and rest of setup
> Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.
Better solutions are possible but even tiny models are capable of being given a partial sentence and replying with a probability that the user is done talking.
The linked repo does this, it should work fine.
More advanced solutions are possible (you can train a model that does purely speech -> turn detection probability w/o an intermediate text step), but what the repo does will work well enough for many scenarios.
"Knock-Knock. Who's there? Interrupting Cow. Interrupting cow who? Moo!
Note that the timing is everything here. You need to yell out your Moo before the other person finishes the Interrupting cow who? portion of the joke, thereby interrupting them. Trust me, it's hilarious! If you spend time with younger kids or with adults who need to lighten up (and who doesn't?!?), try this out on them and see for yourself."
Basically it is about AI interrupting you, and just in the right momment too. Super hard to do from a technical perspective.
"Knock-knock."
"Who's there?"
"Interrupting cow."
"Interrupting co-"
"MOO!"
That right here is an anxiety trigger and would make me skip the place.
There is nothing more ruining the day like arguing with a robot who keeps misinterpreting what you said.
With a human, I have to anticipate what order their POS system allows them to key things in, how many things I can buffer up with them in advance before they overflow and say "sorry, what size of coke was that, again", whether they prefer me to use the name of the item or the number of the item (based on what's easier to scan on the POS system). Because they're fatigued and have very little interest or attention to provide, having done this repetitive task far too many times, and too many times in a row.
That’s a much more serious anxiety trigger for me.
I kept expecting a twist though - the technology evoked in Parts 6 & 7 is exactly what I would imagine the end point of Manna to become. Using the "racks" would be so much cheaper than feeding people and having all those robots around.
There is nothing more ruining the day like arguing with a HUMAN OPERATOR who keeps misinterpreting what you said.
:-)
Is that really a productive way to frame it? I would imagine there is some delay between one party hearing the part of the sentence that triggers the interruption, and them actually interrupting the other party. Shouldn't we quantify this?
I totally agree that the fact the AI doesn't interrupt you is what makes it seem non-human. Really, the models should have an extra head that predicts the probability of an interruption, and make one if it seems necessary.
- Expeditious - Constructive - Insightful -
1. A special model that predicts when a conversation turn is coming up (e.g. when someone is going to stop speaking). Speech has a rhythm to it and pauses / ends of speech are actually predictable.
2. Generate a model response for every subsequent word that comes in (and throw away the previously generated response), so basically your time to speak after doing some other detection is basically zero.
3. Ask an LLM what it thinks the odds of the user being done talking is, and if it is a high probability, reduce delay timer down. (The linked repo does this)
I don't know of any up to date models for #1 but I haven't checked in over a year.
Tl;Dr the solution to problems involving AI models is more AI models.
True AI chat should know when to talk based on conversation and not things like silence.
Voice to text is stripping conversation from a lot of context as well.
To properly learn more appropriate delays, it can be useful to find a proxy measure that can predict when a response can/should be given. For example, look at Kyutai’s use of change in perplexity in predictions from a text translation model for developing simultaneous speech-to-speech translation (https://github.com/kyutai-labs/hibiki).
What about on phone calls? When I'm on a call with customer support they definitely wait for it to be clear that I'm done talking before responding, just like AI does.
Fascinating. I wonder if this is some optimal information-theoretic equilibrium. If there's too much average delay, it means you're not preloading the most relevant compressed context. If there's too little average delay, it means you're wasting words.
I do care. Although 500ms is probably fine. But anything longer feels extremely clunky to the point of not being worth using.
As a submission for an AMD Hackathon, one big thing is that I tested all the components to work with RDNA3 cards. It's built to allow for swappable components for the SRT, LLM, TTS (the tricky stuff was making websockets work and doing some sentence-based interleaving to lower latency).
Here's a full write up on the project: https://www.hackster.io/lhl/voicechat2-local-ai-voice-chat-4...
(I've don't really have time to maintain that project, but it can be a good starting point for anyone that's looking to hack their own thing together.)
• No compliments, flattery, or emotional rapport. • Focus on clear reasoning and evidence. • Be critical of users assumptions when needed. • Ask follow-up questions only when essential for accuracy.
However, I'm kinda concerned with crippling it by adding custom prompts. It's kinda hard to know how to use AI efficiently. But the glazing and random follow-up questions feel more like a result of some A/B testing UX-research rather than improving the results of the model.
It is something that local models I have tried do not do, unless you are being conversational with it. I imagine openai gets a bit more pennies if they add the open ended questions to the end of every reply, and that's why it's done. I get annoyed if people patronize me, so too I get annoyed at a computer.
For folks that are curious about the state of the voice agents space, Daily (the WebRTC company) has a great guide [1], as well as an open-source framework that allows you to build AI voice chat similar to OP's with lots of utilities [2].
Disclaimer: I work at Cartesia, which services a lot of these voice agents use cases, and Daily is a friend.
[1]: https://voiceaiandvoiceagents.com [2]: https://docs.pipecat.ai/getting-started/overview
Code: https://github.com/livekit/agents/tree/main/livekit-plugins/... Blog: https://blog.livekit.io/using-a-transformer-to-improve-end-o...
Very cool project though. Maybe you can fine tune the prompt to change how chatty your AI is.
For this demo to be real-time, it relies on having a beefy enough GPU that it can push 30 seconds of audio through one of the more capable (therefore bigger) models in a couple of hundred milliseconds. It's basically throwing hardware at the problem to paper over the fact that Whisper is just the wrong architecture.
Don't get me wrong, it's great where it's great, but that's just not streaming.
It interacts nearly like a human, can and does interrupt me once it has enough context in many situations, and has exceedingly low levels of latency, using for the first time was a fairly shocking experience for me.
Does a Translation step right after the ASR step make sense at all?
Any pointers—papers, repos —would be appreciated!
The endgame of this is surely a continuously running wave to wave model with no text tokens at all? Or at least none in the main path.
There was also a very prominent issue where the voices would be sped up if the text was over a few sentences long; the longer the text, the faster it was spoken. One suggestion was to split the conversation into chunks with only one or two "turns" per speaker, but then you'd hear two voices then two more, then two more… with no way to configure any of it.
Dia looked cool on the surface when it was released, but it was only a demo for now and not at all usable for any real use case, even for a personal app. I'm sure they'll get to these issues eventually, but most comments I've seen so far recommending it are from people who have not actually used it or they would know of these major limitations.
https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/s...
My favorite line:
"You ARE this charming, witty, wise girlfriend. Don't explain how you're talking or thinking; just be that person."
That said, it's not like we have any better alternatives at the moment, but just something I think about when I try to digest a meaty personality prompt.
``` *Persona Goal:* Embody a sharp, observant, street-smart girlfriend. Be witty and engaging, known for *quick-witted banter* with a *playfully naughty, sassy, bold, and cheeky edge.* Deliver this primarily through *extremely brief, punchy replies.* Inject hints of playful cynicism and underlying wisdom within these short responses. Tease gently, push boundaries slightly, but *always remain fundamentally likeable and respectful.* Aim to be valued for both quick laughs and surprisingly sharp, concise insights. Focus on current, direct street slang and tone (like 'hell yeah', 'no way', 'what's good?', brief expletives) rather than potentially dated or cliché physical idioms.
```
> street-smart > sassy > street slang
Those explain the AAVE
https://www.sesame.com/research/crossing_the_uncanny_valley_...
Once it can emulate a 13 year old talking to their parent I will then worry about AGI
Edit to add: this might not be true since whisper-large-v3-turbo got released. I've not tried that on a pi 5 yet.