UNIT.City — місце, де люди працюють... КРАЩЕ! Обирай свій простір просто зараз 👉
Редакція dev.uaAI Eng
9 March 2026, 13:12
2026-03-09
How to make a Ukrainian-speaking voice AI agent for your business in 2026
To begin with, I want to explain what exactly voice agents can do? The same as text agents, but with voice? And in principle, globally it is, but with the only clarification that voice agents have additional functionality — to make calls and have a conversation, and this opens up additional value for your business.
To begin with, I want to explain what exactly voice agents can do? The same as text agents, but with voice? And in principle, globally it is, but with the only clarification that voice agents have additional functionality — to make calls and have a conversation, and this opens up additional value for your business.
How to create a real-time streaming agent in 2026?
First, you need to understand how it works and what options are available. The solutions can be divided into 2 options: using speech-to-speech models, such as kyutai-labs/moshi, or using a cascade pipeline, which is more suitable for business tasks using models, such as Deepgram, ElevenLabs, and many others.
The first option gives an incredible result in latency — up to 200 ms end-to-end after the last user token and to the first agent token, which makes such a delay sometimes faster than a real person’s response. It is also worth noting the native emotionality of these models, which makes the conversation more natural and humanized.
But today it is practically impossible to use these models for specific business tasks, because these models work speech-to-speech and before creating a session it is possible to load only a prompt… well, that’s it. No RAG, Tool calling, ReAct. Simply put, this is a very cool chatterbox, but it will not be able to integrate with your database or CRM.
Cascading pipeline: how it works
Therefore, the main thing today will be the second option — a cascade pipeline. Why cascade? Because it immediately consists of a variety of cutting-edge technologies that work together.
VAD — Voice Activity Detection
At the beginning of this pipeline is VAD — Voice Activity Detection. It determines in the audio stream where there is speech and where there is silence/noise. As a rule, these are small neurons that work quickly, for example Silero VAD.
STT (Speech-to-Text) is the biggest problem for Ukrainian
Next comes STT (Speech-to-Text) — this is probably the most important problem and difficulty for anyone dealing with real-time voice AI agents. These models recognize voice and convert it into text.
Unfortunately, at the moment there are no open source production-level models that can work with real-time streaming and have good WER indicators for Ukrainian. And what is really catchy is that the same new Mistral Voxtral-transcribe-2 supports 12 languages, including, as you might guess, Russian, but not Ukrainian.
Therefore, you have to use the APIs of Deepgram, ElevenLabs, etc. In fact, these models take up the largest part of the delay between the user’s request and the agent’s response — 400–600 ms.
LLM is the brain of an agent
After we receive the text, we use LLM to answer the user’s question or call the tools. There are many more options here, but there are still nuances associated with the same delay.
Even the fairly simple but relatively fast GPT-4o has a latency of around 450–500 ms, which is long enough for our purposes, and this is a normal response, without calling tools and ReAct. Therefore, we are currently using the Groq API with open models for LLM. This gives a final latency of 200–300 ms.
Or even better, run these models locally on the GPU, which will give even better inference, but it’s a budget hit if there aren’t many calls. Anyway, we got a text response for the user and now we have to generate a voice for them.
TTS — voice generation
For this, powerful enough options have started to appear, but I still haven’t been able to find a single production-ready model that can be self-hosted.
Of the highest quality in terms of nightingale juiciness, I managed to find Respeecher AI, which has a model in English and Ukrainian. The voice quality is impressive, it becomes clear that Ukrainians did it, but my logs show a delay of 400–500 ms in its use, so in production I can’t afford such a luxury now, and you’ll understand why a little lower.
Another one that works well — Cartesia has one or two Ukrainian voices, but I don’t like the way they sound. That’s why my favorite remains ElevenLabs eleven_flash_v2_5. Fast (up to 200 ms), reliable, lots of voices.
Cumulative latency — and why it’s a problem
So, if you calculate the total delay from when the user finished speaking to when the agent started responding, you can get a little freaked out, because it comes out to more than a second, and if you add SIP delay to this, conversations with such an agent become bad, users notice it immediately, and you won’t get far with such a cart.
Practical approaches to reducing end-to-end latency
The main gain is a one-turn reduction in the number of LLM calls. We replaced the two-stage pipeline (classify LLM → respond LLM) with smart_respond — a single LLM call that simultaneously classifies messages and generates a response. This saved ~300 ms on each message without a keyword match.
Keyword pre-filter remained as the first gate — regex patterns work in 0 ms and intercept farewells, greetings, and obscene language, without wasting time on LLM at all.
The second layer of optimizations is at the connection and infrastructure level. Connection pooling (httpx persistent connections to the Groq API) removes the TCP/TLS handshake on each request, which gives ~20–50 ms savings.
resume_false_interruption=True does not reduce latency directly, but removes «idle» cycles when the agent stopped due to background noise and started a new turn from scratch.
Turn detector (EOUModel) adds semantic understanding of the end of a phrase on top of VAD — instead of reacting to any 200 ms pause, the model assesses whether the person has actually finished a thought, which reduces the number of fragmented responses to unfinished phrases.
Streaming overlap: STT → LLM → TTS
The key idea is not to wait for the previous stage to complete, but to start the next one as soon as the first data comes in. In an ideal pipeline, all three models run simultaneously, overlapping each other.
STT → LLM
Streaming STT provides interim results even before the person has finished speaking. You can start preparing the context and system prompt in advance, and send the final request as soon as the final transcript arrives.
A more aggressive approach is speculative inference: run LLM on the interim text, and if the final text matches, the answer is ready, if not, cancel and restart.
LLM → TTS
The most important connection. LLM generates the response token by token. Instead of waiting for the complete response, TTS starts synthesizing speech as soon as the first sentence has been accumulated. By the time LLM adds the second sentence, the first one is already being heard by the user.
Metric — TTFB (Time To First Byte) audio: time from the end of the user’s speech to the first response sound.
Result: the user agrees to the word → STT finalizes → LLM issues the first tokens after ~300 ms → TTS synthesizes immediately → the first audio chunk plays.
Phrase caching
You can also add caching of frequently used phrases, i.e. wav files are cached in advance and transmitted to the user with zero delay, without waiting for TTS to work. This significantly improves the conversation, especially at the beginning of the dialogue, when the question is whether the person will hang up or continue listening.
With various features, it is realistic to achieve a latency of 200–800 ms, which makes the conversation quite natural and comfortable.
aceverse.co - тут можна ознайомитись із деталями нашої послуги
aceverse.co - тут можна ознайомитись із деталями нашої послуги