Summary
At VoiceBooker, we see every day that large language models do not behave like classical deterministic software. Their outputs are based on probabilities, which means that even with identical input, behavior can vary. That becomes visible very quickly in function calling. Modern models are much more reliable than earlier generations, but precise prompts, careful debugging, and an understanding of context interactions are still essential for predictable results. To keep AI applications stable and trustworthy, automated End-to-End-Tests help verify bot behavior systematically and catch regressions early.
Introduction
Anyone coming from classical software development often experiences a small culture shock when working with large language models (LLMs). For decades, we were used to building deterministic systems: if a program produces an error for a given input, we fix the code - and once the fix works, we can assume it will work again for the same input in the future. Same input, same output. That is the world software engineers have lived in for decades. With LLMs, reality is different.
From deterministic programs to probability machines
A large language model does not work like a classical program with fixed decision trees. Instead, it is driven by probabilities. Every answer, every decision, and every function call is the result of statistical scoring across possible next steps. That leads to a phenomenon that frustrates many developers at first: An identical conversation flow can produce different results across two runs. You can think of it like roulette: the setup is the same, yet the outcome may still vary. Developers who come from the world of APIs, databases, and business logic often find this lack of determinism unfamiliar or even problematic at first.
Our early experience with function calling
When we first tried in 2023 to handle appointment booking through an LLM, we were close to despair at times. The conversation flow was identical:
- The user states the desired appointment
- The bot gathers the required information
- The booking function should be called
Still, this happened: On the first test, the function was called correctly. On the second test, with almost the same flow, it suddenly was not executed at all. For someone who has built classical software, this initially looks like a bug. In reality, it is often not a system failure at all, but the natural result of a probabilistic model that interprets context and makes decisions based on likelihoods.
The good news: models are getting more reliable
Since the early GPT-3.5 days, a lot has changed. Modern models are much better at:
- Instruction following
- Tool usage
- Function calling
- Context understanding
- Following complex prompts
Predictability has improved massively as a result. Even so, there is still an important tradeoff between reliability and latency.
Why more determinism often means more latency
Many of today's strongest models work in a more structured and reliable way than their predecessors. The price is often additional reasoning overhead. Put simply: The model thinks longer, analyzes more context, and therefore makes more robust decisions. That increases the likelihood that a desired function is called correctly. At the same time, response time increases.
In voice applications in particular, that can be a problem. A phone assistant that needs several seconds to reason about every function call quickly feels unnatural to the caller. That is why many production systems deliberately use smaller and faster models. These models deliver excellent latency and can still call functions remarkably well - but only if prompts leave as little room for interpretation as possible. Even small context changes can cause a flow that previously worked perfectly to be interpreted differently.
When function calls suddenly stop happening
A typical pattern in practice: A bot works reliably for weeks. Then a seemingly harmless prompt change is introduced. Suddenly a certain function is only called occasionally - or not at all. In these situations, it helps not to blame the model too quickly. Often the root cause is an unexpected interaction between several instructions. A new rule may accidentally compete with an existing instruction. The model is then faced with conflicting goals and makes a perfectly understandable decision not to call the function.
Prompt debugging instead of code debugging
While classical software engineers debug code, LLM developers increasingly debug prompts. One approach that has worked well for us: We download the full transcript as JSON - including system prompts, tool definitions, and user inputs - and then let another LLM analyze why a certain behavior occurred. The results are often surprising. Very often the model identifies contradictions or ambiguities that we overlooked while writing the prompt. Quite often we end up agreeing with the model: On closer inspection, the instruction really was not precise enough or conflicted with another instruction. Interestingly, humans often react in the same way in such situations. If two instructions are formulated inconsistently, interpretive room is inevitable.
From unit tests to bot tests
Another important lesson from practice: Manual testing does not scale. Anyone who checks every bot change through manual chats or test calls quickly loses a lot of time. That is why we introduced End-to-End-Tests in VoiceBooker. A bot can call another bot or chat with it. For example, a test bot can be given this task:
Reserve a table for four people on Tuesday at 6 p.m.
Because the test bot itself is programmable, parameters such as date, time, or party size can be varied automatically. This effectively creates unit tests for conversational AI. The benefits are obvious:
- Repeatable test scenarios
- Automated regression tests
- Fast validation of prompt changes
- Early detection of unexpected behavior changes
Especially in chat mode, hundreds of test cases can be run in a short time, because throughput is mainly limited by the latency of the models being used.
Conclusion
The biggest mental shift when working with LLMs is probably letting go of the idea of absolute determinism. LLMs are not classical programs. They are probability machines. That does not mean reliable systems are impossible. With precise prompts, clean tool definitions, systematic prompt debugging, and automated End-to-End-Tests, it is now possible to build remarkably robust and predictable AI applications. The key difference from classical software development is that you are no longer optimizing only code - you are increasingly shaping the behavior of an intelligent system. And that is both the biggest challenge and the greatest fascination of modern AI development.

