Generative AI and Observability – How to Know What Good Looks Like

By Dom Couldwell, Head of Field Engineering EMEA, DataStax

Generative AI has huge potential. Already, McKinsey estimates that 65 percent of companies employ generative AI in their processes, roughly double that of the previous year’s survey. Three quarters of those questioned expect that generative AI will have a significant or disruptive impact on their markets around reducing costs and increasing growth.

However, the technologies involved in generative AI are new and evolving all the time. You have to knit together multiple components that all play specific and necessary roles in getting data ready for use with generative AI. When something changes as components are updated,  your applications can break. The speed and frequency of changes – not to mention how new the tech itself is – makes it hard to nail down exactly what the fault is and how to fix it.

Dealing with these integration issues also takes you away from working on business problems and delivering value. So when you have a problem, how do you know where to look across your whole generative AI application?

Observability and generative AI

Understanding your application is hard because of all the moving parts involved. Explaining how your application delivered a result involves looking at a mix of application and data processes. For example, how long did it take to get the data together to feed into the LLM and deliver the result back to the user? Getting this timing data is essential, as if your vector search results come back too late then they won’t get to the LLM in time to be used within your response back to the user.

In software development circles, observability is often defined as the combination of logs, traces and metric data to show how applications perform. In classic mechanical engineering and control theory, observability looks at the inputs and outputs for a system to judge how changes affect the results. In practice, looking at the initial requests and what gets returned provides data that can be used for judging performance.

Alongside this, there is the quality of the output to consider as well. Did the result answer the user’s question, and how accurate was the answer? Were there any hallucinations in the response that would affect the user? And where did those results come from? Tracking AI hallucination rates across different LLMs and services shows up how those services perform, where the levels of inaccuracy vary from around 2.5 percent to 22.4 percent.

All the steps involved around managing your data and generative AI app can affect the quality and speed of response at runtime. For example, retrieval augmented generation (RAG) allows you to find and deliver company data in the right format to the LLM so that this context can provide a more relevant response. When you implement RAG,  you have to look at how your system performs and how long it takes to prepare embeddings, carry out a search, and then deliver that data back as context.

One example of how to get this data across generative AI applications is LangSmith. LangSmith collects data from across the whole AI deployment to help developers see where bottlenecks exist and how to fix them. Getting this kind of data helps to show where specific problems have come up, such as latency spikes, as well as wider application problems where integrations have been broken due to changes or updates in those components.

Poor performance around supplying context data can mean that you have the wrong data approach in the first place. Take a product catalogue, where you have lots of product entries that can be very similar. You don’t want to provide inaccurate context to your LLM, and it can take time for your vector search to process, leading to poor performance. When this happens, you can improve the quality of your context data by combining knowledge graphs and RAG together. This combination ensures the right information is supplied to the LLM and that performance is fast enough. Understanding how those components work on their own and when they are integrated with others is essential, as the whole is much more than the sum of its parts.

Testing components across generative AI applications

For those that have started building generative AI applications, the emphasis is normally on the LLM. This is an essential component in the overall application. However, it is not solely responsible for the overall application performance. Looking at the LLM alone is like analysing a car by only considering the engine performance, rather than considering the role of the wheels or fuel. Getting accurate insights into how components operate across your GenAI application is the first step to prepare for future developments. Building more processes to explain how the AI system delivered results is the follow up.

To build on this further, teams should consider what their future AI platform will look like. The issue with this is that replacing different components and carrying out tests can be cumbersome. Implementing new infrastructure can take hours, removing developers from working on business problems even where savings or improvements can be made. Starting whole new implementations can make it hard to compare like with like.

In looking at an AI Platform as a Service (PaaS) approach for generative AI and RAG applications, evaluating different components to see performance levels should be simpler. Using open source integration tools like LangChain and Langflow, you should be able to swap different components into your AI application flow and then get information on performance for transactions. The challenge around integrations can be keeping up with any changes to the individual components, but with integration layers in place, this management headache is removed.

The other benefit of this platform approach is that it is also easier to swap components and try out alternatives without having to recreate applications from scratch or carry out additional integration work for the new component option. Trialling a new option, such as an alternative LLM or vector embedding service, can then be automated as part of the overall application workflow. After a trial, you can then look at your data to gauge whether you have the right LLM, vector search approach or integrations for your applications, or whether you may need to alter your approach.

The future for GenAI observability

Hunting down the issues that affect generative AI applications involves going through all the components that make up that service. They all need to work together and provide the right result at the right pace in order to deliver what you need. Some of these platform elements will be outside your direct control, where you don’t get to see what goes on. Where you don’t get direct data on performance, you will have to rely on observability for your generative AI applications in order to understand the results that your service delivers and the impact that your changes could have.

Tracking AI results and explaining where those responses came from will also be necessary for compliance reasons. The UK’s Artificial Intelligence bill calls for independent auditors to interrogate results from systems, while the European Union’s AI Act demands logging of transactions for traceability of results. Understanding AI results will be necessary for compliance, risk, and performance reasons. Yet much of the focus around these regulations is currently on the LLM models, rather than on the full generative AI application. Without this understanding, regulation will lag behind the market. This will affect companies that want to innovate, as they may either risk getting dragged back by regulation later or stop themselves from innovating in order to stay in step with the rules.

The overall lesson here is that your generative AI application has to deliver on the business goal that you set out to achieve. Understanding this relies on the data that you have available. Using this data effectively ensures you can make the right changes to how your application is structured, with the emphasis on how that business goal gets delivered faster.