Generative AI Solution for Real-Time Inferencing Powered by NVIDIA AI Enterprise

Kinetica, the leader in real-time GPU-accelerated analytics, today announced at NVIDIA GTC a generative AI solution for enterprise customers that showcases the next step in the evolution of retrieval-augmented generation (RAG).

Generative AI applications utilize RAG to access and integrate up-to-date information from external knowledge bases, ensuring responses go beyond a large language model's (LLM) original training data. However, the prevalent methods of enriching context (through vector similarity searches) are inadequate for quantitative data, as they are designed primarily to understand textual content. Moreover, most (if not all) solutions face a significant amount of lag due to reindexing requirements before new data is available for a similarity search. As a result, these solutions cannot effectively support use cases that need to interface with real-time operational data.

Kinetica's solution — powered by the NVIDIA NeMo, part of the NVIDIA AI Enterprise software platform, and NVIDIA accelerated computing infrastructure — addresses all of these concerns. It is founded on two critical components: low-latency vector search (leveraging NVIDIA RAPIDS RAFT technology) and the ability to perform real-time, complex data queries. This powerful combination enables enterprises to instantly enrich their generative AI applications with domain-specific analytical insights, derived directly from the latest operational data.

But Kinetica goes further. To truly understand data, AI needs context about the structure, relationships and meaning of tables and columns in an enterprise's data. Kinetica has built native database objects that allow users to define this semantic context for enterprise data. An LLM can use these objects to grasp the referential context it needs to interact with a database in a context-aware manner.

"Kinetica's real-time RAG solution, powered by NVIDIA NeMo Retriever microservices, seamlessly integrates LLMs with real-time streaming data insights, overcoming the limitations of traditional approaches," said Nima Negahban, Cofounder and CEO, Kinetica. "This innovation helps enterprise clients and analysts gain business insights from operational data, like network data in telcos, using just plain English. All they have to do is ask questions, and we handle the rest."

All the features in Kinetica's generative AI solution are exposed to developers via a relational SQL API and LangChain plugins. This means that developers building applications can harness all the enterprise-grade features that come with a relational database. This includes control over who can access the data (Role-Based Access Control), reduce data movement from existing data lakes and warehouses (query federation that allows push-down to existing data sources), and preservation of existing relational schemas.

"Data is the foundation of AI, and enterprises everywhere are eager to connect theirs to generative AI applications," said Ronnie Vasishta, Senior Vice President of Telecom, NVIDIA. "Kinetica uses the NVIDIA AI Enterprise software platform and accelerated computing infrastructure to infuse real-time data into LLMs, helping customers transform their productivity with generative AI."

How it works

Kinetica's real-time generative AI solution removes the requirement for reindexing vectors before they are available for query. Additionally, it can ingest vector embeddings 5X faster than the previous market leader, based on the popular VectorDBBench benchmark. Taken together, this provides best-in-class performance for vector similarity searches that can support real-time use cases.

User-facing applications need to be responsive. The last thing users want when they are using a chat application is an endless spinning wheel. By executing analytical functions on large volumes of data in real time, Kinetica's solution provides the data runtime for generative AI applications that keeps the conversation flowing.

Under the hood, Kinetica uses NVIDIA CUDA Toolkit to build vectorized database kernels that can harness the massive parallelism offered by NVIDIA GPUs. Kinetica has built a vast corpus of analytical functions that are fully vectorized that cover fundamental operations such as filtering, joining, and aggregating data that is commonly seen in most analytical databases, as well as specialized functions tailored for spatial, time-series, and graph-based analytics.

Use cases

This analytical breadth across different domains is particularly handy for domain-specific generative AI applications. For instance, in telcos, Kinetica's generative AI solution can be used to explore and analyze pcap traces in real-time. This requires extensive use of complex spatial joins and aggregations and time-series operations.

Currently, network engineers use tools like Wireshark and others to troubleshoot problems in the network. Although these tools are very good, they do require a certain level of protocol expertise in order to be effective. With this real-time RAG solution, network engineers can ingest the network traffic and use generative AI to ask questions of the data in plain English.

Another implementation of this solution uses two data inputs: a stream of L2/L3 radio telemetry data and a vector table that stores telecom-specific rules and definitions, along with their embeddings. A domain-specific telco LLM that is trained on telecom data samples and schema is integrated with NVIDIA NeMo to create a chatbot application. The telco LLM converts user questions into a query that is executed in real time. The results of the query, along with any relevant business rules or definitions, are sent to NeMo, which then translates these results into a human-friendly response.