Chain of Thought

Getting answers from your data now: Natural language-to-SQL

Nima Negahban
November 14, 2023

Generative AI has raised people’s expectations about how soon they can get answers to the novel questions they pose. What amazes them first is how fast GenAI responds to them. When people see this, they wonder why they can’t have a similar experience when asking questions about their enterprise data. They don’t want to have to document their requirements, and then fight for resources to eventually be prioritized, only to find themselves waiting for their database teams to engineer the environment to answer yesterday’s questions. They want to ask questions and get answers.

Ad hoc answers

Most databases aren’t good at answering questions to questions that no one anticipated beforehand. Take a look at the traditional methods folks have used for decades, for optimizing database processes and accelerating performance, which includes indexing, denormalization, pre-aggregation, materialized views, partitioning, summarizing. All of this is about overcoming performance limitations of traditional analytic databases.

Pre-engineering data is essentially trading performance for agility. Whenever you pre-engineer data to improve performance for the queries you know and expect, you create technical debt that makes it harder to ask new and untried questions.

To deliver on the promise of supporting ad hoc queries, the database can’t rely on pre-engineering or pre-emptive indexing ahead of time to ensure high performance. You do need a kind of backstop. Kinetica provides this backstop, in the form of a compute architecture that enables the ability to scan all the data quickly. Kinetica was designed from the ground up to quickly answer unknown, complex ad-hoc questions.

We brought this engine to the table before the generative AI discussion even began.

War room

Kinetica’s origins date back to 2009 at the US Dept. of Defense, in support of its mission to track the behavior and tactical movement patterns of suspected terrorists. The DoD had tried all the leading distributed analytic databases, including Teradata, Hadoop, and others. But their environments were too rigid to accomplish the dynamic nature of the questions that analysts were asking.

Our initial objective was to make it feasible to do a very complex set of filters and aggregates across a wide variety of different data sets. These massive data sets included cell phone records, satellite imagery, email communications, and more. We needed to apply variable criteria for specified time ranges, while merging continuously growing data sets and joining them by given geospatial zones. And we were working against real-time data.

The queries themselves had very high cardinality, meaning they worked with high numbers of unique values at all times. They did not lend themselves well to traditional performance improvements like indexing. DoD analysts needed a way to quickly find the needle in the haystack. They needed to slice-and-dice, fuse and filter their data sets, in an unlimited number of dimensions, without knowing in advance what the access patterns would be. Beyond traditional OLAP type questions, analysts had questions involving movement over time and space. They didn’t know what they were looking for; they were working off of clues, working against the clock to better understand their enemy. If their answers didn’t come quickly, the situation would change and the opportunity would be gone.

We looked at this problem from an entirely different perspective. Traditional data engineering techniques are designed to avoid scanning all the data, because looking at every record and performing complex joins have historically been expensive and slow tasks. What if scanning all the data wasn’t expensive and slow? Thinking along these lines, we built an engine that could scan data more quickly and efficiently than other databases.

Our solution for the DoD was vectorized processing. We leveraged the power of GPUs, reducing the time spent scanning all the data in memory to much less than a traditional database needs to run an index or a pipeline. It’s vectorized processing that gives us this backstop — this guarantee that every query will return results quickly, even without engineering the underlying data for performance.

Just add language

Speed and agility are at the core of Kinetica, making it highly complementary to LLMs tasked with converting natural language to SQL (NL2SQL). Kinetica was the first database to introduce NL2SQL to an analytic database last May with the launch of SQL-GPT. Each time you give SQL-GPT a question, the LLM translates that question into a SQL query that can then be run within Kinetica.

Today, most of what we’re seeing around using LLMs to convert language into SQL is centered around the application layer. These solutions take a decoupled approach with the database that will have to execute the query. We believe that approach is flawed.

Kinetica’s approach includes our own native LLM that’s already aware of Kinetica syntax and analytic functions. So when you use Kinetica’s LLM, you’re leveraging semantics that already reflect a deep understanding of the nuances of both SQL and Kinetica’s syntax, which includes vector search, time-series, graph, and geospatial analytics.

To facilitate developers building applications, we added constructs within Kinetica’s database engine that allow users to interact more natively with an LLM. Through our SQL API users can quickly and easily define context objects, provide few-shot training samples and execute LLM output directly.

Typically, the LLMs you see used in AI today are tuned for creativity. One way they go about this is through a process called speculative sampling, which is like educated guesses. Actually, what appears to be spontaneously produced text — not always saying the same thing every time — is a side-effect of making different educated guesses with each iteration.

That’s nice enough when you’re writing an essay. When you’re generating SQL, you don’t want spontaneity — you require unwavering consistency. You don’t want variations in your SQL; you want the code we generate to be reliable and functional every time.

When Kinetica assembles a sequence of SQL commands for the output, rather than always choosing the highest probability single word as the next word in the sequence, Kinetica explores multiple, potential paths before picking the best one. Language models that generate natural-language text tend to avoid these methods, because their text ends up repeating itself. But when you’re creating SQL, you want that consistency.

Peanut butter and jelly

If NL2SQL is the peanut butter, the underlying database is the jelly. When an organization opens the gates to allow users to ask any question of their data, the SQL it would produce will get complicated. It will result in queries that have complex joins, correlated subqueries, and invoke sophisticated analytic functions like a graph algorithm. It won’t be easily controlled or managed. Traditional methods of optimization and planning will no longer work. You need a data engine with the horsepower to take on any query, and return results that contain up-to-date data, with a quick response time.

Now you can finally have a real conversation with your data.