Vectorized processing is the secret sauce behind the fastest, most powerful analytics database, ever

James Dilworth | November 18, 2016

How do you uncover security threats from billions of different signals when time is of the essence? This was the challenge faced by the US Army Intelligence and Security Command eight years ago. It is the challenge that became the genesis for Kinetica vectorized analytics database.

Threat identification requires searching for patterns within large volumes of data and from a wide variety of streaming data sources. The military needed a system that would provide for ad-hoc exploration of that data, while it is fresh, and without knowing in advance what questions would need to be asked.

The challenges faced by the military were in many ways a forerunner to those that are being encountered by all larger businesses today. Cyber data, IoT data, web data, and business transaction data are being generated at massive rates. Patterns and insights need to be recognized quickly, and the need for time relevancy is more important than ever.

The variety and high cardinality of streaming data poses particular challenges for analytical systems. While there are many databases that provide for analytics at scale (such as Oracle, Teradata), or fast data access at scale (NoSQL), these have all struggled when complex querying on time-sensitive data is also required.

One example where this challenge arises is when defending against a DDoS attack. Log data surges in from a wide variety of IPs and you need a system that can look up the top 10,000 IPs by number of packets transferred. This is a difficult challenge for analytics systems.

As the military evaluated different systems, they would see the same pattern play itself over and over again; as the query inventory becomes more advanced, even the most expensive databases couldn’t keep up. More varied queries required more indexes, which slowed ingestion capacity. The system would require more and more hardware until it would eventually fall down.

A better solution was sorely needed.

Parallel processing on GPUs solves compute-intensive workloads

What if it were possible to build a database with so much raw compute power that you wouldn’t need to index as much data? This was the promise of GPUs, which typically have thousands of cores that are well suited to performing repeated similar instructions in parallel. This makes them well-suited to the compute-intensive workloads required of large data sets.

NVIDIA’s release of the CUDA API opened up the door to using graphics cards for high performance compute on standard hardware. With thousands of processing cores available on a single card, it would be possible to use brute force to solve the aggregation, sort, and grouping operations that are so workload intensive for a CPU.

To take advantage of GPUs for such operations required a ground up development of a new database. GPUs require specific programming, and operations need to be processed differently to take maximum advantage of its threading model. Data would be held in a columnar store, optimized for feeding the GPU compute capability as fast as possible.

System memory enables linear scale out

Getting the most from the GPU requires that data be held in memory, but should it be held by memory on the GPU card, or in system memory?

The simplest method is to hold data in vRAM – the memory available on the GPU card itself. The biggest benefit to this is that the transfer time is very, very fast, but the downside is that vRAM is expensive and limited in capacity–currently 24GB on an NVIDIA K80. Even with a high-end HPC system with 16 GPU cards, the maximum data set that could be worked on would still be under 400GB. That’s not enough to tackle the big whale analytics challenges of large business.

Alternatively, system RAM allows for the database to scale capacity to much larger volumes and scale out across many nodes of standard hardware . Data stored in main memory can be fed to the GPU. A typical set-up might be on a machine with 1TB memory and two GPU cards. Almost linear scalability makes it viable to do real-time analytics on a 10TB, or theoretically a 100TB dataset by scaling out to 10 or 100 nodes, respectively.

Multi-head ingest for high volumes of streaming data

With so much machine-generated data being created today, speed of ingest is of critical concern.

Typically, distributed databases will perform writes through a head-node, or cluster of head-nodes, which then decide where the data will go. However, peer-to-peer ingestion, where each node of the system can share the task of data ingest, provides more and faster throughput. The client API is given a routing list of all the nodes in the cluster. Each client can directly ingest to all the workers. There are no bottlenecks to ingestion, and it can always be made faster by adding more nodes. In recent trials, Kinetica was able to ingest almost 4billion rows/minute on a 10 node cluster from Spark on Hadoop.

But getting the data in is only half the story; making it available for read is the other half, and that’s where GPU acceleration leads to even more remarkable improvements.

The Holy Grail : Simultaneous ingest and query

With orders of magnitude more compute provided by GPUs, the data structures can be simpler and heavy indexing is unnecessary. This means the database does not have as much work to do for each update. Data structures are simply grids, with associated metadata, and the compute kernels only need to do fairly simple operations for each write. Complex queries on the new data can be performed immediately.

CPU-bound in-memory systems such as SAP HANA or MemSQL require the use of complex data structures to speed up queries. This works on low node count, but as data under management grows, updating these data structures becomes exponentially difficult, leading to increasing latency between write and ability to read.

With GPUs you can see complex queries returned in milliseconds even as the dataset grows and more nodes are added. More and more customers are trying to solve for challenges where leading analytical databases can’t keep up with the ingest, and where queries that are run simultaneously with ingest will cause the database to max out and crash.

Geospatial analysis

Vectorized processors are also ideal for processing data that is positioned in space and time. Geospatial indexes can become very complex, as systems typically partition data into separate buckets for different resolutions of the earth. A GPU can compute massive amounts geospatial data directly with ease. The GPU also opens the door to powerful visualization capabilities.

GPU-powered analytics is ready for the enterprise–and new workloads

After eight years of development, Kinetica is now at version 7.1 and is quite probably the fastest, most powerful OLAP engine available commercially.

A new era of real-time analytics capability, powered by GPUs, is now upon us. It will be exciting to watch how enterprises leverage these new capabilities.