Skip to content
Kinetica in Motion
The Kinetica Blog
Blog »
Hari Subhash

From Batch to Streaming: The Next Paradigm in Data Analytics

Kinetica is a data warehouse that can run complex queries on millions of rows of streaming data…in real time. Other solutions today take minutes, hours, and sometimes even days to run the types of mission-critical queries that Kinetica can handle in seconds or even milliseconds.

This type of speed provides significant competitive advantage. Businesses that respond to opportunities and threats as they unfold rather than after they do, can always stay a step ahead.

So why can’t existing solutions provide the low latency and high performance analytics that streaming data warehouses like Kinetica can provide?

This post explores why the analysis of streaming data requires a shift away from the existing paradigms of batch and micro-batch processing towards a new paradigm of accelerated computing on modern hardware.

The early 2000’s saw a shift away from centralized data warehouses to a distributed model. These new NoSQL database systems could store vast amounts of data distributed across cheap commodity computers thereby lowering the cost of owning and analyzing massive amounts of data. 

In a typical distributed system, users would write queries, which would then be mapped to the individual nodes that carry a part of the data. The intermediate results from each of these nodes would then be reduced to get the final result.

Apache Hadoop, the open source project that came out of Google research, laid most of the groundwork for this. The majority of solutions that currently exist are based on the MapReduce and batch processing concepts introduced by Hadoop in the mid-2000s. This is the current dominant paradigm in database technologies. So let’s take a closer look at a typical batch processing workflow.

Imagine you are a company that produces three types of products represented by these three different shapes below. You store your data using the Hadoop Distributed File System (HDFS). You are interested in identifying product anomalies that are represented by the color yellow. 

You start collecting data on Monday until the end of the week on Friday. 

At the end of the day on Friday, the analytics team fires off a scheduled MapReduce job that identifies anomalies. This is a complicated piece of code that is mapped to the different data partitions in order to identify anomalies, and then reduced to find the final total, which then goes into a report that is submitted on Monday.

This is a typical batch processing workflow, where the data is collected and analyzed in batches, long after the events of interest have occurred. 

This approach was a remarkable success when initially proposed in the mid 2000s, particularly given its ability to handle the massive amounts of data generated as the internet expanded. But the world has since moved on. 

Today we produce a tremendous amount of streaming data ranging from click events toIoT devices, from telemetry to geospatial services, and more.. The last two years alone have generated 90% of all the data that exists today. This pace will only increase as more devices and more people are connected. 

While new solutions like Apache Kafka allow us to capture these massive streams of data, the analysis of this data still relies on the now old paradigm of batch processing.

The biggest drawback of this approach is that every second wasted after an event has occurred represents money and competitive advantage lost. For instance, in our example, we would have liked to identify the anomalies as they were occurring, rather than days after they did. 

The best solutions today can cut this delay down from a week to perhaps a day or even half a day. And they have to rely on a patchwork of tools and infrastructure to accomplish this. Even after disregarding the cost and complexity of achieving this, the delay is simply too long.

Trying to apply batch processing to handle streaming analytics can often feel like you are trying to jam a square peg into a round hole. And it is no surprise that enterprises often look to Kinetica when conventional solutions reach their breaking point while trying to solve streaming analytical problems that Kinetica can handle in a matter of seconds or minutes.

So what is the secret?

The key to Kinetica’s computation speed and power is that it has been vectorized from scratch to harness the capabilities of modern hardware.

Vectorization quite simply is the ability to apply the same instruction on multiple pieces of data simultaneously. 

This is orders of magnitude faster than the conventional sequential model where each piece of data is handled one after another in sequence.

Most of the analytical code out there is written in this sequential mode. This is understandable, since until about a decade ago, CPU and GPU hardware could not really support vectorization for data analysis.

Analytical functions in Kinetica on the other hand have been written from scratch to make use of the vectorization capabilities of modern hardware. And the great thing about Kinetica is that you don’t have to write a single piece of vectorized code yourself. You can simply write your queries in data science languages that you are familiar with, and under the hood Kinetica will deliver your results at mind boggling speeds.

While vectorization on its own can deliver blazing fast computational performance,  Kinetica does not stop there. It also adds additional layers of features for speed and performance, such as a memory first approach that prioritizes the use of faster system memory and the VRAM, which is chip memory that is co-located with the GPU,  a proprietary columnar data storage format, and intelligent data distribution and localization features that minimize the movement of data between nodes in a cluster.

Finally, Kinetica’s integrated analytical suite, which have all been written from scratch ensures that we can harness Kinetica’s amazing capabilities and perform all our computational tasks in-house without having to bring in external solutions.

Technologies like Kinetica represent the next generation of event driven database technologies built for modern hardware that can respond to events as they occur rather than after they do. And it accomplishes all of this at lower cost and complexity. 

Interested in learning more about streaming analytics with Kinetica? Visit the links below:

Kinetica Streaming Analytics 

What is a Streaming Data Warehouse?

Demo: Dynamic Common Operational Picture Analysis

Hari Subhash is Sr. Manager of Technical Training and Enablement at Kinetica.

You've made it this far, share it on:
GPU Data Analytics eBook
New White Paper

BI Meets AI

Augmenting Analytics with Artificial Intelligence to Beat the Extreme Data Economy