Do you need a streaming database?
Streaming databases are close cousins to time-series databases (think TimescaleDB) or log databases (think Splunk). All are designed to track a series of events and enable queries that can search and produce statistical profiles of blocks of time in near real-time. Streaming databases can ingest data streams and enable querying of the data across larger windows and with greater context compared to analyzing streams of data in motion, all within a compressed latency profile.
How are Streaming Databases different than Conventional Analytic Databases like Snowflake, Redshift, BigQuery, and Oracle?
Conventional analytic databases are batch oriented, meaning the loading of data occurs periodically in defined windows. Many conventional databases support frequent loading periods, known as micro-batch. This is in contrast to streaming databases that are always receiving new data as it is generated with no queuing prior to load.
Further, conventional analytic databases lock the tables that are being loaded. When the tables are locked during the load process, they remain available to query by the end user or application but do not reflect the newly loaded data until which time the batch load is completed, and the table is unlocked. This has some advantages with respect to ACID compliance but has disadvantages when it comes to additional latency.
Another contrast is how the queries are optimized for performance. Conventional analytic databases rely on extensive transformations, data engineering, or pipelines to structure the data in a way that is performant. They also use indexes and materialized summaries to achieve query performance. However, all of these techniques require time to prep the data for query performance which makes the data stale. Streaming databases are solving for data freshness and use alternative techniques to achieve performance and keep the data fresh. These techniques differ between streaming databases, ranging from less desirable limiting of the scope of the question that can be asked by supporting only simple data structures, to next-generation compute architectures that produce radically faster query speeds without the need to do extensive data engineering.
How are Streaming Databases different than Stream Processing Platforms like Kafka, Amazon Kinesis, Google Dataflow, and Azure Stream Analytics?
Streaming databases analyze persisted data, whereas stream processing platforms analyze data in motion. Analyzing data in motion has the advantage of near zero latency. Think: “Alert me whenever there’s a transaction over $100 originating in Liberia.” If the transactions in the streaming queue contain the amount and origin location, then a simple rule can flag the transactions of interest.
However, as the question becomes more sophisticated, greater history and context often needs to be included, which makes it impossible to perform the analytic within the stream processing platform and necessitating persisting of the data. Think: “Alert me whenever there’s a transaction that is over 5X larger than the previous transaction.” If the previous transaction occurred a week ago, that data is no longer in the stream. Think: “Alert me whenever there’s a drone that is within 2 kilometers of a restricted airspace.” Breadcrumb readings from radar of a drone flight may be in the stream queue, but the geofenced restricted airspaces are persisted elsewhere.
It’s very common for data from streaming platforms to get persisted in a conventional analytic database, but if you want to make real time decisions in context to prevent fraud or maintain safe air spaces, you’ll need to perform that workload in a streaming database.
Business Use Cases for a Streaming Database
There are many examples of important use cases for streaming databases, including:
- Enable real-time alerting for market changes
- Support preventive maintenance and network optimization
- Deploy real-time machine learning inference
- Time-critical services like Uber or Lyft
- Monitoring video or other sensors while searching for anomalies
- Scientific experiments that must be constantly analyzed
- Supply chain transparency
- Fleet optimization
- Common operating picture
How does Kinetica compare to other streaming databases like Materialize, Imply.io, Clickhouse, Pinot, and Rockset?
Streaming databases are typically evaluated on data latency and query latency and serves as a great framework to evaluate options. Data latency refers to how long does it take data to get loaded and available for query. Query latency refers to how long does it take the query to run.
Kinetica is one of the few analytic databases that has native integration with Kafka, which results in faster loading than JDBC/ODBC connections found in most streaming databases. Kinetica is designed for headless, distributed ingest, resulting in exponentially faster loading of massive data sets. As with most streaming databases, Kinetica employees a lockless architecture that ensures data is made available for query as fast as it can be streamed.
Once the data is loaded, Kinetica’s fully vectorized query engine crushes other databases in independent TPC-DS benchmarks. Most recently, Radiant Advisors compared Clickhouse with Kinetica using TPC-DS. Not only did Clickhouse fail to execute the vast majority of TPC-DS queries, but the ones it was able to execute revealed that Kinetica is 50X faster. Vectorization speeds up the queries, but also doesn’t require rebuilds of indexes and summaries that add to the overall latency.
All streaming databases can credibly make claims about real-time analytics. They begin to separate and differentiate when you look at the real-time analytic use case. Clickhouse, Pinot, Rockset, Materialize, and others can achieve impressive low latency so long as the query is simple or moderately complex. They don’t like joins, and if there is a join it is usually quite simple. Kinetica is able to do the time-series, spatial, and n-way joins that are increasingly common with sensor streams and where other streaming databases simply can’t handle those in an acceptable latency profile. This again goes back to the engineering differentiation that comes from a fully vectorized database.
Try Kinetica yourself. Kinetica Cloud is free for use on datasets up to 10GB. Set up your own Kinetica database today.