Scaling Streaming Analytics to Petabytes
In this age of digital innovation, organizations need to have continuous insight into their services, users, and the enormous volume of data that powers these services. In this article, we explore how Kinetica delivers high-performance analytics on a massive scale of streaming data to today’s organizations.
Kinetica helps its customers solve a variety of problems, including dynamic inventory replenishment, 4G and 5G network planning, mail routing optimization, drug discovery acceleration, scalable ocean trash detection, and many more. All of these use cases require high-performance analytics on data at a scale that ranges from terabytes to petabytes.
Kinetica takes several approaches to enable analytics at this scale:
- Distributed Data and Parallel Query Execution
When data is ingested, it is automatically distributed across Kinetica servers (physical or virtual), and across its worker ranks. During query execution, queries are broken down and executed on the ranks in parallel. Kinetica intelligently executes query operations on either multi-core CPUs or GPUs to achieve the best performance possible.
- Data Localization
Data can be stored in Kinetica with a shard key, which enforces data locality for tables that will be joined together. Tables with a common shard key are distributed and stored together which guarantees that worker ranks do not have to move data between server nodes to perform a join. For dimension tables that are small, Kinetica can instead replicate these tables to each rank to ensure data locality.
Distributed data, query parallelization, and data localization allows work to be broken down into parallel tasks. Each task operates on a small subset of the query’s workload with minimal data movement. However, the speed at which data can be processed on each node still depends on the I/O speed for bringing data from storage into RAM, and the size of RAM available. Some data platforms opt for an in-memory data strategy, where all required data resides in RAM. This dramatically reduces I/O at runtime, but the scale of the data is constrained by the available memory, which is not cost-effective for massive data volumes. When dealing with petabytes of data, options such as HDFS, Object Storage (such as S3), and blob storage are much more cost effective, but are penalized by slower I/O speeds.
This is a catch-22 for high-speed query performance. Ideally you want the data to be in RAM, but for cost effectiveness, you prefer to use less expensive data storage like HDFS, Cloud Object Storage, or blob storage.
- Memory-First Tiered Storage
Kinetica combines the best qualities in-memory performance and cold storage affordability with its memory-first, tiered storage architecture.
“Memory-first” means that, to the extent possible, data is pre-loaded into RAM. However, you don’t need a petabyte of RAM thanks to Kinetica’s automated tiered storage system. Kinetica optimally moves data between cloud storage, local persistence, and in-memory caching to ensure the best performance during query processing. Kinetica automatically manages data movement between these tiers, letting customers take advantage of inexpensive “data lake” storage, while optimizing data access for faster queries. This allows customers to blend more data sets in their analytics to compute larger aggregations with streaming data, train more accurate models for machine learning, and analyze more detailed location information than other platforms.
- High Speed Analytics of Unmanaged Data Sources
You may already have rich historical data sources, but in order to leverage them in combination with streaming data, you will need a high-speed streaming analytics platform like Kinetica. For example, your ERP and CRM system may have data about your customers’ profiles stored in a data lake, but those systems alone cannot analyze the data in context of your customers’ live location and activity.
Data hosted within data lakes can be referenced as external tables, which can be joined seamlessly with data stored in Kinetica, and aggregated with high-velocity streaming data in the same query. Unlock the hidden value in siloed data sets managed by other systems without the need to manually load them into Kinetica’s primary storage, and accelerate your data engineering and data science workflows.
The combination of Kinetica’s memory-first, tiered storage architecture, and the ability to read from data lakes at very high speed extends Kinetica’s high-performance streaming analytics to petabyte scale.
- Streaming aggregations can incorporate even larger historical aggregations for increased currency and accuracy.
- Machine learning use cases have even larger datasets for rapid training and feature generation, further improving accuracy.
- Geospatial use cases can extend to regional levels with unprecedented levels of detail for drilling down into hyper-local use cases, and companies can better utilize location data stored in their data lakes for blended queries.
- External tables allow data scientists to explore foreign data sets in context with their streaming analytics within Kinetica in order to discover new insights. These insights can lead to new or improved AI models and analytics that become the basis for new or enhanced business services.
To learn more about the above use cases and Kinetica’s streaming analytics capabilities, sign up for a free trial or contact us.