Kinetica In Motion

12 Features to Look for When Choosing a GPU-Accelerated Analytics Database

Amit Vij
August 2, 2017

GPU acceleration is revolutionizing high-performance computing. Leveraging GPUs for processing-intensive workloads is on the rise, particularly among verticals such as finance, retail, logistics, health/pharma, and government. GPU-acceleration is opening new possibilities for machine learning, deep learning, data visualization, or simply performing faster queries, joins and row-by-row math.

If you’re investigating whether a GPU database can meet your needs, you’ll need to make sure it has the features and functionality to meet the full solution. There are several companies that offer GPU-accelerated databases, and it can be challenging to fully understand how they are different and which one is the right fit for your organization’s needs. To help you in your decision-making process, here’s a checklist to help you evaluate GPU-accelerated relational databases, along with information on how Kinetica fits into the picture.

Product Maturity and Enterprise Readiness

One of the most important aspects of choosing a GPU database is product maturity. While there are a variety of GPU compatible databases that can provide first-hand exposure to the benefits of parallelized processing, a production deployment will call on a more robust suite of features.

Production-viable databases are not built overnight and existing deployments with large-scale customers provide assurance that enterprise features have been tested and are in use.

Kinetica is the first ever database built from the ground up to harness the parallel processing power of the GPU and many core devices using vectorized memory structures and SIMD processing. Kinetica was designed to track and analyze terrorist and other national security threats in real time, vying to meet the needs of US Army Intelligence: product development began in 2009.

More recently Kinetica has been put into use with companies such as USPS, GSK, PG&E, ScotiaBank and a range of other financial, retail, and logistics companies.

How will you interact with the database?

SQL Semantics

SQL support is the lingua-franca of relational databases as it drives integration across an enterprise’s IT stack and enables business users, analysts, developers, data scientists, DBAs, and system architects to leverage existing skills to access and analyze data.

If you’ll be interfacing with the database through SQL, you’ll want the solution to support ANSI SQL standards and semantics for:

CRUD operations: Support for create, read, update, and delete operations for full data mutability and comprehensive data management.
Support for SQL constructs such as UNION, DISTINCT, GROUP BY, EXCEPT, JOINS: Support for all join types such as inner, full, left, and right. Support for JOINS involving multiple tables–not just two tables–and multiple columns–not just one column.
SUBQUERY: Subqueries are queries nested inside SELECT statements. They provide better flexibility and expressibility and simple SQL statements. Make sure the database support SUBQUERIES so complex business logic can be easily expressed.

APIs

Business analysts like the simplicity of SQL but developers, programmers, and data scientists desire a RESTful API to provide richer programmatic access to the database.

You can interact with Kinetica through both SQL or the native REST API. This enables much more sophisticated queries and operations with the database. Language specific bindings are available for Java, Python, C++, JavaScript and Node.js. Kinetica APIs and connectors are open source and available in GitHub.

Connectors

Any database lives among an ecosystem of tools that will ultimately need to be assembled together to build a full solution. Suitable connectors make it simpler to get a full solution up and running.

Kinetica comes with connectors for Apache Hadoop, NiFi, Spark and Spark Streaming, Storm, and Kafka, in addition to commercial tools such as FME.

Kinetica can also be connected to popular BI tools such as Tableau, MicroStrategy and Power BI through it’s certified ODBC/JDBC and connector. Kinetica’s connectors and full SQL-92 query support, make it relatively painless to swap in Kinetica as a replacement data source for faster, accelerated BI.

Robustness & Security

System RAM or VRAM?

To feed the high-speed processing available by the GPU, all GPU-accelerated databases will store data in-memory. This could be either system memory, or vRAM on the GPU itself. The benefit to storing data in vRAM is that the transfer time is very, very fast. The downside is that vRAM is expensive and limited in capacity – currently only 16GB on an NVIDIA P100.

Kinetica is able to utilize both system RAM and vRAM. For terabytes of data, system RAM allows the database to scale to much larger volumes. Data stored in main memory can be efficiently fed to the GPU, and this process is even more efficient with the NVLink architecture.

Security

Enterprise-grade security is critical to a business adopting any data and analytics solution. Storing, managing, and analyzing sensitive data such as customer details, social security numbers, and credit card numbers, requires a solution that provides comprehensive security to avoid data breaches and inadvertent information use.

Enterprise-grade security for GPU databases should include:

Authentication: Kinetica integrates with Open LDAP, Kerberos or Active Directory for authentication.
Authorization: Kinetica supports RBAC (Role-Based Access Control) for database and table authorization
Encryption: Kinetica supports encryption of data in motion and rest with SSL, PLS, and AES-256. Grant semantics make it easy to setup and manage user’s access to data.

Data Persistence

The benefit to storing data in-memory is speed. The weakness is that data can be lost if the power goes off or something causes the database to crash. You can have your cake and eat it too, if the database is able to work on data in-memory, and persist that data to disk for reliability.

Kinetica is able to manage data in both GPU VRAM and system memory in addition to persisting that data to disk. Many other GPU databases lack persistence, which means that if your GPU database stops, your data is gone. You would need to reload your data from the original source.

Multi-Table/Multi-Column Joins

Chances are that if you’re looking at GPU-acceleration for your database, you have more complex schemas with data spread across multiple tables. In these scenarios, the ability to perform fast JOINs is essential. Many other GPU databases lack the maturity to perform joins. Even the closest GPU alternative to Kinetica only supports two tables, and only JOINs with one column for each table. Kinetica, on the other hand, offers full join support with left, right, outer, and inner JOINS across many columns and tables.

Data integrity

Data integrity is critical to ensuring the accuracy of the data in a relational database. No business must ever use a technology which can’t guarantee data accuracy. Kinetica supports data integrity constructs such as primary keys.

Scale-out Distributed Architecture

As the volume of data grows and processing demands increase, it becomes necessary to evaluate how your analytics system will scale.

Scaling up, adding more GPU cards, more memory and premium hardware will only take you so far. A database that can distribute the data and workload across multiple machines makes it possible to scale-out without limit on affordable hardware. When looking at a scale-out architecture, there are a couple things to keep in mind:

Master Node or Fully Distributed? Simple distributed systems rely on a single dictionary node to centrally manage metadata. This can create performance bottlenecks where leaf nodes overwhelm the master node with lookup requests, particularly for JOIN. A single master node also introduces a single point of failure. More robust solutions will feature head-less processing with distributed metadata, where lookup requests can be distributed across the whole cluster. Such an architecture is used on systems natively built for distributed processing. Kinetica uses such a clustered multi-node system. This architecture also enables Kinetica to spread the task of ingesting data across all the nodes in the cluster making Kinetica ideal for working on large streams of IoT data, or for use cases where data needs to be quickly hydrated from operational systems. An example of this in action is at USPS where a 10’s of terabytes distributed Kinetica cluster is able to serve thousands of concurrent sessions.
Sharding and parallel data ingest: Distributed databases use algorithmic sharding to project data across multiple nodes in a distributed manner and leverage head-less parallel processing for faster ingest and query performance. Pick a database solution that intelligently manages sharding without reliance on ETL tools to ingest, distribute, and process data with full parallelization without bottlenecks.

High Availability

What will happen if an individual node within a cluster fails?…. or if the whole cluster goes offline? While third-party tools can mitigate the damage, they create additional risks and challenges around security, upgrades, version incompatibility, advanced storages, and additional skills. In-built HA and automated replication ensures fault tolerance and simpler to maintain systems.Kinetica offers Active/Active HA and automatic replication between clusters. This eliminates the single point of failure in a given cluster and provides reliable and fast recovery from failure. It’s one thing to market a database as having ‘high-availability’, and another to deliver: USPS went into production with Kinetica in November 2014 and database access has not gone down a second since the initial production deployment — meeting targets of five-nines (99.999%) availability. Teams are able to patch and upgrade systems while Kinetica remains up and able to ingest, process, and simultaneously provide results with over 15,000 concurrent sessions hitting the system during peak hours every day!

Advanced Capabilities

In-Database Analytics

As questions asked of the data become more complex, data scientists look to more advanced tools for algorithms and modeling. A user-defined functions (UDF) framework enables custom algorithms and code to run on data directly within the database. For workloads that are already computationally intensive, particularly for machine learning, in-database analytics on a GPU-accelerated database opens new opportunities that were previously unthinkable.

UDFs within Kinetica make it possible to run custom compute as well as data processing within the database. In essence this provides a highly flexible means of doing advanced compute-to-grid analytics. BI and AI workloads can run together on the same GPU-accelerated database platform. User defined functions can be written in C++, Java, or Python. Kinetica also ships with bundled TensorFlow for machine learning and deep learning use-cases. Customers such as GlaxoSmithKline, ScotiaBank, and one of the worlds largest retailers leverage the UDF framework for advanced predictive analytics use cases.

Geospatial Support

Many modern streaming datasets include time and location data. This type of data poses unique challenges for analysts, both for timely queries, and for visualization of large datasets. Queries on such geospatial datasets are ideal for parallelized compute on the GPU, but the system needs to be built with geospatial workloads in mind. If your use cases involve working with geo-tagged data, look for the following features:

Native Geospatial Object Support: Individual points can be stored by most systems. More advanced systems have support for more complex geospatial objects such as points, lines & multi-lines, polygons & multipolygons, geometry collections, tracks and labels. Kinetica supports complex vector data types and stores geospatial data as Well Known Text (WKT), a Geospatial standard.
Native Geospatial Functions: Invariably, with geospatial data, you’ll want to look for patterns within specified or arbitrary regions. Exporting data to a separate geospatial system for such analysis is cumbersome and slow. Look for a database that can run geospatial functions such as GEO-JOIN, filters, aggregation, geofencing, and video generation for quicker and more advanced analysis
Geo Visualization: Visualizing large datasets is a complex challenge for web-based clients. Kinetica includes a geospatial visualization pipeline that can quickly render the results of geospatial queries, into heatmaps or vector feature overlays. Kinetica supports Web Mapping Service (WMS) and Keyhole Markup Language (KML), both OGC standard web services. Kinetica also supports rendering complex symbologies and annotations.

Kinetica’s enterprise-grade GPU database has been in production for several years and is both reliable and proven in large-scale deployments. Everything is tightly coupled in a single piece of technology–making for an easy deployment for customers, whether on-premise using industry-standard hardware or with GPU instances in the cloud.

Contact us to learn more about how to take maximum advantage of GPU accelerated in-database compute by talking with us.

Finding Competitive Coverage of the FSQ Places Dataset Over Road Networks Using Batch Isochrone Computations in One Tiny SQL Statement