These days, practically any BI tool might claim to query billions of rows. However, I’ve yet to see any BI tool, out of the box, do this in sub-second time using SQL query…until now. As a matter of fact, in my first four months with Kinetica, we’ve been able to do this with Tableau, Power BI, QlikView, Spotfire, Domo, and even Microsoft Excel—and we’re just getting warmed up.
To fully appreciate how any BI tool can do this, let’s run through a few key concepts:
- Distributed, in-memory, scale-out architecture: BI query performance improves significantly when the underlying database can run SQL on very large data sets that have been distributed and stored in RAM across multiple computers. This is called data sharding, and it makes it possible for queries to run across multiple computers in parallel when BI tools refresh reports and dashboards, or make ad-hoc queries.
- GPU parallel processing architecture: BI query performance shifts into an entirely new gear when each computer can also take advantage of the massively parallel compute processing power of GPUs. With Kinetica, SQL queries are able to leverage this enhanced parallel compute automatically before any additional database tuning. The BI tool uses the same tables, rows, and columns it would expect from any traditional relational database to generate SQL. Only now it can scale up to query much larger and broader data sets than previously possible, making big data insights completely accessible to the business from the tools they already know how to use. Fitting right into existing BI technology investments is key.
- The concept of a speed layer: Think about a database platform that fits within your existing BI landscape as an immediate self-service “speed layer.” A speed layer isn’t just about faster query. To be truly effective, it needs to compress the time and effort to work with these large data sets from BI and/or data science end to end, while making it easier for both to work together seamlessly. Think of the productivity gains when: you can immediately publish and query billions of rows, find that it’s not quite what you need for your report, dashboard, or algorithm (which never happens), just delete, re-publish a new billion row data set, and re-connect everything in minutes, without the constraints, overhead, and complexity of a traditional data warehouse or data appliance.
In the following example, we’ll use a billion-row financial data set that is refreshing 100 to 300 thousand rows per second via a Hadoop MapReduce job running in the background. (Yes, you can also run queries while streaming live updates to the tables!). We can use Tableau for a live query example, and Power BI for an ingest-only query example:
This clip shows the Kinetica administration panel, with a modest 3-node cluster, serving billions of rows on less than a 1TB of RAM, refreshing live directly from Hadoop. These demos are also querying live in the background.
Tableau imports the table metadata, defines your joins, and as the reports are built, SQL query is automatically being generated and sent for processing on the database. Queries process in parallel across a cluster and leverage the power of GPU compute for aggregations. Summary results are streamed back immediately on billions of rows. Again, this is all happening while data is also refreshing in the database.
Tableau does a really nice job of building SQL into a summary query, effectively grouping columns to measures so that summary results are returned and can be visualized very fast. In this example, the result returns about 20 thousand rows in a second, aggregating on the fly from billions of rows.
The combination of Tableau’s ability to send the SQL directly to the database, and the database’s ability to compute large scale aggregation in seconds, will make billion-row query appear seamless from Tableau for users.
Other BI tools may need to first ingest all of the data in the tables in order to run BI queries on its proprietary data store. However, even with tools that start as ingest-only, performance gains can still be immediately realized with a speed layer.
For the next example, we are using Power BI. Without driver certification, the tables will be ingested in their entirety before we can start to build reports and dashboards. With the stated limit of 10GB storage with Power BI, taking into account the time to ingest the entire set of rows and the fact that billions of rows will likely exceed 10GB of storage, we need to give Power BI a little help with the SQL query so that it summarizes and groups within the database. The gains in data scale and query speed should make it more than worthwhile.
Here we see Power BI serving effectively the same scale and speed for the report. The database continues to refresh live data updates. In this example, we are aggregating down to the day level (vs. year in Tableau) and still returning in seconds on billions of rows. This result returns about 40 thousand rows.
When considering all of the complexity, time, resources and cost of scaling up and out traditional BI landscapes, in order to enable the business with billion-row query scale from BI, the immediate benefits of bringing this modern software and hardware database acceleration to any BI environment will be easier and faster than the alternative.
The BI tool is no longer the bottleneck! Companies can now fully leverage their investments in BI technology, processes, and skills to deliver unprecedented SQL-query scale from any BI tool on billions of rows.