Querying with SQL

Data in Kinetica can be queried with SQL as well as through the REST API. One of the unique capabilities of Kinetica is the speed at which Kinetica can query data, and filter/aggregate the data based on geospatial coordinates or other constraints.

In this tutorial, we are going to walk you through some examples of querying the taxi dataset using SQL. Later examples will include some more complex spatial joins to show taxi rides by neighborhood.

You should have loaded sample datasets during the Kinetica Walkthrough and Import Data tutorials. If you have not yet loaded the sample data, please use GAdmin to load at least 500K rows of the NYC Taxi dataset.

Kinetica comes with three SQL editors for you to choose from:

  • KiSQL in GAdmin A Web-based SQL editor with full DML and DDL capabilities
  • KiSQL Command-line editor Write queries from the command line. KiSQL is available in /opt/gpudb/kitools
  • SQL Lab in Reveal Read-only SQL editor for analytics… introduced in the walkthrough tutorial

In addition, you could connect your own favorite SQL editors, such as SQuirrelSQL, DBeaver or DbVisualizer, with the Kinetica database using Kinetica’s JDBC or ODBC drivers.

Explore the Schema

The dataset contains data on all the taxi trips taken in the city over several years including vendor, pickup and dropoff time and location, distance, fare, payment type, tolls, etc.

Let’s get a feel for the data and the schema before we start the data exploration.

  • In GAdmin, choose Data > Tables and then select the MASTER collection
  • Select the nyctaxi table from the list of tables.
  • Browse the data and display its schema (the columns and data types) by clicking the ℹ Info button near the top.

Now that you understand the format of the data, let’s use SQL to explore and ask some questions:

  • How many trips were made by each vendor?
  • How many passengers travel in groups of one, two, or more?
  • Were fares larger when passengers were traveling in a group

We recommend the SQL Lab in Reveal to execute the queries in this step.

Basic SQL Queries

Let’s look at 10 rows of the dataset:

SELECT FROM nyctaxi LIMIT 10

Now let’s ask some deeper questions. While you execute these queries, notice the responsiveness of the Kinetica database – response times are shown on the right hand side in Reveal. The first step in analyzing data is to ask relevant questions.  We want to know the answers to the following questions.

Note: The outputs in this tutorial are based on a nyctaxi table containing 500,000 rows.

Q1: How many trips did each vendor make, between which timeframes?

SQL:

SELECT vendor_id AS Vendor_ID, 
    MIN(pickup_datetime) AS Data_Begin_Date, 
    MAX(dropoff_datetime) AS Data_End_Date, 
    COUNT(*) AS Total_Trips
FROM nyctaxi
GROUP BY vendor_id

Execute the above query and you should see a result set similar to this:

Output:

+-------------+------------------------------+------------------------------+---------------+
| vendor_id   |              Data_Begin_Date |                Data_End_Date |   Total_Trips |
+-------------+------------------------------+------------------------------+---------------+
| CMT         |   2009-01-01 00:02:41.000000 |   2014-08-21 15:00:01.000000 |         86513 |
| DDS         |   2009-01-01 00:33:51.000000 |   2009-12-31 23:54:47.000000 |          3792 |
| VTS         |   2009-01-01 10:04:00.000000 |   2014-08-07 21:03:00.000000 |        124067 |
| NYC         |   2015-01-01 00:04:33.000000 |   2015-06-29 01:08:07.000000 |        148750 |
| YCAB        |   2015-01-01 00:10:31.000000 |   2015-06-28 04:23:52.000000 |        136878 |
+-------------+------------------------------+------------------------------+---------------+
Rows read = 5
Exec time 0.067 Fetch time 0.032

You should be noticing sub-second response times for the above query, even on a CPU-only machine.

Q2: What is the average fare paid by passengers by the size of their group?

Here’s the SQL:

SELECT passenger_count AS Total_Passengers, 
    AVG(total_amount) AS Average_Fare
FROM nyctaxi
GROUP BY passenger_count
ORDER BY passenger_count ASC

You should see a result set similar to this:

Output:

+--------------------+----------------------+
|   Total_Passengers |         Average_Fare |
+--------------------+----------------------+
|                  0 |   11.035732976897224 |
|                  1 |   14.298085961326045 |
|                  2 |   14.800295746100266 |
|                  3 |    14.41203666847088 |
|                  4 |   14.558919232581234 |
|                  5 |   13.959170366554753 |
|                  6 |   15.107998761380346 |
+--------------------+----------------------+

Q3: How many trips were there each year, by group?

Here’s the SQL:

SELECT YEAR(pickup_datetime) AS Pickup_Year, 
     passenger_count AS Total_Passengers, 
     COUNT(*) AS Total_Trips
FROM nyctaxi
GROUP BY 1, 2
ORDER BY 1, 2

Execute the above query and you should see a result set similar to this. The output below is truncated to preview in this document.

Output:

+------------+-----------------+---------------------+
| Pickup_Year| Total_Passengers| Total_Trips         |
+------------+-----------------+---------------------+
| 2009       | 0               | 1                   |
| 2009       | 1               | 41834               |
| 2009       | 2               | 9952                |
| 2009       | 3               | 2799                |
| 2009       | 4               | 1361                |
| 2009       | 5               | 5419                |
| 2009       | 6               | 275                 |
...

Q4: What is the number of trips made and the distance travelled by year by different groups of travelers?

SQL:

SELECT
YEAR(pickup_datetime) AS Pickup_Year,
    CAST(trip_distance AS int) AS Trip_Distance_Miles,
    passenger_count AS Total_Passengers,
COUNT(*) AS Total_Trips
FROM nyctaxi
GROUP BY 1, 2, 3
ORDER BY 1, 4 DESC

Execute the above query and you should see a result set similar to this. The output below is truncated for documentation purposes.

Output:

+------------+--------------+-----------------+---------------------+
| Pickup_Year| Trip_Distance| Total_Passengers| Total_Trips         |
+------------+--------------+-----------------+---------------------+
| 2009       | 1            | 1               | 16200               |
| 2009       | 2            | 1               | 10232               |
| 2009       | 3            | 1               | 4911                |
| 2009       | 1            | 2               | 3709                |
| 2009       | 4            | 1               | 2494                |
| 2009       | 2            | 2               | 2451                |
| 2009       | 0            | 1               | 2265                |
| 2009       | 1            | 5               | 2038                |
| 2009       | 5            | 1               | 1466                |
| 2009       | 2            | 5               | 1306                |
| 2009       | 3            | 2               | 1249                |
| 2009       | 1            | 3               | 1074                |

Advanced Spatial Queries

To perform the below query, you should have imported the nyc_neighborhood dataset in the Import Tutorial. If you have not, please download the NYC Neighborhoods CSV and use the GAdmin Import tool to add the nyc_neighborhood table to the database.

Ensure you’ve loaded the nyc_neighborhood table as a replicated table. This means it can be joined to another table on any column (in contrast to a sharded table, which must be joined on the shard key/keys only). Replicated tables are suitable for smaller datasets, such as nyc_neighborhood which only has 196 rows.

Kinetica is ideal for querying geospatial data. Each taxi pickup and dropoff has a longitude and latitude. But what if we want to know ask many pick-ups were made in Brooklyn or at JFK?

The nyc_neighborhood table contains geospatial coordinates that define New York’s districts and boroughs which we can JOIN with the nyctaxi table to help us answer this question. The coordinates take the form of polygons that define the various named areas which are stored in the industry standard “WKT” (Well Known Text) format.

Q5: What are the number of taxi pickups by hour for a given neighborhood?

In this example, Let’s look at districts (NTAName) with “Airport” in them. We will apply a little formatting to present the pickup timestamp in neat hourly intervals.

SELECT 
   RIGHT('0' || STRING(HOUR(t.pickup_datetime)), 2) || ':00:00' AS Pickup_Hour, 
   COUNT(*) AS Total_Pickups
FROM 
   nyctaxi t 
   JOIN nyc_neighborhood tz
      ON STXY_Intersects(t.pickup_longitude, t.pickup_latitude, tz.geom) = 1
WHERE 
   NTAName LIKE '%Airport%'
GROUP BY 
   HOUR(t.pickup_datetime)
ORDER BY 
    Pickup_Hour

This is a great example of how to use geospatial functions within complex query joins for sophisticated geospatial analytics.

Output:

The output shows taxi rides really increase at JFK after 2pm:

+---------------+-----------------+
| Pickup_Hour   |   Total_Pickups |
+---------------+-----------------+
| 00:00:00      |             680 |
| 01:00:00      |             279 |
| 02:00:00      |              96 |
| 03:00:00      |              43 |
| 04:00:00      |              29 |
| 05:00:00      |             197 |
| 06:00:00      |             361 |
| 07:00:00      |             496 |
| 08:00:00      |             647 |
| 09:00:00      |             670 |
| 10:00:00      |             772 |
| 11:00:00      |             853 |
| 12:00:00      |             865 |
| 13:00:00      |             947 |
| 14:00:00      |            1087 |
| 15:00:00      |            1214 |
| 16:00:00      |            1229 |
| 17:00:00      |            1299 |
| 18:00:00      |            1162 |
| 19:00:00      |            1175 |
| 20:00:00      |            1207 |
| 21:00:00      |            1314 |
| 22:00:00      |            1234 |
| 23:00:00      |            1122 |
+---------------+-----------------+

We’ll dive deeper into the geospatial capabilities in the tutorial, Visualizing Large Geospatial Datasets.

Summary

In this tutorial, you seen some of the capabilities of SQL within Kinetica. It should have also given you a taste for the speed of aggregate queries. With most of the above queries, you should see sub-second response times, as all the data is pinned to the memory.

In the next tutorial, we will walk you through interesting ways to visualize the data in Reveal and create a custom dashboard.

You can read more about the SQL syntax and capabilities in the following document: SQL Documentation and Advanced SQL Tutorials

Next: Query Data with REST API and Python »

Kinetica Trial Feedback