Querying with SQL

Data in Kinetica can be queried with SQL. One of the unique capabilities of Kinetica is the speed at which Kinetica can query and perform filtering & aggregation operations on the data.

In this tutorial, we are going to walk you through some examples of querying the taxi dataset using SQL.

Though this tutorial focuses on using the SQL tool in GAdmin, you can query using the KiSQL Command-line editor (KiSQL is available in /opt/gpudb/kitools) or you could connect your own favorite SQL editors, such as SQuirrelSQL, DBeaver or DbVisualizer, with the Kinetica database using Kinetica’s JDBC or ODBC drivers.

Queries

Kinetica is a SQL-92-compliant database. The queries below provide a sense of the breadth of Kinetica’s full SQL capabilities. Since these example queries will be accessing a table that is the target of the streaming ingest from the previous section, they will have filters applied to only access the original static taxi data loaded via Advanced Import. To run these queries using the SQL tool in GAdmin:

  1. Navigate to GAdmin (http://<kinetica-host>:8080/)
  2. Click Query > SQL.

Q1: What are the shortest, average, and longest trip length for all vendors and each individual vendor?

Using an aggregate group by and a roll-up, we can easily calculate the shortest, average, and longest trip length and group the statistics by vendor.

SELECT
    NVL(vendor_id, '<ALL VENDORS>') AS Vendor_Name,
    DECIMAL(MIN(trip_distance)) AS Shortest_Trip,
    DECIMAL(AVG(trip_distance)) AS Average_Trip,
    DECIMAL(MAX(trip_distance)) AS Longest_Trip,
    COUNT(*) AS Total_Trips
FROM taxi_data
WHERE pickup_datetime < '2019-01-01'
GROUP BY ROLLUP(vendor_id)
ORDER BY 1;

Output:

+-----------------+-----------------+----------------+----------------+---------------+
| Vendor_Name     |   Shortest_Trip |   Average_Trip |   Longest_Trip |   Total_Trips |
+-----------------+-----------------+----------------+----------------+---------------+
| <ALL VENDORS>   |          0.0100 |         2.8898 |        55.8700 |        478843 |
| CMT             |          0.1000 |         2.7427 |        50.0000 |         82452 |
| DDS             |          0.1000 |         2.9158 |        46.0000 |          3584 |
| NYC             |          0.0100 |         3.0071 |        55.8700 |        145395 |
| VTS             |          0.0100 |         2.8620 |        53.7500 |        116597 |
| YCAB            |          0.1000 |         2.8762 |        50.9000 |        130815 |
+-----------------+-----------------+----------------+----------------+---------------+

Q2: Which number of passengers per trip gives the highest tip as a percentage of fare, across each of the cab vendors?

Using subqueries and a pivot table, we can calculate per passenger number “group” (1 through 6) which tipped the most (based on percentage of fare amount) for each vendor.

SELECT
    passenger_count AS Passenger_Count,
    DECIMAL(cmt_atp) AS CMT,
    DECIMAL(dds_atp) AS DDS,
    DECIMAL(nyc_atp) AS NYC,
    DECIMAL(vts_atp) AS VTS,
    DECIMAL(ycab_atp) AS YCAB
FROM
(
    SELECT
        vendor_id,
        passenger_count,
        fare_amount,
        tip_amount
    FROM taxi_data t
    WHERE pickup_datetime < '2019-01-01'
)
PIVOT
(
    AVG(tip_amount / fare_amount * 100) AS atp
    FOR vendor_id IN ('CMT', 'DDS', 'NYC', 'VTS', 'YCAB')
)
ORDER BY passenger_count;

Output:

+-------------------+----------+-----------+-----------+-----------+-----------+
|   Passenger_Count |      CMT |       DDS |       NYC |       VTS |      YCAB |
+-------------------+----------+-----------+-----------+-----------+-----------+
|                 1 |   7.8223 |    4.4903 |   13.2567 |    9.1552 |   13.6989 |
|                 2 |   5.9303 |    3.8937 |   13.0845 |    8.3567 |   12.3209 |
|                 3 |   5.0305 |    1.9208 |   12.9826 |    8.2001 |   11.1806 |
|                 4 |   3.4913 |    2.3227 |   12.8359 |    7.6867 |    8.8138 |
|                 5 |   4.5456 |    0.0000 |   13.1710 |    8.3807 |    8.6874 |
|                 6 |   2.4193 |   27.0270 |   13.0843 |   10.3083 |   12.0803 |
+-------------------+----------+-----------+-----------+-----------+-----------+

Q3: How can I calculate the revenue and trip length for each trip and the running total revenue over a day for a single vendor?

Calculating the revenue and trip length is simple, but calculating the running total venue over a given day for a specific vendor requires a window function.

SELECT
    CAST(pickup_datetime AS TIME) AS Pickup_Time,
    CAST(dropoff_datetime AS TIME) AS Dropoff_Time,
    DECIMAL(total_amount) AS Trip_Revenue,
    SUM(DECIMAL(total_amount)) OVER
        (
            PARTITION BY null
            ORDER BY pickup_datetime
        ) AS Total_Days_Revenue
FROM taxi_data
WHERE vendor_id = 'DDS' AND pickup_datetime >= '2009-12-01' AND pickup_datetime < '2009-12-02'
ORDER BY
    pickup_time;

Output:

+---------------+----------------+----------------+----------------------+
|   Pickup_Time |   Dropoff_Time |   Trip_Revenue |   Total_Days_Revenue |
+---------------+----------------+----------------+----------------------+
|      00:07:22 |       00:11:55 |         6.3000 |               6.3000 |
|      00:59:44 |       01:08:58 |        10.3000 |              16.6000 |
|      06:18:48 |       06:23:20 |         6.4500 |              23.0500 |
|      06:35:59 |       06:58:27 |        19.7999 |              42.8499 |
|      10:16:38 |       10:19:55 |         4.1999 |              47.0498 |
|      12:07:47 |       12:12:44 |         7.0000 |              54.0498 |
|      13:51:36 |       13:55:26 |         4.1999 |              58.2497 |
|      14:06:34 |       14:47:06 |        59.1700 |             117.4197 |
|      14:57:29 |       15:05:53 |         6.6000 |             124.0197 |
|      15:53:37 |       17:25:22 |        55.0700 |             179.0897 |
|      15:58:54 |       16:08:30 |         7.0000 |             186.0897 |
|      17:02:04 |       17:16:15 |        10.3999 |             196.4896 |
|      17:52:37 |       18:04:13 |         9.6000 |             206.0896 |
|      18:26:29 |       18:47:41 |        15.6000 |             221.6896 |
|      18:38:57 |       18:50:33 |         9.2000 |             230.8896 |
|      19:18:18 |       19:27:02 |         9.2000 |             240.0896 |
|      19:45:05 |       19:55:49 |         8.3999 |             248.4895 |
|      19:45:18 |       19:52:29 |        10.0000 |             258.4895 |
|      19:46:53 |       19:53:32 |         6.4000 |             264.8895 |
|      20:27:32 |       20:35:21 |         7.5000 |             272.3895 |
|      20:52:22 |       21:02:24 |         9.1000 |             281.4895 |
|      21:40:08 |       21:44:24 |         4.6000 |             286.0895 |
|      22:27:39 |       22:56:22 |         7.5000 |             293.5895 |
|      23:05:04 |       23:18:40 |        13.7000 |             307.2895 |
|      23:06:16 |       23:19:40 |        16.4000 |             323.6895 |
|      23:08:29 |       23:56:34 |        42.0700 |             365.7595 |
|      23:52:01 |       23:59:40 |         7.9000 |             373.6595 |
+---------------+----------------+----------------+----------------------+

You can show how all vendors fared over the same timeframe if you partition by vendor:

SELECT
    vendor_id AS Vendor_Name,
    CAST(pickup_datetime AS TIME) AS Pickup_Time,
    CAST(dropoff_datetime AS TIME) AS Dropoff_Time,
    DECIMAL(total_amount) AS Trip_Revenue,
    SUM(DECIMAL(total_amount)) OVER
        (
            PARTITION BY vendor_id
            ORDER BY pickup_datetime
        ) AS Total_Days_Revenue
FROM taxi_data
WHERE pickup_datetime >= '2009-12-01' AND pickup_datetime < '2009-12-02'
ORDER BY
    vendor_name,
    pickup_time;

Materialized Views

Kinetica provides support for materialized views, which can allow queries to be more performant by caching & maintaining the result set. To access the SQL tool in GAdmin:

  1. Navigate to GAdmin (http://<kinetica-host>:8080/)
  2. Click Query > SQL.

We can convert the first query in this section to a materialized view by adding the CREATE MATERIALIZED VIEW statement. We’ll also remove the WHERE clause to pull in the streaming records from the Importing Data section.

CREATE MATERIALIZED VIEW taxi_data_summary REFRESH ON CHANGE AS
SELECT
    NVL(vendor_id, '<ALL VENDORS>') AS Vendor_Name,
    DECIMAL(MIN(trip_distance)) AS Shortest_Trip,
    DECIMAL(AVG(trip_distance)) AS Average_Trip,
    DECIMAL(MAX(trip_distance)) AS Longest_Trip,
    COUNT(*) AS Total_Trips
FROM taxi_data
GROUP BY ROLLUP(vendor_id)

We can then simply query that view repeatedly to see the summary statistics change over time:

SELECT *
FROM taxi_data_summary
ORDER BY vendor_name

Kinetica Trial Feedback