Queries with the REST API and Python

In the previous Query with SQL tutorial. you learned how to use SQL to query, insert, and update data into Kinetica. In this tutorial, you’ll learn how to use the REST API and Kinetica’s Python library to perform some similar queries.

In this tutorial, we will:

  • Install the Python API for Kinetica (“gpudb”)
  • Connect to a Kinetica server
  • Execute some queries
  • Display some results

REST API and Language Specific APIs

Kinetica natively communicates through a REST based API framework. Several language specific wrappers are available that make it easy to connect to Kinetica with Java, Python, JavaScript, node.js, C# and C++.

The Python API provides additional capabilities that are not available with SQL. Plus Python is an extremely popular language with data scientists and AI research teams. Python provides libraries like numpy, scikit-learn, and pandas, which are foundational packages for data scientists. Additionally, popular open source machine learning frameworks like TensorFlow and PyTorch use Python as their primary language. Kinetica’s Python API allows data in the Kinetica database to be passed to machine learning libraries and back very easily.

This tutorial is also available as a Jupyter Notebook. Download Here.
 

Installing the Python API

Download the Kinetica client-side Python API from Github. The Python API also depends on the ‘avro’ library. Use pip to install from the repo directory:

sudo pip install .

To validate that the Kinetica Python API is installed run this command – there should be no error (a null response):

python -c "import gpudb"

The Kinetica python libraries work with Python 2.7 and up. For troubleshooting and alternative download instructions, please see the full python install documentation.

Exploring the NYC Taxi Data with the Python API

The first step is to import the required packages. gpudb is the API package for interacting with Kinetica:

import gpudb
import collections
import json

To begin interacting with Kinetica, you must first instantiate an object of the GPUdb class by providing the connection URL, including the host and port of the database server.

h_db = gpudb.GPUdb(encoding='BINARY', host='172.16.81.133', port='9191')

* Replace the IP address with the correct IP or hostname for where your Kinetica server is running.

Next let’s take the gpudb object and retrieve a single row from the table.

nyctaxiresponse = h_db.get_records(table_name='nyctaxi', offset=0, limit=1, 
    encoding="json")['records_json']

print (nyctaxiresponse)

This is similar to executing select * from nyctaxi limit 1 in SQL.

  • We use the get_records() function to fetch the records
  • We select “records_json” element from the response json
  • Response is held in a variable called nyctaxiresponse

You should see something similar to this:

['{  
   "vendor_id":"NYC",
   "pickup_datetime":1429168062000,
   "dropoff_datetime":1429168209000,
   "passenger_count":1,
   "trip_distance":0.8,
   "pickup_longitude":-73.99353,
   "pickup_latitude":40.74739,
   "rate_code_id":1,
   "store_and_fwd_flag":"N",
   "dropoff_longitude":-73.99023,
   "dropoff_latitude":40.73934,
   "payment_type":"Cash",
   "fare_amount":4.5,
   "surcharge":0,
   "mta_tax":0.5,
   "tip_amount":0,
   "tolls_amount":0,
   "total_amount":5.3,
   "cab_type":0
}']

Aggregation Functions

Aggregate functions are counts, sums, averages, group by etc. They summarize the information based on certain criteria. Let’s compute the average fare by passenger counts using the aggregate_group_by() function.

response = h_db.aggregate_group_by(table_name="nyctaxi", 
    column_names=["passenger_count",  'avg(total_amount) as Avg_Fare'],
    offset=0, limit=1000, encoding="json", 
    options={ "sort_order":"ascending", "sort_by":"value"})['json_encoded_response']
{  
   "column_1":[  
      0,
      5,
      1,
      3,
      4,
      2,
      6
   ],
   "column_2":[  
      11.03573297689722,
      13.95917036655475,
      14.29808596132604,
      14.41203666847088,
      14.55891923258123,
      14.80029574610027,
      15.10799876138035
   ],
   "column_headers":[  
      "passenger_count",
      "Avg_Fare"
   ],
   "column_datatypes":[  
      "int8",
      "double"
   ]
}

If you omit the response schema type at the end, The JSON response will be returned as a dynamic schema. To display formatted output with nice headings you can then use a method such as parse_dynamic_response().

data = h_db.parse_dynamic_response(response)['response']

print ("{:<15s} {:<12s}".format("Passenger Count", "Average Fare"))
print ("{:=<15s} {:=<12s}".format("", ""))
for rows in zip(data["passenger_count"], data["Avg_Fare"]):
        print ("{:<15d} {:4.2f}".format(*rows))

The output is now:

Passenger Count Average Fare
=============== ============
0               11.04
5               13.96
1               14.30
3               14.41
4               14.56
2               14.80
6               15.11

Another Example

Now let’s query for the number of trips made each year by passenger rider groups:

response = h_db.aggregate_group_by(table_name="nyctaxi", column_names=[
       "YEAR(pickup_datetime) AS Pickup_Year", 
       "passenger_count as Total_Passengers", 
       "count(*) as Total_Trips"
    ],offset=0, limit=10, encoding="json", 
    options={ "sort_order":"ascending", "sort_by":"value"})['json_encoded_response']
print(response)

The formatted output is:

{  
   "column_1":[  
      2009,
      2012,
      2010,
      2015,
      2011,
      2011,
      2009,
      2010,
      2012,
      2014
   ],
   "column_2":[  
      0,
      0,
      0,
      0,
      6,
      0,
      6,
      6,
      4,
      4
   ],
   "column_3":[  
      1,
      1,
      7,
      118,
      119,
      255,
      275,
      291,
      383,
      389
   ],
   "column_headers":[  
      "Pickup_Year",
      "Total_Passengers",
      "Total_Trips"
   ],
   "column_datatypes":[  
      "int",
      "int8",
      "long"
   ]
}

Aggregate Statistics

We can also compute various aggregated statistics on a column. Statistics are a key part of machine learning solutions and having a handy way to calculate these makes it easier to build solutions. Using aggregate_statistics(), let’s get the average, count, variance, skew and standard deviation for values in the “fare_amount” column, for all rows, in one go.

response = h_db.aggregate_statistics(table_name="nyctaxi", 
    column_name="fare_amount",
    stats = "count, mean, stdv, variance, skew",
    options={})['stats']
print(response)

Output is:

{  
   'count':500000.0,
   'mean':11.843845542878993,
   'skew':5.762956053236264,
   'stdv':9.97504765789146,
   'variance':99.5015757772059
}

Summary

We hope this has given you a taste for how to access Kinetica using Python. Similar client APIs are available for Java, C#, C++, node.js and JavaScript.

Next: Visualizing Geospatial Queries »

Kinetica Trial Feedback