Importing Data

By now you should have installed the product and have had the quick overview of the tools you can use to manage it and query data.  In this tutorial, we’ll show you some of the ways to load data. The datasets being loaded will be used in future tutorials.

The two methods of importing you’ll learn how to use:

  • Drag and Drop for quickly importing small sets of data
  • Advanced Import with the Kinetica Input/Output (KIO) Tool for batch importing large sets of data with more control over the process

Create Target Tables

Before importing, we’ll create both target tables, adding Kinetica-specific attributes to the columns that save space and improve performance.

To create the tables:

  1. Navigate to GAdmin (http://<kinetica-host>:8080/)
  2. Click Query > SQL.
  3. Enter the following CREATE TABLE statements into the SQL Statements text area.

    CREATE TABLE taxi_data
    (
        vendor_id          VARCHAR(4)  NOT NULL,
        pickup_datetime    TIMESTAMP   NOT NULL,
        dropoff_datetime   TIMESTAMP   NOT NULL,
        passenger_count    TINYINT     NOT NULL,
        trip_distance      REAL        NOT NULL,
        pickup_longitude   REAL        NOT NULL,
        pickup_latitude    REAL        NOT NULL,
        rate_code_id       SMALLINT    NOT NULL,
        store_and_fwd_flag VARCHAR(1)  NOT NULL,
        dropoff_longitude  REAL        NOT NULL,
        dropoff_latitude   REAL        NOT NULL,
        payment_type       VARCHAR(16) NOT NULL,
        fare_amount        REAL        NOT NULL,
        surcharge          REAL        NOT NULL,
        mta_tax            REAL        NOT NULL,
        tip_amount         REAL        NOT NULL,
        tolls_amount       REAL        NOT NULL,
        total_amount       REAL        NOT NULL,
        cab_type           TINYINT     NOT NULL
    );
    
    CREATE TABLE nyct2010
    (
        gid        BIGINT       NOT NULL,
        geom       GEOMETRY     NOT NULL,
        CTLabel    DOUBLE       NOT NULL,
        BoroCode   BIGINT       NOT NULL,
        BoroName   VARCHAR(256) NOT NULL,
        CT2010     BIGINT       NOT NULL,
        BoroCT2010 BIGINT       NOT NULL,
        CDEligibil VARCHAR(64)  NOT NULL,
        NTACode    VARCHAR(64)  NOT NULL,
        NTAName    VARCHAR(64)  NOT NULL,
        PUMA       BIGINT       NOT NULL,
        Shape_Leng DOUBLE       NOT NULL,
        Shape_Area DOUBLE       NOT NULL
    );
    
  4. Click Run SQL to create the tables.

Drag and Drop

Kinetica allows you to drag and drop CSV, ORC, Apache Parquet, or Zip files (containing Shapefiles) to import the data into Kinetica. Drag-and-drop importing will attempt to import the file’s first record as a header. The file’s name will be used as the table’s name in Kinetica. The Drag-and-Drop documentation covers additional details and limitations.

We’re going to import via drag-and-drop the nyct2010 CSV file, which is the NYC Neighborhood Tabulation Areas dataset but we’ll refer to it as the NTA dataset throughout the rest of the tutorials. This file provides geospatial boundaries for neighborhoods in New York. The file can be downloaded using the following link: nyct2010.csv

Download and save the above CSV file locally to your disk (if you are using Linux we suggest saving it into /tmp where it can be read by Kinetica). Now let’s import the data:

  1. Navigate to GAdmin (http://<kinetica-host>:8080/)
  2. Click Data > Import.
  3. Drag the nyct2010 file from a file explorer window into the drop area of GAdmin. You can also click Choose File in GAdmin to select the nyct2010 file from your disk.

The data will begin importing. Once completed, click View Table to view the table or click Data > Table.

Advanced Import with KIO

Kinetica also allows you to ingest data from various different sources, including Sybase IQ, Oracle, PostgreSQL, and AWS S3, using the KIO tool. While KIO has a command line interface, this tutorial uses the GAdmin KIO user interface, also known as Advanced Import. Advanced Import is great for importing larger files and offers more control over the incoming schema. We’re going to load a historical NYC taxi dataset using an Apache Parquet file in a public AWS S3 bucket. You can read more about the full public data set, which contains taxi trip information, at nyc.gov.

To import the NYC taxi dataset using Advanced Import:

  1. Navigate to GAdmin (http://<kinetica-host>:8080/)
  2. Click Data > Import.
  3. Click Advanced Import.
  4. In the Source section:

    1. Select AWS S3 for the Datasource.
    2. Select Parquet for the Format.
    3. Input the following bucket File Path for the NYC taxi dataset: /kinstart/taxi_data.parquet
    4. Select the Use Odo Engine checkbox.
  5. In the Target section, input taxi_data for the Table name.
  6. Click Transfer Dataset.

The Transfer Status window appears to show data importing status. Once completed, click View Table to view the table.

Next: Querying with SQL »

Additional Ingestion Methods

The two methods discussed above are among the most common ways to get your data into Kinetica. The list below comprises additional ingestion methods and their use case(s).

  • Use Multi-head Ingest to insert sharded data directly into the nodes of your cluster, bypassing the head node and enabling the best ingest speeds.
  • Use the Spark Connector to enable quick ingestion of large data sets and to stream data out of Kinetica.

Kinetica Trial Feedback