By now you should have installed the product and have had the quick overview of the tools you can use to manage it and query data. At some point you’ll want to load your own data. In this tutorial we’ll show you some of the ways to do this. We’ll then load a geospatial dataset that will be used in combination with the NYC Taxi dataset in future tutorials.
There are 2 main steps for importing a dataset:
- Prepare your Data Header files (Schema definition)
- Import the CSV file
Kinetica allows schema definitions in the headers of the CSV file. This makes it easy to ingest data into the database. For example, you have to define a field called “Fare Amount.” There are four parts to the definition.
- Name or Label of the field, a.k.a. column name
- Generic Data type – Data types of the column string, long, float etc.
- Data storage and handling definition (Optional) – The default value is data, which directs the data to be stored in memory and can be part of the query
- Native Types (e.g., Char16, Char32, etc.) (Optional) – This part defines the primitive data types that define the column to a narrower specificity
The Schema documentation covers all the different types, native types, storage, and handling options. For this tutorial, you can skip the details and move on.
You have 3 columns in your dataset.
- vendor_id (String)
- fare_amount (Integer)
- pickup_date (String, to be stored in-memory, representing date-time)
The header of the CSV file would look like this:
The “|” pipe separator is used as a demarcation within the field definition.
In the above example, you can see that we have omitted the storage and handling field for vendor_id and fare_amount, which will default to “data.” The native type when not specified is the default native type for that generic type. For the pickup_date, we are specifying “datetime” so that we can run datetime functions on the column if needed.
Importing CSV files using GAdmin
We’re going to supplement the NYC Taxi Dataset with a new dataset. This file provides geospatial boundaries for neighborhoods in New York. The files can be downloaded using the following links:
- NYC Neighborhoods
Download and save the above CSV file locally to your disk (if you are using Linux we suggest saving it into /tmp where it can be read by Kinetica’s ODBC server process). If you inspect the file, you will see some example of how the headers have been defined.
Now let’s import the data.
- Navigate to GAdmin (http://<host>:8080/) and click on
Import. Select the
Advanced CSV Importoption.
- Select the nyct2010 CSV file from your disk
- Set the Table Name to nyc_neighborhood
- IMPORTANT: Set Replicated to Yes
The data will begin importing into the specified table. Once completed, you should be able to check for the new Table under
You can click the nyc_neighborhood table to view the data.
Congrats! You have just imported a sample CSV file using GAdmin and created a new table.
Importing CSV with SQL
Kinetica also allows you to create a table from a CSV file using the SQL CTAS (Create Table AS) construct. For the NYC Neighborhoods dataset, the import might look as follows:
CREATE REPLICATED TABLE nyc_neighborhood AS SELECT * FROM FILE."/tmp/nyct2010.csv";
Note the syntax: there is a dot (.) after the “FILE” keyword. The keywords are case insensitive.
Once complete, you should see the following tables:
Read more about working with files in SQL here.
So far, we have been working with a single node instance of Kinetica. In a production environment, usually you would have a cluster of Kinetica nodes.
For performance and parallel throughput, Kinetica allows you to import the data directly to the worker nodes in parallel. This allows for ingest of billions of rows per minute.
To learn more about the BulkIngestor, please consult the Ingestion Documentation.
In this section, you learned about the various loaders and you imported a few sample CSV files. In the next section, you will use SQL to query the data imported.