Intersect

An intersect is a representation of all rows that appear in both of a pair of specified data sets (tables or views). See INTERSECT for the SQL version of this operation.

An intersect is performed via the /create/union endpoint using the intersect or intersect_all mode:

  • Intersect -- all unique rows that exist in both specified data sets
  • Intersect All -- all rows (including duplicates) that exist in both specified data sets

You can only perform an intersect two data sets, and the columns between the two must have similar data types. Kinetica will cast compatible data types as depicted here.

Performing an intersect creates a separate memory-only table containing the results. Intersects can be persisted (like tables) using the persist option.

A memory-only table name must adhere to the standard naming criteria. Each memory-only table exists within a schema and follows the standard name resolution rules for tables.

Note that if the source tables or views are replicated, the results of the intersect will also be replicated. If the included tables or views are sharded, the resulting memory-only table from the intersect will also be sharded; this also means that if a non-sharded table or view is included, the resulting view will also not be sharded.

Limitations on using intersect are discussed in further detail in the Limitations section.

Performing an Intersect

To perform an intersect of two data sets, the /create/union endpoint requires five parameters:

  1. the name of the memory-only table to be created
  2. the list of the two member data sets to be used in the intersect operation; the result will contain all of the elements from the first data set that are also in the second one
  3. the list of columns from each of the given data sets to be used in the intersect operation
  4. the list of column names to be output to the resulting memory-only table
  5. the intersect mode specified in the options input parameter

Example

In Python, an intersect between the lunch_menu table and the dinner_menu table would look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
h_db.create_union(
    table_name = "example.lunch_and_dinner_menu",
    table_names = ["example.lunch_menu", "example.dinner_menu"],
    input_column_names = [
        ["food_name", "category", "price"],
        ["food_name", "category", "price"]
    ],
    output_column_names = ["lunch_and_dinner_food_name", "category", "price"],
    options = {"mode": "intersect"}
)

The results from the above call would contain only the menu items (excluding duplicates) found in the extracted columns from both lunch_menu and dinner_menu.

Note

Since the example includes price and all columns selected must match between the two sets for an item to be included, a lunch item that is priced differently as a dinner item would not appear in the result set.

Operating on an Intersect

Examples

The intersect created in the Performing an Intersect section can be filtered for lunch & dinner food items that are in the sandwich category:

1
2
3
4
5
h_db.filter(
    table_name = "example.lunch_and_dinner_menu",
    view_name = "example.lunch_and_dinner_sandwiches",
    expression = "category = 'sandwich'"
)

When executed against an intersect, the /filter endpoint produces filtered view. A chain of these filters could be used to create more and more restrictive filtered views from the original intersect operation.

To retrieve records from the intersect results in Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
product_table = gpudb.GPUdbTable(
    _type = None,
    name = "example.lunch_and_dinner_menu",
    db = h_db
)
product_table.get_records_by_column(
    [
        "lunch_and_dinner_food_name",
        "category",
        "price"
    ],
    options = {"order_by": "lunch_and_dinner_food_name"},
    print_data = True
)

Limitations

  • Performing an intersect between two data sets results in an entirely new data set, so be mindful of the memory usage implications.
  • All data sets have to be replicated or not replicated, e.g., you cannot intersect replicated and non-replicated data sets.
  • If attempting to intersect sharded data sets, all data sets have to be sharded similarly (if all data is not on the same processing node, the intersect can't be calculated properly).
  • The result of an intersect operation does not get updated if source data set(s) are updated.
  • The input_column_name parameter vector size needs to match the number of data sets listed, i.e. if you want to intersect a data set to itself, the data set will need to be listed twice in the table_names parameter.
  • The input_column_name parameter vectors need to be listed in the same order as their source data sets, e.g., if two data sets are listed in the table_names parameter, the first data set's columns should be listed first in the input_column_name parameter, etc.
  • Store-only columns and non-charN string column types cannot have intersect operations applied to them.
  • The result of an intersect is transient, by default, and will expire after the default TTL setting.
  • The result of an intersect is not persisted, by default, and will not survive a database restart; specifying a persist option of true will make the table permanent and not expire.