Version:

Merge Records Tables

Merging Records is similar to a union in that multiple tables or views are merged together to create a new merge records table. Since a table is created by the merge records operation, records can be added, modified, or deleted. If you want to duplicate the records in a table or view, merging a table or view with itself is also possible, but the data set will need to be listed twice in the source_table_names parameter. The endpoint is also one of the few to allow operations involving unrestricted string columns, meaning you can merge any number of unrestricted string columns together.

Merge records tables cannot be sharded, regardless of the composition of the source data sets being merged. Merge records tables are also persisted by default, even if only views are merged.

Note that it is possible to create n new columns in a merge records table from different columns in separate tables, but this would add nulls where there normally would be corresponding values.

For example, if you have two tables that look like this:

tableA

col_x col_y
3 7
4 8

tableB

col_z col_q
1 9
2 10

And you performed the following merge (in Python):

gpudb.merge_records(
  table_name = "tableAB",
  source_table_names = ["tableA","tableB"]
  field_maps = [
    {"new_col1":"col_x"},
    {"new_col2":"col_q"}
  ]
)

The new table would look like this:

new_col1 new_col2
3 <null>
4 <null>
<null> 9
<null> 10

Creating a Merge Records Table

To create a merge records table using the /merge/records endpoint requires three parameters:

  1. the name of the table to be created
  2. the name(s) of the source dataset(s)
  3. maps containing the new columns mapped to columns (or column expressions) from the source dataset(s)

Examples

The example below outlines merging together june_weather and july_weather tables into the merge records table summer_weather in Python

gpudb.merge_records(
  table_name = "summer_weather",
  source_table_names = ["june_weather","july_weather"],
  field_maps = [
    {"date":"july_date_of_record", "avg_summer_temps":"july_avg_temp", "time":"july_time_of_record"},
    {"date":"june_date_of_record", "avg_summer_temps":"june_avg_temp", "time":"june_time_of_record"}
  ]
)

Limitations and Cautions

  • A merge records table is persisted by default, which means its TTL will never expire. Note that the merge records table is not protected, which means it can still be deleted.

  • The merge records table will only inherit nullability from its source tables. Other column properties (e.g., compression, store only, etc.), any keys (shard or primary), or replication will not be inherited.

  • The /merge/records endpoint won't merge two columns from one dataset into the same new column in the same map, e.g., the value in the below field_maps parameter is valid, but the values in the tableA_col1 column won't be available in the merge records table

    field_maps = [
      {"merged_col1":"tableA_col1", "merged_col1":"tableA_col2"}
    ]
    
  • Columns being merged must match types, e.g., a column type of int16 will merge successfully with another column type of int16, but attempting to merge with a type of int, int8, double, etc., will not work.

  • Store-only columns cannot have merge records operations applied to them.

  • Collections cannot have merge records operations applied to them.

  • The merge records table does not get updated if source data set(s) are updated.

  • The number of data sets in the source_table_names parameter needs to match the number of maps included in the field_maps parameter and vice versa.

  • The order of the maps in the field_maps parameter need to match the order of the data sets listed in the source_table_names parameter, e.g. the following is not valid:

    gpudb.merge_records(
      table_name = "merged_table",
      source_table_names = ["table1","table2"],
      field_maps = [
        {"merged_col1":"col1_table2"},
        {"merged_col1":"col1_table1"}
      ]
    )