Amazon Redshift

DataCater can stream all data change events processed by a data pipeline to the data warehouse Amazon Redshift.


Requirements

Please make sure that you have created an user in Redshift, which can insert data into the used Redshift table.


Configuration

This sink connector supports the following configuration options:

Redshift endpoint

The endpoint of the Redshift database, as provided in the AWS console. Typically in the format [cluster_id].[zone].redshift.amazonaws.com:[port]/[database_name].

Username

The name of the Redshift user.

Password

The password of the Redshift user.

Schema name

The name of the Redshift schema.

Table name

The name of the Redshift table. You may retrieve the list of tables available in the given Redshift database and schema by clicking on Fetch table names.


Data Types

When retrieving the schema of the Redshift table, DataCater performs the following mapping between Redshifts's data types and DataCater's data types:

Redshift data type DataCater data type
BIGINT Long
BOOL Boolean
BOOLEAN Boolean
BPCHAR Boolean
CHAR String
CHARACTER String
CHARACTER VARYING String
DATE Date
DECIMAL Double
DOUBLE PRECISION Double
FLOAT Float
FLOAT4 Float
FLOAT8 Float
GEOMETRY String
INTEGER Int
INT Int
INT2 Int
INT4 Int
INT8 Int
NCHAR String
NUMERIC Double
NVARCHAR String
REAL Float
SMALLINT Int
TEXT String
TIME Time
TIMETZ String
TIME WITH TIME ZONE String
TIMESTAMP Timestamp
TIMESTAMP WITHOUT TIME ZONE Timestamp
TIMESTAMPTZ String
TIMESTAMP WITH TIME ZONE String
VARCHAR String

Change Events

DataCater uses Amazon Redshift in append-only mode and streams all types of change events, i.e., insertions, updates, and deletions, as separate records.

Each event processed by a pipeline produces a new row in Redshift, resembling a change log of the data source. Let's assume that you stream data from a MySQL table (data source) to a Redshift table (data sink). When you INSERT a new row in MySQL, perform two UPDATEs on the row, and eventually DELETE the row from MySQL, DataCater will create four rows in Redshift, one for each state of the row.

INSERTs and UPDATEs extract the full row state from the data source, process it with your data pipeline and insert the processed row into the Redshift table. DELETIONS are implemented similarly, except that they cause the insertion of a row with all columns set to NULL except the Redshift column filled with the primary key of the data source.

While change logs of data sources may be useful for some use cases, you often want to get the current state of a data set. When mapping the schema of the data pipeline to the schema of the Redshift table, DataCater allows to fill timestamp columns of the Redshift table with the processing time of the change events. Processing times are not only available for insertions and updates, but also for deletions, which have all columns set to NULL except the Redshift column filled with the primary key of the data source and the timestamp column(s) filled with the processing time of the change event.

This approach allows you to define a view on the Redshift table, which groups all rows by the primary key and, for each row, picks the most recent state based on the timestamp column filled with the processing time of the change event.