Amazon Redshift

DataCater can stream all data change events processed by a data pipeline to the data warehouse Amazon Redshift.

Requirements

Please make sure that you have created an user in Redshift, which can insert data into the used Redshift table.

Configuration

This sink connector supports the following configuration options:

Redshift endpoint

The endpoint of the Redshift database, as provided in the AWS console. Typically in the format [cluster_id].[zone].redshift.amazonaws.com:[port]/[database_name].

Username

The name of the Redshift user.

Password

The password of the Redshift user.

Schema name

The name of the Redshift schema.

Table name

The name of the Redshift table. You may retrieve the list of tables available in the given Redshift database and schema by clicking on Fetch table names.

Data Types

When retrieving the schema of the Redshift table, DataCater performs the following mapping between Redshifts's data types and DataCater's data types:

Redshift data type	DataCater data type
BIGINT	Long
BOOL	Boolean
BOOLEAN	Boolean
BPCHAR	Boolean
CHAR	String
CHARACTER	String
CHARACTER VARYING	String
DATE	Date
DECIMAL	Double
DOUBLE PRECISION	Double
FLOAT	Float
FLOAT4	Float
FLOAT8	Float
GEOMETRY	String
INTEGER	Int
INT	Int
INT2	Int
INT4	Int
INT8	Int
NCHAR	String
NUMERIC	Double
NVARCHAR	String
REAL	Float
SMALLINT	Int
TEXT	String
TIME	Time
TIMETZ	String
TIME WITH TIME ZONE	String
TIMESTAMP	Timestamp
TIMESTAMP WITHOUT TIME ZONE	Timestamp
TIMESTAMPTZ	String
TIMESTAMP WITH TIME ZONE	String
VARCHAR	String

Change Events

DataCater uses Amazon Redshift in append-only mode and streams all types of change events, i.e., insertions, updates, and deletions, as separate records.

Each event processed by a pipeline produces a new row in Redshift, resembling a change log of the data source. Let's assume that you stream data from a MySQL table (data source) to a Redshift table (data sink). When you INSERT a new row in MySQL, perform two UPDATEs on the row, and eventually DELETE the row from MySQL, DataCater will create four rows in Redshift, one for each state of the row.

INSERTs and UPDATEs extract the full row state from the data source, process it with your data pipeline and insert the processed row into the Redshift table. DELETIONS are implemented similarly, except that they cause the insertion of a row with all columns set to NULL except the Redshift column filled with the primary key of the data source.

While change logs of data sources may be useful for some use cases, you often want to get the current state of a data set. When mapping the schema of the data pipeline to the schema of the Redshift table, DataCater allows to fill timestamp columns of the Redshift table with the processing time of the change events. Processing times are not only available for insertions and updates, but also for deletions, which have all columns set to NULL except the Redshift column filled with the primary key of the data source and the timestamp column(s) filled with the processing time of the change event.

This approach allows you to define a view on the Redshift table, which groups all rows by the primary key and, for each row, picks the most recent state based on the timestamp column filled with the processing time of the change event.

Getting started

Data Pipelines