DataCater can stream all data change events processed by a data pipeline to the data warehouse Amazon Redshift.
Please make sure that you have created an user in Redshift, which can insert data into the used Redshift table.
This sink connector supports the following configuration options:
The endpoint of the Redshift database, as provided in the AWS console. Typically in the format [cluster_id].[zone].redshift.amazonaws.com:[port]/[database_name].
The name of the Redshift user.
The password of the Redshift user.
The name of the Redshift schema.
The name of the Redshift table. You may retrieve the list of tables available in the given Redshift database and schema by clicking on Fetch table names.
When retrieving the schema of the Redshift table, DataCater performs the following mapping between Redshifts's data types and DataCater's data types:
Redshift data type | DataCater data type |
---|---|
BIGINT | Long |
BOOL | Boolean |
BOOLEAN | Boolean |
BPCHAR | Boolean |
CHAR | String |
CHARACTER | String |
CHARACTER VARYING | String |
DATE | Date |
DECIMAL | Double |
DOUBLE PRECISION | Double |
FLOAT | Float |
FLOAT4 | Float |
FLOAT8 | Float |
GEOMETRY | String |
INTEGER | Int |
INT | Int |
INT2 | Int |
INT4 | Int |
INT8 | Int |
NCHAR | String |
NUMERIC | Double |
NVARCHAR | String |
REAL | Float |
SMALLINT | Int |
TEXT | String |
TIME | Time |
TIMETZ | String |
TIME WITH TIME ZONE | String |
TIMESTAMP | Timestamp |
TIMESTAMP WITHOUT TIME ZONE | Timestamp |
TIMESTAMPTZ | String |
TIMESTAMP WITH TIME ZONE | String |
VARCHAR | String |
DataCater uses Amazon Redshift in append-only mode and streams all types of change events, i.e., insertions, updates, and deletions, as separate records.
Each event processed by a pipeline produces a new row in Redshift, resembling a change log of the data source. Let's assume that you stream data from a MySQL table (data source) to a Redshift table (data sink). When you INSERT a new row in MySQL, perform two UPDATEs on the row, and eventually DELETE the row from MySQL, DataCater will create four rows in Redshift, one for each state of the row.
INSERTs and UPDATEs extract the full row state from the data source, process it with your data pipeline and insert the processed row into the Redshift table. DELETIONS are implemented similarly, except that they cause the insertion of a row with all columns set to NULL except the Redshift column filled with the primary key of the data source.
While change logs of data sources may be useful for some use cases, you often want to get the current state of a data set. When mapping the schema of the data pipeline to the schema of the Redshift table, DataCater allows to fill timestamp columns of the Redshift table with the processing time of the change events. Processing times are not only available for insertions and updates, but also for deletions, which have all columns set to NULL except the Redshift column filled with the primary key of the data source and the timestamp column(s) filled with the processing time of the change event.
This approach allows you to define a view on the Redshift table, which groups all rows by the primary key and, for each row, picks the most recent state based on the timestamp column filled with the processing time of the change event.