DataCater can stream all data change events processed by a data pipeline to the cloud data warehouse Google Cloud BigQuery.
This connector uses BigQuery's streaming inserts feature, which allows to scalably insert one record at a time.
Streaming inserts are not available for the free tier of Google Cloud and require users to have billing enabled.
Please make sure that you have created a service account in Google Cloud, which is assigned to the primitive IAM role BigQueryEditor.
This sink connector supports the following configuration options:
The content of the JSON-based credentials file provided by Google Cloud for the service account. The service account must have been assigned to either the primitive IAM role BigQueryEditor or the predefined IAM role bigquery.dataEditor.
The name of the BigQuery project. We try to automatically extract the name of the BigQuery project from the provided Google Cloud credentials.
The name of the BigQuery dataset.
If enabled, DataCater will write data into an existing BigQuery table, which can be configured using the subsequent options.
If disabled, DataCater will automatically create a table in the BigQuery dataset, named datacater-pipeline-$pipelineId unless a table name is provided, when starting the deployment of the pipeline for the first time, and write data into it. Subsequent schema changes need to be performed manually in BigQuery though.
The name of the BigQuery table.
The partitioning approach used by the BigQuery table.
Please choose between By ingestion time (daily), By timestamp column (daily), or No partitioning. If you choose By timestamp column (daily), please also provide the name of the column used for partitioning.
If you have fetched the table names, we try to automatically fill out the partitioning approach.
Note that, at the moment, this connector does not support hourly partitioning.
The name of the column used for partitioning the BigQuery table. This configuration option is only available when using By timestamp column (daily) for the option Table partitioning.
If you have fetched the table names, we try to automatically fill out the partitioning column.
When retrieving the schema of the BigQuery table, DataCater performs the following mapping between BigQuery's data types and DataCater's data types:
BigQuery data type | DataCater data type |
---|---|
BOOLEAN | Boolean |
BYTES | String |
DATE | Date |
DATETIME | Timestamp |
FLOAT | Double |
INTEGER | Long |
NUMERIC | Double |
STRING | String |
TIME | Time |
TIMESTAMP | Timestamp |
Note that, at the moment, DataCater does not support the BigQuery data types ARRAY, GEOGRAPHY and STRUCT.
Although Google Cloud BigQuery is an append-only data store, DataCater can stream all types of change events, i.e., insertions, updates, and deletions, to BigQuery.
Each event processed by a pipeline produces a new row in BigQuery, resembling a change log of the data source. Let's assume that you stream data from a MySQL database (data source) to a BigQuery table (data sink). When you INSERT a new row in MySQL, perform two UPDATEs on the row, and eventually DELETE the row from MySQL, DataCater will create four rows in BigQuery, one for each state of the row.
INSERTs and UPDATEs extract the full row state from the data source, process it with your data pipeline and insert the processed row into the BigQuery table. DELETIONS are implemented similarly, except that they cause the insertion of a row with all columns set to NULL except the BigQuery column filled with the primary key of the data source.
While change logs of data sources may be useful for some use cases, you often want to get the current state of a data set. When mapping the schema of the data pipeline to the schema of the BigQuery table, DataCater allows to fill timestamp columns of the BigQuery table with the processing time of the change events. Processing times are not only available for insertions and updates, but also for deletions, which have all columns set to NULL except the BigQuery column filled with the primary key of the data source and the timestamp column(s) filled with the processing time of the change event.
This approach allows you to define a view on the BigQuery table, which groups all rows by the primary key and, for each row, picks the most recent state based on the timestamp column filled with the processing time of the change event.