Google Cloud BigQuery

DataCater can stream all data change events processed by a data pipeline to the cloud data warehouse Google Cloud BigQuery.

This connector uses BigQuery's streaming inserts feature, which allows to scalably insert one record at a time.


Requirements

Streaming inserts are not available for the free tier of Google Cloud and require users to have billing enabled.

Please make sure that you have created a service account in Google Cloud, which is assigned to the primitive IAM role BigQueryEditor.


Configuration

This sink connector supports the following configuration options:

Google Cloud credentials (JSON)

The content of the JSON-based credentials file provided by Google Cloud for the service account. The service account must have been assigned to either the primitive IAM role BigQueryEditor or the predefined IAM role bigquery.dataEditor.

Project name

The name of the BigQuery project. We try to automatically extract the name of the BigQuery project from the provided Google Cloud credentials.

Dataset name

The name of the BigQuery dataset.

Write data into existing BigQuery table

If enabled, DataCater will write data into an existing BigQuery table, which can be configured using the subsequent options.

If disabled, DataCater will automatically create a table in the BigQuery dataset, named datacater-pipeline-$pipelineId unless a table name is provided, when starting the deployment of the pipeline for the first time, and write data into it. Subsequent schema changes need to be performed manually in BigQuery though.

Table name

The name of the BigQuery table.

Table partitioning

The partitioning approach used by the BigQuery table.

Please choose between By ingestion time (daily), By timestamp column (daily), or No partitioning. If you choose By timestamp column (daily), please also provide the name of the column used for partitioning.

If you have fetched the table names, we try to automatically fill out the partitioning approach.

Note that, at the moment, this connector does not support hourly partitioning.

Timestamp column to be used for partitioning

The name of the column used for partitioning the BigQuery table. This configuration option is only available when using By timestamp column (daily) for the option Table partitioning.

If you have fetched the table names, we try to automatically fill out the partitioning column.


Data Types

When retrieving the schema of the BigQuery table, DataCater performs the following mapping between BigQuery's data types and DataCater's data types:

BigQuery data type DataCater data type
BOOLEAN Boolean
BYTES String
DATE Date
DATETIME Timestamp
FLOAT Double
INTEGER Long
NUMERIC Double
STRING String
TIME Time
TIMESTAMP Timestamp

Note that, at the moment, DataCater does not support the BigQuery data types ARRAY, GEOGRAPHY and STRUCT.


Change Events

Although Google Cloud BigQuery is an append-only data store, DataCater can stream all types of change events, i.e., insertions, updates, and deletions, to BigQuery.

Each event processed by a pipeline produces a new row in BigQuery, resembling a change log of the data source. Let's assume that you stream data from a MySQL database (data source) to a BigQuery table (data sink). When you INSERT a new row in MySQL, perform two UPDATEs on the row, and eventually DELETE the row from MySQL, DataCater will create four rows in BigQuery, one for each state of the row.

INSERTs and UPDATEs extract the full row state from the data source, process it with your data pipeline and insert the processed row into the BigQuery table. DELETIONS are implemented similarly, except that they cause the insertion of a row with all columns set to NULL except the BigQuery column filled with the primary key of the data source.

While change logs of data sources may be useful for some use cases, you often want to get the current state of a data set. When mapping the schema of the data pipeline to the schema of the BigQuery table, DataCater allows to fill timestamp columns of the BigQuery table with the processing time of the change events. Processing times are not only available for insertions and updates, but also for deletions, which have all columns set to NULL except the BigQuery column filled with the primary key of the data source and the timestamp column(s) filled with the processing time of the change event.

This approach allows you to define a view on the BigQuery table, which groups all rows by the primary key and, for each row, picks the most recent state based on the timestamp column filled with the processing time of the change event.