Google Cloud BigQuery

Use change data capture to stream data from Google Cloud BigQuery tables to any data sink and transform them on the way.

Requirements

Please make sure that you have created a service account in Google Cloud, which is assigned to the primitive IAM role BigQuery Data Viewer on the level of the dataset and the primitive IAM roles BigQuery Job User and BigQuery Resource Viewer on the level of the project.

Configuration

This source connector supports the following configuration options:

Google Cloud credentials (JSON)

The content of the JSON-based credentials file provided by Google Cloud for the service account. The service account must have been assigned to the primitive IAM roles BigQuery Data Viewer (dataset level), BigQuery Job User (project level), and BigQuery ResourceViewer (project level).

Service account e-mail address

The e-mail address of the service account. We try to automatically extract the e-mail address from the provided Google Cloud credentials.

Project name

The name of the BigQuery project. We try to automatically extract the name of the BigQuery project from the provided Google Cloud credentials.

Dataset name

The name of the BigQuery dataset.

Table name

The name of the BigQuery table (or view). You may retrieve the list of tables (and views) available in the given BigQuery project and dataset by clicking on Fetch table names.

Change Data Capture mode

You may choose one of the following modes for change data capture:

BULK: Recurringly load all data from the BigQuery table. Basically no change data capture at all.
INCREMENTING: Use the primary key column, specified using the Primary key column configuration option, to recurringly extract new records. This mode does only extract INSERTs, but skips UPDATEs and DELETEs.
TIMESTAMP: Use the timestamp column, specified using the Timestamp column configuration option, to recurringly extract new and updated records. This mode does only extract INSERTs and UPDATEs, but skips DELETEs.
TIMESTAMP/INCREMENTING: Use the primary key column and the timestamp column, specified using the Primary key column and Timestamp column configuration options, to recurringly extract new and updated records. This mode does only extract INSERTs and UPDATEs, but skips DELETEs.

Primary key column

BigQuery does not natively support the concept of primary keys. Specifying a column, which can be used for uniquely identifying a row in BigQuery, allows DataCater to detect new records. Please make sure that this colum is never NULL.

Timestamp column

DataCater can use a timestamp column, which stores the time of the most recent update of a record, to detect record updates. Specifying the timestamp column is required when using TIMESTAMP or TIMESTAMP/INCREMENTING as Change Data Capture mode.

Sync interval

The interval in milliseconds between synchronizations of the BigQuery table and DataCater (default: 60000).

Data Types

The following table shows the mapping between BigQuery data types and the data types used by DataCater.

BigQuery data type	DataCater data type
ARRAY	String
BOOLEAN	Boolean
BYTES	String
DATE	Date
DATETIME	Timestamp
FLOAT	Double
INTEGER	Long
NUMERIC	Double
STRING	String
STRUCT	String
TIME	Time
TIMESTAMP	Timestamp

DataCater extracts the BigQuery data types ARRAY and STRUCT as JSON-formatted strings.

Installation (On-Premise only)

For licensing reasons, we are not allowed to ship the official BigQuery JAR file, which is why we need to ask you to manually download the JAR file and mount it into the folder /kafka/connect/kafka-connect-jdbc of the Kafka Connect containers - Thanks for your understanding.

Getting started

Data Pipelines