Google Drive

Use change data capture to stream flat files from a folder in Google Drive to the any data sink and transform them on the way.


Change Data Capture

At startup, the connector extracts data from all (matching) files from the given folder. After this initial sync, it watches the folder for new or updated files and syncs only relevant changes.


Requirements

Please create a service account in Google Cloud and share the Google Drive folder with it. Please make sure that you have enabled the Drive API.


Configuration

This source connector supports the following configuration options:

Google Cloud credentials (JSON)

The content of the JSON-based credentials file provided by Google Cloud for the service account.

Folder name

The name of the Google Drive folder from which we shall extract files. Please click the Fetch folder name button to fetch the list of available folders.

File name filter

Regular expression applied to files from the Google Drive folder. Only files with a name matching the regular expression will be extracted. Default value: .* (matches all file names).

File format

The format of the extracted files. At the moment, this connector only supports CSV files, JSON files, and Google Sheets.

Generate attribute names from header row

Only available for the file formats CSV and Google Sheet. Whether to use the first row of the file for extracting attribute names or not. If this option is set to false, DataCater will generate attribute names based on the index of the attribute, and name them column_1, column_2, etc.

CSV delimiter value

Only available for the file format CSV. The character that delimits different columns (default: ,).

CSV quote character

Only available for the file format CSV. Character used for quotes (default: ").

CSV quote escape character

Only available for the file format CSV. Character used for escaping quotes (default: ").

CSV line separator

Only available for the file format CSV. String used for separating multiple lines (default: \n).

CSV comment character

Only available for the file format CSV. Character used for comments. It must appear at the beginning of a line (default: #).

Number of lines to skip from beginning

Only available for the file formats CSV and Google Sheet. Number of lines to skip for parsing from the beginning of the file (default: 0).

JSON Pointer to records list

Only available for the file format JSON. By default, DataCater expects either a single JSON objedt or an array of JSON objects, each resembling a record, at the root level of the respective file. For all other cases, you may provide a JSON pointer pointing to the location of the data of interest within a possibly deepled-nested JSON structure.

One object per file

Only available for the file format JSON. Whether JSON files contain a single object or an array of objects (default: false).

Primary key column

Name of the attribute that uniquely identifies records, similar to a primary key in a database system.

Sync interval (s)

The interval in seconds between the synchronization of the Google Drive folder and DataCater (default: 120). When synchronizing, DataCater consumes only those files from the Google Drive folder, which have not yet been processed by the pipeline, allowing to implement change data capture to some degree.


Data Types

DataCater imports all columns of CSV files and Google Sheets as attributes of type string.

DataCater automatically extends the set of attributes with the attribute __datacater_file_name and fills it with the name of the file.