AWS S3

Use change data capture to stream files from AWS S3 buckets to any data sink and transform them on the way.

Change Data Capture

At startup, the connector extracts data from all (matching) files from the given folder. After this initial sync, it watches the folder for new files and syncs only relevant changes.

Requirements

Please provide a pair of AWS access key ID and AWS secret access key, which has the permissions to list your buckets (s3:ListBucket) and retrieve objects (s3:GetObject).

Configuration

This source connector supports the following configuration options:

AWS region

Your AWS region (default: eu-central-1).

AWS access key ID

The AWS access key ID to use for authentication.

AWS secret access key

The AWS secret access key to use for authentication

Bucket name

The name of the AWS S3 bucket from which we shall extract files.

File name filter

Regular expression applied to files from the AWS S3 bucket. Only files with a name matching the regular expression will be extracted. Default value: .* (matches all file names).

File format

The format of the extracted files. At the moment, this connector only supports CSV and Parquet files.

CSV delimiter value

Only available for the file type CSV. The character that delimits different columns (default: ,).

Generate attribute names from CSV header row

Only available for the file type CSV. Whether to use the first row of the CSV file for extracting attribute names or not. If this option is set to false, DataCater will generate attribute names based on the index of the attribute, and name them column_1, column_2, etc.

Primary key column

Name of the attribute that uniquely identifies records, similar to a primary key in a database system.

Sync interval (s)

The interval in seconds between the synchronization of the AWS S3 bucket and DataCater (default: 120). When synchronizing, DataCater consumes only those files from the AWS S3 bucket, which have not yet been processed by the pipeline, allowing to implement change data capture to some degree.

Data Types

DataCater imports all columns of a CSV file as attributes of type String.

For Parquet files, DataCater performs the following mapping between Parquet data types and DataCater data types:

Parquet data type	DataCater data type
ARRAY	String
BOOLEAN	Boolean
BYTES	String
DOUBLE	Double
ENUM	String
FIXED	Double (when using decimal as logical type) or String
FLOAT	Float
INT	Int
LONG	Long
MAP	String
RECORD	String
STRING	String

DataCater automatically extends the set of attributes with the attribute __datacater_file_name and fills it with the name of the file.

Getting started

Data Pipelines