Changelog
2022.10
- October 17th, 2022
New features:
- PostgreSQL source connector: Allow to exclude columns from data extraction
- Google Cloud BigQuery sink connector: Allow to choose table name for auto-managed tables
Bug fix:
- PostgreSQL source connector: Support decimal and numeric fields that specify precision and scale
2022.09
- September 13th, 2022
New features:
- Upgrade Java Kubernetes client to version 15.0.1
- PostgreSQL source connector: Support array columns
2022.08
- August 30th, 2022
New features:
- REST/HTTP source connector: Support offset-based pagination
- HubSpot sink connector: Support private apps
- PostgreSQL source connector: Allow to override SELECT statement used for snapshotting
- Python transforms: Provide line number for exceptions
- Reduce number of calls to Kubernetes API
2022.07
- July 15th, 2022
New features:
- REST/HTTP sink connector: Support OAuth2 and time-restricted access tokens
- Kubernetes: Specify resource limits and requests for pipeline pods
New connectors:
2022.06
- June 15th, 2022
New features:
- Introduce Streams resource
- HubSpot sink connector: Support custom objects
- Support flat file uploads of up to 250MB
- REST/HTTP source connector: Support arrays as records
- Kubernetes: Run platform service as Stateful Set for HA
Bug fixes:
- Assign pipeline deployments to correct node pool, if a
node selector is configured
- RSS source connector: Fix full re-sync by producing correct tombstone record
2022.05
- May 9th, 2022
New features:
- Google Drive source connector: Support JSON and Google Sheet files
- REST/HTTP sink connector: Support PATCH verb
- Remove key
sourceSchema
from YAML descriptions
- Attach YAML descriptions to pipeline deployments
- Add Python module
beautifulsoup4
to Python transforms
2022.04
- April 5th, 2022
New features:
- Allow to sync DELETE events to REST sink
- Support time(stamp) values with precision
- Error page for JavaScript errors
- Kubernetes deployment improvements
- Profile integers in JSON sources as longs
Bug fixes:
- Autocreate BigQuery schema with attributes from joined data source
- Get rid of ghost indices created by Elasticsearch sink
2022.03
- March 8th, 2022
New features:
- Introduce self-service sign ups
- When fetching table or column names fails for a data source or data sink, show the exact error message in the frontend.
- Allow to create new pipelines from the detail page of a data source
- Remove projects from top-level navigation
- Show welcome guide to new users
Bug fixes:
- Fix sampling of headerless CSV files in the data
source connectors for AWS S3, FTP/SFTP, Google Cloud Storage, and Google Drive.
2022.02
- February 7th, 2022
New features:
- Allow to export pipelines in declarative YAML format
- Move main navigation to the top
- Rely only on the Kafka Connect API for monitoring health of data
sources/sinks
- Google Cloud Storage source connector: Extend CSV parsing options
Bug fixes:
- Fix inconsistency in casting strings to time/timestamp objects in the pipeline designer's preview
- Propagate updates in data source configs to join connectors of consuming pipelines
2022.01
- January 7th, 2022
New features:
- Allow to duplicate pipelines with one button click.
- Improve debugging of failed Python transformations by
providing information on the pipeline step and attribute,
where the Python transformation failed, in the deployment log.
- Unify health checks of data sources/sinks and the associated Kafka Connect connectors.
- Allow to configure SSL/TLS-related settings of the mailer through environment variables of the platform container.
- Allow to manually recreate source connectors from the UI.
- RSS source connector: Support the enclosures tag.
- REST/HTTP source connector: Support HTTP request headers with comma-separated lists as values.
- Reduce sensitivity when detecting the health of the connectors to reduce alert fatigue.
- Rename pipeline steps to transformation steps to use a consistent naming.
- SFTP source connector: Support password-based authentication.
- FTP/SFTP source connector: Support headerless CSV files.
Bug fixes:
- PostgreSQL source connector: Support timestamp/time values with time zone information.
- PostgreSQL source connector: Fix out-of-memory errors occasionally happening when extracting data from big tables.
- Google Cloud BigQuery source connector: Escape data set and table name.
- Fix bug in detecting health of connectors: When a connector throws an exception while a connection test is being performed, it can be considered as failed.
- FTP/SFTP source connector: Fix bug in extracting attribute names from CSV header row.
2021.12
- December 1st, 2021
New features:
- PostgreSQL source connector: Add support for connecting via a JDBC driver. This allows us to support PostgreSQL installations, which do not offer logical replication.
- REST/HTTP source connector: Allow to treat empty strings as NULL values.
- FTP source connector: Fix connection issue.
New connectors:
Bug fixes:
- FTP source connector: Fix connection issue.
- Google Drive source connector and Google Cloud Storage source connector: Fix bug in detecting primary keys, when auto-generating attribute names.
2021.11
- November 5th, 2021
New features:
- In the configuration forms of source and sink connectors, explain the reasons for a failed connection tests.
- For the FTP/SFTP source connector, support full re-syncs.
- For the MySQL source connector, allow to manually specify the primary key column.
- For the REST source connector, allow to specify the timeout of HTTP requests.
New connectors:
Bug fixes:
- Fix UI bug in managing notification settings of project members.
- For the CSV source connector, fix a bug in specifying the number of to-be-skipped lines..
2021.10
- October 11th, 2021
New features:
- For the BigQuery sink connector, support automatic creation of tables.
- Do no longer require to re-enter the password when
performing a test connection for an existing data source or
data sink.
- For the REST source connector, support UNIX timestamps for capturing change information.
- For the REST source connector, use
1970-01-01T00:00:00Z. as default value for the initial
timestamp offset.
- In DataCater Self-Managed, allow to configure Kafka-related settings via the
following environment variables: KAFKA_TOPICS_CLEANUP_POLICY,
KAFKA_TOPICS_PARTITIONS, KAFKA_TOPICS_REPLICATIONFACTOR,
KAFKA_TOPICS_RETENTION_BYTES, and
KAFKA_TOPICS_RETENTION_MS.
New connectors:
2021.09
- September 1st, 2021
New features:
- Advanced validation of data source and data sink configs.
- Support sending health notifications to Slack channels (only available for projects).
- Bump maximum event size from 1MB to 3MB.
- Improve management of project settings. For instance,
project admins can now manage the individual notification settings of project
members.
- Support automated handling of time-based access tokens for REST endpoint data source.
- Allow to drop primary key columns in pipelines.
- Redirect to project page after deleting project resources, like pipelines.
2021.08
- August 2nd, 2021
New features:
- When creating or editing a connector, show the link to its
documentation.
- For flat files, automatically create the attribute __datacater_file_name and fill it
with the name of the flat file.
- Truncate log messages longer than 5,000 characters.
- Enforce UTF-8 encoding for FTP/SFTP source connector.
- For REST data source, allow to extract records from a JSON
object, in addition to a JSON array. Treat all object keys as
separate records.
New connectors:
Bug fixes:
- Fix encoding bug in CSV/JSON/XML data sinks.
- Escape HTML tags in the deployment logs.
2021.07
- July 2nd, 2021
New features:
- Support timestamp formats without timezones for REST data
sources.
- Allow to use MySQL and PostgreSQL sinks in append-only
mode (configurable via configuration option insert mode).
- Show build time for deployments.
- Set default retention for Kafka topics to 1 day or 100MB.
- Install Python module feedparser.
New connectors:
Bug fixes:
- Treat NUMERIC and DECIMAL fields as doubles for the
JDBC-based MySQL source connector.
2021.06
- June 4th, 2021
New features:
- Allow to reset the offset of the sink connector on the
Deployments page.
- Provide the environment variable DATACATER_ENVIRONMENT to
user-defined transformation functions, which holds either
preview (Pipeline Designer)
or production (Deployment).
- In the AX Semantics sink connector, perform a commit of the sink connector offset after each processed record to prevent timeout issues.
- Allow specifying multiple collection IDs as a
comma-separated list for the AX
Semantics sink connector.
- Add the Python modules geopy and Shapely.
- Automatically create the attribute __datacater_file_name for flat file
sources, which contains the name of the uploaded CSV, JSON, or
XML file.
-
Show a maximum of 1,500 characters in the cells of the
pipeline designer to prevent performance issues with very
long text values.
-
Provide the timestamp variables now, today, tomorrow, dayAfterTomorrow, yesterday, and dayBeforeYesterday in the config options of the REST endpoint source connector.
Bug fixes:
- Fix bug in re-uploading flat files in the pipeline
designer.
2021.05
- May 3rd, 2021
New features:
-
Use HTML layouts for notification mails.
-
Allow to reset offsets of pipelines, which will skip all
unprocessed events in the Kafka source topic of the
pipeline.
-
Update Debezium-based connectors (MySQL source, PostgreSQL
source) to version 1.5.
-
Add the non-standard Python module pytz to the UDF runner.
-
Show only the last 200 lines of the deployment logs, by
default.
-
Validate CRON expressions provided in the config for the REST source
connector.
New connectors:
2021.04
- April 5th, 2021
New features:
-
Update user-defined transformations to Python 3.7.3
and pre-install the following non-standard Python modules:
langdetect, nested-lookup, nltk, numpy, requests, requests-cache, and spacy.
-
Allow users to unsubscribe from notifications in projects.
-
Support specifying sync intervals as CRON expressions for
the REST source connector.
-
Show the lag of the pipeline and the sink connector on the
deployments page:
-
The lag of the pipeline equals the number of records,
which have been extracted by the source connector but
have not yet been processed by the pipeline.
-
The lag of the sink connector equals the number of
records,
which have been processed by the pipeline but have not
yet been published by the sink connector.
-
Write errors of deployments to stderr.
-
Allow to filter attributes of the data sink while mapping a
pipeline to a data sink.
-
Simplify the parsing of XML files. We recommend to use
Python's xml.etree.ElementTree module, available in the user-defined
transformations, for parsing deeply-nested XML structures.
-
Support DELETE as HTTP method in the REST sink connector.
-
Skip unparseable DDL statements in the
MySQL source connector.
Bug fixes:
-
Using three double quotes (""") in user-defined
transformations leads to failures in creating
deployments.
-
Deleting pipelines might lead to a sign out of the user.
-
Deployment logs are not always correctly reset when
switching between pipelines.
-
Data sinks can be changed while a deployment is running.
2021.03
- March 2nd, 2021
New connectors:
New features:
-
Allow to manage pipelines, data sources, and data sinks in
projects. Projects enable collaboration in teams and allow users to share ressources with colleagues.
Projects can be created by DataCater admin users in the
admin UI. When adding members to a projects, one may choose
between the following three roles:
-
A Viewer gets read access to
all project ressources,
-
an Editor gets, in addition to read
access, also write access to
all projects ressources, but cannot neither delete ressources nor manage the project,
-
an Administrator can, in addition to the permissions of the Editor, delete ressources, manage project memberships, and administrate the project.
-
Simplify parsing of JSON data sources: Parse JSON arrays and
objects as strings.
-
Include the ID of a deployment in the name of the container
to ease pipeline-level monitoring. Containers are named
datacater_deployment-ID,
where ID equals the ID
of the deployment.
2021.02
- February 1st, 2021
New features:
- Improve navigation of the Pipeline Designer.
-
User-defined transformations can take the whole record, provided as a Python Dictionary object, as second parameter:
def transform(value, row):
return value.replace("###name###", row["name"])
-
Add failure reason to notification e-mails about failed connectors to speed up debugging.
-
Allow to configure whether source or sink connector shall
be automatically restarted in case of failures. This
configuration option is enabled by default and can be changed by
editing the respective data source or data sink.
Bug fixes:
-
Fix bug in applying Replace with attribute transformation
to date, time, and timestamp values.
2021.01
- January 4th, 2021
New connectors:
New transformation functions:
New features:
-
Improve interactivity of creating pipelines.
-
Improve monitoring of Kafka Connect connectors.
-
Send notification via e-mail when a pipeline source or sink
connector fails.
Bug fixes:
-
Fix bug in deleting pipeline sink connectors.
2020.12
- December 1st, 2020
New features:
-
Allow editing data sources and data sinks without
manually re-entering the password.
-
Trim hostnames, database names, table names of data
sources and data sinks to sanitize user input.
-
MySQL source connector: Do not monitor the schemas of tables
other than the monitored one. As a consequence, when
changing the table name of a MySQL data source, the data
processing of all consuming pipelines must be manually reset
for re-fetching the schema of the new table.
-
Show dedicated error page, when accessing unavailable
resources, e.g., data sources, or pipelines.
2020.11
- November 2nd, 2020
New connectors:
New features:
-
Add support for left outer and inner joins.
-
When starting the DataCater application, automatically
restart all pipeline containers that are still marked as
running. This is helpful in situations, where the DataCater
application is being restarted after not being shutdown gracefully (e.g., power outage).
-
Automatically drop used PostgreSQL replication slots
once they are no longer needed, i.e., when deleting the
pipeline.
-
Allow naming deployments.
-
Add widget showing running pipelines to start page.
-
Allow providing custom primary keys for flat file sources (CSV,
JSON, and XML).
-
Publish attributes of type timestamp
with milliseconds precision to data sinks.
Bug fixes:
-
Fix bug in parsing primary keys from MySQL: Columns with a
uniqueness constraint were falsely detected as primary keys.
2020.10
- October 1st, 2020
New connectors:
New features:
-
If the profiling of a data sink, which is performed when
assigning a certain data sink to a pipeline, fails show the
failure message in the UI.
-
Support configuration of the logical replication plugin to
be used for the PostgreSQL source
connector.
-
Support configuration of the server
timezone for the MySQL data source
connector and the MySQL data sink
connector.
-
Show deletions of data sources, data sinks, and pipelines in
the activity stream.
-
Do not empty the data sink when resetting the data
processing.
In most cases, this does not change anything, because we use
upserts and simply overwrite already-processed records.
If you made changes to the primary key between the first
execution of a data pipeline and the reset of the data
processing, you may need to
manually remove data from the data sink before resetting the
processing.
-
Show the processing status of ingested CSV and JSON files.
-
Manually create Kafka topics for data pipelines to support
pipeline-level settings for Kafka configuration options, such as the replication
factor.
-
Provide more information in logs of running data
pipelines.
Bug fixes:
-
Fix bug in retrieving the schema from a BigQuery table,
where the schema was edited after the initial creation.
-
Fix bug in processing date, time, and timestamp fields with
the PostgreSQL
sink connector.
2020.09
- September 1st, 2020
New connectors:
New features:
- Allow user-defined transformation functions to take
another attribute as a parameter.
- Use chunked transfer encoding for serving flat file sinks, which strongly improves the handling of large data sets.
- Return empty files when trying to download empty flat file
sinks.
- Allow retrieving available tables for MySQL data source,
PostgreSQL data source, MySQL data sink, PostgreSQL data sink,
and BigQuery data sink.
- Improve connection test for PostgreSQL sink.
- Improve handling of date, time, and timestamp
attributes.
- Allow more characters in attribute names: numbers, whitespaces, and German umlauts.
- Show error message when building, starting, or stopping a
deployment fails.
- Validate attribute names.
- Show the current release name of DataCater in navigation.
Bug fixes:
- Fix bug in persisting data sink mapping.
- Fix bug in reassigning data sink connectors to pipeline.
2020.08
- August 3rd, 2020
New features:
- Improve internal management of Kafka Connect connectors.
- Improve layout of the admin interface for managing user
accounts.
- Remove dependency on Elasticsearch for the storage of sample data.
- Add support for managing flat files as regular data
sources.
- Improve visual feedback for successful uploads of flat
files.
- Move management of deployments to Pipeline Designer.
- Replace calls to window.alert() with modern modal
dialogs.
- Add health check for data sources and data sinks.
Bug fixes:
- Fix bug in loading sample records after creating new
pipelines.
2020.07
- July 1st, 2020
Initial release of DataCater! 🥳