Code Transformations

DataCater offers code-based (or user-defined) transformation functions as a powerful and flexible means to implementing custom needs in data preparation.

Code-based transformations can be implemented in Python and applied to attributes of the data types boolean, double, float, int, long, or string.

They are implemented as a Python function, which takes two parameters:

The first parameter is the value of the attribute that the function is applied to.
The second parameter is the entire row provided as a Python dict, which allows to address all attributes of the data set by their name.

The following code listing shows the structure of a code-based transformation function:

def transform(value, row):
  return value

While one may choose an arbitrary name for the parameters (default: value and row), the function itself must be called transform.

The returned value must be of the same type as the attribute that the transformation function is applied to.

Example

Please see below an exemplary UDTF, which replaces substrings in an attribute of type string with the value of the attribute name:

# value is a string
def transform(value, row):
  return value.replace("###name###", row["name"])

Python version

The current release of DataCater ships Python version 3.7.3. We use vanilla CPython.

Python modules

At the moment, the following non-standard Python modules are available in code-based transformations:

Example usage of Python module

Please see below an exemplary function, which uses the langdetect module for automatically detecting the language of a string value:

from langdetect import detect

# value is a string
def transform(value, row):
  return detect(value)

Timeouts

We use the following execution timeouts for code-based transformation functions:

For previews in the pipeline designer, they have a timeout of 10 seconds.
In deployments, they have a timeout of 60 seconds.

Execution environment

In some cases, you may not want to execute all code in the pipeline designer to keep previews interactive and fast.

Using the environment variable DATACATER_ENVIRONMENT, you can distinguish between the pipeline designer and the deployments. In the pipeline designer, it is set to preview while in the deployments it is set to production.

The following code listing shows how to access the environment variable:

import os

env = os.environ['DATACATER_ENVIRONMENT']

def transform(value, row):
  if env == "production":
    # execute complex code block
    return value
  else:
    # fast computations to keep previews interactive
    return value

Limitations

At the moment, code-based transformations have the following known limitations:

They cannot change the data type of an attribute.

Getting started

Data Pipelines