Learn how to decide between Python and SQL for building data pipelines.
Businesses employ data pipelines as the glue between their IT systems. They use data pipelines to (1) retrieve information from data source systems, (2) transform, filter, and aggregate the data, and (3) publish the prepared data to data sink systems or directly make them available to consumers.
Data-driven organizations use data pipelines for a plethora of use cases, among them are the following popular examples:
Today’s most popular programming languages for implementing the transformation stage of data pipelines are SQL and Python. In this article, I take a look at the differences between both languages in the context of data pipelines and discuss when to choose which approach, a decision that is mainly driven by the faced use cases.
The Structured Query Language (SQL) is probably the most popular interface to relational database systems. Being originally developed for querying data stored in the relational model, it became increasingly popular for modeling or transforming data in the recent years. SQL is rather easy to learn, very precise when it comes to joining or aggregating data, and completely declarative, giving the underlying execution engine much room for the optimization of the data pipeline. Popular technologies for defining data pipelines with SQL are DBT, Flink SQL, and ksqlDB.
Python is often described as the lingua franca of the data community. Python is a very precise, easy-to-learn programming language with an excellent ecosystem for working with data. It provides a rich set of modules and libraries for performing data prepration tasks, such as validating credit card numbers, detecting the language of texts, etc. While frameworks might provide interfaces for aggregating and joining data, Python is naturally very strong when it comes to transforming or filtering data in a precise manner. Exemplary Python-based data pipelining technologies are Apache Airflow, Kubeflow, and DataCater.
While in general both SQL and Python are suitable for implementing the transformation stage of data pipelines, each of them are best suited for different use cases:
SQL supports aggregating and joining data natively,using keywords such as JOIN
or GROUP BY
, while Python requires a custom implementation or the usage of frameworks.
Let us assume that we have a data set with warehouse items. We could define the following SQL query to count the number of in-stock items for each item category:
SELECT category_name, COUNT(*) FROM warehouse_items GROUP BY category_name;
Data pipelines often need to trim down a data set to a subset of interest. While SQL provides the WHERE
clause for filtering data, Python allows us to define completely custom filters while potentially making use of modules.
SQL might be sufficient when dealing with standard filters, such as selecting only warehouse items from the price range USD 100 to USD 200:
SELECT * FROM warehouse_items WHERE price BETWEEN 100 AND 200;
When it comes to coping with more advanced filtering requirements, Python is the clear winner because it allows to implement custom filtering logic. Let us assume that we want to filter in-stock warehouse items and need to send an HTTP request to an API for checking the status. While this cannot be implemented with SQL, we could use the Python library requests to handle the requirement in a few lines of code.
Typically, SQL is quite limited when it comes to transforming data. It provides only a few transforms, e.g., for changing the type of fields. While SQL-based products, like Flink SQL or ksqlDB, offer user-defined functions as a means to extending SQL’s transformation capabilities, they are typically very complicated to use in practice.
Python is much more flexible and allows to build custom transformations in a few lines of code. Furthermore, it allows to use non-standard modules to, for instance, enrich data with the results of API calls.
Data pipelines are an essential piece in the architecture of modern, data-driven organizations. They integrate different IT systems and enable businesses to turn raw data into value.
In this article, I discussed the traits and differences of using SQL and Python for implementing data pipelines. The general recommendation is that SQL excels for aggregating or joining data sets in data pipelines, while Python is much more flexible for implementing custom filters or transformations.