Community Meetup #3: Streaming and Batching with DataCater and dbt

Check this recap for a short summary of the third DataCater community meetup.

By Jonas Zech

On February 8 we were happy to hold our third community meetup. Check this recap for a short summary.

In our bimonthly community meetup, we connect with our users and extended community to chat about all things data. Olivier, responsible for Partnerships and Community at DataCater, moderated the meetup. Our CTO Hakan shared insights about the latest product development and the future roadmap. The highlight of the meetup was the guest talk: Dr. Tim Jungnickel (Head of Engineering, Native Instruments) presented the data Infrastructure he and his team built from scratch.

Streaming and Batching with DataCater and dbt (Dr. Tim Jungnickel)

Tim started his presentation with a pitch of what Native Instruments, a Berlin-based company with over 500 employees, is about. Native Instrument’s core businesses contain DJ software, hardware for producing music, and most importantly software instruments. He came up with a cool example of digitalizing the Stradivari violin, one of the most exclusive violins in the world. Without Native Instruments, most people would never have the chance to use the sound of this rare violin in their tracks. Using the software of Native Instruments, everyone can use exclusive instruments, like the Stradivari violin, in a digital way with high sound quality. A really cool business case! In 2018, Tim and his team set out to build a new, cloud-based data infrastructure for Native Instruments to better support their e-commerce activities with data.

While they were able to build a cloud-based data architecture on Google Cloud Platform, most of their data sources were still being hosted on-premise. Their first approach to connect their on-premise PostgreSQL database with their Google Cloud BigQuery-powered data lake was to load database dumps from the on-premise database into a Google Cloud SQL instance. In a second step, data from the Cloud SQL database was loaded into BigQuery. They chose Apache Airflow for the orchestration of these batch jobs. This setup was very fragile and slow, taking more than 6 hours for one batch load, so they needed something more efficient: DataCater.

DataCater makes change data capture (CDC) and streaming data pipelines accessible to data teams. Using DataCater, they were able to transfer only the relevant, changed data in a streaming manner. That is of course much more efficient and leads to more predictable and consistent resource usage. The latter is beneficial on cloud platforms since cloud providers typically offer discounts for usage commitments.

Check out the recording of his great presentation:

Big thanks to every attendee, we are always happy to see new names in our community calls.

And of course thanks to the speakers Olivier, Hakan, and Tim for their great presentations. It was a great meetup.

Our next community meetup will take place in two months. Follow our LinkedIn profile to keep track of our announcements.