8 min read

Supercharging dbt: performance, working at scale & trade-offs (1/3)

Supercharging dbt: performance, working at scale & trade-offs (1/3)
Become a dbt wizard !
As of 2025, dbt has become the go-to for data transformations. But as teams grow and pipelines expand, new challenges start to emerge: performance bottlenecks, long runtimes, and rising costs.

In this series of article, I’ll explore what breaks at scale, and how to fix it.

Over the past few years, dbt has become the go-to standard for data engineering and analytics. And for good reason. dbt Labs, the company behind the product, was the first to introduce a structured SQL-based framework to manage data transformations — something the modern data stack desperately needed.

Before dbt, every company built their own tools

Before dbt came along, most companies built custom transformation frameworks—typically a mix of Airflow, raw SQL scripts, and internal best practices. These setups worked… to an extent. But they were often brittle, not very flexible, and far too technical for analysts to contribute to without heavy engineering support.

Then came dbt. It bridged the gap between engineering and analytics, making SQL workflows more approachable, version-controlled, and testable. Naturally, it took off.

The new problem: the illusion of governance

But here’s what I’ve observed throughout my freelance career: as much as dbt enables teams, it also often becomes a mess.

Once it’s in place, dbt projects can quickly spiral into a spaghetti bowl of models —unmaintained, undocumented, and often disconnected from real business needs. Governance falls apart. Ownership blurs. You end up with hundreds (sometimes thousands) of models that no one understands anymore.

In short: dbt can become a liability if not handled carefully.

I still love dbt — but it needs discipline

Let me be clear: I love dbt. The core idea: providing a standardized transformation framework that prevents every company from reinventing the wheel is brilliant. As a data practitioner, I don’t want to write the same logic over and over again. And dbt helps avoid that.

But loving a tool doesn’t mean ignoring its flaws.

The 2 million dollar questions are:

  • How do we scale dbt without losing control ?
  • How do we manage thousands of models while keeping delivery sharp, governance clear, and business alignment intact?

That’s what this series of articles is about

With this article, I begin a series of 3, which will be my attempt to explore those questions. I don’t pretend to have all the answers. But if you come away from this article with even one useful idea for managing dbt better, then it’s done its job.

Let’s dive in.


This article will be divided into 3 to keep things easy to read:

I. Overclocking dbt [*], (1/3)
II. Keeping quality as high as possible while maintaining governance, (2/3)
III. Interfacing dbt & Airflow (3/3)

Here, we'll focus on the first part - Overclocking dbt.


[*] Kudos to Chris Dong for coming with with this formulation, here's his super interesting article:

Overclocking dbt: Discord’s Custom Solution in Processing Petabytes of Data
Explore how Discord supercharged dbt with a tailored solution designed for performance, developer productivity, and data quality.

I. Overclocking dbt

Let’s start with a reality check: for the vast majority of companies, think 90 to 95%, building a fully custom solution like Discord’s simply isn’t realistic. And frankly, it’s not necessary.

In most cases, dbt already covers everything you need out of the box - you just need to tweek it a bit.

The 4 biggest dbt performance killers (and how to fix them)

1 - Full tables scans

As always, avoid SELECT * at all cost:

-- ❌ Avoid this - even though dbt Labs clearly push for this syntax

with 
  cte as (
  select * from {{ ref('some_table') }}
)

select * from cte

-- ✅ Do this instead
with 
  cte as (
    select 
      column1
      , column2
      , column3
    from
      {{ ref('some_table') }}
    where 
      some_filter
  )

select * from cte

Optimize your table scans

2 - Full instead of incremental tables

To me, the main dbt performance killer is the dbt Labs given advice to use full tables instead of incremental tables.

I don't agree with this idea. Working with incremental is not always more complicated - even more so when you're handling the whole lifecycle of data (ingestion to fact / dim tables) and it certainly brings more benefits: costs reduction, faster pipelines, easy & cost-performant backfills etc.

If you're a traditional dbt practitioner, I think you'll probably don't like this next sentence – which is the essence of the approach I propose:

📌
No event-type model should ever be run without any time filtering logic

Note: this is not always true for dimension type tables.

That's it. It's not fancy – there's no AI in there – but this still works well. To decrease your processing costs & time, you need to reduce the amount of data you're moving.

Seems basic right ? In my experience, not so much.

Now, incremental tables requires more team alignment than full tables – it certainly does not work out of the box, and can become messy if not done right.

This begs the question: how should you setup your dbt incremental logic ? Well, let's dive in.

  • Capitalize on dbt vars,

To make incremental models work consistently across your project, you need a pattern for time-based filtering. That's where vars come in.

What's great about dbt is the possibility to add vars to your pipelines easily, for instance, you can:

select
  column1
  , column2
  , column3
  , column4
  , column5
  , ingestion_date
  , updated_date
from
  {{ source('dataset', 'source') }}
where
    ingestion_date 
      between {{ var('event_start_date') }} and {{ var('event_end_date') }}

your_model.sql

You can then use those vars in your dbt invocation:

dbt run -s your_model --vars '{event_start_date: 2025-01-01, event_end_date: 2025-02-01}'

Invoke dbt with vars

My advice would be to setup your data sources using variables (always the same variables for tables at the same grain). This would effectively blocks someone to use any data source without any kind of date filtering logic.

Then, I would setup defaults in the the dbt_project.yml:

vars:
  event_time_start: "current_date()"
  event_time_end: "current_date() - 7"

dbt_project.yml

Based on your setup, you'd have to change syntax here.

  • Handle interval logic based on env

If you want to go even further, you could also setup the variables like so (still in the dbt_project.yml ):

vars:
  event_time_start: "current_date()"
  event_time_end: "{{ get_default__event_time_end(var('event_time_start')) }}"

dbt_project.yml

This would use the macro get_default__event_time_end , which could decide on a time interval based on your environment:

{% macro get_default__event_time_end(event_time_start=None) %}
  {% if event_time_start == None %}
    {% do raise ValueError("event_time_start cannot be None") %}
  {% endif %}

  {% if target.name = "dev" %}
    {% set interval = 7 %}
  {% elif target.name = "preprod" %}
    {% set interval = 60 %}
  {% elif target.name = "prod" %}
    {% set interval = 90 %}

  {% do return(event_time_start | string ~ " - " ~ interval | string) %}
{% endmacro %}

get_default__event_time_end.sql

This macro is equivalent to having:

event_time_start: "current_date()"
event_time_end: "current_date() - 7"

On dev

event_time_start: "current_date()"
event_time_end: "current_date() - 60"

On preprod

event_time_start: "current_date()"
event_time_end: "current_date() - 90"

On prod

I would advise to only setup this on dev / preprod as it can lead to unforeseen issues in prod (for example forgetting to setup your vars).

These macros would effectively deter your people to work with too much data in a given environment while still keeping it flexible & without any hassle.

3 - Use views for easy debugging

Barring from any performance issue in your pipeline - you should use views instead of ephemeral code for easy debugging.

Of course, when your pipeline just takes to much time / I/O; this may be a good idea to look at materialization between your source & final table.

For instance, consider this pipeline:

Not optimized dbt lineage

You may want to materialize some views into table to tackle performance issues:

Previous pipeline optimized

The official dbt recommendation states that whenever you're chaining > 4 views - you'd probably gain performance by materializing them into table.

In my personal experience – it depends on the volumetry of data you're working with.

Test things out and check the associated query profiles – it may surprise you!

4- Set-up incremental materialization in your final tables

Based on your Cloud Provider, you may (or not) have access to materialization types:

About incremental strategy | dbt Developer Hub
Incremental strategies for materializations optimize performance by defining how to handle new and changed data.

See dbt documentation for more informations

Available incremental strategy based on your cloud provider

Here's how you would setup this in your project on Big Query:

{{ config(
    materialized='incremental',
    incremental_strategy = 'merge',
    unique_key = ['primary_key'],
    incremental_predicates = 'current_date() between "{{ var('event_time_start') }}" and "{{ var('event_time_end') }}"'
    # Only scan the table between the considered interval
) }}

final_table.sql

This config will make sure that you only scan the data you need to scan and only merge the data you need. This is both elegant and efficient – which will keep your cost low.


Conclusion

Key Takeaways

  • Avoid full-table scans filter early and often,
  • Default to incremental materializations as much as possible,
  • Materialize models when performance is needed,
  • Use vars + macros to enforce lightweight filtering on ingestion date whenever possible,

Conclusion

As your dbt project grows, so will the number of models, and inevitably, the technical debt. That’s just the nature of software projects (unfortunately).

What I’ve learned during my freelancing career is this: a team is only as effective as the systems and guardrails it puts in place early on.

If you want to move fast later, you need to go slow at first: take the time to deeply understand your current and future use cases, and design your systems accordingly.

That’s why my strongest advice is this: before scaling your dbt project, invest time in writing down clear conventions, defining best practices, and aligning the team.
It’s not glamorous, but it’s the foundation that keeps everything from collapsing when scale hits.



Interested in my expertise ?
CTA Image

Struggling with your data analytics project? Let's fix it together.
I help teams solve tough data problems — fast.

Let's talk !