Increase your visibility on your BigQuery Costs with DBT

Prerequisites:

You are using dbt to orchestrate your BigQuery workloads.

Output:

Cost visibility broken down by dbt model, team, package, or any custom label.
Ability to identify which models drive auto-scaling or high processing costs.
Foundation for cost allocation and optimization.

Steps:

Step 1: Label your dbt operations in BigQuery

Configure automated labels in your dbt_project.yml so every query is tagged with DBT native metadata:

query-comment:
  comment: "{{ bq_labels(node, var) }}"
  job-label: True

Then create a macro bq_labels to populate the labels automatically: here you can choose what’s your label strategy: Here is all the information you can use:

Object	Property	Description
Node	node.original_file_path	The original file path of the node
Node	node.name	The name of the node
Node	node.resource_type	The resource type of the node (`model`, `test`, `seed`, `snapshot`, `table`)
Node	node.package_name	The DBT package name
Node	node.database	For BigQuery it’s the project name of the project on which you write
Node	node.schema	The DBT schema of the node
env_var macro	ex: env_var(‘username’,‘cloud_run_task’)	Any env var available at runtime
var macro	ex: var(‘customer’, ‘test’)	Any DBT variable available at runtime
target	target.get(‘profile_name’)	The profile name of the target (useful if you split per target your ingested / full refresh / backfill jobs)
target	target.get(‘target_name’)	The target name of the target

{% macro bq_labels(node, var) %}
    {%- set comment_dict = {} -%}
    {%- do comment_dict.update(
        app='background_tasks',
        customer=var('customer', 'test'),
        repository="analytics",
        processed_by=env_var('username','cloud_run_task'),
        profile_name=target.get('profile_name'),
        target_name=target.get('target_name')
    ) -%}
    {%- if node is not none -%}
      {%- do comment_dict.update(
        file=node.original_file_path,
        node_id=node.unique_id,
        node_name=node.name,
        resource_type=node.resource_type,
        package_name=node.package_name,
        database=node.database,
        schema=node.schema
      ) -%}
    {%- endif -%}
    {% do return(tojson(comment_dict)) %}
{% endmacro %}

This macro will label every query run during the DBT run.

Step 2: Label your storage resources

In BigQuery storage costs are visible at dataset level using resource_id / resource_name. they’re not visible at billing level per table. To automatically add labels at the table level, you can use the following configuration: Apply labels at the model level in dbt to tag BigQuery tables:

{{
  config(
    tags=["ingestion"],
    materialized='table',
    labels={'maintained_by': 'team_A'}
  )
}}

You can also set labels at the dbt_project.yml level. Note that project-level labels will override model-level labels if both are set.

Step 3: Visualize your labeled job costs without Costory:

Use BigQuery’s INFORMATION_SCHEMA to analyze costs by label:

SELECT
    ARRAY((SELECT value FROM UNNEST(labels) WHERE key = "label_name"))[SAFE_OFFSET(0)] AS label,
    SUM(total_slot_ms) AS total_slot_ms,
    SUM(total_bytes_billed) AS total_bytes_billed
FROM `<your_project>`.`region-eu`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE EXTRACT(DATE FROM creation_time) = CURRENT_DATE()
    AND ARRAY((SELECT value FROM UNNEST(labels) WHERE key = "label_name"))[SAFE_OFFSET(0)] IS NOT NULL
GROUP BY 1

To query storage-level labels on tables:

SELECT
    *,
    ARRAY((SELECT AS STRUCT json[0] AS key, json[1] AS value
           FROM UNNEST(ARRAY(
               (SELECT AS STRUCT JSON_VALUE_ARRAY(CONCAT('[', elements, ']')) AS json
                FROM UNNEST(regexp_extract_all(option_value, r'STRUCT\("[^"]+", "[^"]+"\)')) elements)
           )))) AS labels
FROM
    <your_project>.INFORMATION_SCHEMA.TABLE_OPTIONS
WHERE option_name = 'labels'

Step 4: Visualize your labeled job costs with Costory:

Ingest the labels into Costory using feature engineering UI:
If part of your BigQuery labels should be merged within an existing label, you can use the merge feature to merge two labels into one: for example k8S_label_app and app.
Create a virtual dimension to correctly allocate the cost or directly explore costs within the explorer:

What works best ?

In terms of labels:

Ingest node_name for fine grained cost investigation: to understand which exact model is driving the cost.
Use package name or schema to map each schema to the correct team / owner.
Use target name to split per target daily jobs / full refresh / backfill jobs.

In terms of quotas:

set a quota on GCP about query bytes billed per service account / per user QueryUsagePerUserPerDay

In terms of alerts:

set an alert very close to the daily average on the ingestion costs: ingestion daily job should be very close to the daily average.
For full refresh jobs, alerts should be performed over a large period to avoid spikes effects.

For storage costs:

Don’t forget to evaluate frequenlty tre storage costs: after 90 days the storage costs decreases but if you never read the data you should instead rely on expiration policy.

Since 2023, you can choose per dataset (the storage billing mode)[https://docs.cloud.google.com/bigquery/docs/updating-datasets#update_storage_billing_models]:

Logical storage based: you pay based on the logical size (so before compression).
Physical storage based: you pay based on the physical size (after compression).

You can compare both costs using the following SQL query: do it per region and change the us suffix:

DECLARE active_logical_gib_price FLOAT64 DEFAULT 0.02;
DECLARE long_term_logical_gib_price FLOAT64 DEFAULT 0.01;
DECLARE active_physical_gib_price FLOAT64 DEFAULT 0.04;
DECLARE long_term_physical_gib_price FLOAT64 DEFAULT 0.02;
WITH storage_sizes AS (
SELECT
table_schema AS dataset_name,
-- Logical Bytes
SUM(active_logical_bytes) / POWER(1024, 3) AS active_logical_gib,
SUM(long_term_logical_bytes) / POWER(1024, 3) AS long_term_logical_gib,
-- Physical Bytes (includes compression)
SUM(active_physical_bytes) / POWER(1024, 3) AS active_physical_gib,
SUM(long_term_physical_bytes) / POWER(1024, 3) AS long_term_physical_gib,
SUM(time_travel_physical_bytes) / POWER(1024, 3) AS time_travel_physical_gib,
SUM(fail_safe_physical_bytes) / POWER(1024, 3) AS fail_safe_physical_gib
FROM
`region-us`.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECT -- Change 'region-us' to your region
WHERE table_type = 'BASE TABLE'
GROUP BY 1
)
SELECT
dataset_name,
-- Logical Model Cost (Time Travel/Fail-safe are FREE in logical)
ROUND((active_logical_gib * active_logical_gib_price) + (long_term_logical_gib * long_term_logical_gib_price), 2) AS monthly_cost_logical,
-- Physical Model Cost (Time Travel/Fail-safe are CHARGED at active physical rate)
ROUND(
(active_physical_gib * active_physical_gib_price) +
(long_term_physical_gib * long_term_physical_gib_price) +
((time_travel_physical_gib + fail_safe_physical_gib) * active_physical_gib_price), 2
) AS monthly_cost_physical,
-- Compression Ratio
ROUND(SAFE_DIVIDE(active_logical_gib + long_term_logical_gib, active_physical_gib + long_term_physical_gib), 2) AS compression_ratio
FROM
storage_sizes
ORDER BY
monthly_cost_logical DESC;

Usually tabular data with frequent values compress really well so the physical cost is usually lower than the logical cost since it will be stored as an enum

If you find a dataset that would be significantly cheaper under the physical model:Console: Go to the Dataset Details tab > Edit Details > Advanced Options > Change Storage Billing Model to Physical.

ALTER SCHEMA my_dataset SET OPTIONS(storage_billing_model = 'physical');

Important Restrictions:

24-Hour Delay: Changes take about 24 hours to affect billing.
14-Day Lock: Once you switch, you must wait 14 days before you can switch back to the previous model.

Get Started

Main Features

Setup

Automate your FinOps reporting

Use cases

Increase your visibility on your BigQuery Costs with DBT

Prerequisites:

Output:

Steps:

Get Started

Main Features

Setup

Automate your FinOps reporting

Use cases

​Prerequisites:

​Output:

​Steps:

Prerequisites:

Output:

Steps: