Skip to main content

Prerequisites:

  1. You are using dbt to orchestrate your BigQuery workloads.

Output:

  • Cost visibility broken down by dbt model, team, package, or any custom label.
  • Ability to identify which models drive auto-scaling or high processing costs.
  • Foundation for cost allocation and optimization.

Steps:

1

Step 1: Label your dbt operations in BigQuery

Configure automated labels in your dbt_project.yml so every query is tagged with DBT native metadata:
query-comment:
  comment: "{{ bq_labels(node, var) }}"
  job-label: True
Then create a macro bq_labels to populate the labels automatically: here you can choose what’s your label strategy: Here is all the information you can use:
ObjectPropertyDescription
Nodenode.original_file_pathThe original file path of the node
Nodenode.nameThe name of the node
Nodenode.resource_typeThe resource type of the node (model, test, seed, snapshot, table)
Nodenode.package_nameThe DBT package name
Nodenode.databaseFor BigQuery it’s the project name of the project on which you write
Nodenode.schemaThe DBT schema of the node
env_var macroex: env_var(‘username’,‘cloud_run_task’)Any env var available at runtime
var macroex: var(‘customer’, ‘test’)Any DBT variable available at runtime
targettarget.get(‘profile_name’)The profile name of the target (useful if you split per target your ingested / full refresh / backfill jobs)
targettarget.get(‘target_name’)The target name of the target
{% macro bq_labels(node, var) %}
    {%- set comment_dict = {} -%}
    {%- do comment_dict.update(
        app='background_tasks',
        customer=var('customer', 'test'),
        repository="analytics",
        processed_by=env_var('username','cloud_run_task'),
        profile_name=target.get('profile_name'),
        target_name=target.get('target_name')
    ) -%}
    {%- if node is not none -%}
      {%- do comment_dict.update(
        file=node.original_file_path,
        node_id=node.unique_id,
        node_name=node.name,
        resource_type=node.resource_type,
        package_name=node.package_name,
        database=node.database,
        schema=node.schema
      ) -%}
    {%- endif -%}
    {% do return(tojson(comment_dict)) %}
{% endmacro %}
This macro will label every query run during the DBT run.
2

Step 2: Label your storage resources

In BigQuery storage costs are visible at dataset level using resource_id / resource_name. they’re not visible at billing level per table. To automatically add labels at the table level, you can use the following configuration: Apply labels at the model level in dbt to tag BigQuery tables:
{{
  config(
    tags=["ingestion"],
    materialized='table',
    labels={'maintained_by': 'team_A'}
  )
}}
You can also set labels at the dbt_project.yml level. Note that project-level labels will override model-level labels if both are set.
3

Step 3: Visualize your labeled job costs without Costory:

Use BigQuery’s INFORMATION_SCHEMA to analyze costs by label:
SELECT
    ARRAY((SELECT value FROM UNNEST(labels) WHERE key = "label_name"))[SAFE_OFFSET(0)] AS label,
    SUM(total_slot_ms) AS total_slot_ms,
    SUM(total_bytes_billed) AS total_bytes_billed
FROM `<your_project>`.`region-eu`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE EXTRACT(DATE FROM creation_time) = CURRENT_DATE()
    AND ARRAY((SELECT value FROM UNNEST(labels) WHERE key = "label_name"))[SAFE_OFFSET(0)] IS NOT NULL
GROUP BY 1
To query storage-level labels on tables:
SELECT
    *,
    ARRAY((SELECT AS STRUCT json[0] AS key, json[1] AS value
           FROM UNNEST(ARRAY(
               (SELECT AS STRUCT JSON_VALUE_ARRAY(CONCAT('[', elements, ']')) AS json
                FROM UNNEST(regexp_extract_all(option_value, r'STRUCT\("[^"]+", "[^"]+"\)')) elements)
           )))) AS labels
FROM
    <your_project>.INFORMATION_SCHEMA.TABLE_OPTIONS
WHERE option_name = 'labels'
4

Step 4: Visualize your labeled job costs with Costory:

  1. Ingest the labels into Costory using feature engineering UI: Costory
  2. If part of your BigQuery labels should be merged within an existing label, you can use the merge feature to merge two labels into one: for example k8S_label_app and app.
  3. Create a virtual dimension to correctly allocate the cost or directly explore costs within the explorer: Costory
5

What works best ?

In terms of labels:
  • Ingest node_name for fine grained cost investigation: to understand which exact model is driving the cost.
  • Use package name or schema to map each schema to the correct team / owner.
  • Use target name to split per target daily jobs / full refresh / backfill jobs.
In terms of quotas:In terms of alerts:
  • set an alert very close to the daily average on the ingestion costs: ingestion daily job should be very close to the daily average.
  • For full refresh jobs, alerts should be performed over a large period to avoid spikes effects.
For storage costs:
  • Don’t forget to evaluate frequenlty tre storage costs: after 90 days the storage costs decreases but if you never read the data you should instead rely on expiration policy.
  • Since 2023, you can choose per dataset (the storage billing mode)[https://docs.cloud.google.com/bigquery/docs/updating-datasets#update_storage_billing_models]:
    • Logical storage based: you pay based on the logical size (so before compression).
    • Physical storage based: you pay based on the physical size (after compression).
    You can compare both costs using the following SQL query: do it per region and change the us suffix:
    DECLARE active_logical_gib_price FLOAT64 DEFAULT 0.02;
    DECLARE long_term_logical_gib_price FLOAT64 DEFAULT 0.01;
    DECLARE active_physical_gib_price FLOAT64 DEFAULT 0.04;
    DECLARE long_term_physical_gib_price FLOAT64 DEFAULT 0.02;
    WITH storage_sizes AS (
    SELECT
    table_schema AS dataset_name,
    -- Logical Bytes
    SUM(active_logical_bytes) / POWER(1024, 3) AS active_logical_gib,
    SUM(long_term_logical_bytes) / POWER(1024, 3) AS long_term_logical_gib,
    -- Physical Bytes (includes compression)
    SUM(active_physical_bytes) / POWER(1024, 3) AS active_physical_gib,
    SUM(long_term_physical_bytes) / POWER(1024, 3) AS long_term_physical_gib,
    SUM(time_travel_physical_bytes) / POWER(1024, 3) AS time_travel_physical_gib,
    SUM(fail_safe_physical_bytes) / POWER(1024, 3) AS fail_safe_physical_gib
    FROM
    `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECT -- Change 'region-us' to your region
    WHERE table_type = 'BASE TABLE'
    GROUP BY 1
    )
    SELECT
    dataset_name,
    -- Logical Model Cost (Time Travel/Fail-safe are FREE in logical)
    ROUND((active_logical_gib * active_logical_gib_price) + (long_term_logical_gib * long_term_logical_gib_price), 2) AS monthly_cost_logical,
    -- Physical Model Cost (Time Travel/Fail-safe are CHARGED at active physical rate)
    ROUND(
    (active_physical_gib * active_physical_gib_price) +
    (long_term_physical_gib * long_term_physical_gib_price) +
    ((time_travel_physical_gib + fail_safe_physical_gib) * active_physical_gib_price), 2
    ) AS monthly_cost_physical,
    -- Compression Ratio
    ROUND(SAFE_DIVIDE(active_logical_gib + long_term_logical_gib, active_physical_gib + long_term_physical_gib), 2) AS compression_ratio
    FROM
    storage_sizes
    ORDER BY
    monthly_cost_logical DESC;
    
    Usually tabular data with frequent values compress really well so the physical cost is usually lower than the logical cost since it will be stored as an enum
If you find a dataset that would be significantly cheaper under the physical model:Console: Go to the Dataset Details tab > Edit Details > Advanced Options > Change Storage Billing Model to Physical.
ALTER SCHEMA my_dataset SET OPTIONS(storage_billing_model = 'physical');
Important Restrictions:
  • 24-Hour Delay: Changes take about 24 hours to affect billing.
  • 14-Day Lock: Once you switch, you must wait 14 days before you can switch back to the previous model.