Tired of stacking UNION ALL, have you tried BigQuery GROUPING SETS?

William W
5 min readJan 27, 2025

--

In the world of data analysis, the GROUP BY clause is a fundamental tool for aggregating data. It allows us to summarize information based on specific columns, providing valuable insights. However, traditional GROUP BY can sometimes feel limiting when we need to generate multiple aggregations at different levels of granularity within a single query. This is where GROUPING SETS in BigQuery shines, offering a more flexible and powerful approach to data aggregation.

Improve stacked multi-granularity aggregation query readability by using GROUPING SETS

The Limitations of Traditional GROUP BY

Let’s imagine a scenario where we’re analyzing sales data for an e-commerce platform. We have a table with columns like product_category, region, and sales. Using a standard GROUP BY, we can easily find the total sales per product category:

SELECT product_category, SUM(sales) AS total_sales
FROM sales_table
GROUP BY product_category;

Similarly, we can find the total sales per region:

SELECT region, SUM(sales) AS total_sales
FROM sales_table
GROUP BY region;

But what if we want both these aggregations and the overall total sales in a single query? We’d traditionally have to resort to multiple queries or complex UNION ALL statements, which can be inefficient.

Introducing: GROUPING SETS

GROUPING SETS provides a concise and efficient way to generate multiple GROUP BY results in one go. It allows us to specify multiple sets of columns to group by, effectively creating multiple aggregations within a single result set.

Here’s how we can use GROUPING SETS to achieve our desired outcome:

SELECT
product_category,
region,
SUM(sales) AS total_sales
FROM
sales_table
GROUP BY GROUPING SETS((product_category), (region), ());

In this query:

  • (product_category): This grouping set calculates the total sales for each product category.
  • (region): This grouping set calculates the total sales for each region.
  • (): This empty grouping set calculates the overall total sales across all categories and regions.

The result will be a table with three types of rows:

  • Rows with a specific product_category and a NULL region, showing sales per category.
  • Rows with a specific region and a NULL product_category, showing sales per region.
  • A single row with NULL for both product_category and region, showing the grand total sales.

Beyond Simple Aggregations

GROUPING SETS can be combined with other aggregation functions like AVG, COUNT, MIN, and MAX to provide even more comprehensive analysis. It can also handle more complex grouping sets involving multiple columns:

SELECT
product_category,
region,
YEAR(order_date) as order_year,
SUM(sales) AS total_sales
FROM
sales_table
GROUP BY GROUPING SETS((product_category, region), (order_year), ());

This query would generate aggregations by product category and region combinations, by year, and the grand total.

What are the Drawbacks of GROUPING SETS?

While GROUPING SETS offers significant advantages in terms of query conciseness and expressiveness, it's essential to be aware of its limitations:

  1. Performance Considerations: It’s important to understand that GROUPING SETS doesn't inherently provide a performance boost compared to equivalent queries using UNION ALL. Under the hood, BigQuery often translates GROUPING SETS into a series of individual aggregations that are then combined. Therefore, the primary benefit of GROUPING SETS from a performance perspective is improved query readability and maintainability, which can indirectly lead to better performance by reducing the likelihood of errors in complex UNION ALL statements. However, it's not a magic bullet for faster execution.
  2. Explicit Granularity Management: GROUPING SETS defines the grouping combinations, but it doesn't directly provide a way to label or categorize the resulting aggregations. To distinguish between the different levels of granularity (e.g., whether a row represents an aggregation by product category, region, or the grand total), we rely on the GROUPING() function. While effective, this introduces a slight overhead. The GROUPING() function calculates a bitmask for each row, indicating which columns were used in the grouping. This extra processing, while usually minimal, can add a small performance cost, especially with very large datasets or complex grouping sets.

Diving Deeper into “Explicit Granularity Management” in GROUPING SETS

As mentioned above, the challenge of explicitly mention granularity in GROUPING SETS can be resolved by using GROUPING function and mapping. It can be done like this:

WITH pre_grouping AS (
SELECT
product_category,
region,
SUM(sales) AS total_sales,
GROUPING(product_category) AS grouping_product_category,
GROUPING(region) AS grouping_region
FROM
sales_table
GROUP BY GROUPING SETS((product_category, region), (region), ());
)

SELECT
CASE
WHEN grouping_product_category = 0
AND grouping_region = 0
THEN "product_category, region granularity"
WHEN product_name_agg = 0
THEN "region granularity"
ELSE "ALL granularity"
END AS grouping_set,
COALESCE(product_category, 'ALL') AS product_category,
COALESCE(region, 'ALL') AS region,
total_sales
FROM pre_grouping

Unfortunately, explicitly mentioning granularity through the approach mentioned above might introduce an extra computation overhead. On my current company, we did try to implement this pattern and tested the performance for each approach. In this performance testing, we’re comparing the performance of “UNION ALL” pattern vs “GROUPING SETS w. explicit mentioned granularity”. Bellow are the results:

1-day incremental run simulation:

existing: union all pattern

  • duration: 19s
  • slot time consumed: 1h54m

new: grouping sets w. explicit granularity pattern

  • duration: 30s
  • slot time consumed: 2h35m

30-days backfill run simulation:

existing: union all pattern

  • duration: 5m7s
  • slot time consumed: 3d18h

new: grouping sets w. explicit granularity pattern

  • duration 8m23s
  • slot time consumed 8d1h

Thinking about it, since our aggregate now needs to add several GROUPING() fields for tagging, it seems expected that the performance won’t be as good as the UNION ALL pattern. We noticed this by the slightly increased number of bytes shuffled (around ~10–15%).

Conclusion

In conclusion, BigQuery’s GROUPING SETS offers a compelling alternative to complex UNION ALL queries for generating aggregations at multiple granularities. By specifying different grouping combinations within a single GROUP BY clause, GROUPING SETS significantly improves query readability and maintainability. This leads to cleaner, more concise code, reducing the risk of errors often associated with lengthy and intricate UNION ALL statements.

However, it’s crucial to acknowledge the limitations. While GROUPING SETS enhances query structure, it doesn't inherently boost performance compared to optimized UNION ALL queries. BigQuery often processes GROUPING SETS by performing individual aggregations and then combining the results. Furthermore, managing explicit granularity with GROUPING SETS requires the use of the GROUPING() function, which introduces a small computational overhead. As demonstrated by performance testing, explicitly labeling granularity with GROUPING() and subsequent CASE statements can indeed lead to increased execution time and slot consumption compared to equivalent UNION ALL implementations, due to the additional processing required for generating and evaluating the grouping bitmasks.

Therefore, the decision to use GROUPING SETS should be based on a careful evaluation of the specific use case. If query readability and maintainability are paramount, and the performance overhead associated with GROUPING() is acceptable, GROUPING SETS is an excellent choice. However, if performance is the primary concern, especially with very large datasets or complex aggregations, a well-optimized UNION ALL approach might still be preferable. It's recommended to conduct thorough testing to determine the optimal solution for each specific scenario. Ultimately, GROUPING SETS is a valuable tool in the BigQuery arsenal, offering a more elegant and often more manageable way to handle multi-granularity aggregations.

Reference

Google Cloud. (2024). SQL in Bigquery — Query Syntax. Google Cloud.https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#group_by_grouping_sets

--

--

William W
William W

Written by William W

I love to share about Data Engineering & Software Engineering concepts.

No responses yet