Tired of stacking UNION ALL, have you tried BigQuery GROUPING SETS?
In the world of data analysis, the GROUP BY
clause is a fundamental tool for aggregating data. It allows us to summarize information based on specific columns, providing valuable insights. However, traditional GROUP BY
can sometimes feel limiting when we need to generate multiple aggregations at different levels of granularity within a single query. This is where GROUPING SETS
in BigQuery shines, offering a more flexible and powerful approach to data aggregation.
The Limitations of Traditional GROUP BY
Let’s imagine a scenario where we’re analyzing sales data for an e-commerce platform. We have a table with columns like product_category
, region
, and sales
. Using a standard GROUP BY
, we can easily find the total sales per product category:
SELECT product_category, SUM(sales) AS total_sales
FROM sales_table
GROUP BY product_category;
Similarly, we can find the total sales per region:
SELECT region, SUM(sales) AS total_sales
FROM sales_table
GROUP BY region;
But what if we want both these aggregations and the overall total sales in a single query? We’d traditionally have to resort to multiple queries or complex UNION ALL
statements, which can be inefficient.
Introducing: GROUPING SETS
GROUPING SETS
provides a concise and efficient way to generate multiple GROUP BY
results in one go. It allows us to specify multiple sets of columns to group by, effectively creating multiple aggregations within a single result set.
Here’s how we can use GROUPING SETS
to achieve our desired outcome:
SELECT
product_category,
region,
SUM(sales) AS total_sales
FROM
sales_table
GROUP BY GROUPING SETS((product_category), (region), ());
In this query:
(product_category)
: This grouping set calculates the total sales for each product category.(region)
: This grouping set calculates the total sales for each region.()
: This empty grouping set calculates the overall total sales across all categories and regions.
The result will be a table with three types of rows:
- Rows with a specific
product_category
and aNULL
region
, showing sales per category. - Rows with a specific
region
and aNULL
product_category
, showing sales per region. - A single row with
NULL
for bothproduct_category
andregion
, showing the grand total sales.
Beyond Simple Aggregations
GROUPING SETS
can be combined with other aggregation functions like AVG
, COUNT
, MIN
, and MAX
to provide even more comprehensive analysis. It can also handle more complex grouping sets involving multiple columns:
SELECT
product_category,
region,
YEAR(order_date) as order_year,
SUM(sales) AS total_sales
FROM
sales_table
GROUP BY GROUPING SETS((product_category, region), (order_year), ());
This query would generate aggregations by product category and region combinations, by year, and the grand total.
What are the Drawbacks of GROUPING SETS?
While GROUPING SETS
offers significant advantages in terms of query conciseness and expressiveness, it's essential to be aware of its limitations:
- Performance Considerations: It’s important to understand that
GROUPING SETS
doesn't inherently provide a performance boost compared to equivalent queries usingUNION ALL
. Under the hood, BigQuery often translatesGROUPING SETS
into a series of individual aggregations that are then combined. Therefore, the primary benefit ofGROUPING SETS
from a performance perspective is improved query readability and maintainability, which can indirectly lead to better performance by reducing the likelihood of errors in complexUNION ALL
statements. However, it's not a magic bullet for faster execution. - Explicit Granularity Management:
GROUPING SETS
defines the grouping combinations, but it doesn't directly provide a way to label or categorize the resulting aggregations. To distinguish between the different levels of granularity (e.g., whether a row represents an aggregation by product category, region, or the grand total), we rely on theGROUPING()
function. While effective, this introduces a slight overhead. TheGROUPING()
function calculates a bitmask for each row, indicating which columns were used in the grouping. This extra processing, while usually minimal, can add a small performance cost, especially with very large datasets or complex grouping sets.
Diving Deeper into “Explicit Granularity Management” in GROUPING SETS
As mentioned above, the challenge of explicitly mention granularity in GROUPING SETS can be resolved by using GROUPING function and mapping. It can be done like this:
WITH pre_grouping AS (
SELECT
product_category,
region,
SUM(sales) AS total_sales,
GROUPING(product_category) AS grouping_product_category,
GROUPING(region) AS grouping_region
FROM
sales_table
GROUP BY GROUPING SETS((product_category, region), (region), ());
)
SELECT
CASE
WHEN grouping_product_category = 0
AND grouping_region = 0
THEN "product_category, region granularity"
WHEN product_name_agg = 0
THEN "region granularity"
ELSE "ALL granularity"
END AS grouping_set,
COALESCE(product_category, 'ALL') AS product_category,
COALESCE(region, 'ALL') AS region,
total_sales
FROM pre_grouping
Unfortunately, explicitly mentioning granularity through the approach mentioned above might introduce an extra computation overhead. On my current company, we did try to implement this pattern and tested the performance for each approach. In this performance testing, we’re comparing the performance of “UNION ALL” pattern vs “GROUPING SETS w. explicit mentioned granularity”. Bellow are the results:
1-day incremental run simulation:
existing: union all pattern
- duration:
19s
- slot time consumed:
1h54m
new: grouping sets w. explicit granularity pattern
- duration:
30s
- slot time consumed:
2h35m
30-days backfill run simulation:
existing: union all pattern
- duration:
5m7s
- slot time consumed:
3d18h
new: grouping sets w. explicit granularity pattern
- duration
8m23s
- slot time consumed
8d1h
Thinking about it, since our aggregate now needs to add several GROUPING() fields for tagging, it seems expected that the performance won’t be as good as the UNION ALL pattern. We noticed this by the slightly increased number of bytes shuffled (around ~10–15%).
Conclusion
In conclusion, BigQuery’s GROUPING SETS
offers a compelling alternative to complex UNION ALL
queries for generating aggregations at multiple granularities. By specifying different grouping combinations within a single GROUP BY
clause, GROUPING SETS
significantly improves query readability and maintainability. This leads to cleaner, more concise code, reducing the risk of errors often associated with lengthy and intricate UNION ALL
statements.
However, it’s crucial to acknowledge the limitations. While GROUPING SETS
enhances query structure, it doesn't inherently boost performance compared to optimized UNION ALL
queries. BigQuery often processes GROUPING SETS
by performing individual aggregations and then combining the results. Furthermore, managing explicit granularity with GROUPING SETS
requires the use of the GROUPING()
function, which introduces a small computational overhead. As demonstrated by performance testing, explicitly labeling granularity with GROUPING()
and subsequent CASE
statements can indeed lead to increased execution time and slot consumption compared to equivalent UNION ALL
implementations, due to the additional processing required for generating and evaluating the grouping bitmasks.
Therefore, the decision to use GROUPING SETS
should be based on a careful evaluation of the specific use case. If query readability and maintainability are paramount, and the performance overhead associated with GROUPING()
is acceptable, GROUPING SETS
is an excellent choice. However, if performance is the primary concern, especially with very large datasets or complex aggregations, a well-optimized UNION ALL
approach might still be preferable. It's recommended to conduct thorough testing to determine the optimal solution for each specific scenario. Ultimately, GROUPING SETS
is a valuable tool in the BigQuery arsenal, offering a more elegant and often more manageable way to handle multi-granularity aggregations.
Reference
Google Cloud. (2024). SQL in Bigquery — Query Syntax. Google Cloud.https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#group_by_grouping_sets