BigQuery “NOT IN” vs “NOT EXISTS” vs “EXCEPT DISTINCT” understand the differences in 3 minutes!

William W
4 min readJan 27, 2025

--

As a data engineer, writing efficient SQL queries is a crucial part of my daily work. But, sometimes you’re also expected to have a certain deliverey speed when debuging some data that aren’t work as expected. This is exactly what happened last year when I’m working on bot traffic exclusion project. Long story short, we’ve implemented new exclusion logic to our production datawarehouse yesterday, but we got a massive reduction to our traffic (we’re talking somewhere between ~70–75% reduction in traffic). Rushed by the limited time, we’re trying to validate whether this reduction is expected or not. That’s where the data exclusion problem came in.

Understanding about the bot exclusion problems

Visualization of Bot Traffic Logic Implementation

There are 2 tables within my project, first one called “Raw” table and the second table is “Cleansed”. We implemented a bot cleansing logic between them.

Back to the issue, to validate and pin point the root cause, we planned to gather all cookie_id who were exist in the “Raw” data but doesn’t exist on the “Cleansed”. When it comes to excluding data from a result set in BigQuery, there are several options available:

  1. Not In
    reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/operators#in_operators
  2. Not Exists
    reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/operators#exists_operator
  3. Except Distinct
    reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#except

as you may expected, each of them have their own behavior. To better understand each of them, we gonna see the implementation for each approach & how they would affect the result.

This is my literal suffering last year. Hope you don’t experience it this year :”)

The “NOT IN”

The NOT IN operator checks if a value is not present in a list or subquery. This is actually the most basic common operator we intuitively used when faced with data exclusion problem. Bellow are the implementation example:

-- [Option 1] NOT IN
SELECT DISTINCT raw.cookie_id
FROM `bq-project-name.bq-dataset.search_traffic` AS raw
WHERE raw._partitiondate >= '2024-04-30'
AND raw.cookie_id NOT IN (
SELECT DISTINCT cleansed.cookie_id
FROM `bq-project-name.bq-dataset.cleaned_search_traffic` AS cleansed
WHERE cleansed.partition_date >= '2024-04-30'
);
-- result: no result

When I tried to execute the query above, it shows no result. It happens because of how the NOT IN operator works. Referring to this thread raised by @praxiteles at Stackoverflow, I noticed If the subquery used with NOT IN returns any NULL values, the entire query might return no results due to how it handles NULL comparisons.

The “NOT EXISTS”

The NOT EXISTS operator checks for the absence of rows in a subquery. But this time it would consider NULL values as an actual valid value for comparisson.

-- [Option 2] NOT EXISTS
SELECT DISTINCT raw.cookie_id
FROM `bq-project-name.bq-dataset.search_traffic` AS raw
WHERE raw._partitiondate >= '2024-04-30'
AND NOT EXISTS (
SELECT DISTINCT cleansed.cookie_id
FROM `bq-project-name.bq-dataset.cleaned_search_traffic` AS cleansed
WHERE cleansed.partition_date >= '2024-04-30'
);
-- result: no result

At first, I was puting a high hope to this method, but just like how the typical asian parents saw their kids, I saw this one as a dissapointment. The query above somehow still return no result. I keep diving for an explanation and noticed that when the anchor table (the table we used as the main reference, which in this case is the “Raw” table) is having a NULL value, this NOT EXISTS pattern will ultimately failed to execute a proper comparisson.

The “EXCEPT DISTINCT”

This operator might sounds unfamiliar for some people, but you better starts to learn more about it, because it’s amazing. To put it simple, the EXCEPT DISTINCT operator returns distinct rows from the left-hand query that are not present in the right-hand query.

-- [Method 3] Except Distinct
SELECT DISTINCT raw.cookie_id
FROM `bq-project-name.bq-dataset.search_traffic` AS raw
WHERE raw._partitiondate >= '2024-04-30'
EXCEPT DISTINCT
SELECT cleansed.cookie_id
FROM `bq-project-name.bq-dataset.cleaned_search_traffic` AS cleansed
WHERE cleansed.partition_date >= '2024-04-30';
-- result: count 82052

Third time is indeed a charm, we finally managed to gather all cookie_id who were exist in the “Raw” data but doesn’t exist on the “Cleansed”.

Conclusion

From this problem, we’ve learnt the difference between each data exclusion operator in BigQuery. Overall, you can follow this simple rule when deciding which operator you’re trying to use:

  • Use NOT IN if there are absolutely no NULL values in either tables.
  • Use NOT EXISTS if there are NULL in the comparator CTE, but there are no NULL in the anchor.
  • Use EXCEPT DISTINCT if there are NULL in both tables.

Got any further questions or want to share some of your thoughts? Feel free to reach me out on hello@williamwibowo.com

References

  1. Google Cloud. (2024). SQL in Bigquery — Operators. Google Cloud. https://cloud.google.com/bigquery/docs/reference/standard-sql/operators
  2. Google Cloud. (2024). SQL in Bigquery — Query Syntax. Google Cloud. https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#except
  3. Praxiteles & Hoffa, Felipe (2020). Why is this BigQuery WHERE NOT IN statement giving no results?. Stackoverflow.https://stackoverflow.com/questions/62057909/why-is-this-bigquery-where-not-in-statement-giving-no-results

--

--

William W
William W

Written by William W

I love to share about Data Engineering & Software Engineering concepts.

No responses yet