Nick White

If you’ve ever used BigQuery, you’ll know it’s an absolute beast of a warehouse, and its flavour of SQL (GoogleSQL) is fantastically deep and useful once you get into it.

However, when I got started, I ended up writing code that was more verbose than it needed to be because I didn’t know all the syntax (especially features that were added to the language more recently) and was copying obsolete patterns from old-ish parts of the codebase.

Recently I’ve been trying to right this wrong by spending more time reading the BigQuery docs, and I’ve pulled together a rough collection of the most useful insights from this process below.

Just use AI?

Large language models are fantastic tools for discovering this kind of thing. I like to paste in a query I’ve already written and ask if there are any newer language features I could make use of to improve the query. That’s how I discovered many of these.

However, I generally do not ask AI to generate queries directly. I think the “facepalm moment” of realising there is a better way is a much more powerful learning stimulus if you’ve been through the struggle of the old way first. Also, in my domain, very subtle errors can cause huge data quality problems down the line, so it would be risky to use LLM-generated queries directly.

The Features

HAVING

Allows you to query on an aggregate without using a CTE.

Old:

WITH department_salaries AS (
  SELECT
    department,
    AVG(salary) as avg_salary
  FROM employees
  GROUP BY department
)
SELECT
  department,
  avg_salary
FROM department_salaries
WHERE avg_salary > 100000;

New:

SELECT
  department,
  AVG(salary) as avg_salary,
  COUNT(*) as employee_count
FROM employees
GROUP BY department
HAVING AVG(salary) > 100000;

Docs on HAVING

QUALIFY

Allows you to filter on a window function without a CTE.

Old:

WITH ranked_employees AS (
  SELECT
    department,
    employee_name,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
  FROM employees
)
SELECT * EXCEPT (rank)
FROM ranked_employees
WHERE rank <= 2;

New:

 SELECT
  department,
  employee_name,
  salary
FROM employees
QUALIFY
	RANK() OVER (PARTITION BY department ORDER BY salary DESC) <= 2;

Docs on QUALIFY

Timestamp +/- operator overloading

I find this nicer because the polarity (plus or minus) is no longer hidden in the function name.

You can also use the WEEK interval with operators, which oddly doesn’t work with TIMESTAMP_ADD/TIMESTAMP_SUB.

Old:

SELECT
  TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 12 HOUR) as twelve_hours_ago,
  TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) as one_week_from_now;

New (note we can also use the WEEK interval):

SELECT
  CURRENT_TIMESTAMP() - INTERVAL 12 HOUR as twelve_hours_ago,
  CURRENT_TIMESTAMP() + INTERVAL 1 WEEK as one_week_from_now;

However, there is a gotcha with DATE:

SELECT
	-- This returns a DATETIME, not a DATE!!
  CURRENT_DATE() - INTERVAL 1 DAY as dt_yesterday,

  -- This returns DATE but the syntax feels implicit/weird
  CURRENT_DATE() - 1 AS date_yesterday_weird_syntax,

  -- This returns a DATE, but is almost as verbose as DATE_SUB()
  DATE(CURRENT_DATE() - INTERVAL 1 DAY) as date_yesterday

Docs on interval arithmetic

IF() and IFNULL() functions

CASE WHEN is great for heavily branched logic, but for a single condition IF() is usually easier to read. There is also the shorthand IFNULL() for null checks.

Old:

SELECT 
  CASE WHEN weight > 5 THEN 'large' ELSE 'small' END as size,
  CASE WHEN type IS NULL then 'unknown' ELSE type AS type;

New:

SELECT 
  IF(weight > 5, 'large', 'small'),
  IFNULL(type, 'unknown') as type;

Docs on IF()

Docs on IFNULL()

IS DISTINCT FROM

This is a much more readable way to check for inequality, handling nulls as a regular value, not the insane way that they are handled by default.

Old:

SELECT
  employee_id,
  COALESCE((backup_phone IS NULL AND primary_phone IS NULL) OR backup_phone = primary_phone, FALSE) as same_phone_numbers
FROM employee_details;

New:

SELECT
  employee_id,
  backup_phone IS NOT DISTINCT FROM primary_phone as same_phone_numbers
FROM employee_details;

Docs on IS DISTINCT FROM

EXCEPT DISTINCT

Old:

SELECT * FROM table1 
LEFT JOIN table2 USING (id)
WHERE table2.id IS NULL;

New:

SELECT * FROM table1
EXCEPT DISTINCT
SELECT * FROM table2;

Docs on EXCEPT

BigQuery syntax I wish I'd known sooner

Just use AI?

The Features

HAVING

QUALIFY

Timestamp +/- operator overloading

IF() and IFNULL() functions

IS DISTINCT FROM

EXCEPT DISTINCT