Source: https://docs.telemetry.mozilla.org/concepts/sql_style
Always use uppercase for reserved keywords like SELECT
, WHERE
, or AS
.
first_name
.
Do not use camelCase.cardinality
, approx_distinct
, or substr
,
are identifiers
and should be treated like variable names.When choosing between explicit or implicit syntax, prefer explicit.
Always include the AS
keyword when aliasing a variable or table name,
it's easier to read when explicit.
Good
SELECT
date(submission_timestamp) AS day
FROM
telemetry.main
LIMIT
10
Bad
SELECT
date(submission_timestamp) day
FROM
telemetry.main
LIMIT
10
Always include the JOIN
type rather than relying on the default join.
Good
-- BigQuery Standard SQL Syntax
SELECT
submission_date,
experiment.key AS experiment_id,
experiment.value AS experiment_branch,
count(*) AS count
FROM
telemetry.clients_daily
CROSS JOIN
UNNEST(experiments.key_value) AS experiment
WHERE
submission_date > '2019-07-01'
AND sample_id = '10'
GROUP BY
submission_date,
experiment_id,
experiment_branch
Bad
-- BigQuery Standard SQL Syntax
SELECT
submission_date,
experiment.key AS experiment_id,
experiment.value AS experiment_branch,
count(*) AS count
FROM
telemetry.clients_daily,
UNNEST(experiments.key_value) AS experiment -- Implicit JOIN
WHERE
submission_date > '2019-07-01'
AND sample_id = '10'
GROUP BY
1, 2, 3 -- Implicit grouping column names
In the previous example, implicit grouping columns were discouraged, but there are cases where it makes sense.
In some SQL flavors (such as Presto) grouping elements must refer to the expression before any aliasing is done. If you are grouping by a complex expression it may be desirable to use implicit grouping columns rather than repeating the expression.
Good
-- BigQuery SQL Syntax
SELECT
submission_date,
normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease,
count(*) AS count
FROM
telemetry.clients_daily
WHERE
submission_date > '2019-07-01'
GROUP BY
submission_date,
is_prerelease -- Grouping by aliases is supported in BigQuery
Good
-- Presto SQL Syntax
SELECT
submission_date,
normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease,
count(*) AS count
FROM
telemetry.clients_daily
WHERE
submission_date > '20190701'
GROUP BY
1, 2 -- Implicit grouping avoids repeating expressions
Bad
-- Presto SQL Syntax
SELECT
submission_date,
normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease,
count(*) AS count
FROM
telemetry.clients_daily
WHERE
submission_date > '20190701'
GROUP BY
submission_date,
normalized_channel IN ('nightly', 'aurora', 'beta')
Root keywords should all start on the same character boundary. This is counter to the common "rivers" pattern described here.
Good:
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
sample_id = '42'
AND submission_date > '20180101'
LIMIT
10
Bad:
SELECT client_id,
submission_date
FROM main_summary
WHERE sample_id = '42'
AND submission_date > '20180101'
Root keywords should be on their own line. For example:
Good:
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
submission_date > '20180101'
AND sample_id = '42'
LIMIT
10
It's acceptable to include an argument on the same line as the root keyword, if there is exactly one argument.
Acceptable:
SELECT
client_id,
submission_date
FROM main_summary
WHERE
submission_date > '20180101'
AND sample_id = '42'
LIMIT 10
Do not include multiple arguments on one line.
Bad:
SELECT client_id, submission_date
FROM main_summary
WHERE
submission_date > '20180101'
AND sample_id = '42'
LIMIT 10
Bad
SELECT
client_id,
submission_date
FROM main_summary
WHERE submission_date > '20180101'
AND sample_id = '42'
LIMIT 10
The ON
and USING
keywords should start on a new line indented one level
more than the join keyword and be followed by the join conditions starting on
the same line. For example:
Good:
...
FROM
telemetry_stable.main_v4
LEFT JOIN
static.normalized_os_name
ON main_v4.environment.system.os.name = normalized_os_name.os_name
Bad:
...
FROM
telemetry_stable.main_v4
LEFT JOIN
static.normalized_os_name ON main_v4.environment.system.os.name = normalized_os_name.os_name
Bad:
...
FROM
telemetry_stable.main_v4
LEFT JOIN
static.normalized_os_name
ON
main_v4.environment.system.os.name = normalized_os_name.os_name
If parentheses span multiple lines:
For example:
Good
WITH sample AS (
SELECT
client_id,
FROM
main_summary
WHERE
sample_id = '42'
)
Bad (Terminating parenthesis on shared line)
WITH sample AS (
SELECT
client_id,
FROM
main_summary
WHERE
sample_id = '42')
Bad (No indent)
WITH sample AS (
SELECT
client_id,
FROM
main_summary
WHERE
sample_id = '42'
)
AND
and OR
should always be at the beginning of the line.
For example:
Good
...
WHERE
submission_date > 20180101
AND sample_id = '42'
Bad
...
WHERE
submission_date > 20180101 AND
sample_id = '42'
Do not use nested queries. Instead, use common table expressions to improve readability.
Good:
WITH sample AS (
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
sample_id = '42'
)
SELECT *
FROM sample
LIMIT 10
Bad:
SELECT *
FROM (
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
sample_id = '42'
)
LIMIT 10
This document was heavily influenced by https://www.sqlstyle.guide/
Changes to the style guide should be reviewed by at least one member of both the Data Engineering team and the Data Science team.