Migrating from legacy SQL

BigQuery supports two SQL dialects: standard SQL and legacy SQL. This topic explains the differences between the two dialects, including syntax, functions, and semantics, and gives examples of some of the highlights of standard SQL.

Comparison of legacy and standard SQL

Previously, BigQuery executed queries using a non-standard SQL dialect known as BigQuery SQL. With the launch of BigQuery 2.0, BigQuery released support for standard SQL, and renamed BigQuery SQL to legacy SQL. Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.

Do I have to migrate to standard SQL?

Migration from legacy SQL to standard SQL is recommended but not required. For example, suppose that you execute many queries that use legacy SQL, but you want to take advantage of a standard SQL feature for a new query. You can create new queries using standard SQL that run alongside queries using legacy SQL.

Enabling standard SQL

You have a choice of whether to use legacy or standard SQL when you execute a query through BigQuery. See enabling standard SQL for steps to enable standard SQL in the BigQuery UI, CLI, API, or whichever interface you are using.

Advantages of standard SQL

Standard SQL is compliant with the SQL 2011 standard, and has extensions that support querying nested and repeated data. It has several advantages over legacy SQL, including:

Composability using WITH clauses and SQL functions
Subqueries in the SELECT list and WHERE clause
Correlated subqueries
ARRAY and STRUCT data types
Inserts, updates, and deletes
COUNT(DISTINCT <expr>) is exact and scalable, providing the accuracy of EXACT_COUNT_DISTINCT without its limitations
Automatic predicate push-down through JOINs
Complex JOIN predicates, including arbitrary expressions

For examples that demonstrate some of these features, see Standard SQL highlights.

Type differences

Legacy SQL types have an equivalent in standard SQL and vice versa. In some cases, the type has a different name. The following table lists each legacy SQL data type and its standard SQL equivalent.

Legacy SQL	Standard SQL	Notes
`BOOL`	`BOOL`
`INTEGER`	`INT64`
`FLOAT`	`FLOAT64`
`STRING`	`STRING`
`BYTES`	`BYTES`
`RECORD`	`STRUCT`
`REPEATED`	`ARRAY`
`TIMESTAMP`	`TIMESTAMP`	Standard SQL has a stricter `TIMESTAMP` range
`DATE`	`DATE`	Legacy SQL has limited support for `DATE`
`TIME`	`TIME`	Legacy SQL has limited support for `TIME`
`DATETIME`	`DATETIME`	Legacy SQL has limited support for `DATETIME`

For more information on the standard SQL type system, see the standard SQL data types reference. For more information on data types in BigQuery, see the BigQuery data types reference.

Label differences

Legacy SQL has the concept of "labels" for fields, which are:

REQUIRED - A value stored in this field cannot be NULL
NULLABLE - A value stored in this field can be NULL
REPEATED - A value of this field comprises any number of elements. Elements cannot be NULL

The following table describes the availability of these concepts in standard SQL.

Legacy SQL label	Supported in standard SQL?	Notes
`REQUIRED`	No	Convert `REQUIRED` fields to `NULLABLE` to append to tables that include them
`NULLABLE`	Yes	All standard SQL types are `NULLABLE`
`REPEATED`	Yes; equivalent to `ARRAY`	None needed

For more information on REPEATED field handling, see Differences in repeated field handling.

Converting `REQUIRED` fields to `NULLABLE`

The legacy SQL concept of REQUIRED is not present in standard SQL, and a value of any type can be NULL. To append to a table that has REQUIRED fields using standard SQL, you must first change those fields to be NULLABLE instead. There are multiple ways to change REQUIRED fields to NULLABLE:

BigQuery UI

Navigate to the table to which you want to append. Under the "Schema" tab, click the arrow next to REQUIRED, then select "All REQUIRED to NULLABLE".

BigQuery API

Using the jobs API, pass the ALLOW_FIELD_RELAXATION option as part of the configuration.query.schemaUpdateOptions list. See the jobs API reference for more details.

After changing all REQUIRED fields in your table to NULLABLE, you can now append to it using standard SQL.

`REPEATED` fields

The legacy SQL concept of REPEATED is equivalent to the ARRAY type in standard SQL. For example, REPEATED INTEGER is equivalent to ARRAY<INT64> in standard SQL.

Standard SQL supports NULL array elements, but raises an error if there is a NULL array element in the query result. If there is a NULL array column in the query result, standard SQL stores it as an empty array.

For more information on the standard SQL type system, see the standard SQL data types reference. For more information on data types in BigQuery, see the BigQuery data types reference.

Syntax differences

Escaping reserved keywords and invalid identifiers

In legacy SQL, you escape reserved keywords and identifiers that contain invalid characters such as a space or hyphen - using square brackets []. In standard SQL, you escape such keywords and identifiers using backticks `. For example:

SELECT
  word,
  SUM(word_count) AS word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE word IN ('me', 'I', 'you')
GROUP BY word;

For a list of reserved keywords and what constitutes valid identifiers, see Lexical Structure.

Project-qualified table names

In legacy SQL, to query a table with a project-qualified name, you use a semi-colon, :, as a separator. For example:

SELECT
  word
FROM
  [bigquery-public-data:samples.shakespeare]
LIMIT 1;

In standard SQL, you use a period, ., instead. For example:

SELECT
  word
FROM
  `bigquery-public-data.samples.shakespeare`
LIMIT 1;

If your project name includes a domain, such as example.com:myproject, you use example.com:myproject as the project name, including the :.

Table decorators and wildcard functions

Standard SQL does not support the TABLE_DATE_RANGE, TABLE_DATE_RANGE_STRICT, or TABLE_QUERY functions.

You can achieve the same semantics of TABLE_DATE_RANGE and TABLE_QUERY using a filter on the _TABLE_SUFFIX pseudocolumn. For example, consider the following legacy SQL query, which counts the number of rows across 2010 and 2011 in the National Oceanic and Atmospheric Administration GSOD (global summary of the day) tables:

SELECT COUNT(*)
FROM TABLE_QUERY([bigquery-public-data:noaa_gsod],
                 'table_id IN ("gsod2010", "gsod2011")');

An equivalent query using standard SQL is:

SELECT COUNT(*)
FROM `bigquery-public-data.noaa_gsod.*`
WHERE _TABLE_SUFFIX IN ("gsod2010", "gsod2011");

For more information, including examples that use TABLE_DATE_RANGE, see Migrating legacy SQL wildcard functions

Trailing commas in the `SELECT` list

Unlike legacy SQL, standard SQL does not permit trailing commas prior to the FROM clause. For example, the following query is invalid:

SELECT
  word,
  corpus,  -- Error due to trailing comma
FROM
  `bigquery-public-data.samples.shakespeare`
LIMIT 1;

To correct the error, remove the comma after corpus:

SELECT
  word,
  corpus
FROM
  `bigquery-public-data.samples.shakespeare`
LIMIT 1;

Comma operator with tables

In legacy SQL, the comma operator , has the non-standard meaning of UNION ALL when applied to tables. In standard SQL, the comma operator has the standard meaning of JOIN. For example, consider the following legacy SQL query:

SELECT
  x,
  y
FROM
  (SELECT 1 AS x, "foo" AS y),
  (SELECT 2 AS x, "bar" AS y);

This is equivalent to the standard SQL query:

SELECT
  x,
  y
FROM
  (SELECT 1 AS x, "foo" AS y UNION ALL
   SELECT 2 AS x, "bar" AS y);

Note also that in standard SQL, UNION ALL associates columns by position rather than by name. The above query is equivalent to:

SELECT
  x,
  y
FROM
  (SELECT 1 AS x, "foo" AS y UNION ALL
   SELECT 2, "bar");

A common usage of the comma operator in standard SQL is to JOIN with an array. For example:

WITH T AS (
  SELECT 0 AS x, [1, 2, 3] AS arr UNION ALL
  SELECT 1, [4, 5])
SELECT
  x,
  y
FROM
  T,
  UNNEST(arr) AS y;

This returns the cross product of the table T with the elements of arr. You can also express the query in standard SQL as:

WITH T AS (
  SELECT 0 AS x, [1, 2, 3] AS arr UNION ALL
  SELECT 1, [4, 5])
SELECT
  x,
  y
FROM
  T
JOIN
  UNNEST(arr) AS y;

In this query, JOIN has the same meaning as the , comma operator separating T and UNNEST(arr) AS y in the example above it.

Logical views

You cannot query a logical view defined with legacy SQL using standard SQL and vice versa due to differences in syntax and semantics between the dialects. Instead, you would need to create a new view that uses standard SQL--possibly under a different name--to replace a view that uses legacy SQL.

As an example, suppose that view V is defined using legacy SQL as:

SELECT *, UTC_USEC_TO_DAY(timestamp_col) AS day
FROM MyTable;

Suppose that view W is defined using legacy SQL as:

SELECT user, action, day
FROM V;

Suppose that you execute the following legacy SQL query daily, but you want to migrate it to use standard SQL instead:

SELECT EXACT_COUNT_DISTINCT(user), action, day
FROM W
GROUP BY action, day;

One possible migration path is to create new views using different names. The steps involved are:

Create a view named V2 using standard SQL with the following contents:

SELET *, EXTRACT(DAY FROM timestamp_col) AS day
FROM MyTable;

Create a view named W2 using standard SQL with the following contents:

SELECT user, action, day
FROM V2;

Change your query that executes daily to use standard SQL and refer to W2 instead:

SELECT COUNT(DISTINCT user), action, day
FROM W2
GROUP BY action, day;

Another option is to delete views V and W, then recreate them using standard SQL under the same names. With this option, you would need to migrate all of your queries that reference V or W to use standard SQL at the same time, however.

Function comparison

The following is a partial list of legacy SQL functions and their standard SQL equivalents.

Legacy SQL	Standard SQL	Notes
`INTEGER(x)`	`SAFE_CAST(x AS INT64)`
`CAST(x AS INTEGER)`	`SAFE_CAST(x AS INT64)`
`DATEDIFF(t1, t2)`	`TIMESTAMP_DIFF(t1, t2, DAY)`
`NOW`	`CURRENT_TIMESTAMP`
`STRFTIME_UTC_USEC(t, fmt)`	`FORMAT_TIMESTAMP(fmt, t)`
`UTC_USEC_TO_DAY(t)`	`TIMESTAMP_TRUNC(t, DAY)`
`REGEXP_MATCH(s, pattern)`	`REGEXP_CONTAINS(s, pattern)`
`IS_NULL(x)`	`x IS NULL`
`LEFT(s, len)`	`SUBSTR(s, 0, len)`
`RIGHT(s, len)`	`SUBSTR(s, -len)`
`s CONTAINS "foo"`	`STRPOS(s, "foo") > 0` or `s LIKE '%foo%'`
`x % y`	`MOD(x, y)`
`NEST(x)`	`ARRAY_AGG(x)`
`ANY(x)`	`ANY_VALUE(x)`
`GROUP_CONCAT_UNQUOTED(s, sep)`	`STRING_AGG(s, sep)`
`SOME(x)`	`IFNULL(LOGICAL_OR(x), false)`
`EVERY(x)`	`IFNULL(LOGICAL_AND(x), true)`
`COUNT(DISTINCT x)`	`APPROX_COUNT_DISTINCT(x)`	see notes below
`EXACT_COUNT_DISTINCT(x)`	`COUNT(DISTINCT x)`	see notes below
`QUANTILES(x, buckets + 1)`	`APPROX_QUANTILES(x, buckets)`
`TOP(x, num), COUNT(*)`	`APPROX_TOP_COUNT(x, num)`
`NTH(index, arr) WITHIN RECORD`	`arr[SAFE_ORDINAL(index)]`
`COUNT(arr) WITHIN RECORD`	`ARRAY_LENGTH(arr)`
`HOST(url)`	`NET.HOST(url)`	see differences below
`TLD(url)`	`NET.PUBLIC_SUFFIX(url)`	see differences below
`DOMAIN(url)`	`NET.REG_DOMAIN(url)`	see differences below

For more information on standard SQL functions, see the Functions & Operators topic.

`COUNT` function comparison

Both legacy SQL and standard SQL contain a COUNT function. However, each function behaves differently, depending on the SQL dialect you use.

In legacy SQL, COUNT(DISTINCT x) returns an approximate count. In standard SQL, it returns an exact count. For an approximate count of distinct values that runs faster and requires fewer resources, use APPROX_COUNT_DISTINCT.

URL function comparison

Both legacy SQL and standard SQL contain functions for parsing URLs. In legacy SQL, these functions are HOST(url), TLD(url), and DOMAIN(url). In standard SQL, these functions are NET.HOST(url), NET.PUBLIC_SUFFIX(url), and NET.REG_DOMAIN(url).

Improvements over legacy SQL functions

Standard SQL URL functions can parse URLs starting with "//".
When the input is not compliant with RFC 3986 or is not a URL (for example, "mailto:?to=&subject=&body="), different rules are applied to parse the input. In particular, Standard SQL URL functions can parse non-standard inputs without "//", such as "www.google.com". For best results, it is recommended that you ensure that inputs are URLs and comply with RFC 3986.
NET.PUBLIC_SUFFIX returns results without leading dots. For example, it returns "com" instead of ".com". This complies with the format in the public suffix list.
NET.PUBLIC_SUFFIX and NET.REG_DOMAIN support uppercase letters and internationalized domain names. TLD and DOMAIN do not support them (might return unexpected results).

Minor differences on edge cases

If the input does not contain any suffix in the public suffix list, NET.PUBLIC_SUFFIX and NET.REG_DOMAIN return NULL, while TLD and DOMAIN return non-NULL values as best effort guesses.
If the input contains only a public suffix without a preceding label (for example, "http://com"), NET.PUBLIC_SUFFIX returns the public suffix, while TLD returns an empty string. Similarly, NET.REG_DOMAIN returns NULL, while DOMAIN returns the public suffix.
For inputs with IPv6 hosts, NET.HOST does not remove brackets from the result, as specified by RFC 3986.
For inputs with IPv4 hosts, NET.REG_DOMAIN returns NULL, while DOMAIN returns the first 3 octets.

Examples

In the table below, gray text color indicates results that are the same between legacy and standard SQL.

URL (description)	HOST	NET.HOST	TLD	NET.PUBLIC _SUFFIX	DOMAIN	NET.REG_DOMAIN
"//google.com" (starting with "//")	NULL	"google.com"	NULL	"com"	NULL	"google.com"
"google.com" (non-standard; no "//")	NULL	"google.com"	NULL	"com"	NULL	"google.com"
"http://user:pass@[email protected]" (non-standard with multiple "@")	"[email protected]"	"x.com"	".com"	"com"	"[email protected]"	"x.com"
"http://foo.com:1:2" (non-standard with multiple ":")	"foo.com:1"	"foo.com"	".com:1"	"com"	"foo.com"	"foo.com"
"http://x.Co.uk" (upper case letters)	"x.Co.uk"	"x.Co.uk"	".uk"	"Co.uk"	"Co.uk"	"x.Co.uk"
"http://a.b" (public suffix not found)	"a.b"	"a.b"	".b"	NULL	"a.b"	NULL
"http://com" (host contains only a public suffix)	"com"	"com"	""	"com"	"com"	NULL
"http://[::1]" (IPv6 host; no public suffix)	"::1"	"[::1]"	""	NULL	"::1"	NULL
"http://1.2.3.4" (IPv4 host; no public suffix)	"1.2.3.4"	"1.2.3.4"	""	NULL	"1.2.3"	NULL

Differences in repeated field handling

A REPEATED type in legacy SQL is equivalent to an ARRAY of that type in standard SQL. The following section discusses some of the differences in operations on repeated fields between legacy and standard SQL.

Selecting nested repeated leaf fields

Using legacy SQL, you can "dot" into a nested repeated field without needing to consider where the repetition occurs. In standard SQL, attempting to "dot" into a nested repeated field results in an error. For example:

SELECT
  repository.url,
  payload.pages.page_name
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT 5;

Attempting to execute this query returns:

Cannot access field page_name on a value with type
ARRAY<STRUCT<action STRING, html_url STRING, page_name STRING, ...>>

To correct the error and return an array of page_names in the result, use an ARRAY subquery instead. For example:

SELECT
  repository.url,
  ARRAY(SELECT page_name FROM UNNEST(payload.pages)) AS page_names
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT 5;

For more information on arrays and ARRAY subqueries, see the Working with Arrays topic.

Filtering repeated fields

Using legacy SQL, you can filter repeated fields directly using a WHERE clause. In standard SQL, you can express similar logic with a JOIN comma operator followed by a filter. For example, consider the following legacy SQL query:

SELECT
  payload.pages.title
FROM
  [bigquery-public-data:samples.github_nested]
WHERE payload.pages.page_name IN ('db_jobskill', 'Profession');

This query returns all titles of pages for which the page_name is either db_jobskill or Profession. You can express a similar query in standard SQL as:

SELECT
  page.title
FROM
  `bigquery-public-data.samples.github_nested`,
  UNNEST(payload.pages) AS page
WHERE page.page_name IN ('db_jobskill', 'Profession');

One difference between the preceding legacy SQL and standard SQL queries is that if you unset the "Flatten Results" option and execute the legacy SQL query, payload.pages.title is REPEATED in the query result. To achieve the same semantics in standard SQL and return an array for the title column, use an ARRAY subquery instead:

SELECT
  title
FROM (
  SELECT
    ARRAY(SELECT title FROM UNNEST(payload.pages)
          WHERE page_name IN ('db_jobskill', 'Profession')) AS title
  FROM
    `bigquery-public-data.samples.github_nested`)
WHERE ARRAY_LENGTH(title) > 0;

This query creates an array of titles where the page_name is either 'db_jobskill' or 'Profession', then filters any rows where the array did not match that condition using ARRAY_LENGTH(title) > 0.

For more information on arrays, see the Working with Arrays topic.

Structure of selected nested leaf fields

Legacy SQL preserves the structure of nested leaf fields in the SELECT list when the "Flatten Results" option is unset, whereas standard SQL does not. For example, consider the following legacy SQL query:

SELECT
  repository.url,
  repository.has_downloads
FROM
  [bigquery-public-data.samples.github_nested]
LIMIT 5;

This query returns url and has_downloads within a record named repository when "Flatten Results" is unset. Now consider the following standard SQL query:

SELECT
  repository.url,
  repository.has_downloads
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT 5;

This query returns url and has_downloads as top-level columns; they are not part of a repository record or struct. To return them as part of a struct, use the STRUCT operator:

SELECT
  STRUCT(
    repository.url,
    repository.has_downloads) AS repository
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT 5;

Removing repetition with `FLATTEN`

Standard SQL does not have a FLATTEN function as in legacy SQL, but you can achieve similar semantics using the JOIN (comma) operator. For example, consider the following legacy SQL query:

SELECT
  repository.url,
  payload.pages.page_name
FROM
  FLATTEN([bigquery-public-data:samples.github_nested], payload.pages.page_name)
LIMIT 5;

You can express a similar query in standard SQL as follows:

SELECT
  repository.url,
  page.page_name
FROM
  `bigquery-public-data.samples.github_nested`,
  UNNEST(payload.pages) AS page
LIMIT 5;

Or, equivalently, use JOIN rather than the comma , operator:

SELECT
  repository.url,
  page.page_name
FROM
  `bigquery-public-data.samples.github_nested`
JOIN
  UNNEST(payload.pages) AS page
LIMIT 5;

One important difference is that the legacy SQL query returns a row where payload.pages.page_name is NULL if payload.pages is empty. The standard SQL query, however, does not return a row if payload.pages is empty. To achieve exactly the same semantics, use a LEFT JOIN or LEFT OUTER JOIN. For example:

SELECT
  repository.url,
  page.page_name
FROM
  `bigquery-public-data.samples.github_nested`
LEFT JOIN
  UNNEST(payload.pages) AS page
LIMIT 5;

For more information on arrays, see the Working with Arrays topic. For more information on UNNEST, see the UNNEST topic.

Filtering rows with `OMIT RECORD IF`

The OMIT IF clause from legacy SQL allows you to filter rows based on a condition that can apply to repeated fields. In standard SQL, you can model an OMIT IF clause with an EXISTS clause, IN clause, or simple filter. For example, consider the following legacy SQL query:

SELECT
  repository.url,
FROM
  [bigquery-public-data:samples.github_nested]
OMIT RECORD IF
  EVERY(payload.pages.page_name != 'db_jobskill'
        AND payload.pages.page_name != 'Profession');

The analogous standard SQL query is:

SELECT
  repository.url
FROM
  `bigquery-public-data.samples.github_nested`
WHERE EXISTS (
  SELECT 1 FROM UNNEST(payload.pages)
  WHERE page_name = 'db_jobskill'
    OR page_name = 'Profession');

Here the EXISTS clause evaluates to true if there is at least one element of payload.pages where the page name is 'db_jobskill' or 'Profession'.

Alternatively, suppose that the legacy SQL query uses IN:

SELECT
  repository.url,
FROM
  [bigquery-public-data:samples.github_nested]
OMIT RECORD IF NOT
  SOME(payload.pages.page_name IN ('db_jobskill', 'Profession'));

In standard SQL, you can express the query using an EXISTS clause with IN:

SELECT
  repository.url
FROM
  `bigquery-public-data.samples.github_nested`
WHERE EXISTS (
  SELECT 1 FROM UNNEST(payload.pages)
  WHERE page_name IN ('db_jobskill', 'Profession'));

Consider the following legacy SQL query that filters records with 80 or fewer pages:

SELECT
  repository.url,
FROM
  [bigquery-public-data:samples.github_nested]
OMIT RECORD IF
  COUNT(payload.pages.page_name) <= 80;

In this case, you can use a filter with ARRAY_LENGTH in standard SQL:

SELECT
  repository.url
FROM
  `bigquery-public-data.samples.github_nested`
WHERE
  ARRAY_LENGTH(payload.pages) > 80;

Note that the ARRAY_LENGTH function applies to the repeated payload.pages field directly rather than the nested field payload.pages.page_name as in the legacy SQL query.

For more information on arrays and ARRAY subqueries, see the Working with Arrays topic.

Semantic differences

The semantics of some operations differ between legacy and standard SQL.

Automatic data type coercions

Both legacy and standard SQL support coercions (automatic conversions) between certain data types. For example, BigQuery coerces a value of type INT64 to FLOAT64 if the query passes it to a function that requires FLOAT64 as input. Standard SQL does not support the following coercions that legacy SQL supports. Instead, you must use an explicit CAST.

INT64 literal to TIMESTAMP. Instead, use TIMESTAMP_MICROS(micros_value).
STRING literal to INT64, FLOAT64, or BOOL. Instead, use CAST(str AS INT64), CAST(str AS FLOAT64), or CAST(str AS BOOL).
STRING to BYTES. Instead, use CAST(str AS BYTES).

Runtime errors

Some functions in legacy SQL return NULL for invalid input, potentially masking problems in queries or in data. Standard SQL is generally more strict, and raises an error if an input is invalid.

For all mathematical functions and operators, legacy SQL does not check for overflows. Standard SQL adds overflow checks, and raises an error if a computation overflows. This includes the +, -, * operators, the SUM, AVG, and STDDEV aggregate functions, and others.
Standard SQL raises an error upon division by zero, whereas legacy SQL returns NULL. To return NULL for division by zero in standard SQL, use SAFE_DIVIDE.
Standard SQL raises an error for CASTs where the input format is invalid or out of range for the target type, whereas legacy SQL returns NULL. To avoid raising an error for an invalid cast in standard SQL, use SAFE_CAST.

Nested repeated results

Queries executed using standard SQL preserve any nesting and repetition of the columns in the result, and the "Flatten Results" option has no effect. To return top-level columns for nested fields, use the .* operator on struct columns. For example:

SELECT
  repository.*
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT 5;

To return top-level columns for repeated nested fields (ARRAYs of STRUCTs), use a JOIN to take the cross product of the table's rows and the elements of the repeated nested field. For example:

SELECT
  repository.url,
  page.*
FROM
  `bigquery-public-data.samples.github_nested`
JOIN
  UNNEST(payload.pages) AS page
LIMIT 5;

For more information on arrays and ARRAY subqueries, see the Working with Arrays topic.

Differences in user-defined JavaScript functions

The User-Defined Functions topic documents how to use JavaScript user-defined functions with standard SQL. This section explains some of the key differences between user-defined functions in legacy and standard SQL.

Functions in the query text

With standard SQL, you use CREATE TEMPORARY FUNCTION as part of the query body rather than specifying user-defined functions separately. Consider the following standard SQL query:

-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
--   n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>)
  RETURNS FLOAT64 LANGUAGE js AS """
var sum_of_reciprocals = 0;
for (var i = 0; i < arr.length; ++i) {
  sum_of_reciprocals += 1 / arr[i];
}
return arr.length / sum_of_reciprocals;
""";

WITH T AS (
  SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
  FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;

This query defines a JavaScript function named HarmonicMean and then applies it to the array column arr from T.

For more information on user-defined functions, see the User-Defined Functions topic.

Functions operate on values rather than rows

In legacy SQL, JavaScript functions operate on rows from a table. In standard SQL, as in the example above, JavaScript functions operate on values. To pass a row value to a JavaScript function using standard SQL, define a function that takes a struct of the same row type as the table. For example:

-- Takes a struct of x, y, and z and returns a struct with a new field foo.
CREATE TEMPORARY FUNCTION AddField(s STRUCT<x FLOAT64, y BOOL, z STRING>)
  RETURNS STRUCT<x FLOAT64, y BOOL, z STRING, foo STRING> LANGUAGE js AS """
var new_struct = new Object();
new_struct.x = s.x;
new_struct.y = s.y;
new_struct.z = s.z;
if (s.y) {
  new_struct.foo = 'bar';
} else {
  new_struct.foo = 'baz';
}

return new_struct;
""";

WITH T AS (
  SELECT x, MOD(off, 2) = 0 AS y, CAST(x AS STRING) AS z
  FROM UNNEST([5.0, 4.0, 3.0, 2.0, 1.0]) AS x WITH OFFSET off
)
SELECT AddField(t).*
FROM T AS t;

This query defines a JavaScript function that takes a struct with the same row type as T and creates a new struct with an additional field named foo. The SELECT statement passes the row t as input to the function and uses .* to return the fields of the resulting struct in the output.

Standard SQL highlights

This section discusses some of the highlights of standard SQL compared to legacy SQL.

Composability using `WITH` clauses

Some of the standard SQL examples on this page make use of a WITH clause, which enables extraction or reuse of named subqueries. For example:

WITH T AS (
  SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
)
SELECT x / (SELECT SUM(x) FROM T) AS weighted_x
FROM T;

This query defines a names subquery T that contains x values of 1, 2, 3, and 4. It selects x values from T and divides them by the sum of all x values in T. This query is equivalent to a query where the contents of T are inline:

SELECT
  x / (SELECT SUM(x)
       FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x)) AS weighted_x
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);

As another example, consider this query, which uses multiple names subqueries:

WITH T AS (
  SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
),
TPlusOne AS (
  SELECT x + 1 AS y
  FROM T
),
TPlusOneTimesTwo AS (
  SELECT y * 2 AS z
  FROM TPlusOne
)
SELECT z
FROM TPlusOneTimesTwo;

This query defines a sequence of transformations of the original data, followed by a SELECT statement over TPlusOneTimesTwo. This query is equivalent to the following query, which inlines the computations:

SELECT (x + 1) * 2 AS z
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);

For more information, see the WITH clause topic in the documentation.

Composability using SQL functions

Standard SQL supports user-defined SQL functions. You can use user-defined SQL functions to define common expressions and then reference them from the query. For example:

-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
--   n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>) AS
(
  ARRAY_LENGTH(arr) / (SELECT SUM(1 / x) FROM UNNEST(arr) AS x)
);

WITH T AS (
  SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
  FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;

This query defines a SQL function named HarmonicMean and then applies it to the array column arr from T.

Subqueries in more places

Standard SQL supports subqueries in the SELECT list, WHERE clause, and anywhere else in the query that expects an expression. For example, consider the following standard SQL query that computes the fraction of warm days in Seattle in 2015:

WITH SeattleWeather AS (
  SELECT *
  FROM `bigquery-public-data.noaa_gsod.gsod2015`
  WHERE stn = '994014'
)
SELECT
  COUNTIF(max >= 70) /
    (SELECT COUNT(*) FROM SeattleWeather) AS warm_days_fraction
FROM SeattleWeather;

The Seattle weather station has an ID of '994014'. The query computes the number of warm days based on those where the temperature reached 70 degrees Fahrenheit, or approximately 21 degrees Celsius, divided by the total number of recorded days for that station in 2015.

Correlated subqueries

In standard SQL, subqueries can reference correlated columns; that is, columns that originate from the outer query. For example, consider the following standard SQL query:

WITH WashingtonStations AS (
  SELECT weather.stn AS station_id, ANY_VALUE(station.name) AS name
  FROM `bigquery-public-data.noaa_gsod.stations` AS station
  INNER JOIN `bigquery-public-data.noaa_gsod.gsod2015` AS weather
  ON station.usaf = weather.stn
  WHERE station.state = 'WA' AND station.usaf != '999999'
  GROUP BY station_id
)
SELECT washington_stations.name,
  (SELECT COUNT(*)
   FROM `bigquery-public-data.noaa_gsod.gsod2015` AS weather
   WHERE washington_stations.station_id = weather.stn
   AND max >= 70) AS warm_days
FROM WashingtonStations AS washington_stations
ORDER BY warm_days DESC;

This query computes the names of weather stations in Washington state and the number of days in 2015 that the temperature reached 70 degrees Fahrenheit, or approximately 21 degrees Celsius. Notice that there is a subquery in the SELECT list, and that the subquery references washington_stations.station_id from the outer scope, namely FROM WashingtonStations AS washington_stations.

Arrays and structs

ARRAY and STRUCT are powerful concepts in standard SQL. As an example that uses both, consider the following query, which computes the top two articles for each day in the HackerNews dataset:

WITH TitlesAndScores AS (
  SELECT
    ARRAY_AGG(STRUCT(title, score)) AS titles,
    EXTRACT(DATE FROM time_ts) AS date
  FROM `bigquery-public-data.hacker_news.stories`
  WHERE score IS NOT NULL AND title IS NOT NULL
  GROUP BY date)
SELECT date,
  ARRAY(SELECT AS STRUCT title, score
        FROM UNNEST(titles)
        ORDER BY score DESC
        LIMIT 2)
  AS top_articles
FROM TitlesAndScores
ORDER BY date DESC;

The WITH clause defines TitlesAndScores, which contains two columns. The first is an array of structs, where one field is an article title and the second is a score. The ARRAY_AGG expression returns an array of these structs for each day.

The SELECT statement following the WITH clause uses an ARRAY subquery to sort and return the top two articles within each array in accordance with the score, then returns the results in descending order by date.

For more information on arrays and ARRAY subqueries, see the Working with Arrays topic. See also the references for arrays and structs.

Comparison of legacy and standard SQL

Do I have to migrate to standard SQL?

Enabling standard SQL

Advantages of standard SQL

Type differences

Label differences

Converting REQUIRED fields to NULLABLE

BigQuery UI

BigQuery API

REPEATED fields

Syntax differences

Escaping reserved keywords and invalid identifiers

Project-qualified table names

Table decorators and wildcard functions

Trailing commas in the SELECT list

Comma operator with tables

Logical views

Function comparison

COUNT function comparison

URL function comparison

Differences in repeated field handling

Selecting nested repeated leaf fields

Filtering repeated fields

Structure of selected nested leaf fields

Removing repetition with FLATTEN

Filtering rows with OMIT RECORD IF

Semantic differences

Automatic data type coercions

Runtime errors

Nested repeated results

Differences in user-defined JavaScript functions

Functions in the query text

Functions operate on values rather than rows

Standard SQL highlights

Composability using WITH clauses

Composability using SQL functions

Subqueries in more places

Correlated subqueries

Arrays and structs

Send feedback about...

Converting `REQUIRED` fields to `NULLABLE`

`REPEATED` fields

Trailing commas in the `SELECT` list

`COUNT` function comparison

Removing repetition with `FLATTEN`

Filtering rows with `OMIT RECORD IF`

Composability using `WITH` clauses