This document describes how to deal with BigQuery data, such as setting parameters and handling nested and repeated fields.
Parameters
REST API methods accept three types of parameters: path parameters, query parameters, and body parameters. The following method signature demonstrates all three parameter types:
PUT https://www.googleapis.com/bigquery/v2/projects/{projectId}/datasets/{datasetId}/tables/{tableId}?userIp="192.0.2.211" { "friendlyName": string, "description": string }
- projectId , datasetId, and tableId are all path parameters
- userIp is a query parameter
- friendlyName and description are both body parameters
The API documentation lists all the query parameters defined specifically by BigQuery. Query parameters that apply to all operations are shown below.
Setting Parameters
Different client libraries expose different techniques for setting these different types of parameters. For example, when using the Python client, you set path and query parameters the same way, but use a different method to set body parameters:
updateResponse = tableCollection.update(projectId='1234', # Path param
datasetId='5678', # Path param
tableId='9012', # Path param
userIp='192.0.2.211' # Query param
body={'friendlyName':'Donut Count', # Body params
'description':'Worldwide donut usage count'}) #
Query parameters that apply to all Google BigQuery API operations are shown in the table below.
Notes (on API keys and auth tokens):
- The
keyparameter is required with every request, unless you provide an OAuth 2.0 token with the request. - You must send an authorization token with every request that requires an OAuth scope. OAuth 2.0 is the only supported authorization protocol.
- You can provide an OAuth 2.0 token with any request in one of two ways:
- Using the
access_tokenquery parameter like this:?access_token=oauth2-token - Using the HTTP
Authorizationheader like this:Authorization: Beareroauth2-token
- Using the
All parameters are optional except where noted.
| Parameter | Meaning | Notes |
|---|---|---|
access_token |
OAuth 2.0 token for the current user. |
|
callback |
Callback function. |
|
fields |
Selector specifying a subset of fields to include in the response. |
|
key |
API key. (REQUIRED*) |
|
prettyPrint |
Returns response with indentations and line breaks. |
|
quotaUser |
Alternative to userIp. |
|
userIp |
IP address of the end user for whom the API call is being made. |
|
Paging Through list Results
All collection.list methods return paginated results under certain circumstances. The number of results per page is controlled by the maxResults property.
| Method | Pagination criteria | Default maxResults value |
Maximum maxResults value |
|---|---|---|---|
Tabledata.list |
Returns paginated results if the response size is more than 10 MB of serialized JSON or more than maxResults rows. |
100,000 | 100,000 |
All other collection.list methods |
Returns paginated results if the response is more than maxResults rows. |
50 | 1,000 |
If you set maxResults to a value greater than the maximum value listed above, the results are paginated based on the maximum value.
A page is a subset of the total number of rows. If your results are more
than one page of data, the result data will have a pageToken property.
To retrieve the next page of results, make another list call and
include the token value as a URL parameter named pageToken.
The bigquery.tabledata.list method, which is used to page through table data, uses a row offset value or a page token. See Browsing Through Table Data for information.
The following samples demonstrate paging through bigquery results.
C#
For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.
Java
For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.
Go
For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.
The Google Cloud Client Library for Go automatically paginates by default, so you do not need to implement pagination yourself, for example:
Node.js
For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.
The Google Cloud Client Library for Node.js automatically paginates by default, so you do not need to implement pagination yourself, for example:
PHP
For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.
Pagination happens automatically in the Google Cloud Client Library for PHP
using the generator function rows, which fetches the next page of results during iteration.
Python
For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.
Ruby
For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.
Pagination happens automatically in the Google Cloud Client Library for Ruby
using Table#data and Data#next.
Requesting Arbitrary Pages, and Avoiding Redundant List Calls
When you page backwards or jump to arbitrary pages using cached pageToken values, it is possible that the data in your pages might have changed since it was last viewed but there is no clear indication that the data might have changed. To mitigate this, you can use the Etag property.
Every collection.list method (except for Tabledata) returns an Etag property in the result. This property is a hash of the page results that can be used to verify whether the page has changed since last request. When you make a request to BigQuery with an Etag value, BigQuery compares the Etag value to the ETag value returned by the API and responds based on whether the ETag values match. You can use ETags to help avoid redundant list calls in the following ways:
- If you only want to return list values if the values have changed:
If you only want to return a page of list values if the values have changed, you can make a list call with a previously-stored ETag using the HTTP "if-none-match" header. If the ETag you provide doesn't match the ETag on the server, BigQuery returns a page of new list values. If the ETags do match, BigQuery returns a HTTP 304 "Not Modified" result and no values. An example of this might be a webpage where users might periodically fill in information that is stored in BigQuery. You can avoid making redundant list calls to BigQuery if there are no changes to your data by using the if-none-match header with ETags.
- If you only want to return list values if the values have not changed:
If you only want to return a page of list values if the list values have not changed, you can use the HTTP "if-match" header. BigQuery matches the ETag values and returns the page of results if the results have not changed or returns a 412 "Precondition Failed" result if the page has changed.
Nested and Repeated Data
BigQuery supports loading
and exporting nested and repeated data in the form of JSON and Avro files. Once loaded you can query your repeated and nested data using the FLATTEN and WITHIN SQL functions.
You can try running these queries yourself by downloading the following sample files and uploading it to your BigQuery account:
FLATTEN
When you query nested data, BigQuery automatically flattens the table data for you. For example, let's take a look at personsDataSchema.json:
Last modified Schema Total Rows Total Bytes Expiration
----------------- ----------------------------------- ------------ ------------- ------------
27 Sep 10:01:06 |- kind: string 4 794
|- fullName: string (required)
|- age: integer
|- gender: string
+- phoneNumber: record
| |- areaCode: integer
| |- number: integer
+- children: record (repeated)
| |- name: string
| |- gender: string
| |- age: integer
+- citiesLived: record (repeated)
| |- place: string
| +- yearsLived: integer (repeated)
Notice that there are several repeated and nested fields. If you run the following query:
SELECT fullName AS name, age, gender, citiesLived.place, citiesLived.yearsLived FROM [dataset.tableId]
Notice that BigQuery returns your data with a flattened output:
+---------------+-----+--------+-------------------+------------------------+ | name | age | gender | citiesLived_place | citiesLived_yearsLived | +---------------+-----+--------+-------------------+------------------------+ | John Doe | 22 | Male | Seattle | 1995 | | John Doe | 22 | Male | Stockholm | 2005 | | Mike Jones | 35 | Male | Los Angeles | 1989 | | Mike Jones | 35 | Male | Los Angeles | 1993 | | Mike Jones | 35 | Male | Los Angeles | 1998 | | Mike Jones | 35 | Male | Los Angeles | 2002 | | Mike Jones | 35 | Male | Washington DC | 1990 | | Mike Jones | 35 | Male | Washington DC | 1993 | | Mike Jones | 35 | Male | Washington DC | 1998 | | Mike Jones | 35 | Male | Washington DC | 2008 | | Mike Jones | 35 | Male | Portland | 1993 | | Mike Jones | 35 | Male | Portland | 1998 | | Mike Jones | 35 | Male | Portland | 2003 | | Mike Jones | 35 | Male | Portland | 2005 | | Mike Jones | 35 | Male | Austin | 1973 | | Mike Jones | 35 | Male | Austin | 1998 | | Mike Jones | 35 | Male | Austin | 2001 | | Mike Jones | 35 | Male | Austin | 2005 | | Anna Karenina | 45 | Female | Stockholm | 1992 | | Anna Karenina | 45 | Female | Stockholm | 1998 | | Anna Karenina | 45 | Female | Stockholm | 2000 | | Anna Karenina | 45 | Female | Stockholm | 2010 | | Anna Karenina | 45 | Female | Russia | 1998 | | Anna Karenina | 45 | Female | Russia | 2001 | | Anna Karenina | 45 | Female | Russia | 2005 | | Anna Karenina | 45 | Female | Austin | 1995 | | Anna Karenina | 45 | Female | Austin | 1999 | +---------------+-----+--------+-------------------+------------------------+
In this example, citiesLived.place is now citiesLived_place and citiesLived.yearsLived is now citiesLived_yearsLived.
Although BigQuery can automatically flatten nested fields, you may need to explicitly call FLATTEN when dealing with more than one repeated field. For example, if you try to run the following query:
SELECT fullName, age FROM [dataset.tableId] WHERE (citiesLived.yearsLived > 1995 ) AND (children.age > 3)
BigQuery returns an error similar to the following:
Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived
To query across more than one repeated field, you need to flatten one of the fields:
SELECT fullName, age, gender, citiesLived.place FROM (FLATTEN([dataset.tableId], children)) WHERE (citiesLived.yearsLived > 1995) AND (children.age > 3) GROUP BY fullName, age, gender, citiesLived.place
Which returns:
+------------+-----+--------+-------------------+ | fullName | age | gender | citiesLived_place | +------------+-----+--------+-------------------+ | John Doe | 22 | Male | Stockholm | | Mike Jones | 35 | Male | Los Angeles | | Mike Jones | 35 | Male | Washington DC | | Mike Jones | 35 | Male | Portland | | Mike Jones | 35 | Male | Austin | +------------+-----+--------+-------------------+
WITHIN Clause
The WITHIN keyword specifically works with aggregate functions to aggregate across children and repeated fields within
records and nested fields. When you specify the WITHIN keyword, you need to specify the scope over
which you want to aggregate:
WITHIN RECORD: Aggregates data in the repeated values within the record.WITHIN node_name: Aggregates data in the repeated values within the specified node, where a node is a parent node of the field in the aggregation function.
Suppose that you want to find the number of children each person in our previous example has. To do so, you can count the number of children.name each record has:
SELECT fullName, COUNT(children.name) WITHIN RECORD AS numberOfChildren FROM [dataset.tableId];
You get the following result:
+---------------+------------------+ | fullName | numberOfChildren | +---------------+------------------+ | John Doe | 2 | | Jane Austen | 2 | | Mike Jones | 3 | | Anna Karenina | 0 | +---------------+------------------+
To compare, try listing all of the children's names:
SELECT fullName, children.name FROM [dataset.tableId]
+---------------+---------------+ | fullName | children_name | +---------------+---------------+ | John Doe | Jane | | John Doe | John | | Jane Austen | Josh | | Jane Austen | Jim | | Mike Jones | Earl | | Mike Jones | Sam | | Mike Jones | Kit | | Anna Karenina | None | +---------------+---------------+
This matches with our WITHIN RECORD query results; John Doe does have two children named Jane and John, Jane Austen has two children named Josh and Jim, Mike Jones has three children named Earl, Sam, and Kit, and Anna Karenina doesn't have any children.
Now, suppose that you want to find the number of times a person has lived in different places. You can use the WITHIN clause to aggregate across one particular node:
SELECT fullName, COUNT(citiesLived.place) WITHIN RECORD AS numberOfPlacesLived, citiesLived.place, COUNT(citiesLived.yearsLived) WITHIN citiesLived AS numberOfTimesInEachCity, FROM [dataset.tableId];
+---------------+---------------------+-------------------+-------------------------+ | fullName | numberOfPlacesLived | citiesLived_place | numberOfTimesInEachCity | +---------------+---------------------+-------------------+-------------------------+ | John Doe | 2 | Seattle | 1 | | John Doe | 2 | Stockholm | 1 | | Mike Jones | 4 | Los Angeles | 4 | | Mike Jones | 4 | Washington DC | 4 | | Mike Jones | 4 | Portland | 4 | | Mike Jones | 4 | Austin | 4 | | Anna Karenina | 3 | Stockholm | 4 | | Anna Karenina | 3 | Russia | 3 | | Anna Karenina | 3 | Austin | 2 | +---------------+---------------------+-------------------+-------------------------+
This query does the following:
- Performs a
WITHIN RECORDoncitiesLived.placeand counts the number of places each person has lived in - Performs a
WITHINoncitiesLived.yearsLivedand counts the number of times each person has lived in each city (counting just acrosscitiesLived).
Using scoped aggregation over nested and repeated fields is one of BigQuery's most powerful features, which can often eliminate expensive joins in queries.