This page describes how to query data stored outside of BigQuery.
An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. You can create a permanent external table or a temporary table.
BigQuery offers support for querying data directly from Google Cloud Storage and Google Drive. Use cases include:
- Loading and cleaning your data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated.
Limitations include the following:
- BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
- If query speed is a priority, load the data into BigQuery instead of setting up an external data source. The performance of a query that includes an external data source depends on the external storage type. For example, querying data stored in Google Cloud Storage is faster than querying data stored in Google Drive. In general, query performance for external data sources should be equivalent to reading the data directly from the external storage.
- You cannot run a BigQuery job that exports data from an external data source.
- You cannot use the
TableDataListJSON API method to retrieve data from tables that reside in an external data source. For more information, see Tabledata: list.
BigQuery supports the following data-source formats:
| Google Cloud Storage | Google Drive |
|---|---|
| Comma-separated values (CSV) | Comma-separated values (CSV) |
| JSON ( newline-delimited ) | JSON ( newline-delimited ) |
| Avro files | Avro files |
| Google Cloud Datastore backup files | Google Sheets (first tab only) |
To directly query an external data source, provide the Cloud Storage or Google Drive URI path to your data and create an external table that references that data source. For CSV and JSON files, you can also include the table schema. If you omit the schema, BigQuery attempts to automatically detect the schema. For Avro and Cloud Datastore backup files, BigQuery reads the schema from the data source.
Before you begin
- Identify a data file that you plan to use as your external data source.
- If you plan to use the
bqcommand-line tool instead of the web UI, ensure that you have activated BigQuery and installed thegcloudSDK. For more information, see the Before you begin section of the command-line tool quickstart. - The command-line tool examples on this page assume that you have set a
default project so that you can omit the
project_idflag. For information on how to set a default project, seegcloudSDK documentation forgcloudconfig set.
Creating external tables
An external table is a permanent table that references an external data source, which means that the data is not stored in BigQuery. To create an external table, use the external table type. Because the table is permanent, you can share the table with others who also have access to the underlying external data source.
Creating an external table using Cloud Storage
Web UI
Go to the BigQuery web UI.
Go to the BigQuery web UIIn the navigation, hover on a dataset ID and click the down arrow icon
next to the ID
and click Create new table.Under Source data select Google Cloud Storage for the Location and then enter the URI to your Cloud Storage file. For example, enter
gs://mybucket/yob2010.csvfor a file namedyob2010.csvin themybucketCloud Storage bucket.Choose the format of your data. Valid formats include the following:
- Comma-separated values (CSV)
- JSON (newline delimited)
- Avro
- Cloud Datastore backup
Under Destination Table input a table ID, then select External table for the Table type field.
For CSV and JSON files, input the table schema. For Cloud Datastore backups and Avro files, you can omit the table schema and let BigQuery detect the schema. Click the Submit button to complete the process.
You can then run a query against the table as if it were a native BigQuery table, subject to the limitations of external data sources.
After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.
CLI
To follow along with the example below:
Create a CSV file named
fed-sample.csvthat contains the following sample data:1,Alice,10,one 2,Alex,11,one 3,Bob,11,twoUpload the file to a Cloud Storage bucket using the gsutil tool (replace
[MY_BUCKET]with the name of a Cloud Storage bucket):gsutil cp fed-sample.csv gs://[MY_BUCKET]/fed-sample.csv
To create an external table:
Create the table by running the following command with your own values for the dataset ID (
[DATASET_ID]):bq mk --external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[CLOUD_STORAGE_URI] [DATASET_ID].[TABLE_NAME]Where
[SOURCE_FORMAT]is one of:CSV,NEWLINE_DELIMITED_JSON,AVRO, orDATASTORE_BACKUP.For example, the following command uses the
fed-sample.csvfile that is part of a dataset namedmydataset:bq mk --external_table_definition=id:integer,name:string,num:integer,ord:string@CSV=gs://[MY_BUCKET]/fed-sample.csv mydataset.fed_sampleWhere
[MY_BUCKET]is a valid Cloud Storage bucket name. If your schema is more complicated, you can store it in a separate file in JSON format. For example, the following JSON object represents the inline schema for thefed-sample.csvdata.[{ "name": "id", "type": "integer", "mode": "nullable" }, { "name": "name", "type": "string", "mode": "nullable" }, { "name": "num", "type": "integer", "mode": "nullable" }, { "name": "ord", "type": "integer", "mode": "nullable" } ]If the JSON schema is stored in a file named
fed-sample-schema.jsonin your/tmpdirectory, you can refer to that file instead of supplying the schema inline:bq mk --external_table_definition=/tmp/fed-sample-schema.json@CSV=gs://[MY_BUCKET]/fed-sample.csv mydataset.fed_sampleRun a query against the table by running the following command:
bq query --project_id=[PROJECT_ID] 'select name from [DATASET_ID].fed_sample where name contains "Alex";'Where
[PROJECT_ID]is a valid project ID.
API
The sourceUris must be fully-qualified, in the format gs://[BUCKET]/[OBJECT].
Specify the data format by setting the externalDataConfiguration.sourceFormat property. Valid values include:
CSVNEWLINE_DELIMITED_JSONAVRODATASTORE_BACKUP
Creating an external table using Google Drive
Before you begin, ensure that you have enabled:
- Google Drive access and the
- Google Drive API.
To specify a Google Drive file, use one of these URI formats:
| Source Format | URL Format |
|---|---|
| CSV | JSON | AVRO | Google Sheets |
To access a Google Drive file's URI in this format:
|
| Google Sheets |
https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER] To access a Google Drive file's URI in this format:
|
Web UI
Go to the BigQuery web UI.
In the navigation, hover on a dataset ID and click the down arrow icon
next to the ID
and click Create new table.Under Source data select Google Drive for the Location and then enter the URI to your Google Drive file.
Select the format of your data. For Google Sheets, select the CSV option. Valid formats include the following:
- Comma-separated values (CSV)
- JSON (newline delimited)
- Avro
- Google Sheets (first tab only)
Under Destination Table input a table ID, then select External table for the Table type field.
For CSV and JSON files, input the table schema. For Avro files, you can omit the table schema and let BigQuery detect the schema. Click the Submit button to complete the process.
You can then run a query against the table as if it were a native BigQuery table, subject to the limitations of external data sources.
After your query completes, you can download the results as CSV or JSON, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.
CLI
To follow along with the example below:
Create a CSV file named
fed-sample.csvthat contains the following sample data:1,Alice,10,one 2,Alex,11,one 3,Bob,11,twoUpload the file to Google Drive.
To create an external table:
Create the table by running the following command with your own values:
bq mk --external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[GOOGLE_DRIVE_URL] [DATASET_ID].[TABLE_NAME]Where
[SOURCE_FORMAT]is one of:CSV,NEWLINE_DELIMITED_JSON,AVRO, orGOOGLE_SHEETS.For example, the following command uses the
fed-sample.csvfile that is part of a dataset namedmydataset:bq mk --external_table_definition=id:integer,name:string,num:integer,ord:string@CSV=[GOOGLE_DRIVE_URL] mydataset.fed_sampleRun a query against the table by running the following command:
bq query 'select name from [DATASET_ID].fed_sample where name contains "Alex";'Where
[DATASET_ID]is a valid dataset ID.
API
The sourceUris must be fully-qualified, in the format https://drive.google.com/open?id=[FILE_ID] or https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER].
Specify the data format by setting the externalDataConfiguration.sourceFormat property. Valid values include:
CSVNEWLINE_DELIMITED_JSONAVROGOOGLE_SHEETS
Querying external data using temporary tables
To directly query an external data source without creating a permanent table, run a query that creates a temporary table. Querying a temporary table is useful for one-time, ad-hoc queries over remote data, or for extract, transform, and load (ETL) processes.
The example below creates a table definition file that contains the schema. The example then uses the definition file and the data source URIs to create and query a temporary table.
Querying data from Cloud Storage
CLI
-
Create a table definition by running the following command, which creates a table definition and pipes the output to the file
/tmp/wikipedia:- CSV
-
bq mkdef --source_format=CSV "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia
The table is CSV format and has four fields: id, name, category and depth.
- JSON
-
bq mkdef --source_format=NEWLINE_DELIMITED_JSON "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia
The table is JSON (newline delimited) format and has four fields: id, name, category and depth.
- Avro
-
bq mkdef --source_format=AVRO "gs://[BUCKET_URI]" > /tmp/wikipedia
If you use an Avro file as your external data source, BigQuery can detect the schema if you don't specify one.
- Cloud Datastore
-
bq mkdef --source_format=DATASTORE_BACKUP "gs://[BUCKET_URI]" > /tmp/wikipedia
If you use a Cloud Datastore backup as your external data source, BigQuery can auto-detect the schema if you don't specify one.
The
[BUCKET_URI]comprises your bucket name and your object (filename). For example, if the Cloud Storage bucket is namedmybucketand the data file is namedfed-sample.csv, the bucket URI would begs://mybucket/fed-sample.csv.If your table is separated into multiple files that share a common base-name, you can refer to the files using a wildcard. You append an asterisk (*) to the base-name and enclose the bucket and filename in quotes. For example, if you have two files named
fed-sample000001.csvandfed-sample000002.csv, the bucket URI would be"gs://mybucket/fed-sample*". You can only use one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported. -
Run a query against the table by running the following command:
bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name from wikipedia where name contains "Alex";'
The above command runs a query that references the table as "wikipedia" and provides the table definition with the file created in the previous step.
API
Create a query. See Querying Data for information about calling jobs.query and jobs.insert.
Specify the external data source by setting the configuration.query.tableDefinitions property.
Querying data from Google Drive
Before you begin, ensure that you have enabled Google Drive access.
To specify a Google Drive file, use one of these URI formats:
| Source Format | URL Format |
|---|---|
| CSV | JSON | AVRO | Google Sheets |
To access a Google Drive file's URI in this format:
|
| Google Sheets |
https://docs.google.com/spreadsheets/d/[FILE_ID]/edit#gid=[IDENTIFIER] To access a Google Drive file's URI in this format:
|
CLI
-
Create a table definition by running the following command, which creates a table definition and pipes the output to the file
/tmp/wikipedia:- CSV
-
bq mkdef --source_format=CSV "[GOOGLE_DRIVE_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia
The table is CSV format and has four fields: id, name, category and depth.
- JSON
-
bq mkdef --source_format=NEWLINE_DELIMITED_JSON "[GOOGLE_DRIVE_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia
The table is JSON (newline delimited) format and has four fields: id, name, category and depth.
- Avro
-
bq mkdef --source_format=AVRO "[GOOGLE_DRIVE_URI]" > /tmp/wikipedia
If you use an Avro file as your external data source, BigQuery can detect the schema if you don't specify one.
- Google Sheets
-
bq mkdef --source_format=GOOGLE_SHEETS "[GOOGLE_DRIVE_URI]" > /tmp/wikipedia
The source is the first tab of a Google Sheets file.
-
Run a query against the table by running the following command:
bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name from wikipedia where name contains "Alex";'
The above command runs a query that references the table as "wikipedia" and provides the table definition with the file created in the previous step.
API
Create a query. See Querying Data for information about calling jobs.query and jobs.insert.
Specify the external data source by setting the configuration.query.tableDefinitions property.
Creating table definition files
A table definition file contains metadata about a table, such as the table's source format and related properties. Table definition files can be used to create permanent external tables or temporary external tables. The command-line tool and the API support the creation and use of table definition files.
To create and use a table definition:
CLI
Use the command-line tool's
mkdefcommand to create a table definition or create the table definition manually. Themkdefcommand generates a table definition file in JSON format.If necessary, manually edit the table definition file to modify, add, or delete custom parameters.
Create an external table by using
bq mkor query a temporary external table by usingbq queryusing the table definition file as the value for the--external_table_definitionflag.
API
If you are using the BigQuery API, define a table definition as part of your job configuration. For more information, see configuration.query.tableDefinitions in the API Reference.
Many table definition configuration settings apply to more than one source format. For detailed information about settings and source formats, see the API Reference descriptions for each of the following settings:
| General table definition settings | |
|---|---|
autodetect |
Type: boolean Default value: Description: See CSV and JSON schema auto-detection. API Reference: configuration.query.tableDefinitions.(key).autodetect |
compression |
Type: string Default value: Valid values: API Reference: configuration.query.tableDefinitions.(key).compression |
ignoreUnknownValues |
Type: boolean Default value: API Reference: configuration.query.tableDefinitions.(key).ignoreUnknownValues |
maxBadRecords |
Type: integer Default value: API Reference: configuration.query.tableDefinitions.(key).maxBadRecords |
schema |
Type: nested object API Reference: configuration.query.tableDefinitions.(key).schema |
sourceFormat |
Type: string API Reference: configuration.query.tableDefinitions.(key).sourceFormat |
sourceUris |
Type: list API Reference: configuration.query.tableDefinitions.(key).sourceUris |
For the complete list of table definition configuration settings, see configuration.query.tableDefinitions in the API Reference.
Table definitions based on CSV source files
The following command creates a table definition based on a CSV file named fed-
sample.csv in a Cloud Storage bucket named [BUCKET], where the schema is
stored in /tmp/fed-sample-schema.json and the generated definition file is
stored in /tmp/fed-sample-def:
bq mkdef --source_format=CSV 'gs://[BUCKET]/fed-sample.csv' /tmp/fed-sample-schema.json > /tmp/fed-sample-def
The contents of fed-sample-def (schema values omitted for brevity):
{
"csvOptions": {
"allowJaggedRows": false,
"allowQuotedNewlines": false,
"encoding": "UTF-8",
"fieldDelimiter": ",",
"quote": "\"",
"skipLeadingRows": 0
},
"schema": {
"fields": [
...
]
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://[BUCKET]/fed-sample.csv"
]
}
You can modify a table definition file so that subsequent calls to bq mk or
bq query use the updated values. For example, if you decide
that you want to allow jagged rows, you can modify that property in fed-sample-def:
{
"csvOptions": {
"allowJaggedRows": true,
...
The following options are available only for table definitions based on CSV files:
| CSV table definition options | |
|---|---|
allowJaggedRows |
Type: boolean Default value: API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowJaggedRows |
allowQuotedNewlines |
Type: boolean Default value: Description: CSV files with newline characters are difficult to process in parallel. Setting this to true can reduce the performance gains associated with parallel processing. API Reference: configuration.query.tableDefinitions.(key).csvOptions.allowQuotedNewlines |
encoding |
Type: string Default value: Valid values: API Reference: configuration.query.tableDefinitions.(key).csvOptions.encoding |
fieldDelimiter |
Type: string Default value: "," API Reference: configuration.query.tableDefinitions.(key).csvOptions.fieldDelimiter |
quote |
Type: string Default value: " (quotation mark) API Reference: configuration.query.tableDefinitions.(key).csvOptions.quote |
skipLeadingRows |
Type: long Default value: 0 API Reference: configuration.query.tableDefinitions.(key).csvOptions.skipLeadingRows |
Table definitions based on JSON source files
The following command creates a table definition based on a JSON file named
fed-sample.json in a Cloud Storage bucket named [BUCKET], where the schema
is stored in /tmp/fed-sample-schema.json and the generated definition file is
stored in /tmp/fed-sample-def:
bq mkdef --source_format=NEWLINE_DELIMITED_JSON 'gs://[BUCKET]/fed-sample.json' /tmp/fed-sample-schema.json > /tmp/fed-sample-json-def
The contents of fed-sample-json-def (schema values omitted for brevity):
{
"autodetect": true,
"schema": {
"fields": [
...
]
},
"sourceFormat": "NEWLINE_DELIMITED_JSON",
"sourceUris": [
"gs://[BUCKET]/fed-sample.json"
]
}
There are no configuration settings specific to JSON source files, but many of the general settings apply to JSON files. For more information, see General table definition settings.
Table definitions based on Google Sheets source files
The following command creates a table definition based on a Google Sheet stored
on Google Drive specified by [GOOGLE_DRIVE_URI], where the schema is auto-
detected and the generated definition file is stored in /tmp/google-sheet-def:
bq mkdef --source_format=GOOGLE_SHEETS "[GOOGLE_DRIVE_URI]" > /tmp/google-sheet-def
The contents of google-sheet-def:
{
"autodetect": true,
"sourceFormat": "GOOGLE_SHEETS",
"sourceUris": [
"[GOOGLE_DRIVE_URI]"
]
}
You can modify the table definition to skip leading rows by adding a new
googleSheetsOptions section. The following example modifies the table
definition so that it skips the first 3 rows:
{
"autodetect": true,
"sourceFormat": "GOOGLE_SHEETS",
"sourceUris": [
"[GOOGLE_DRIVE_URI]"
],
"googleSheetsOptions": {
"skipLeadingRows": 3
}
}
| Google Sheets table definition settings | |
|---|---|
autodetect |
Type: boolean Default value: API Reference: configuration.query.tableDefinitions.(key).autodetect |
skipLeadingRows |
Type: long Default value: 0 API Reference: configuration.query.tableDefinitions.(key).googleSheetsOptions.skipLeadingRows |
For a list of general configuration settings, , see General table definition settings.
Enabling Google Drive access
Web UI
Follow the web-based authentication steps when creating the table in the web UI.
CLI
To enable Google Drive access
Use the Google Cloud Platform Console to enable the Google Drive API for the project you plan to use.
Ensure that you have the latest version of the command-line tool:
gcloud components updateAuthenticate with Google Drive
gcloud auth login --enable-gdrive-access
API
If you are using the BigQuery API:
Use the Google Cloud Platform Console to enable the Google Drive API for the project making the API call.
Request the OAuth scope for Google Drive in addition to the scope for BigQuery.
Avro and Cloud Datastore backup schema detection
If you use an Avro file or a Cloud Datastore backup as an external data source, BigQuery detects the schema for you.
When BigQuery detects schemas, it might, on rare occasions, change a field name to make it compatible with BigQuery SQL syntax.
To see the detected schema, create an external table and use the bq show
command. Alternately, you can use the BigQuery web UI to see the schema.
For information about data type conversion, see:
- Date type conversion from Loading Data from Cloud Datastore.
- Avro format data conversion table.
CSV and JSON schema auto-detection
If you specify a CSV or JSON file without including a schema description, BigQuery makes a best-effort attempt to automatically infer the schema. Automatic detection is currently available in the command-line tool and the BigQuery API.
BigQuery starts the inference process by selecting a random file in the data set and scanning up to 100 rows of the file to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in that sample.
Schema auto-detection is enabled by default. To disable auto-detection, you can
pass the --noautodetect flag when creating the table definition using mkdef
or modify an existing table definition file by changing the value of the
autodetect property to false. For example, the following excerpt from a
table definition JSON file disables auto-detection:
{
"autodetect": false,
...
}
Other automatic detection details include:
- Compression
- BigQuery recognizes gzip-compatible file compression when opening a file.
- CSV Delimiter
- BigQuery detects the following delimiters:
- comma (,)
- pipe (|)
- tab (\t)
- CSV Header
- BigQuery infers headers by comparing the first row of the file with other rows in the data set. If the first line contains only strings, and the other lines do not, BigQuery assumes that the first row is a header row.
- CSV Quoted new lines
- BigQuery detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary.
- Timestamps
-
BigQuery detects a wide array of timestamp formats, including, but not limited to:
- `yyyy-mm-dd` in any order
- `yyyy-mm-dd hh:mm:ss`
- `yyyy-mm-dd hh:mm:ss.sss`
- Date separators can be "-", "/", or "."
- A time zone can be appended using an offset or name
The _FILE_NAME pseudo column
Tables based on external data sources provide a pseudo column named _FILE_NAME.
This column contains the fully qualified path to the file to which the row
belongs. This column is available only in external tables that reference Google
Cloud Storage and Google Drive.
The _FILE_NAME column name is reserved, which means that you cannot
create a column by that name in any of your tables. To select the value of
_FILE_NAME, you must use an alias. The following example demonstrates
selecting _FILE_NAME by assigning the alias fn to the pseudo column.
-
Create a table definition for the table
wikipediausing the URI for the files from a Cloud Storage bucket. For example:bq mkdef --source_format=CSV "gs://[BUCKET_URI]" id:integer,name:string,category:integer,depth:string > /tmp/wikipedia
The
[BUCKET_URI]comprises your bucket name and your filename. For example, if the Cloud Storage bucket is namedmybucketand the data file is namedfed-sample.csv, the bucket URI would begs://mybucket/fed-sample.csv. -
Query the
_FILE_NAMEpseudo column:bq query --project_id=<project ID> --external_table_definition=wikipedia::/tmp/wikipedia 'select name, _FILE_NAME as fn from wikipedia where name contains "Alex";'
Limits
The limits for external data sources are the same as the limits for load jobs, as described in the Load jobs section on the Quota Policy page.