A public dataset is any dataset that is stored in BigQuery and made available to the general public. This page lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1 TB per month is free, subject to query pricing details).
Public datasets hosted by BigQuery
-
GDELT Book Corpus
A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes).
-
GitHub Data
This public dataset contains GitHub activity data for more than 2.8 million open source GitHub repositories, more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files.
-
Hacker News
A dataset that contains all stories and comments from Hacker News since its launch in 2006.
-
IRS Form 990 Data
A dataset that contains financial information about nonprofit/exempt organizations in the United States, gathered by the Internal Revenue Service (IRS) using Form 990.
-
Medicare Data
This public dataset summarizes the utilization and payments for procedures, services, and prescription drugs provided to Medicare beneficiaries by specific inpatient and outpatient hospitals, physicians, and other suppliers.
-
Major League Baseball Data
This public dataset contains pitch-by-pitch activity data for Major League Baseball (MLB) in 2016.
-
NOAA GHCN
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. This dataset draws from more than 20 sources, including some data from every year since 1763.
-
NOAA GSOD
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and 2016, collected from over 9000 stations.
-
NYC 311 Service Requests (New)
This public data includes all 311 service requests from 2010 to the present, and is updated daily. 311 is a non-emergency number that provides access to non-emergency municipal services.
-
NYC Citi Bike Trips (New)
Data collected by the NYC Citi Bike bicycle sharing program, that includes trip records for 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City since Citi Bike launched in September 2013.
-
NYC TLC Trips
Data collected by the NYC Taxi and Limousine Commission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015.
-
NYC Tree Census (New)
The NYC street tree data includes data from the 1995, 2005 and 2015 Street Tree Censuses, which are conducted by volunteers organized by the NYC Department of Parks and Recreation.
-
NYPD Motor Vechicle Collisions (New)
This dataset includes details of Motor Vehicle Collisions in New York City provided by the Police Department (NYPD) from 2012 to the present.
-
Open Images Data
This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.
-
Stack Overflow Data (New)
This public dataset contains an archive of Stack Overflow content, including posts, votes, tags, and badges.
-
USA Bureau of Labor Statistics (New)
This dataset includes economic statistics on inflation, prices, unemployment, and pay & benefits provided by the Bureau of Labor Statistics (BLS).
-
USA Disease Surveillance
A dataset published by the US Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
-
USA Names
A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879.
How to query public data sets using BigQuery
BigQuery is a fully managed data warehouse and analytics platform. The public datasets listed on this page are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.
The first terabyte of data processed per month is free, so you can start querying datasets without enabling billing. To get started running some sample queries, select or create a project and then run the example queries on the NOAA GSOD weather dataset.
- Select or create a Cloud Platform Console project.
- Go to the NOAA GSOD dataset in the BigQuery Web UI.
Go to NOAA GSOD dataset - Click the COMPOSE QUERY button.
- Copy and paste the SQL examples on the NOAA GSOD page.
Other Public Datasets
There are many other public datasets available for you to query, some of which are also hosted by Google, but many more that are hosted by third parties. You can share any of your datasets with the public by changing the sharing permissions associated with your dataset. For more information about sharing datasets, see Access Control.
- Sample Tables
- Google Genomics Public Data
- Datasets publicly available on Google BigQuery (reddit.com)
How to list your public data set on BigQuery
If you have any questions about listing a public data set on this page, please contact us at [email protected].