GitHub
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can start exploring this data in the BigQuery console:
Go to the GitHub public dataset
Sample Queries
Here are some examples of SQL queries you can run on this data in BigQuery. For more tips, updated resources, and community content, see the updated resource list at https://medium.com/@hoffa/b3576fd2b150
Most commonly used Go packages
SELECT
REGEXP_EXTRACT(line, r'"([^"]+)"') AS url,
COUNT(*) AS count
FROM
FLATTEN( (
SELECT
SPLIT(SPLIT(REGEXP_EXTRACT(content, r'.*import\s*[(]([^)]*)[)]'), '\n'), ';') AS line,
FROM (
SELECT
id,
content
FROM
[bigquery-public-data:github_repos.sample_contents]
WHERE
REGEXP_MATCH(content, r'.*import\s*[(][^)]*[)]')) AS C
JOIN (
SELECT
id
FROM
[bigquery-public-data:github_repos.sample_files]
WHERE
path LIKE '%.go'
GROUP BY
id) AS F
ON
C.id = F.id), line)
GROUP BY
url
HAVING
url IS NOT NULL
ORDER BY
count DESC
LIMIT 10

Most commonly used Java packages
This query uses one of the smaller sample tables to find the most popular Java packages.
SELECT
package,
COUNT(*) count
FROM (
SELECT
REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package,
id
FROM (
SELECT
SPLIT(content, '\n') line,
id
FROM
[bigquery-public-data:github_repos.sample_contents]
WHERE
content CONTAINS 'import'
AND sample_path LIKE '%.java'
HAVING
LEFT(line, 6)='import' )
GROUP BY
package,
id )
GROUP BY
1
ORDER BY
count DESC
LIMIT
40;

How many times 'This should never happen' appears
This query uses a smaller sample table to find how many times the comment "this should never happen" is present.
SELECT SUM(copies) FROM [bigquery-public-data:github_repos.sample_contents] WHERE NOT binary AND content CONTAINS 'This should never happen'
Hint: if you run this query against the entire dataset the answer is around 1,000,000!

About the dataset
Dataset Source: GitHub
Category: Technology, Social
Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://help.github.com/articles/github-terms-of-service/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
View in BigQuery: Go to GitHub data