Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.
You can start exploring this data in the BigQuery console:
Sample queries
Here are some examples of SQL queries you can run on this data in BigQuery.
These samples use BigQuery’s support for standard SQL. Use the #standardSQL
tag to let BigQuery know you want to use standard SQL. For more information
about the #standardSQL prefix, see
Setting a query prefix.
What is the percentage of questions that have been answered over the years?
This short query looks at how many questions were posted on Stack Overflow from 2009 to 2015, and what percentage of them have been answered annually. It shows that the total number of questions posted are increasing annually, but the percentage of them answered are decreasing annually.
Web UI
Open the following query in the Web UI
#standardSQL SELECT EXTRACT(YEAR FROM creation_date) AS Year, COUNT(*) AS Number_of_Questions, ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers FROM `bigquery-public-data.stackoverflow.posts_questions` GROUP BY Year HAVING Year > 2008 AND Year < 2016 ORDER BY Year
Command-line
bq query --use_legacy_sql=false '
SELECT
EXTRACT(YEAR FROM creation_date) AS Year,
COUNT(*) AS Number_of_Questions,
ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers
FROM
`bigquery-public-data.stackoverflow.posts_questions`
GROUP BY
Year
HAVING
Year > 2008 AND Year < 2016
ORDER BY
Year
'
The results are shown here:
+------+---------------------+--------------------------------+
| Year | Number_of_Questions | Percent_Questions_with_Answers |
+------+---------------------+--------------------------------+
| 2009 | 345864 | 99.5 |
| 2010 | 702964 | 98.1 |
| 2011 | 1213146 | 96.3 |
| 2012 | 1664204 | 93.6 |
| 2013 | 2076336 | 90.9 |
| 2014 | 2179015 | 87.6 |
| 2015 | 2388670 | 79.5 |
+------+---------------------+--------------------------------+
What is the reputation and badge count of users across different tenures on StackOverflow?
This query breaks down Stack Overflow users into different cohorts by the number of years they’ve been on the platform, and computes the average reputation and number of badges for each cohort. It’s not surprising that users who have been on StackOverflow longer would have higher reputation and number of badges on average. In addition, it’s interesting to see that users typically only begin to have multiple badges after two years on StackOverflow.
Web UI
Open the following query in the Web UI
#standardSQL
SELECT User_Tenure,
COUNT(1) AS Num_Users,
ROUND(AVG(reputation)) AS Avg_Reputation,
ROUND(AVG(num_badges)) AS Avg_Num_Badges
FROM (
SELECT users.id AS user,
ROUND(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), ANY_VALUE(users.creation_date), DAY)/365) AS user_tenure,
ANY_VALUE(users.reputation) AS reputation,
SUM(IF(badges.user_id IS NULL, 0, 1)) AS num_badges
FROM bigquery-public-data.stackoverflow.users users
LEFT JOIN bigquery-public-data.stackoverflow.badges badges
ON users.id = badges.user_id
GROUP BY user
)
GROUP BY User_Tenure
ORDER BY User_Tenure
Command-line
bq query --use_legacy_sql=false '
SELECT User_Tenure,
COUNT(1) AS Num_Users,
ROUND(AVG(reputation)) AS Avg_Reputation,
ROUND(AVG(num_badges)) AS Avg_Num_Badges
FROM (
SELECT users.id AS user,
ROUND(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), ANY_VALUE(users.creation_date), DAY)/365) AS user_tenure,
ANY_VALUE(users.reputation) AS reputation,
SUM(IF(badges.user_id IS NULL, 0, 1)) AS num_badges
FROM `bigquery-public-data.stackoverflow.users` users
LEFT JOIN `bigquery-public-data.stackoverflow.badges` badges
ON users.id = badges.user_id
GROUP BY user
)
GROUP BY User_Tenure
ORDER BY User_Tenure
'
The results are shown here:
+-------------+-----------+----------------+----------------+
| User_Tenure | Num_Users | Avg_Reputation | Avg_Num_Badges |
+-------------+-----------+----------------+----------------+
| 0.0 | 46218 | 2.0 | 1.0 |
| 1.0 | 1401011 | 8.0 | 1.0 |
| 2.0 | 1198233 | 20.0 | 1.0 |
| 3.0 | 1175131 | 37.0 | 2.0 |
| 4.0 | 940001 | 82.0 | 3.0 |
| 5.0 | 471452 | 234.0 | 7.0 |
| 6.0 | 272851 | 456.0 | 10.0 |
| 7.0 | 129695 | 1017.0 | 17.0 |
| 8.0 | 42666 | 4031.0 | 45.0 |
+-------------+-----------+----------------+----------------+
What are 10 of the “easier” gold badges to earn?
This query investigates which gold badges are easier to obtain. It generates the top 10 gold badges, ranked by how many users got them as their first ever gold badges. We also extract how many days it took for these gold badges to be obtained on average.
Web UI
Open the following query in the Web UI
#standardSQL
SELECT badge_name AS First_Gold_Badge,
COUNT(1) AS Num_Users,
ROUND(AVG(tenure_in_days)) AS Avg_Num_Days
FROM
(
SELECT
badges.user_id AS user_id,
badges.name AS badge_name,
TIMESTAMP_DIFF(badges.date, users.creation_date, DAY) AS tenure_in_days,
ROW_NUMBER() OVER (PARTITION BY badges.user_id
ORDER BY badges.date) AS row_number
FROM
bigquery-public-data.stackoverflow.badges badges
JOIN
bigquery-public-data.stackoverflow.users users
ON badges.user_id = users.id
WHERE badges.class = 1
)
WHERE row_number = 1
GROUP BY First_Gold_Badge
ORDER BY Num_Users DESC
LIMIT 10
Command-line
bq query --use_legacy_sql=false '
SELECT badge_name AS First_Gold_Badge,
COUNT(1) AS Num_Users,
ROUND(AVG(tenure_in_days)) AS Avg_Num_Days
FROM
(
SELECT
badges.user_id AS user_id,
badges.name AS badge_name,
TIMESTAMP_DIFF(badges.date, users.creation_date, DAY) AS tenure_in_days,
ROW_NUMBER() OVER (PARTITION BY badges.user_id
ORDER BY badges.date) AS row_number
FROM
`bigquery-public-data.stackoverflow.badges` badges
JOIN
`bigquery-public-data.stackoverflow.users` users
ON badges.user_id = users.id
WHERE badges.class = 1
)
WHERE row_number = 1
GROUP BY First_Gold_Badge
ORDER BY Num_Users DESC
LIMIT 10
'
The results are shown here:
+------------------+-----------+--------------+
| First_Gold_Badge | Num_Users | Avg_Num_Days |
+------------------+-----------+--------------+
| Famous Question | 144416 | 1094.0 |
| Fanatic | 12992 | 579.0 |
| Unsung Hero | 11392 | 555.0 |
| Great Answer | 8722 | 1263.0 |
| Electorate | 4722 | 769.0 |
| Populist | 4631 | 1147.0 |
| Steward | 743 | 924.0 |
| Great Question | 537 | 684.0 |
| Copy Editor | 232 | 533.0 |
| Marshal | 136 | 553.0 |
+------------------+-----------+--------------+
Which day of the week has most questions answered within an hour?
In this query we look at which day of the week is the best to ask questions if one would like to get an answer very quickly. The query returns day of the week as integers from 1 to 7 (1 = Sunday, 2 = Monday, etc), and the number of questions on each day in 2016. For each day, we also query for how many of these questions received an answer within 1 hour of submission, and the corresponding percentage. The volume of questions and answers is the highest in the middle of the week (Tue, Wed, and Thur), and percentage of questions being answered within 1 hour is lower on Sat and Sun, but not by much compared to weekdays.
Web UI
Open the following query in the Web UI
#standardSQL
SELECT
Day_of_Week,
COUNT(1) AS Num_Questions,
SUM(answered_in_1h) AS Num_Answered_in_1H,
ROUND(100 * SUM(answered_in_1h) / COUNT(1),1) AS Percent_Answered_in_1H
FROM
(
SELECT
q.id AS question_id,
EXTRACT(DAYOFWEEK FROM q.creation_date) AS day_of_week,
MAX(IF(a.parent_id IS NOT NULL AND
(UNIX_SECONDS(a.creation_date)-UNIX_SECONDS(q.creation_date))/(60*60) <= 1, 1, 0)) AS answered_in_1h
FROM
bigquery-public-data.stackoverflow.posts_questions q
LEFT JOIN
bigquery-public-data.stackoverflow.posts_answers a
ON q.id = a.parent_id
WHERE EXTRACT(YEAR FROM a.creation_date) = 2016
AND EXTRACT(YEAR FROM q.creation_date) = 2016
GROUP BY question_id, day_of_week
)
GROUP BY
Day_of_Week
ORDER BY
Day_of_Week
Command-line
bq query --use_legacy_sql=false '
SELECT
Day_of_Week,
COUNT(1) AS Num_Questions,
SUM(answered_in_1h) AS Num_Answered_in_1H,
ROUND(100 * SUM(answered_in_1h) / COUNT(1),1) AS Percent_Answered_in_1H
FROM
(
SELECT
q.id AS question_id,
EXTRACT(DAYOFWEEK FROM q.creation_date) AS day_of_week,
MAX(IF(a.parent_id IS NOT NULL AND
(UNIX_SECONDS(a.creation_date)-UNIX_SECONDS(q.creation_date))/(60*60) <= 1, 1, 0)) AS answered_in_1h
FROM
`bigquery-public-data.stackoverflow.posts_questions` q
LEFT JOIN
`bigquery-public-data.stackoverflow.posts_answers` a
ON q.id = a.parent_id
WHERE EXTRACT(YEAR FROM a.creation_date) = 2016
AND EXTRACT(YEAR FROM q.creation_date) = 2016
GROUP BY question_id, day_of_week
)
GROUP BY
Day_of_Week
ORDER BY
Day_of_Week
'
The results are shown here:
+-------------+---------------+--------------------+------------------------+
| Day_of_Week | Num_Questions | Num_Answered_in_1H | Percent_Answered_in_1H |
+-------------+---------------+--------------------+------------------------+
| 1 | 76315 | 44978 | 58.9 |
| 2 | 132876 | 80031 | 60.2 |
| 3 | 145685 | 88401 | 60.7 |
| 4 | 149534 | 90392 | 60.4 |
| 5 | 146850 | 88898 | 60.5 |
| 6 | 133790 | 81357 | 60.8 |
| 7 | 78809 | 46306 | 58.8 |
+-------------+---------------+--------------------+------------------------+
About the data
Dataset Source: https://archive.org/download/stackexchange
Category: Encyclopedic, Research
Use: cc-by-sa 3.0
Update Frequency: Quarterly
View in BigQuery: Go to Stack Overflow dataset