Stack Overflow Data

Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

You can start exploring this data in the BigQuery console:

Go to Stack Overflow Dataset

Sample queries

Here are some examples of SQL queries you can run on this data in BigQuery.

These samples use BigQuery’s support for standard SQL. Use the #standardSQL tag to let BigQuery know you want to use standard SQL. For more information about the #standardSQL prefix, see Setting a query prefix.

What is the percentage of questions that have been answered over the years?

This short query looks at how many questions were posted on Stack Overflow from 2009 to 2015, and what percentage of them have been answered annually. It shows that the total number of questions posted are increasing annually, but the percentage of them answered are decreasing annually.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  EXTRACT(YEAR FROM creation_date) AS Year,
  COUNT(*) AS Number_of_Questions,
  ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers
FROM
  `bigquery-public-data.stackoverflow.posts_questions`
GROUP BY
  Year
HAVING
  Year > 2008 AND Year < 2016
ORDER BY
  Year

Command-line

bq query --use_legacy_sql=false '
SELECT
  EXTRACT(YEAR FROM creation_date) AS Year,
  COUNT(*) AS Number_of_Questions,
  ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers
FROM
  `bigquery-public-data.stackoverflow.posts_questions`
GROUP BY
  Year
HAVING
  Year > 2008 AND Year < 2016
ORDER BY
  Year
'

The results are shown here:

+------+---------------------+--------------------------------+
| Year | Number_of_Questions | Percent_Questions_with_Answers |
+------+---------------------+--------------------------------+
| 2009 |              345864 |                           99.5 |
| 2010 |              702964 |                           98.1 |
| 2011 |             1213146 |                           96.3 |
| 2012 |             1664204 |                           93.6 |
| 2013 |             2076336 |                           90.9 |
| 2014 |             2179015 |                           87.6 |
| 2015 |             2388670 |                           79.5 |
+------+---------------------+--------------------------------+

What is the reputation and badge count of users across different tenures on StackOverflow?

This query breaks down Stack Overflow users into different cohorts by the number of years they’ve been on the platform, and computes the average reputation and number of badges for each cohort. It’s not surprising that users who have been on StackOverflow longer would have higher reputation and number of badges on average. In addition, it’s interesting to see that users typically only begin to have multiple badges after two years on StackOverflow.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT User_Tenure,
       COUNT(1) AS Num_Users,
       ROUND(AVG(reputation)) AS Avg_Reputation,
       ROUND(AVG(num_badges)) AS Avg_Num_Badges
FROM (
  SELECT users.id AS user,
         ROUND(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), ANY_VALUE(users.creation_date), DAY)/365) AS user_tenure,
         ANY_VALUE(users.reputation) AS reputation,
         SUM(IF(badges.user_id IS NULL, 0, 1)) AS num_badges
  FROM bigquery-public-data.stackoverflow.users users
  LEFT JOIN bigquery-public-data.stackoverflow.badges badges
  ON users.id = badges.user_id
  GROUP BY user
)
GROUP BY User_Tenure
ORDER BY User_Tenure

Command-line

bq query --use_legacy_sql=false '
SELECT User_Tenure,
       COUNT(1) AS Num_Users,
       ROUND(AVG(reputation)) AS Avg_Reputation,
       ROUND(AVG(num_badges)) AS Avg_Num_Badges
FROM (
  SELECT users.id AS user,
         ROUND(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), ANY_VALUE(users.creation_date), DAY)/365) AS user_tenure,
         ANY_VALUE(users.reputation) AS reputation,
         SUM(IF(badges.user_id IS NULL, 0, 1)) AS num_badges
  FROM `bigquery-public-data.stackoverflow.users` users
  LEFT JOIN `bigquery-public-data.stackoverflow.badges` badges
  ON users.id = badges.user_id
  GROUP BY user
)
GROUP BY User_Tenure
ORDER BY User_Tenure
'

The results are shown here:

+-------------+-----------+----------------+----------------+
| User_Tenure | Num_Users | Avg_Reputation | Avg_Num_Badges |
+-------------+-----------+----------------+----------------+
|         0.0 |     46218 |            2.0 |            1.0 |
|         1.0 |   1401011 |            8.0 |            1.0 |
|         2.0 |   1198233 |           20.0 |            1.0 |
|         3.0 |   1175131 |           37.0 |            2.0 |
|         4.0 |    940001 |           82.0 |            3.0 |
|         5.0 |    471452 |          234.0 |            7.0 |
|         6.0 |    272851 |          456.0 |           10.0 |
|         7.0 |    129695 |         1017.0 |           17.0 |
|         8.0 |     42666 |         4031.0 |           45.0 |
+-------------+-----------+----------------+----------------+

What are 10 of the “easier” gold badges to earn?

This query investigates which gold badges are easier to obtain. It generates the top 10 gold badges, ranked by how many users got them as their first ever gold badges. We also extract how many days it took for these gold badges to be obtained on average.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT badge_name AS First_Gold_Badge,
       COUNT(1) AS Num_Users,
       ROUND(AVG(tenure_in_days)) AS Avg_Num_Days
FROM
(
  SELECT
    badges.user_id AS user_id,
    badges.name AS badge_name,
    TIMESTAMP_DIFF(badges.date, users.creation_date, DAY) AS tenure_in_days,
    ROW_NUMBER() OVER (PARTITION BY badges.user_id
                       ORDER BY badges.date) AS row_number
  FROM
    bigquery-public-data.stackoverflow.badges badges
  JOIN
    bigquery-public-data.stackoverflow.users users
  ON badges.user_id = users.id
  WHERE badges.class = 1
)
WHERE row_number = 1
GROUP BY First_Gold_Badge
ORDER BY Num_Users DESC
LIMIT 10

Command-line

bq query --use_legacy_sql=false '
SELECT badge_name AS First_Gold_Badge,
       COUNT(1) AS Num_Users,
       ROUND(AVG(tenure_in_days)) AS Avg_Num_Days
FROM
(
  SELECT
    badges.user_id AS user_id,
    badges.name AS badge_name,
    TIMESTAMP_DIFF(badges.date, users.creation_date, DAY) AS tenure_in_days,
    ROW_NUMBER() OVER (PARTITION BY badges.user_id
                       ORDER BY badges.date) AS row_number
  FROM
    `bigquery-public-data.stackoverflow.badges` badges
  JOIN
    `bigquery-public-data.stackoverflow.users` users
  ON badges.user_id = users.id
  WHERE badges.class = 1
)
WHERE row_number = 1
GROUP BY First_Gold_Badge
ORDER BY Num_Users DESC
LIMIT 10
'

The results are shown here:

+------------------+-----------+--------------+
| First_Gold_Badge | Num_Users | Avg_Num_Days |
+------------------+-----------+--------------+
| Famous Question  |    144416 |       1094.0 |
| Fanatic          |     12992 |        579.0 |
| Unsung Hero      |     11392 |        555.0 |
| Great Answer     |      8722 |       1263.0 |
| Electorate       |      4722 |        769.0 |
| Populist         |      4631 |       1147.0 |
| Steward          |       743 |        924.0 |
| Great Question   |       537 |        684.0 |
| Copy Editor      |       232 |        533.0 |
| Marshal          |       136 |        553.0 |
+------------------+-----------+--------------+

Which day of the week has most questions answered within an hour?

In this query we look at which day of the week is the best to ask questions if one would like to get an answer very quickly. The query returns day of the week as integers from 1 to 7 (1 = Sunday, 2 = Monday, etc), and the number of questions on each day in 2016. For each day, we also query for how many of these questions received an answer within 1 hour of submission, and the corresponding percentage. The volume of questions and answers is the highest in the middle of the week (Tue, Wed, and Thur), and percentage of questions being answered within 1 hour is lower on Sat and Sun, but not by much compared to weekdays.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  Day_of_Week,
  COUNT(1) AS Num_Questions,
  SUM(answered_in_1h) AS Num_Answered_in_1H,
  ROUND(100 * SUM(answered_in_1h) / COUNT(1),1) AS Percent_Answered_in_1H
FROM
(
  SELECT
    q.id AS question_id,
    EXTRACT(DAYOFWEEK FROM q.creation_date) AS day_of_week,
    MAX(IF(a.parent_id IS NOT NULL AND
           (UNIX_SECONDS(a.creation_date)-UNIX_SECONDS(q.creation_date))/(60*60) <= 1, 1, 0)) AS answered_in_1h
  FROM
    bigquery-public-data.stackoverflow.posts_questions q
  LEFT JOIN
    bigquery-public-data.stackoverflow.posts_answers a
  ON q.id = a.parent_id
  WHERE EXTRACT(YEAR FROM a.creation_date) = 2016
    AND EXTRACT(YEAR FROM q.creation_date) = 2016
  GROUP BY question_id, day_of_week
)
GROUP BY
  Day_of_Week
ORDER BY
  Day_of_Week

Command-line

bq query --use_legacy_sql=false '
SELECT
  Day_of_Week,
  COUNT(1) AS Num_Questions,
  SUM(answered_in_1h) AS Num_Answered_in_1H,
  ROUND(100 * SUM(answered_in_1h) / COUNT(1),1) AS Percent_Answered_in_1H
FROM
(
  SELECT
    q.id AS question_id,
    EXTRACT(DAYOFWEEK FROM q.creation_date) AS day_of_week,
    MAX(IF(a.parent_id IS NOT NULL AND
           (UNIX_SECONDS(a.creation_date)-UNIX_SECONDS(q.creation_date))/(60*60) <= 1, 1, 0)) AS answered_in_1h
  FROM
    `bigquery-public-data.stackoverflow.posts_questions` q
  LEFT JOIN
    `bigquery-public-data.stackoverflow.posts_answers` a
  ON q.id = a.parent_id
  WHERE EXTRACT(YEAR FROM a.creation_date) = 2016
    AND EXTRACT(YEAR FROM q.creation_date) = 2016
  GROUP BY question_id, day_of_week
)
GROUP BY
  Day_of_Week
ORDER BY
  Day_of_Week
'

The results are shown here:

+-------------+---------------+--------------------+------------------------+
| Day_of_Week | Num_Questions | Num_Answered_in_1H | Percent_Answered_in_1H |
+-------------+---------------+--------------------+------------------------+
|           1 |         76315 |              44978 |                   58.9 |
|           2 |        132876 |              80031 |                   60.2 |
|           3 |        145685 |              88401 |                   60.7 |
|           4 |        149534 |              90392 |                   60.4 |
|           5 |        146850 |              88898 |                   60.5 |
|           6 |        133790 |              81357 |                   60.8 |
|           7 |         78809 |              46306 |                   58.8 |
+-------------+---------------+--------------------+------------------------+

About the data

Dataset Source: https://archive.org/download/stackexchange

Category: Encyclopedic, Research

Use: cc-by-sa 3.0

Update Frequency: Quarterly

View in BigQuery: Go to Stack Overflow dataset

Sample queries

What is the percentage of questions that have been answered over the years?

Web UI

Command-line

What is the reputation and badge count of users across different tenures on StackOverflow?

Web UI

Command-line

What are 10 of the “easier” gold badges to earn?

Web UI

Command-line

Which day of the week has most questions answered within an hour?

Web UI

Command-line

About the data

Send feedback about...