Introduction
The following list of data sources has been collected and categorized for your convenience. The list has been limited to those for which there is a reasonably simple process for importing csv files. Most of the data sets listed below are free, however, some are not.
If an (R!) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.)
Data Science
This section contains data sets used in the book "Doing Data Science" by Rachel Schutt and Cathy O'Neil (O'Reilly 2014)
- Datasets on the book site: https://github.com/oreillymedia/doing_data_science
- Enron Email Dataset: http://www.cs.cmu.edu/~enron/
- Titanic Survival Data Set: http://bit.ly/1kJ4pkF
- Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/
Economics
- American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
- Gapminder: http://www.gapminder.org/data/
- UMD:: http://inforumweb.umd.edu/econdata/econdata.html
- World bank: http://data.worldbank.org/indicator
Finance
- CBOE Futures Exchange: http://cfe.cboe.com/Data/
- Google Finance: https://www.google.com/finance (R!)
- Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
- St Louis Fed: http://research.stlouisfed.org/fred2/ (R!)
- NASDAQ: https://data.nasdaq.com/
- OANDA: http://www.oanda.com/ (R!)
- Quandl: http://www.quandl.com/
- Yahoo Finance: http://finance.yahoo.com/ (R!)
Government
- Archived national government statistics: http://www.archive-it.org/
- Australia: http://www.abs.gov.au/AUSSTATS/[email protected]/DetailsPage/3301.02009?OpenDocument
- Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
- DataMarket: http://datamarket.com/
- FDA: https://open.fda.gov/index.html
- Fed Stats: http://fedstats.sites.usa.gov/
- HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
- London, U.K. data: http://data.london.gov.uk/dataset
- New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by...
- NYC data: http://nycplatform.socrata.com/
- OECD: http://www.oecd.org/statistics/
- RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
- San Francisco Data sets: http://datasf.org/
- U.K. Government Data: http://data.gov.uk/data
- United Nations: http://data.un.org/
- U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
- U.S. Federal Government Agencies: http://www.data.gov/metric
- US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
- The World Bank: http://wdronline.worldbank.org/
- UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/r/2013/02/05/2011-census-open-atlas-project/
- The R package rUnemplomentData contains data from the US Bureau of Labor Statistics: rUnemploymentData
- Utah Open Data Catalog: http://www.utah.gov/data/
Health Care
- Gapminder: http://www.gapminder.org/data/
Machine Learning
- Amazon Web Services Data: http://aws.amazon.com/datasets
- Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
- AppliedPredictiveModeling (R package): http://bit.ly/16wyvkG
- Australian Weather: http://www.bom.gov.au/climate/dwo/
- Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
- Kaggle competition data: http://www.kaggle.com/
- KDNuggets competition site: http://www.kdnuggets.com/datasets/
- The Koblenz Network Collection: http://konect.uni-koblenz.de/
- Machine Learning Data Set Repository: http://mldata.org/
- Medicare Data File: http://go.cms.gov/19xxPN4
- Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
- Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
- More song datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
- MovieLens Data Sets: http://datahub.io/dataset/movielens
- NYC Taxi Data (2010-2013): http://publish.illinois.edu/dbwork/open-data/
- RDataMining.com R and Data Mining ebook data: http://www.rdatamining.com/data
- Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
- UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
- 53.5 billion clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
Networks
- Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
Public Domain Collections
- Data360: http://www.data360.org/index.aspx
- Factual: http://www.factual.com/
- Freebase: http://www.freebase.com/
- Google: http://www.google.com/publicdata/directory
- infochimps: http://www.infochimps.com/
- Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
- RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
- Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html (R!)
- SourceForge Research Data: http://www.nd.edu/~oss/Data/data.html
- StatSci.org: http://www.statsci.org/datasets.html
- UFO Reports: http://www.nuforc.org/webreports.html
- Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
- The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
Science
- Agricultural Experiments: http://www.inside-r.org/packages/cran/agridat/docs/agridat (R!)
- Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
- Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
- Geo Spatial Data: http://geodacenter.asu.edu/datalist/
- Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
- MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
- NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
- NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R!)
- Protein structure: http://www.infobiotic.net/PSPbenchmarks/
- Public Gene Data: http://www.pubgene.org/
- Stanford Microarray Data: http://smd.stanford.edu//
Social Sciences
- General Social Survey: http://www3.norc.org/GSS+Website/
- ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
- Pew Research: http://www.pewinternet.org/datasets/pages/2/
- SNAP: http://snap.stanford.edu/data/index.html
- UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
- UPJOHN INST: Search for data at http://www.upjohn.org
Time Series
- Time Series data Library: http://robjhyndman.com/TSDL/
Universities
- Carnegie Mellon University Enron email: http://www.cs.cmu.edu/~enron/
- Keel Repository: http://sci2s.ugr.es/keel/datasets.php
- Ohio State University Financial data: http://fisher.osu.edu/fin/fdf/osudata.htm
- UC Berkeley: http://ucdata.berkeley.edu/
- UCLA: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
- UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
- University of Toronto: http://www.cs.toronto.edu/~delve/data/datasets.html
Microsoft R Server .XDF Datasets
- The XDF Collection for Microsoft R Server: http://www.revolutionanalytics.com/subscriptions/datasets/