Download & Streaming : Web Crawls : Internet Archive

SHOW DETAILS

SORT BY

VIEWS

TITLE

DATE ARCHIVED

CREATOR

7.9B 7.9B

Internet Archive Web Crawls

905,844

ITEMS

7.9B

VIEWS

Jun 11, 2010 06/10

eye 7.9B

The Internet Archive discovers and captures web pages through many different web crawls. At any given time several distinct crawls are running, some for months, and some every day or longer. View the web archive through the Wayback Machine .
Topic: webwidecrawl

3.3B 3.3B

Alexa Crawls

145,753

ITEMS

3.3B

VIEWS

Nov 16, 2010 11/10

eye 3.3B

Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
Topics: web crawl, Alexa

3.2B 3.2B

Worldwide Web Crawls

463,743

ITEMS

3.2B

VIEWS

Oct 5, 2010 10/10

eye 3.2B

Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites. Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for...

1.9B 1.9B

Survey Crawls

65,529

ITEMS

1.9B

VIEWS

Nov 17, 2012 11/12

eye 1.9B

Survey crawls are run about twice a year, on average, and attempt to capture the content of the front page of every web host ever seen by the Internet Archive since 1996.
Topic: survey crawls

1.7B 1.7B

Live Web Proxy Crawls

17,083

ITEMS

1.7B

VIEWS

Apr 26, 2011 04/11

eye 1.7B

Content crawled via the Wayback Machine Live Proxy mostly by the Save Page Now feature on web.archive.org. Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.

812M 812M

Archive-It Digital Collection

252,604

ITEMS

812M

VIEWS

Dec 14, 2010 12/10

eye 812M

Archive-It is a subscription web archiving service of the Internet Archive that helps organizations harvest, build, and preserve collections of digital content. Partners create domain specific collections of web captures that can be searched on Archive It . Content is hosted and stored at the Internet Archive data centers. Archive-It works with more than 400 partner organizations in 48 U.S. states and 16 countries worldwide including: College and University Libraries State Archives, Libraries,...
Topic: Colleges, Universities, Libraries, Archives, NGOs, Museums

801.4M 801M

Archive-It Partners

247,928

ITEMS

801.4M

VIEWS

Oct 20, 2015 10/15

eye 801.4M

Archive-It is the leading web archiving service for collecting and accessing cultural heritage on the web and is a service of Internet Archive used by libraries, archives, governments, non-profits, and other organizations to build collections of web materials.
Topic: TK

623.8M 624M

Focused Crawls

164,586

ITEMS

623.8M

VIEWS

Nov 4, 2011 11/11

by Internet Archive

eye 623.8M

Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Topic: webcrawl

615M 615M

Survey Crawl April 2013

16,282

ITEMS

615M

VIEWS

Nov 17, 2012 11/12

eye 615M

Survey crawl of domains started April 2013. This data is currently not publicly accessible.

508.1M 508M

Custom Crawl Services

52,853

ITEMS

508.1M

VIEWS

Apr 8, 2011 04/11

by Internet Archive

eye 508.1M

National library harvesting.
Topic: ccs

476.3M 476M

web-group-internal

31,761

ITEMS

476.3M

VIEWS

Jul 21, 2011 07/11

eye 476.3M

miscellaneous data
Topic: brad tofel

446.6M 447M

Fix Broken Links Web Crawls

57,476

ITEMS

446.6M

VIEWS

Sep 12, 2013 09/13

eye 446.6M

These crawls are part of an effort to archive pages as they are created and archive the pages that they refer to. That way, as the pages that are referenced are changed or taken from the web, a link to the version that was live when the page was written will be preserved. Then the Internet Archive hopes that references to these archived pages will be put in place of a link that would be otherwise be broken, or a companion link to allow people to see what was originally intended by a page's...

442.1M 442M

Wide Crawl started April 2013

25,005

ITEMS

442.1M

VIEWS

Apr 18, 2013 04/13

eye 442.1M

Web wide crawl with initial seedlist and crawler configuration from April 2013.

424.4M 424M

Top Domains

86,227

ITEMS

424.4M

VIEWS

Nov 29, 2011 11/11

eye 424.4M

A daily collection of thousands of the most popular web sites according to Alexa.com's top sites rankings .
Topics: daily, popular sites, Alexa

409.6M 410M

Wayback Indexes

554

ITEMS

409.6M

VIEWS

Apr 4, 2012 04/12

eye 409.6M

Wayback indexes. This data is currently not publicly accessible.

390.2M 390M

Survey Crawl December 2014

13,890

ITEMS

390.2M

VIEWS

Dec 17, 2014 12/14

eye 390.2M

Survey crawl of domains started December 2014. This data is currently not publicly accessible.

385.1M 385M

alexa_2007

7,636

ITEMS

385.1M

VIEWS

Jul 12, 2012 07/12

eye 385.1M

this data is currently not publicly accessible.

339.1M 339M

Archive Team

225,986

ITEMS

339.1M

VIEWS

May 4, 2011 05/11

eye 339.1M

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history. History is littered with hundreds of conflicts over the future of a community, group, location or...

321M 321M

Wiki Collections

1,181,942

ITEMS

321M

VIEWS

Apr 15, 2013 04/13

eye 321M

Collections of Wiki data
Topics: crawls, data, wiki

317M 317M

Wikipedia Outlinks

24,477

ITEMS

317M

VIEWS

May 13, 2011 05/11

eye 317M

Crawl of outlinks from wikipedia.org . These files are currently not publicly accessible. from Wikipedia : Wikipedia is a multilingual, web-based, free-content encyclopedia project operated by the Wikimedia Foundation and based on an openly editable model. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links to guide the...

309.8M 310M

Wide Crawl started June 2014

45,313

ITEMS

309.8M

VIEWS

Jun 6, 2014 06/14

eye 309.8M

Web wide crawl with initial seedlist and crawler configuration from June 2014.

276.7M 277M

Wide Crawl started August 2013

21,911

ITEMS

276.7M

VIEWS

Jul 30, 2013 07/13

eye 276.7M

Web wide crawl with initial seedlist and crawler configuration from August 2013.

276.7M 277M

Wide Crawl Number 12 - started March, 14th 2015

49,621

ITEMS

276.7M

VIEWS

Jan 9, 2015 01/15

eye 276.7M

Web wide crawl with initial seedlist and crawler configuration from January 2015.

269.8M 270M

alexa_2006

6,507

ITEMS

269.8M

VIEWS

Jul 12, 2012 07/12

eye 269.8M

this data is currently not publicly accessible.

267.2M 267M

Wide Crawl started January 2012

30,362

ITEMS

267.2M

VIEWS

Dec 30, 2011 12/11

eye 267.2M

Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.

245.6M 246M

Wide Crawl Number 14 started March 2016

71,730

ITEMS

245.6M

VIEWS

Mar 4, 2016 03/16

eye 245.6M

Web wide crawl.

237M 237M

Wide Crawl started April 2012

39,252

ITEMS

237M

VIEWS

Mar 31, 2012 03/12

eye 237M

Web wide crawl with initial seedlist and crawler configuration from April 2012.

234.9M 235M

Wikipedia Outbound Links

14,034

ITEMS

234.9M

VIEWS

Sep 23, 2013 09/13

eye 234.9M

This is a collection of web page captures from links added to, or changed on, Wikipedia pages. The idea is to bring a reliability to Wikipedia outlinks so that if the pages referenced by Wikipedia articles are changed, or go away, a reader can permanently find what was originally referred to. This is part of the Internet Archive's attempt to rid the web of broken links .
Topics: Wikipedia, Wikimedia

233.7M 234M

Survey Crawl

12,622

ITEMS

233.7M

VIEWS

Jan 9, 2015 01/15

eye 233.7M

Survey crawl of domains. This data is currently not publicly accessible.

217.9M 218M

Survey Crawl started July 2015

10,137

ITEMS

217.9M

VIEWS

Jan 9, 2015 01/15

eye 217.9M

Survey crawl of domains. This data is currently not publicly accessible.

196.5M 196M

Survey Crawl May 2014

6,909

ITEMS

196.5M

VIEWS

Apr 25, 2014 04/14

eye 196.5M

Survey crawl of domains started May 2014. This data is currently not publicly accessible.

185.4M 185M

ArchiveBot: The Archive Team Crowdsourced Crawler

5,959

ITEMS

185.4M

VIEWS

Apr 8, 2014 04/14

eye 185.4M

ArchiveBot is an IRC bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, records it in a WARC, and then uploads that WARC to ArchiveTeam servers for eventual injection into the Internet Archive (or other archive sites). To use ArchiveBot, drop by #archivebot on EFNet. To interact with ArchiveBot, you issue commands by typing it into the channel. Note you will need channel...
Topics: archiveteam, archivebot, webcrawl, robot, love

173.2M 173M

Wide Crawl started October 2010

15,839

ITEMS

173.2M

VIEWS

Oct 5, 2010 10/10

eye 173.2M

Web wide crawl with initial seedlist and crawler configuration from October 2010

170.7M 171M

Wide Crawl Started January 2013

15,138

ITEMS

170.7M

VIEWS

Jan 1, 2013 01/13

eye 170.7M

Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.

170.3M 170M

Wide Crawl started September 2012

22,402

ITEMS

170.3M

VIEWS

Aug 24, 2012 08/12

eye 170.3M

Web wide crawl with initial seedlist and crawler configuration from September 2012.

166M 166M

Wide Crawl Number 13

Crawl performed by Internet Archive. This data is currently not publicly accessible.

49.5M 50M

Archive Team: The News Roundup

26,298

ITEMS

49.5M

VIEWS

Jan 21, 2016 01/16

by Archive Team

eye 49.5M

Archive Team now searches many, many news sites, including extensive worldwide and obscure sources, to capture unique news stories for history.

48.6M 49M

Alexa Crawls EA

1,315

ITEMS

48.6M

VIEWS

Jul 12, 2012 07/12

eye 48.6M

Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Topic: crawldata

48.4M 48M

Elections Web

1,609

ITEMS

48.4M

VIEWS

Oct 20, 2012 10/12

eye 48.4M

This collection contains collaborative Election crawls performed by IA.
Topics: elections, web

48.4M 48M

Election Crawl 2012

1,608

ITEMS

48.4M

VIEWS

Oct 20, 2012 10/12

eye 48.4M

This crawl was performed in Summer & Fall of 2012 to archive the US Federal Elections.
Topics: US, federal, elections, web, 2012

48.3M 48M

Alexa Crawls DO

493

ITEMS

48.3M

VIEWS

Jul 11, 2012 07/12

eye 48.3M

Crawl data donated by Alexa Internet. This data is currently not publicly accessible

MORE RESULTS
Fetching more results

Subject	Poster	Replies	Date
404 - Redir question	devuser	0	Mar 25, 2017 9:13pm Mar 25, 2017 9:13pm
Why does some censorship exist?	Animedude5555	0	Mar 7, 2017 4:22pm Mar 7, 2017 4:22pm
Please add Electric Furnace	Mrout	0	Mar 6, 2017 3:01am Mar 6, 2017 3:01am
Retrieve photo's old Hyves profile	stehof	0	Feb 26, 2017 2:52pm Feb 26, 2017 2:52pm
test only	SeaDoo	0	Nov 23, 2016 6:49am Nov 23, 2016 6:49am
Site specific search options?	JaneLeia	0	Sep 12, 2016 12:37am Sep 12, 2016 12:37am
Site Removal Please	MGMidget1234	0	Jun 9, 2016 8:57am Jun 9, 2016 8:57am
Site Removal Request	4687431212	1	May 1, 2016 8:23pm May 1, 2016 8:23pm
Re: Site Removal Request	4687431212	0	May 1, 2016 10:41pm May 1, 2016 10:41pm
Takedown request	victorlsxiv	0	Apr 24, 2016 7:00am Apr 24, 2016 7:00am
only two hours left of April20 (420): everybody Wayback cannabis homepages	EarthFurst	2	Apr 20, 2016 4:04pm Apr 20, 2016 4:04pm
Re: only two hours left of April20 (420): everybody Wayback cannabis homepages	EarthFurst	0	Apr 20, 2016 3:31pm Apr 20, 2016 3:31pm
Re: only two hours left of April20 (420): everybody Wayback cannabis homepages	EarthFurst	0	Apr 20, 2016 5:04pm Apr 20, 2016 5:04pm
"archived" pages disappearing from Wayback: reference at archive.is	EarthFurst	1	Apr 20, 2016 12:21pm Apr 20, 2016 12:21pm
Re: 'archived' pages disappearing from Wayback: reference at archive.is	Jeff Kaplan	1	Apr 20, 2016 4:08pm Apr 20, 2016 4:08pm
Re: 'archived' pages disappearing from Wayback: reference at archive.is	EarthFurst	1	Apr 22, 2016 2:37am Apr 22, 2016 2:37am
Re: 'archived' pages disappearing from Wayback: reference at archive.is	Jeff Kaplan	0	Apr 22, 2016 10:03am Apr 22, 2016 10:03am
The Wayback Machine Forum is "(closed)", but nothing will stop me from adding this post– BELIEVE IT!	pegzmasta	1	Apr 6, 2016 6:27pm Apr 6, 2016 6:27pm
Re: Original Archive is '(closed)'	PDpolice	1	Apr 6, 2016 6:14pm Apr 6, 2016 6:14pm
Re: Original Archive is '(closed)'	pegzmasta	0	Apr 7, 2016 2:37pm Apr 7, 2016 2:37pm
Multiple Set-Cookie Headers: Wayback	River_Delta_CA_USA	0	Apr 4, 2016 10:17am Apr 4, 2016 10:17am
Hi, Wayback– Problem Solved!	pegzmasta	1	Apr 3, 2016 11:13am Apr 3, 2016 11:13am
This Is Only a Test	Dupenhagen Moonbat	1	Apr 6, 2016 4:46pm Apr 6, 2016 4:46pm
Re: This Is Only a Test	pegzmasta	0	Apr 6, 2016 5:27pm Apr 6, 2016 5:27pm
how to query for all the websites that end in ".com.br"?	LucasMation	1	Mar 31, 2016 6:20am Mar 31, 2016 6:20am
Re: how to query for all the websites that end in '.com.br'?	pegzmasta	1	Apr 1, 2016 10:13am Apr 1, 2016 10:13am
Re: how to query for all the websites that end in '.com.br'?	LucasMation	1	Apr 1, 2016 12:03pm Apr 1, 2016 12:03pm
Re: how to query for all the websites that end in '.com.br'?	pegzmasta	0	Apr 1, 2016 12:19pm Apr 1, 2016 12:19pm
Challenge: Read, Reply, and Correct! [The Internet Archive is tasked with preserving content on the Internet, but will it preserve and fix it's own forums?]	pegzmasta	0	Mar 16, 2016 2:35pm Mar 16, 2016 2:35pm
How long does it take to get a response from [email protected]?	juwhyonee	1	Feb 26, 2016 10:26am Feb 26, 2016 10:26am
Re: How long does it take to get a response from [email protected]?	aanon	0	May 3, 2016 5:45am May 3, 2016 5:45am
problem with waybacks of comicbookresources.com homepage after 2013	EarthFurst	0	Feb 18, 2016 1:47am Feb 18, 2016 1:47am
my website is not archiving	jon617	0	Jan 7, 2016 4:11pm Jan 7, 2016 4:11pm
So does excluding via robots actually delete or not?	talkingnewspapers	0	Jan 7, 2016 9:46am Jan 7, 2016 9:46am
Crawl and archive a whole website recursively	maltris	1	Jan 7, 2016 2:26am Jan 7, 2016 2:26am
Re: Crawl and archive a whole website recursively	B4CK and F0RTH	0	Dec 29, 2016 8:22pm Dec 29, 2016 8:22pm
My Website Is Not Crawled Despite Removing Restrictions From Robots.txt	leodwight	0	Jan 4, 2016 7:56pm Jan 4, 2016 7:56pm
What is the algorithm for deciding when to not crawl a page anymore?	zwol	0	Dec 4, 2015 9:37am Dec 4, 2015 9:37am
End of an era: Imageshack deletes free accounts	Javik	0	Nov 28, 2015 12:55pm Nov 28, 2015 12:55pm
Wayback machine rebuild suggestions	Archive Lover1	1	Oct 23, 2015 8:44am Oct 23, 2015 8:44am
Re: Wayback machine rebuild suggestions	h891322	0	Dec 12, 2015 5:55am Dec 12, 2015 5:55am

Featured

Top

Featured

Top

Featured

Top

Featured

Top

Featured

Top

Web Crawls

3,283,133 RESULTS rss

Media Type

Topics & Subjects

Collection

Creator

Language

eye 7.9B

eye 3.3B

eye 3.2B

eye 1.9B

eye 1.7B

eye 812M

eye 801.4M

eye 623.8M

eye 615M

eye 508.1M

eye 476.3M

eye 446.6M

eye 442.1M

eye 424.4M

eye 409.6M

eye 390.2M

eye 385.1M

eye 339.1M

eye 321M

eye 317M

eye 309.8M

eye 276.7M

eye 276.7M

eye 269.8M

eye 267.2M

eye 245.6M

eye 237M

eye 234.9M

eye 233.7M

eye 217.9M

eye 196.5M

eye 185.4M

eye 173.2M

eye 170.7M

eye 170.3M

eye 166M

eye 153.7M

eye 153M

eye 152.2M

eye 141.3M

eye 140.6M

eye 136.8M

eye 129.1M

eye 123.7M

eye 122.7M

eye 122.2M

eye 122.2M

eye 118.3M

eye 109.6M

eye 102M

eye 97.3M

eye 95.6M

eye 93.8M

eye 82.6M

eye 79.4M

eye 79.2M

eye 65.8M

eye 63.7M

eye 62M

eye 58.1M

eye 57.4M

eye 55.1M

eye 54.3M

3,283,133
RESULTS
rss