How does webarchive get content

All off topic discussions go here. Everything from the funny thing your cat did to your favorite tv shows. Non-programming computer questions are ok too.
Post Reply
PeterX
Member
Member
Posts: 590
Joined: Fri Nov 22, 2019 5:46 am

How does webarchive get content

Post by PeterX »

How/where does the web-archive page get old and lost web page contents?

The page(s) must obviously be stored by someone _before_ they disappear.

I mean, does some person or software store randomly webpages and checks if they disappear? Or does some person save webpages on his own behalf and later contributes them to the archive?

And how do they know in advance which pages will disappear?

Greetings
Peter
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: How does webarchive get content

Post by nullplan »

The web archive employs a program to download the pages and builtin assets (I believe those are called "spiders"). And then they have a ton of storage somewhere to store this stuff on. Obviously they don't know which sites are going to disappear. They just sample some pages by some algorithm. If they managed to sample a page you want before it got memory holed, then you are in luck. You can also somehow request a certain page be added.
Carpe diem!
User avatar
xenos
Member
Member
Posts: 1121
Joined: Thu Aug 11, 2005 11:00 pm
Libera.chat IRC: xenos1984
Location: Tartu, Estonia
Contact:

Re: How does webarchive get content

Post by xenos »

One can also manually save pages with web.archive.org/save/URL.
Programmers' Hardware Database // GitHub user: xenos1984; OS project: NOS
Post Reply