When Online Content Disappears (2024)

38% of webpages that existed in 2013 are no longer accessible a decade later

How we did this

Pew Research Center conducted the analysis to examine how often online content that once existed becomes inaccessible. One part of the study looks at a representative sample of webpages that existed over the past decade to see how many are still accessible today. For this analysis, we collected a sample of pages from the Common Crawl web repository for each year from 2013 to 2023. We then tried to access those pages to see how many still exist.

A second part of the study looks at the links on existing webpages to see how many of those links are still functional. We did this by collecting a large sample of pages from government websites, news websites and the online encyclopedia Wikipedia.

We identified relevant news domains using data from the audience metrics company comScore and relevant government domains (at multiple levels of government) using data from get.gov, the official administrator for the .gov domain. We collected the news and government pages via Common Crawl and the Wikipedia pages from an archive maintained by the Wikimedia Foundation. For each collection, we identified the links on those pages and followed them to their destination to see what share of those links point to sites that are no longer accessible.

A third part of the study looks at how often individual posts on social media sites are deleted or otherwise removed from public view. We did this by collecting a large sample of public tweets on the social media platform X (then known as Twitter) in real time using the Twitter Streaming API. We then tracked the status of those tweets for a period of three months using the Twitter Search API to monitor how many were still publicly available. Refer to the report methodology for more details.

The internet is an unimaginably vast repository of modern life, with hundreds of billions of indexed webpages. But even as users across the world rely on the web to access books, images, news articles and other resources, this content sometimes disappears from view.

A new Pew Research Center analysis shows just how fleeting online content actually is:

  • A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.
  • For older content, this trend is even starker. Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023.

This “digital decay” occurs in many different online spaces. We examined the links that appear on government and news websites, as well as in the “References” section of Wikipedia pages as of spring 2023. This analysis found that:

  • 23% of news webpages contain at least one broken link, as do 21% of webpages from government sites. News sites with a high level of site traffic and those with less are about equally likely to contain broken links. Local-level government webpages (those belonging to city governments) are especially likely to have broken links.
  • 54% of Wikipedia pages contain at least one link in their “References” section that points to a page that no longer exists.

To see how digital decay plays out on social media, we also collected a real-time sample of tweets during spring 2023 on the social media platform X (then known as Twitter) and followed them for three months. We found that:

  • Nearly one-in-five tweets are no longer publicly visible on the site just months after being posted. In 60% of these cases, the account that originally posted the tweet was made private, suspended or deleted entirely. In the other 40%, the account holder deleted the individual tweet, but the account itself still existed.
  • Certain types of tweets tend to go away more often than others. More than 40% of tweets written in Turkish or Arabic are no longer visible on the site within three months of being posted. And tweets from accounts with the default profile settings are especially likely to disappear from public view.

How this report defines inaccessible links and webpages

There are many ways of defining whether something on the internet that used to exist is now inaccessible to people trying to reach it today. For instance, “inaccessible” could mean that:

  • The page no longer exists on its host server, or the host server itself no longer exists. Someone visiting this type of page would typically receive a variation on the “404 Not Found” server error instead of the content they were looking for.
  • The page address exists but its content has been changed – sometimes dramatically – from what it was originally.
  • The page exists but certain users – such as those with blindness or other visual impairments – might find it difficult or impossible to read.

For this report, we focused on the first of these: pages that no longer exist. The other definitions of accessibility are beyond the scope of this research.

Our approach is a straightforward way of measuring whether something online is accessible or not. But even so, there is some ambiguity.

First, there are dozens of status codes indicating a problem that a user might encounter when they try to access a page. Not all of them definitively indicate whether the page is permanently defunct or just temporarily unavailable. Second, for security reasons, many sites actively try to prevent the sort of automated data collection that we used to test our full list of links.

For these reasons, we used the most conservative estimate possible for deciding whether a site was actually accessible or not. We counted pages as inaccessible only if they returned one of nine error codes that definitively indicate that the page and/or its host server no longer exist or have become nonfunctional – regardless of how they are being accessed, and by whom. The full list of error codes that we included in our definition are in the methodology.

Here are some of the findings from our analysis of digital decay in various online spaces.

Webpages from the last decade

To conduct this part of our analysis, we collected a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.

We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.

Not surprisingly, the older snapshots in our collection had the largest share of inaccessible links. Of the pages collected from the 2013 snapshot, 38% were no longer accessible in 2023. But even for pages collected in the 2021 snapshot, about one-in-five were no longer accessible just two years later.

Links on government websites

We sampled around 500,000 pages from government websites using the Common Crawl March/April 2023 snapshot of the internet, including a mix of different levels of government (federal, state, local and others). We found every link on each page and followed a random selection of those links to their destination to see if the pages they refer to still exist.

Across the government websites we sampled, there were 42 million links. The vast majority of those links (86%) were internal, meaning they link to a different page on the same website. An explainer resource on the IRS website that links to other documents or forms on the IRS site would be an example of an internal link.

Around three-quarters of government webpages we sampled contained at least one on-page link. The typical (median) page contains 50 links, but many pages contain far more. A page in the 90th percentile contains 190 links, and a page in the 99th percentile (that is, the top 1% of pages by number of links) has 740 links.

Other facts about government webpage links:

  • The vast majority go to secure HTTP pages (and have a URL starting with “https://”).
  • 6% go to a static file, like a PDF document.
  • 16% now redirect to a different URL than the one they originally pointed to.

When we followed these links, we found that 6% point to pages that are no longer accessible. Similar shares of internal and external links are no longer functional.

Overall, 21% of all the government webpages we examined contained at least one broken link. Across every level of government we looked at, there were broken links on at least 14% of pages; city government pages had the highest rates of broken links.

Links on news websites

For this analysis, we sampled 500,000 pages from 2,063 websites classified as “News/Information” by the audience metrics firm comScore. The pages were collected from the Common Crawl March/April 2023 snapshot of the internet.

Across the news sites sampled, this collection contained more than 14 million links pointing to an outside website.1 Some 94% of these pages contain at least one external-facing link. The median page contains 20 links, and pages in the top 10% by link count have 56 links.

Like government websites, the vast majority of these links go to secure HTTP pages (those with a URL beginning with “https://”). Around 12% of links on these news sites point to a static file, like a PDF document. And 32% of links on news sites redirected to a different URL than the one they originally pointed to – slightly less than the 39% of external links on government sites that redirect.

When we tracked these links to their destination, we found that 5% of all links on news site pages are no longer accessible. And 23% of all the pages we sampled contained at least one broken link.

Broken links are about as prevalent on the most-trafficked news websites as they are on the least-trafficked sites. Some 25% of pages on news websites in the top 20% by site traffic have at least one broken link. That is nearly identical to the 26% of sites in the bottom 20% by site traffic.

Reference links on Wikipedia

For this analysis, we collected a random sample of 50,000 English-language Wikipedia pages and examined the links in their “References” section. The vast majority of these pages (82%) contain at least one reference link – that is, one that directs the reader to a webpage other than Wikipedia itself.

In total, there are just over 1 million reference links across all the pages we collected. The typical page has four reference links.

The analysis indicates that 11% of all references linked on Wikipedia are no longer accessible. On about 2% of source pages containing reference links, every link on the page was broken or otherwise inaccessible, while another 53% of pages contained at least one broken link.

Posts on Twitter

For this analysis, we collected nearly 5 million tweets posted from March 8 to April 27, 2023, on the social media platform X, which at the time was known as Twitter. We did this using Twitter’s Streaming API, collecting 3,000 public tweets every 30 minutes in real time. This provided us with a representative sample of all tweets posted on the platform during that period. We monitored those tweets until June 15, 2023, and checked each day to see if they were still available on the site or not.

At the end of the observation period, we found that 18% of the tweets from our initial collection window were no longer publicly visible on the site. In a majority of cases, this was because the account that originally posted the tweet was made private, suspended or deleted entirely. For the remaining tweets, the account that posted the tweet was still visible on the site, but the individual tweet had been deleted.

Which tweets tend to disappear?

Tweets were especially likely to be deleted or removed over the course of our collection period if they were:

  • Written in certain languages. Nearly half of all the Turkish-language tweets we collected – and a slightly smaller share of those written in Arabic – were no longer available at the end of the tracking period.
  • Posted by accounts using the site’s default profile settings. More than half of tweets from accounts using the default profile image were no longer available at the end of the tracking period, as were more than a third from accounts with a default bio field. Tweets from these accounts tend to disappear because the entire account has been deleted or made private, as opposed to the individual tweet being deleted.
  • Posted by unverified accounts.

We also found that removed or deleted tweets tended to come from newer accounts with relatively few followers and modest activityon the site. On average, tweets that were no longer visible on the site were posted by accounts around eight months younger than those whose tweets stayed on the site.

And when we analyzed the types of tweets that were no longer available, we found that retweets, quote tweets and original tweets did not differ much from the overall average. But replies were relatively unlikely to be removed – just 12% of replies were inaccessible at the end of our monitoring period.

Most tweets that are removed from the site tend to disappear soon after being posted. In addition to looking at how many tweets from our collection were still available at the end of our tracking period, we conducted a survival analysis to see how long these tweets tended to remain available. We found that:

  • 1% of tweets are removed within one hour
  • 3% within a day
  • 10% within a week
  • 15% within a month

Put another way: Half of tweets that are eventually removed from the platform are unavailable within the first six days of being posted. And 90% of these tweets are unavailable within 46 days.

Tweets don’t always disappear forever, though. Some 6% of the tweets we collected disappeared and then became available again at a later point. This could be due to an account going private and then returning to public status, or to the account being suspended and later reinstated. Of those “reappeared” tweets, the vast majority (90%) were still accessible on Twitter at the end of the monitoring period.

When Online Content Disappears (2024)

FAQs

How long does content stay on the Internet? ›

The short answer for this one is forever, but the reality is a little more complex. To understand the longevity of information and how long online content lasts, you must consider several dynamics that shape data lifespan. Data is constantly being created and disseminated, and data collection is big business.

Why do webpages disappear? ›

Websites come and go and the data they host can easily be lost. This fleeting nature of web content is often referred to as “link rot” or “content decay.” Essentially, the longer a page exists, the higher the chances it will eventually disappear.

Are parts of the Internet disappearing? ›

It found that 25 per cent of all pages collected between 2013 and 2023 were no longer available. Of those, 16 per cent of pages came from a website that continues to exist, while 9 per cent were located on websites that no longer exist at all.

When online content disappears, 38% of webpages that existed in 2013 are no longer accessible a decade later.? ›

Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023. This “digital decay” occurs in many different online spaces. We examined the links that appear on government and news websites, as well as in the “References” section of Wikipedia pages as of spring 2023.

What is the lifespan of online content? ›

Content Lifespan: Digital vs.

For example, research shows that the average lifespan of a website is 2 years and 7 months. Research also shows that while our lives grow more online each day, printed marketing materials have a longer lifespan than those online.

How long is your internet history kept? ›

The law usually mandates ISPs to retain their clients' data, including their browsing history, for a period from six months to two years or longer. You should check the contract you've signed with your internet provider for more detail on your data privacy and retention.

How long does it take for a website to disappear? ›

The total time can be anywhere from a day or two to a few weeks, typically, depending on many factors. Learn how Google crawls the web.

Why do some websites go blank? ›

If you're getting a blank page on your browser, it could be due to one of several reasons, such as issues with the website code or because the URL is incorrect. Here are the most common reasons the about:blank page appears: You clicked a download link that opens a second window or tab.

What is internet rot? ›

Link rot (also called link death, link breaking, or reference rot) is the phenomenon of hyperlinks tending over time to cease to point to their originally targeted file, web page, or server due to that resource being relocated to a new address or becoming permanently unavailable.

Are things ever deleted from the internet? ›

No matter what you do, you can never guarantee that something has been entirely scrubbed from the internet. The cookies, caches, and people's ability to screenshot and screen record means that it almost doesn't matter how quickly you take down a post, the chances are someone has spotted it and nabbed it for the future.

Why does my internet keep disappearing? ›

Your wireless connection is the biggest reason your internet keeps dropping. A weak Wi-Fi signal, congested ISP network or hardware issues could be the culprit. You should also check for outages, your device or modem, cable connections and Wi-Fi signal to troubleshoot.

Why does the internet cut out? ›

In some cases, the problem of your internet outages lie with your ISP. Network congestion, outages, or maintenance work can cause your internet to disconnect intermittently. These issues are often at the heart of internet connectivity problems. Additionally, fluctuations in internet speed can be a significant factor.

How is the Internet disappearing before our eyes? ›

'Algorithms are deciding'

The Google algorithm favours fast-loading websites, and removing "thin content" allows for a quicker loading time. And while some of these pages will be "duplicated material", others will be older, genuine news articles that "algorithms are deciding" are not worth keeping.

Will websites become obsolete? ›

As technology continues to advance, websites will evolve, integrating new features and improving user experiences. The future of websites is not one of obsolescence but of transformation, adapting to meet the needs of an ever-changing digital world.

Why are web pages disappearing? ›

In most cases, this is because an individual page was deleted or removed on an otherwise functional website.

When you post something on the internet does it stay there forever? ›

While the internet is vast, and there are many places where content can be stored, there is no guarantee that your posts will be there forever. Websites and social media platforms are constantly updating their algorithms, which can result in some content being removed.

How long is information stored on the internet? ›

The simple answer is a long time, indefinitely, forever. The reality however is often quite different because there are limitations to data, data storage and retrieval that often give digital information a lifespan.

Can content be removed from the internet? ›

Reach out to the writer or owner of the site

The first step to get content removed from the internet is to directly reach out to the writer of an article or the owner of a website that contains content you want removed. Most websites have a contact page with an email address or contact form you can fill out.

How long does content last on social media? ›

Average Lifespan of Social Media Posts by Platform

After the 48-hour period, the post is less likely to receive much attention. On LinkedIn, posts last even shorter– with an average of 24 hours. And on Facebook, although it varies, a post typically receives around 75% of its total activity within the first five hours.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Aracelis Kilback

Last Updated:

Views: 6109

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.