Skip to content
Search engines 5511dd3

How to Find All Existing and Archived URLs on a Website

Tom Capper

Table of Contents

Tom Capper

How to Find All Existing and Archived URLs on a Website

There are many reasons you might need to find all the URLs on a website, but your exact goal will determine what you’re searching for. For instance, you may want to:

  • Identify every indexed URL to analyze issues like cannibalization or index bloat
  • Collect current and historic URLs Google has seen, especially for site migrations
  • Find all 404 URLs to recover from post-migration errors

In each scenario, a single tool won’t give you everything you need. Unfortunately, Google Search Console isn’t exhaustive, and a “site:example.com” search is limited and difficult to extract data from.

In this post, I’ll walk you through some tools to build your URL list and before deduplicating the data using a spreadsheet or Jupyter Notebook, depending on your website’s size.

Old sitemaps and crawl exports

If you’re looking for URLs that disappeared from the live site recently, there’s a chance someone on your team may have saved a sitemap file or a crawl export before the changes were made. If you haven’t already, check for these files; they can often provide what you need. But, if you’re reading this, you probably did not get so lucky.

Archive.org

Archive.org

Archive.org is an invaluable tool for SEO tasks, funded by donations. If you search for a domain and select the “URLs” option, you can access up to 10,000 listed URLs.

However, there are a few limitations:

  • URL limit: You can only retrieve up to 10,000 URLs, which is insufficient for larger sites.
  • Quality: Many URLs may be malformed or reference resource files (e.g., images or scripts).
  • No export option: There isn’t a built-in way to export the list.

To bypass the lack of an export button, use a browser scraping plugin like Dataminer.io. However, these limitations mean Archive.org may not provide a complete solution for larger sites. Also, Archive.org doesn’t indicate whether Google indexed a URL—but if Archive.org found it, there’s a good chance Google did, too.

Moz Pro

While you might typically use a link index to find external sites linking to you, these tools also discover URLs on your site in the process.

How to use it:
Export your inbound links in Moz Pro to get a quick and easy list of target URLs from your site. If you’re dealing with a massive website, consider using the Moz API to export data beyond what’s manageable in Excel or Google Sheets.

It’s important to note that Moz Pro doesn’t confirm if URLs are indexed or discovered by Google. However, since most sites apply the same robots.txt rules to Moz’s bots as they do to Google’s, this method generally works well as a proxy for Googlebot’s discoverability.

Google Search Console

Google Search Console offers several valuable sources for building your list of URLs.

Links reports:

Similar to Moz Pro, the Links section provides exportable lists of target URLs. Unfortunately, these exports are capped at 1,000 URLs each. You can apply filters for specific pages, but since filters don’t apply to the export, you might need to rely on browser scraping tools—limited to 500 filtered URLs at a time. Not ideal.

Performance → Search Results:

This export gives you a list of pages receiving search impressions. While the export is limited, you can use Google Search Console API for larger datasets. There are also free Google Sheets plugins that simplify pulling more extensive data.

Indexing → Pages report:

This section provides exports filtered by issue type, though these are also limited in scope.

Google Analytics

Google Analytics

The Engagement → Pages and Screens default report in GA4 is an excellent source for collecting URLs, with a generous limit of 100,000 URLs.

Even better, you can apply filters to create different URL lists, effectively surpassing the 100k limit. For example, if you want to export only blog URLs, follow these steps:

Step 1: Add a segment to the report

Step 2: Click “Create a new segment.”

Step 3: Define the segment with a narrower URL pattern, such as URLs containing /blog/

Note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they offer valuable insights.

Server log files

Server or CDN log files are perhaps the ultimate tool at your disposal. These logs capture an exhaustive list of every URL path queried by users, Googlebot, or other bots during the recorded period.

Considerations:

  • Data size: Log files can be massive, so many sites only retain the last two weeks of data.
  • Complexity: Analyzing log files can be challenging, but various tools are available to simplify the process.

Combine, and good luck

Once you’ve gathered URLs from all these sources, it’s time to combine them. If your site is small enough, use Excel or, for larger datasets, tools like Google Sheets or Jupyter Notebook. Ensure all URLs are consistently formatted, then deduplicate the list.

And voilà—you now have a comprehensive list of current, old, and archived URLs. Good luck!

Back to Top

With Moz Pro, you have the tools you need to get SEO right — all in one place.

Read Next

How to Use Chrome to View a Website as Googlebot

How to Use Chrome to View a Website as Googlebot

Dec 23, 2024
How to Optimize E-commerce Sitemaps with 1M+ Pages — Whiteboard Friday

How to Optimize E-commerce Sitemaps with 1M+ Pages — Whiteboard Friday

May 17, 2024
7 Ways SEO and Product Teams Can Collaborate to Ensure Success

7 Ways SEO and Product Teams Can Collaborate to Ensure Success

Apr 24, 2024