Published on January 29, 2024

Enterprise Guide to Finding Pages That Deplete Your Crawl Budget

Enterprise Guide to Finding Pages That Deplete Your Crawl Budget

Struggling with your enterprise website’s crawl budget?

Discover the challenges and practical tips to optimize your website’s crawl budget effectively.

With millions of URLs to analyze, it can be daunting to get started, but with a clear plan and exact targets, you can move faster and more efficiently. Here, we define the type of pages you need to identify, how to find them at scale, and how to overcome the unique challenges enterprise websites face during a crawl budget audit.

The 5 Most Common Pages Depleting Your Crawl Budget

Now that we have a better understanding of what this process will need from us, we can identify the different pages you’ll want to find that are depleting your crawl budget:

1. Soft-404 Errors

Soft-404 errors occur when a page no longer exists (so it should be displaying a 404 not found error) appears like existing because the server returns a 200-ok status code or it’s displaying content it shouldn’t be there anymore.

This error is basically a mismatch between what the search engine expected and what the server actually returns.

For example, another way it can happen is if the page is returning a 200 status code, but the content is blank, from which search engines would interpret it as actually a (soft) 404 error.

Finding these pages is actually quite simple because Google Search Console has a specific soft 404 error report.

Why Pages Aren't Indexed

Although crawlers will eventually stop crawling pages with a “Not found (404)” status code, soft 404 pages will keep wasting your crawl budget until the error is fixed.

Note: It’s a good idea to check your “Not found (404)” report and make sure that these pages are returning the correct status code, just to be sure.

For a more detailed step-by-step guide, follow our tutorial on how to fix soft 404 errors.

2. Redirection Chains

A redirection chain is created when Page A redirects to Page B, which then redirects to Page C, and Page C redirects to Page A, forming an infinite loop that traps Googlebot and depletes your crawl budget.

These chains usually happen during site migrations or large technical changes, so it’s more common than you might think.

To quickly find redirect chains on your site, crawl your site with Screaming Frog and export the redirect chain report.

Screaming Frog Audit - Dashboard

However, crawling your entire site would require a lot of storage and resources from your machine to handle.

The best approach is to crawl your site using the sections you already created. This will take some of the pressure off your machine as well as reduce the amount of time it would take to crawl millions of URLs in one go.

Note: You have to remember that Screaming Frog will also crawl resources like JS files and CSS, taking more time and resources from your machine.

To submit a list for Screaming Frog to crawl, switch to List Mode.

Screaming Frog Crawl Data

Follow our tutorial on fixing redirect chains for clear step-by-step instructions.

3. Duplicate Content

According to Google, “duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content in the same language or are appreciably similar.”

These pages can take many forms like faceted navigation, several URL versions being available, misimplemented Hreflang alternate URLs for localization, or just generic category templates showing almost the same content.

Because Google perceives them as essentially the same page, it will crawl all of these pages but won’t index them or treat them as spam and hurt your overall SEO scores.

In terms of crawl budget, all of these pages are taking resources away from the main pages on your site, wasting precious crawling time. This gets even worse if the pages flagged as duplicate content require JavaScript rendering, which will 9X the time needed to crawl them.

The best way to find duplicate pages is by checking Google Search Console’s “duplicate without user-selected canonical” report, which lists all the pages Google assumes are duplicated pages.

Google Search Console Feedback

You can contrast this information with Screaming Frog by crawling your site locally and then checking the “near duplicate” report.

Duplicate content audit in Screaming Frog

Check our guide on fixing duplicate content issues for a step-by-step tutorial.

4. Slow or Unresponsive URLs

If we understand crawl budget as the number of HTTP requests Googlebot can send to your website (pages crawled) in a specific period of time (crawling session) without overwhelming your site (crawl capacity limit), then we have to assume page speed plays a big role in crawl budget efficiency.

From the moment Google requests your page to the moment it finally renders and processes the page, we can say it is still crawling that URL. The longer your page takes to render, the more crawl budget it consumes.

There are two main strategies that you can use to identify slow pages:

  • Connect PageSpeed Insights with Screaming Frog 

Screaming Frog allows you to connect several SEO tools APIs to gather more information from the URLs you’re crawling.

PageSpeed Insights

Connecting PageSpeed Insights will provide you with all the performance information at a larger scale and directly on the same crawl report.

If you’ve already crawled your site section, don’t worry, Screaming Frog allows you to request data from the APIs you’ve connected retroactively. Avoiding you the pain of crawling thousands of URLs again.

Example of how to request API data in Screaming Frog

  • Identify slow pages through log file analysis

Log files contain real-life data from your server activity, including the time it took for your server to deliver your page to search engines.

The two metrics you want to pay attention to are:

  • Average Bytes – the average size in bytes of all log events for the URL.
  • Average Response Time (ms) – The length of time taken (measured in milliseconds) to download the URL on average.

As a benchmark, your pages should completely load in 2 seconds or less for a good user experience, but Googlebot will stop crawling any URL with a server response longer than two seconds.

Here’s a complete guide to page speed optimization for enterprise sites you can follow to start improving your page performance on both mobile and desktop.

5. Pages with No Search Value in Your Sitemap

Your sitemap acts as a priority list for Google. This file is used to tell Google about the most important pages on your site that require its attention.

However, in many cases, developers and SEOs tend to list all URLs within the sitemap, even those without any relevancy for search.

It’s important to remember that your sitemap won’t replace a well-planned site structure. For example, sitemaps won’t solve issues like orphan pages (URLs without any internal links pointing to them).

Also, just because a URL is in your sitemap doesn’t mean it’ll get crawled. So you have to be mindful about what pages you’re adding to the file, or you could be taking resources away from your priority pages.

Here are the pages you want to identify and remove from your sitemap:

  • Pages tagged as noindex 

This tag tells Google not to index the URL (it won’t show in search results), but Google will still use crawl budget to request, and render that page before dropping it due to the tag.

If you’re adding this URL to your sitemap, you’re wasting crawl budget on pages that have no search relevance.

To find them, you can crawl your site using Screaming Frog and go to the directive report:

Noindex report in Screaming Frog

You can also find these pages within Search Console’s indexing report.

Page indexing feedback in Google Search Console

  • Landing pages and content without ranking potential

There are landing pages and content that just won’t rank, and from millions of URLs, you are bound to have a few of those.

It doesn’t mean there’s no value in having them indexed, it just means there’s no value in having them discovered through the sitemap.

Your sitemap should be a curated list of highly valuable pages for search. Those that move the needle in traffic and conversions.

A great way to find underperforming pages at scale is by connecting your Google Analytics and Search Console directly to Screaming Frog. This will allow you to spot pages with zero rankings quickly.

You can also connect Ahrefs’ API to get backlink data on the URLs you’re analyzing.

  • Error pages, faceted navigation, and session identifiers

It goes without saying, but you also need to clean your sitemap from any page returning 4XX and 5XX status codes, URLs with filters, session identifiers, canonicalized URLs, etc.

Note: You don’t need to block canonicalized pages through the Robot.txt file.

For example, there’s no value in adding the entire pagination of a product category. You only need the first page of the pagination, and Google will find the rest while crawling it.

You’ve already found most of these pages in the previous steps. Now check if any of them is in your sitemap and get them out of there.

The Unique Challenges of Enterprise Crawl Budget Optimization

Although enterprise sites must follow the same best practices as any other type of website, enterprise sites represent a unique challenge because of their large scale and complex structure. To ensure we’re on the same page, here are the three main challenges you’ll encounter during your enterprise crawl budget optimization process:

  • High Number of URLs – one common denominator among enterprise sites is their size. In most cases, enterprise websites have over 1M URLs, making it harder for any team to stay on top of each and every one of them. Plus, it also makes it easy for teams to overlook some URLs and make mistakes. 
  • Deadlines and Resources – something that’s very common in this kind of project is having unrealistic deadlines with limited resources. These resources are both time, people, and tools. Crawl budget optimization is a highly technical and collaborative endeavor, and for teams to be able to manage it successfully, they need to automate certain tasks with tools, count with different experts like writers, SEOs, and developers, and have enough time to implement safely.
  • Implementation and Testing – connected to the previous point, teams usually don’t have enough time for testing, which can have terrible results if your teams have to rush through implementation. Because of the complex structure of your site, even a small change might affect hundreds or thousands of URLs, creating issues like rank drops.

That said, making changes to your site shouldn’t be scary as long as you develop a solid plan, provide the right resources (including time, tools, and experts), and be conscious that testing is a necessary part of the process to avoid losing traffic and conversions.

3 Tips to Plan Your Enterprise Crawl Budget Audit

Every plan has to be tailored-made to fit your own organization, business needs, and site structure, so there’s no one-fit-all strategy you can find online.

Still, there are three simple tips you can use to make your crawl budget audit easier to manage and implement:

1. Find a Logical Way to Break Down Your Site

You can group similar pages following your website structure (which is the hierarchical order pages are linked between each other in relation to the homepage), creating individual sections that are easier to understand and evaluate.

Web Structure Example

Because every website has a unique structure, there isn’t a single best way to separate every site. Nevertheless, you can use any of the following ideas to get started:

  • Group your pages by type – not all pages are built equally, but there are groups of pages sharing a similar approach. For example, your main landing pages might use the same template across all categories, so evaluating these pages together will give you insights into a bigger portion of your site faster than if you analyze them one by one.
  • Group your pages by cluster – another way to go about this is by identifying topic clusters. For example, you might have category pages, landing pages, and content pages talking about similar topics and, most likely than not, linking to each other, so it makes sense for a team to focus on the whole.
  • Group your pages by hierarchy – as you can see in the image above, your structure has a “natural” flow based on the way you’ve structured your website. By building a graphical representation of your pages, you can separate the different sections of your pages based on this structure – however, for this to work, your website needs a clear structure, if it doesn’t, it’s better to go with any of the previous ideas.

2. Delegate Each Site Section to an SEO Team

Once you know how you want to divide your website, it’s time to get all URLs into a central spreadsheet and then allocate each site section to a particular team – you can copy all the URLs assigned to a team into a different spreadsheet.

When we talk about teams, it doesn’t necessarily mean having three different experts per section of your page – although that would be ideal.

Instead, you’ll want to have one SEO professional for each section of the site doing the audit. These are the ones that’ll provide all the guidelines, assign tasks, and have oversight of the entire project.

For technical optimization like page speed, you’ll want one or two dedicated developers. This will ensure that you’re not taking time away from the product.

Lastly, you’ll want at least two writers to help you merge content together where needed.

By creating these teams, you can have a smoother process that doesn’t take away resources from other marketing areas or engineering, and because they’ll focus solely on this project, it can be managed faster.

3. Start Testing

You can use a project management tool like Asana or Trello to create a centralized operation, allowing data to flow throughout the project.

Once the first tasks are assigned (implementation tasks), it’s important to give a testing period to see how your rankings are being affected by changes. In other words, instead of rolling all changes at the same time, it’s better to have a couple of months to take a small sample of URLs and make the changes.

If the effects are negative, you’ll have the time to revert the changes back and study what happened without putting the entire site at risk.

 

Wrapping Up: Take Care of Your Pages’ Rendering

Rendering has a direct impact on your crawl budget and is arguably the biggest. When Google finds a JS file within a page, it’ll need to add an additional rendering step to access the entire content of the URL.

This rendering step makes the crawling process take 9x longer than HTML static pages and can deplete your crawl budget between just a few URLs, leaving the rest of your site undiscovered.

Furthermore, many things can go wrong during the rendering process, like timeouts or partially rendered content, which create all sorts of SEO issues like missing meta tags, missing content, thin pages, or URLs getting ignored.

To find these pages, you’ll need to identify all URLs using JS to inject or modify the pages’ content. Most commonly, these are pages showing dynamic data, product listings, and dynamically generated content.

However, if your website is built using a JavaScript framework like React, Angular, or Vue, you’ll experience this rendering issue across the entire site.

The good news is Prerender can solve all rendering issues from its root, increase crawl budget, and speed up the crawling process with a simple plug-and-play installation.

Prerender crawls your pages and generates and caches a 100% index-ready snapshot – a completely rendered version – of your page.

When a search engine requests your URL again, Prerender will deliver the snapshot in 0.03 seconds, taking the rendering process off Google’s shoulders and allowing it to crawl more pages faster.

Explore more about crawl budget optimization:

Picture of Leo Rodriguez

Leo Rodriguez

Leo is a technical content writer based in Italy with over 4 years of experience in technical SEO. He’s currently working at saas.group as a content marketing manager. Contact him on LinkedIn.

Table of Contents

Prerender’s Newsletter

We’ll not send more than one email per week.

More From Our Blog

What is INP? How does it impact your site’s configuration and SEO results.
We compare the features and pricing of 4 SEO solutions similar to Botify.

Increased Traffic and
Sales Awaits

Unlock missed opportunities and reach your full SEO potential. When more web pages are crawled, it’s easier to index more of your site and boost SEO performance. Get started with 1,000 URLs free.