Before your pages can rank on Google, they must be discovered and indexed.
The first step to achieving this is known as crawling.
When crawling, Google (or search engines) will navigate to your pages through a link. A bot will grab all content on the page, identify links within the page, and follow them too. However, what many don’t know is that Google’s resources aren’t unlimited. To crawl all pages from everywhere in the world, Google needs to make some compromises and set a unique budget for every site based on several variables.
In today’s guide, we’re going to explore the seven factors that heavily influence your crawl budget and how, so you can be better prepared to identify and deal with them.
1. Faceted Navigation and Infinite Filtering Combinations
Faceted navigation (or faceted search) is an in-page navigation system that allows users to filter items.
When a user picks one of these options, the page will only display the items that match the parameters selected, making it easier for the consumer to find whatever they’re looking for.
This change is commonly made in two ways:
However, what actually interests us is how the URL reacts to this change.
Because of the nature of faceted navigation, in most cases, it generates new URLs based on the facets picked by the user to accommodate for the changes. There are three common facet implementations:
- Applying new parameters at the end of the URL
- Adding a hash to the URL identifying the facets selected
- Generating a new static URL
In the example above, the URL looks as follows:
Let’s see what happens if we apply a few facets to it:
- Shop by > Men’s: https://www.zumiez.com/snow/snowboards.html?en_gender_styles=Men%27s
- Adding Brand > Burton: https://www.zumiez.com/snow/snowboards.html?brand=Burton&en_gender_styles=Men%27s
- Adding Shape > Directional: https://www.zumiez.com/snow/snowboards.html?brand=Burton&en_gender_styles=Men%27s&en_shape=Directional
As you can see, there’s a new URL variation for each facet we chose, which means there’s a new URL to be crawled by search engines.
A crawler works very straightforwardly: it finds a link > follows that link to a page > gathers all links within that page > and follows them too. This process is very fast for small sites, so there’s no need to worry about wasting the crawl budget.
If your site has around 500 pages, Googlebot can crawl it pretty fast. But what happens when faceted URLs enter the mix, and now you have ten variations for every page? Now a 500-pages site becomes a 5k-pages site, and your crawl budget gets wasted. You don’t want nor need Google to crawl ten times the same content.
In fact, more likely, only a few important pages will get crawled as every variation will eat the crawl budget, making new and more important pages take longer to be crawled and indexed.
Subfolders, Tags, and Filters
In the case of Magento 2, the ecommerce platform, implementing filters will result in these faceted URLs potentially bloating your site. However, a more distinct problem to Magento is that one URL can be part of several subfolders through categories, generating multiple URL variations that display the same content and waste the crawl budget. Shopify users suffer from something similar but from tags.
Tags are usually added at the URL’s end but don’t change much on the page. Like categories, the same product/page can use several tags, creating multiple URL versions that will get crawled without adding any SEO value.
2. Session Identifiers and Tracking IDs
Session IDs and tracking IDs are parameters used for analytics or, in some cases, to remember certain user preferences. When these are implemented through the URL, the server will generate multiple instances of the same page, thus increasing the number of duplicate pages crawlers need to handle.
Unlike faceted navigation, Google is able to recognize these parameters as irrelevant and choose the original URL, so at least in terms of indexing, it is not very common for irrelevant versions to be indexed nor ranked over the canonical version.
However, it will create duplicate content issues, and Google can even perceive your site as spammy. This will, in turn, impact the crawl budget allocated to your site – as Google doesn’t want to waste resources on low-quality websites – and waste the resources assigned on irrelevant instances of every page.
3. Broken Links and Redirect Chains
As said, Google will follow all links and will only bounce out once the crawl budget for the session is exhausted. A crawl budget is about server resources and time; every redirect or broken link is time wasted.
One of the most problematic scenarios is when redirect chains are generated. For every redirect chain, the crawler must jump from link to link until it finally arrives at its destination. If you’ve experienced a redirect before, each jump can take anywhere from a few milliseconds to a couple of seconds – which in machine time is a very long time.
Now, imagine how long it would take for the crawler to get to your content when it has to jump three, four, or seven times. And if you have several chains on your website, you can easily burn your crawl budget just in these chains.
Infinite Redirect Loops
The most dangerous of these chains are the infinite redirect loops.
Basically, each redirect chain should ultimately arrive at a final destination. But for technical mistakes, what happens if the ultimate destination just keeps redirecting to the beginning of the chain?
To visualize this, you can imagine that Page A redirects to Page B > Page B redirects to Page C > and Page C redirects to Page A. This creates a closed loop where the destination page never resolves.
When the crawler goes through this process and realizes it’s stuck within a loop, it’ll break the connection and, most likely, leave your site. Leaving the rest of your pages uncrawled.
4. Unoptimized Sitemaps
Sitemaps are text files that provide URLs and information about them (like hreflang variations) to search engines like Google and Bing. In terms of crawling, sitemaps can help search engines determine which URLs to prioritize crawling and help them better understand the relationship between them.
The key term here is prioritization.
Many webmasters create sitemaps without any strategy in mind, adding all existing URLs to the file and potentially having Google crawl irrelevant page variations, low-quality pages, or non-indexable pages. This issue can also appear when you only rely on your CMS capabilities to generate an automatic sitemap.
Instead, the sitemap should be a vehicle to prioritize the main pages that we want to index and rank. For example, adding conversion-only pages (without ranking intention) to the sitemap is a waste of resources, as having those pages indexed won’t make any difference.
Make sure to only add pages that return a 200 HTTP status code, are relevant for your rankings, and are the canonical URL. Search engines will focus on these pages and use your crawl budget more wisely – it’s also a great way to help orphan pages, or pages deep in the hierarchy, be discovered.
5. Site Architecture
The site architecture of your website is the organization of pages based on taxonomies and internal linking. You can imagine your website’s starting point as the homepage and how, from there, the rest of the website is organized in a hierarchical structure.
So how does it relate to the crawl budget? In a more raw sense, your site is just a combination of different pages connected by hyperlinks organized under one domain name. Search engines start crawling from your homepage and then move down through every link they can find.
Through this process, crawlers get to map your website to understand its structure, the relationship between the pages (for example, categorizing them by topics) and get a sense of their importance – more specifically, the closer the page is to the homepage, the more important it is.
This means that pages closer to the homepage are prioritized for crawling, and as the pages get deeper down the structure, it takes more time for crawlers to find them and are deemed less important, so they are crawled less frequently.
It also means that there’s an internal linking dependency. If Page H is only linked from Page G and Page G doesn’t get crawled, Page H won’t be crawled either.
A well-designed site architecture will ensure that all pages are discoverable through the crawling process and help with page prioritization.
6. Page Authority / Backlink Profile
In simple terms, a backlink is a link pointing to a page on your domain from an external domain. These inbound links act like votes for your pages and signal to Google that your content is trustworthy, increasing its authority.
While backlinks are more commonly talked about as a ranking factor, they can also help your website increase its crawl budget.
One of the factors in calculating your crawl budget is crawl demand, which determines how often a URL should be recrawled, and one of the main variables taken into account in this process is popularity – which is calculated by the number of internal links and backlinks pointing to the URL.
If your page is getting a lot of link equity – both internally and externally – it means it’s worth crawling more often, and as more and more of your pages need to get crawled more often, your site’s crawl budget will increase.
Another way backlinks help your site crawl is by discovering your URLs from other websites. If a crawler finds your URL from an outside source, it will follow your link and then all the links from that page, creating a healthy crawling effect the more high-quality backlinks you earn.
Note: Backlinks from authoritative pages, generating a healthy amount of traffic, are more valuable than a massive number of low-quality backlinks from pages without any traffic.
7. Site Speed and Hosting Setup
Site speed is one of the most talked about technical SEO topics because it heavily affects user experience and can definitely make a difference in your SEO performance. It’s such an important metric, Google has broken it down into different aspects in their core web vitals scores.
Understanding that in the crawling process, Google needs to send an HTTP request to the server, download all necessary files and then move to the next link paints a clear picture of why slow response times from your website can be an issue.
Every second the crawler has to wait for the page to respond, it’s a second your crawl budget gets consumed just waiting.
It’s important to notice that it is not just about how your website is built (the code), but also your hosting service can make your website unresponsive – even to timeout.
There are several components to it, but a good place to start is the bandwidth. The hosting plan’s bandwidth determines how much information the server can transfer. Think of this like your internet speed; if you’re a streamer with a slow internet connection, no matter how fast your audience’s internet speed is, it will be a slow, painful transmission.
But it’s not just about consuming your crawl budget. Google will avoid overwhelming your server at all costs. After all, the crawling process is traffic coming to your site. If your website timeout or takes way too long to load, Google will lower your crawl budget thinking your server can’t manage the current frequency.
How to Improve Your Crawl Budget with Prerender?
It’s very common for SPAs to:
- Generate hash and faceted URLs, creating, in some cases, a massive number of URLs for Google to crawl
To help you optimize your crawl budget without recurring to time-consuming workarounds, Prerender fetches your URLs, caches all necessary resources, creates a snapshot of all fully rendered pages, and then serves the static HTML to search engines. Crawlers won’t need to deal with any complexities or wait for your pages to load. Prerender will provide it with the HTML page so it can move to the next one without generating bottlenecks.