Search Engines crawl the web finding websites to save to their index to later show their users on the off chance it will help solve a user’s query.
The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As Google’s crawlers visit these websites, they use links on those sites to discover other pages. Their software pays special attention to new sites, changes to existing sites, and dead links.
These changes are registered in a few ways:
- Webmasters can leverage Google Search Console to request indexing of individual URLs
- Webmasters can update XML sitemaps with new URLs
- Webmasters can provide detailed instructions about how to process pages on their sites
PRO TIP: XML Sitemaps are an SEO’s best friend for a few reasons. Multiple sitemaps allow you to organize your web pages into product groupings, categories, content types, sales funnel positions, test groupings, and so much more.
PRO TIP: Make sure to do update modification times in the XML sitemap only when substantial changes to those web pages are made.
PRO TIP: <Priority> doesn’t matter to Google in particular, they ignore it.
A quick note on robots.txt files
Please don’t leave these blank. Please. At the very least include a link to the XML index file in the robots.txt file. But the robots.txt file is the fastest and (often) easiest way to limit what pages get crawled.
Common example: A blank robots.txt file exists on a site where the SEO team and the Paid Marketing team both have pages that serve the same purpose. Both are by default indexable, which creates a duplicate content issue. If the robots.txt file were to have a “disallow” command on the subfolder level for the paid marketing versions of the web page, then no duplication issue exists, saving everyone headaches.