Google has made changes to a few of its Google Search help documentation over the past couple of days. Updated docs include HTTP status codes, Googlebot, and job posting help documentation. Note, the HTTP status code aspect is not new, the content has just been moved from one location to another.
On the Googlebot, how many bytes of textual content, such as HTML, Googlebot will specifically crawl here. Here are the new lines of text:
Googlebot can crawl the first 15MB of content in an HTML file or compatible text file. After the first 15MB of the file, Googlebot stops crawling and only considers the first 15MB of content for indexing.
On jobs, Google clarified that when using the jobLocation property, you must also include the addressCountry property.
HTTP status codes
The HTTP status codes document added a large section for 404 errors which did not exist in the old version. Here is what has moved into this document:
FWIW the soft-404 docs just moved, they are… not new 🙂
— 🐝 johnmu.csv (personal) 🐝 (@JohnMu) June 23, 2022
software 404 errors
A soft 404 error occurs when a URL returns a page telling the user that the page does not exist and also a 200 (success) status code. In some cases, it may be a page with no main content or an empty page. These pages may be generated for various reasons by your website’s web server or content management system, or by the user’s browser. For instance:
- A missing server-side include file.
- A broken connection to the database.
- An empty internal search results page.
It’s a bad user experience to return a 200 (success) status code and then display or suggest an error message or some kind of error on the page. Users may think the page is a live work page, but then are presented with some kind of error. These pages are excluded from the search.
When Google’s algorithms detect that the page is actually an error page based on its content, Search Console displays a 404 soft error in the site’s index coverage report.
Fix 404 software errors
Depending on the state of the page and the desired outcome, you can resolve soft 404 errors in several ways: Try to determine which solution would be best for your users.
The page and content are no longer available
If you deleted the page and there is no replacement page on your site with similar content, return a 404 (not found) or 410 (disappeared) response (status) code for the page. These status codes tell search engines that the page does not exist and that the content should not be indexed.
If you have access to your server’s configuration files, you can make these error pages useful to users by customizing them. A good, personalized 404 page helps people find the information they’re looking for and also provides other useful content that encourages people to explore your site further. Here are some tips for designing a useful custom 404 page:
- Make it clear to visitors that the page they are looking for cannot be found. Use friendly and inviting language.
- Make sure your 404 page looks the same (including the navigation) as the rest of your site.
- Consider adding links to your most popular articles or publications, as well as a link to your site’s home page.
- Consider providing users with a way to report a broken link.
Custom 404 pages are created for users only. Since these pages are useless from a search engine’s perspective, make sure the server returns an HTTP 404 status code to prevent the pages from being indexed.
The page or content is now somewhere else
If your page has moved or has a clear replacement on your site, return a 301 (permanent redirect) to redirect the user. It won’t interrupt their browsing experience, and it’s also a great way to let search engines know about the page’s new location. Use the URL Inspection tool to check if your URL actually returns the correct code.
Page and content still exist
If an otherwise fine page was flagged with a soft 404 error, it likely didn’t load properly for Googlebot, was missing critical resources, or displayed an error message important when rendering. Use the URL Inspection tool to examine rendered content and returned HTTP code. If the rendered page is empty, nearly empty, or the content has an error message, your page may be referencing many resources that cannot be loaded (images, scripts, and other non-text elements), which can be interpreted as software 404. Reasons for resources not being able to load include blocked resources (blocked by robots.txt), having too many resources on a page, various server errors, a slow loading or very large resources. Hat tip about it from Kenichi Suzuki on Twitter.
These are the changes spotted in recent days in Google’s help documentation.
Discussion forum on Twitter.