Enhance Search Engine Indexing with sitemap.xml

  •   2 minutes read
I had this article in progress for several years and sent parts of it to colleagues as needed. Martin Pešout brought me to KontentKing and their great articles — boldly read XML Sitemap or the summary below.

What not to forget

  • Include only SEO-relevant URLs. Google considers the URLs listed in sitemap.xml to be fundamental and indexes them primarily.
  • It’s a great place to define URLs to be excluded from the index. The dismissal URLs listed here are processed fastly.
  • The loc tag is required. It must contain an absolute and canonical URL (of course, the same as in the meta tag). We also consider the so-called self-canonical URL to be a canonical URL.
  • In multiple languages, do not forget to state the language alternations hreflang. Don’t duplicate hreflang in meta tads. Use onz one solution — meta or in a sitemap.xml.
  • The lastmod tag is optional but significant because it informs the robot about the changes, and therefore the suitability of reindexing. Change the date only if the content was changed significantly. That is not only when correcting typos :-) Google often penalizes frequently updated pages with minimal change.
  • The changefreq and prio tags are not necessary when using lastmod correctly.
  • Look at the specifications for inserting images or videos for larger projects again, to the will of the crawl budget not to include and use JSON-LD.
  • Do not embed URLs for articles (news) in the sitemap and instead use the format for RSS / Atom feeds. Don’t forget to write links to feeds in corresponding meta tags.
  • For larger projects, it is good to check the current specifications, such as the limit of 50 MB in the uncompressed state (index.xml.gz can be used), max. 50,000 URLs, use sitemap-index.xml (sitemap-index.xml.gz).

What should not get into XML

  • Non-canonical pages
  • Duplicate pages
  • Pages with paging 2+ inclusive
  • With parameters or session ID
  • Search results (internal)
  • Various versions created for sharing (abbreviated for Twitter, e‑mail, etc.)
  • URLs created using filtering that are not important for indexing (see SEO formulas and noindex),
  • archived pages
  • Any 3xx redirects, missing 4xx pages, or 5xx errors
  • Pages blocked in robots.txt
  • Pages in noindex
  • Pages after submitting the form, etc.
  • Pages relevant only to users, such as login, contact form, privacy policy, etc.

Conclusion

Perhaps the summary helped to orientate oneself in the topic of sitemaps :-)